<a href="https://colab.research.google.com/github/robitussin/CCDATSCL_EXERCISES/blob/main/Exercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

In this activity, you will work with the New York City Airbnb Open Data—a popular, messy, and highly realistic dataset used by data analysts and data scientists around the world. The dataset includes thousands of Airbnb listings along with details about hosts, locations, prices, reviews, and more.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8Pmq3Gv7y7z_2Xun-OqMIk43kd5u0TXEatw&s"/>


To guide your work, you will follow the 6-Step Data Wrangling Process, a professional workflow used in industry:

- Discovering. Become familiar with the data by exploring its structure and observing patterns.

- Structuring. Fix issues related to format, column types, indices, and duplicated records.

- Cleaning. Handle missing values, outliers, and inconsistencies so the dataset becomes trustworthy.

- Enriching. Add new features, aggregate information, and enhance the dataset for deeper insights.

- Validating. Apply rule-based checks to ensure the data meets quality standards.

- Publishing. Export and prepare the cleaned dataset for downstream analysis or reporting.
Throughout this exercise, you will answer questions and fill in code cells based on these six steps. You will mimic what professional data scientists do when preparing data for business intelligence dashboards, machine learning models, or exploratory analysis.

In [306]:
import kagglehub
import os
import pandas as pd
import numpy as np

# Download latest version
path = kagglehub.dataset_download("dgomonov/new-york-city-airbnb-open-data")

print("Path to dataset files:", path)

if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

Using Colab cache for faster access to the 'new-york-city-airbnb-open-data' dataset.
Path to dataset files: /kaggle/input/new-york-city-airbnb-open-data
True


## A. Discovering


Display the first 10 rows

In [307]:
df.head(10)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,2019-06-22,0.59,1,129
6,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,45,49,2017-10-05,0.4,1,0
7,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,79,2,430,2019-06-24,3.47,1,220
8,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.80178,-73.96723,Private room,79,2,118,2017-07-21,0.99,1,0
9,5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188


Display the column names:

In [308]:
df.columns.tolist()

['id',
 'name',
 'host_id',
 'host_name',
 'neighbourhood_group',
 'neighbourhood',
 'latitude',
 'longitude',
 'room_type',
 'price',
 'minimum_nights',
 'number_of_reviews',
 'last_review',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365']

Display the dataframe shape:

In [309]:
df.shape

(48895, 16)

Identify the numerical features:

In [310]:
numerical_cols = df._get_numeric_data().columns.tolist()
print("Numerical features:", numerical_cols)


Numerical features: ['id', 'host_id', 'latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']


Identify the categorical features

In [311]:
categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
print("Categorical features:", categorical_cols)


Categorical features: ['name', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type', 'last_review']


What are the top 5 questions this dataset might help answer?

In [312]:
questions = [
    "What is the typical listing price across neighborhoods?",
    "Which neighborhoods have the largest number of Airbnb listings?",
    "What relationship exists between listing prices and review counts?",
    "Which hosts operate the most Airbnb listings?",
    "How does the minimum-night rule impact booking activity?"
]

for i, question in enumerate(questions, 1):
    print(f"{i}. {question}")


1. What is the typical listing price across neighborhoods?
2. Which neighborhoods have the largest number of Airbnb listings?
3. What relationship exists between listing prices and review counts?
4. Which hosts operate the most Airbnb listings?
5. How does the minimum-night rule impact booking activity?


## B. Structuring

Check if the dataset has proper column types

In [313]:
df.dtypes

Unnamed: 0,0
id,int64
name,object
host_id,int64
host_name,object
neighbourhood_group,object
neighbourhood,object
latitude,float64
longitude,float64
room_type,object
price,int64


Check if the dataset has index issues

In [314]:
idx = df.index

print("Index type     :", type(idx))
print("Unique index?  :", idx.is_unique)
print("Index range    :", idx.min(), "to", idx.max())

idx


Index type     : <class 'pandas.core.indexes.range.RangeIndex'>
Unique index?  : True
Index range    : 0 to 48894


RangeIndex(start=0, stop=48895, step=1)

Check if the dataset has duplicated rows

In [315]:
dups = df.loc[df.duplicated()]
print(f"Duplicated rows: {dups.shape[0]}")

dups if not dups.empty else print("No duplicated rows found.")


Duplicated rows: 0
No duplicated rows found.


Convert columns to correct types (e.g., dates → datetime).

In [316]:
print("Current dtypes:")
print(df.dtypes)

conversions = {
    'last_review': lambda x: pd.to_datetime(x, errors='coerce'),
    'price': lambda x: pd.to_numeric(x, errors='coerce')
}

for col, func in conversions.items():
    if col in df.columns:
        df[col] = func(df[col])

print("\nUpdated dtypes:")
print(df.dtypes)


Current dtypes:
id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

Updated dtypes:
id                                         int64
name                                      object
host_id                                    int64
host_name                                 object
neighbourhood_group                       object
neighbourhood                       

Identify at least one structural issue and describe how to fix it.

In [317]:
print("Checking structural issues...\n")

issues = []
fixes = []

# -----------------------------------
# Price checks (ensure all positive)
# -----------------------------------
if 'price' in df.columns:

    # Step 1: Detect non-numeric price values
    if df['price'].dtype == 'object':
        numeric_price = pd.to_numeric(df['price'], errors='coerce')
        non_numeric_count = numeric_price.isna().sum()

        if non_numeric_count > 0:
            issues.append(f"Price has {non_numeric_count} non-numeric values")
            fixes.append("Convert to numeric and drop invalid rows")

            print(f"Issue: Price column has {non_numeric_count} non-numeric values")
            print("Fix: Converting price to numeric and removing invalid rows...\n")

        # Convert and replace
        df['price'] = pd.to_numeric(df['price'], errors='coerce')

    # Step 2: Remove rows where price <= 0 or price is NaN
    invalid_price = df['price'].isna().sum() + (df['price'] <= 0).sum()

    if invalid_price > 0:
        issues.append(f"Price contains {invalid_price} invalid values (non-numeric, NaN, or <= 0)")
        fixes.append("Keep only rows where price > 0")

        print(f"Issue: {invalid_price} price values are invalid")
        print("Fix: Removing rows with price <= 0 or NaN...\n")

        # Apply fix
        df = df[df['price'] > 0]


# -----------------------------------
# Date checks
# -----------------------------------
date_cols = [col for col in df.columns if 'date' in col.lower() or 'review' in col.lower()]

for col in date_cols:
    if df[col].dtype == 'object':
        issues.append(f"{col} stored as string")
        fixes.append(f"Convert {col} to datetime")

        print(f"Issue: {col} is stored as string")
        print(f"Fix: Converting {col} to datetime...\n")

        df[col] = pd.to_datetime(df[col], errors='coerce')


# -----------------------------------
# Index checks
# -----------------------------------
if not df.index.is_unique:
    issues.append("Index has duplicates")
    fixes.append("Reset index")
    print("Issue: Index has duplicate values")
    print("Fix: Reset index\n")

    df = df.reset_index(drop=True)

elif df.index.min() != 0:
    issues.append("Index not starting from 0")
    fixes.append("Reset index")
    print("Issue: Index is not sequential")
    print("Fix: Reset index\n")

    df = df.reset_index(drop=True)


# -----------------------------------
# Summary
# -----------------------------------
print("=" * 50)
print("Summary:")
if issues:
    for i, (issue, fix) in enumerate(zip(issues, fixes), 1):
        print(f"{i}. {issue}")
        print(f"   → {fix}")
else:
    print("No major structural issues found")

# Return df so it is updated
df


Checking structural issues...

Issue: 11 price values are invalid
Fix: Removing rows with price <= 0 or NaN...

Summary:
1. Price contains 11 invalid values (non-numeric, NaN, or <= 0)
   → Keep only rows where price > 0


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,NaT,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,NaT,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,NaT,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,NaT,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,NaT,,6,2


## C. Cleaning



Check for missing values

In [318]:
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percent
})

missing_df = missing_df[missing_df['Missing Count'] > 0] \
    .sort_values('Missing Count', ascending=False)

print("Missing values summary:")
missing_df


Missing values summary:


Unnamed: 0,Missing Count,Percentage
reviews_per_month,10051,20.56092
last_review,10051,20.56092
host_name,21,0.042959
name,16,0.032731


Check for outliers in `minimum_nights`

In [319]:
def analyze_minimum_nights(df):
    mn = df['minimum_nights']

    print("Minimum nights statistics:")
    print(mn.describe())

    Q1, Q3 = mn.quantile([0.25, 0.75])
    IQR = Q3 - Q1
    lb, ub = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR

    outliers = df[(mn < lb) | (mn > ub)]

    print(f"\nOutliers (IQR method):")
    print(f"Lower bound: {lb}, Upper bound: {ub}")
    print(f"Number of outliers: {len(outliers)}")

    return outliers[['minimum_nights']].head(10)

if 'minimum_nights' in df.columns:
    analyze_minimum_nights(df)


Minimum nights statistics:
count    48884.000000
mean         7.029887
std         20.512224
min          1.000000
25%          1.000000
50%          3.000000
75%          5.000000
max       1250.000000
Name: minimum_nights, dtype: float64

Outliers (IQR method):
Lower bound: -5.0, Upper bound: 11.0
Number of outliers: 6605


Check for invalid values (e.g., price ≤ 0)

In [320]:
if 'price' in df.columns:
    invalid_prices = df[df['price'] <= 0]

    print(f"Number of rows with price <= 0: {len(invalid_prices)}")

    if len(invalid_prices) > 0:
        print("\nInvalid price rows:")
        invalid_prices[['id', 'name', 'price']].head(10)
    else:
        print("No invalid prices found.")


Number of rows with price <= 0: 0
No invalid prices found.


Apply a missing-value strategy (drop, fill, or flag)


In [321]:
df_cleaned = df.dropna(subset=['name', 'host_id', 'neighbourhood_group'])

if 'reviews_per_month' in df_cleaned:
    df_cleaned['reviews_per_month'] = df_cleaned['reviews_per_month'].fillna(0)

print(f"Original rows: {len(df)}")
print(f"Cleaned rows: {len(df_cleaned)}")
print(f"Removed: {len(df) - len(df_cleaned)}")

df = df_cleaned


Original rows: 48884
Cleaned rows: 48868
Removed: 16


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['reviews_per_month'] = df_cleaned['reviews_per_month'].fillna(0)


Apply an outlier-handling strategy

In [322]:
# Cap outliers instead of removing them
if 'minimum_nights' in df.columns:

    # Cap minimum_nights at 365 (1 year)
    df['minimum_nights'] = df['minimum_nights'].clip(upper=365)

    # Cap price at the 99th percentile if available
    if 'price' in df.columns:
        p99 = df['price'].quantile(0.99)
        df['price'] = df['price'].clip(upper=p99)

    print("Outliers have been capped.")
    print(f"Minimum nights max: {df['minimum_nights'].max()}")

    if 'price' in df.columns:
        print(f"Price max: {df['price'].max()}")


Outliers have been capped.
Minimum nights max: 365
Price max: 799


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['minimum_nights'] = df['minimum_nights'].clip(upper=365)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = df['price'].clip(upper=p99)


## D. Enriching

Create at least 3 new variables:
- price_per_minimum_night
- review_rate_per_month (combine ratings + frequency)
- is_expensive (boolean flag based on price threshold)


In [323]:
# 1. price_per_minimum_night
if {'price', 'minimum_nights'}.issubset(df.columns):
    df['price_per_minimum_night'] = df['price'] / df['minimum_nights'].replace(0, 1)

# 2. review_rate_per_month (combined metric or fallback)
if {'number_of_reviews', 'reviews_per_month'}.issubset(df.columns):

    # Combined metric
    df['review_rate_per_month'] = (
        df['reviews_per_month'] *
        (df['number_of_reviews'] / (df['number_of_reviews'] + 1))
    )

    # Fallback: if reviews_per_month exists, ensure no missing values
    df['review_rate_per_month'] = df['review_rate_per_month'].fillna(0)

# 3. is_expensive (boolean flag based on median price)
if 'price' in df.columns:
    price_median = df['price'].median()
    df['is_expensive'] = df['price'] > price_median

# Show results
print("New variables created:")
cols_to_show = [
    'price_per_minimum_night',
    'review_rate_per_month',
    'is_expensive'
]
print(df[cols_to_show].head())


New variables created:
   price_per_minimum_night  review_rate_per_month  is_expensive
0                    149.0               0.189000          True
1                    225.0               0.371739          True
2                     50.0               0.000000          True
3                     89.0               4.622878         False
4                      8.0               0.090000         False


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_per_minimum_night'] = df['price'] / df['minimum_nights'].replace(0, 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_rate_per_month'] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_rate_per_month'] = df['review_rate_per_month'].fillna(0)
A value is trying to be 

Create 1 aggregated summary table

(e.g. average price per neighborhood, number of hosts per borough)

In [324]:
# Average price per neighborhood
if {'neighbourhood_group', 'price'}.issubset(df.columns):

    summary = (
        df.groupby('neighbourhood_group')
          .agg(
              Avg_Price=('price', 'mean'),
              Median_Price=('price', 'median'),
              Number_of_Listings=('price', 'count'),
              Number_of_Hosts=('host_id', 'nunique')
          )
          .round(2)
          .sort_values('Avg_Price', ascending=False)
    )

    print("Summary: Average price and number of hosts per neighborhood group")
    display(summary)   # if in Jupyter; otherwise print(summary)


Summary: Average price and number of hosts per neighborhood group


Unnamed: 0_level_0,Avg_Price,Median_Price,Number_of_Listings,Number_of_Hosts
neighbourhood_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Manhattan,182.96,150.0,21651,16570
Brooklyn,119.43,90.0,20089,15959
Staten Island,101.8,75.0,373,256
Queens,96.1,75.0,5666,3983
Bronx,85.79,65.0,1089,788


## E. Validating

Create at least 3 validation checks
- Price must be positive.
- Host ID should not be null.
- Minimum nights must be ≥ 1.
- Generate a validation report:
- Count of rows violating each rule.

Decide if invalid rows (if any) should be removed or corrected?

In [325]:
# Step 1: Create validation checks
validation_results = {}

# Rule 1: Price must be positive
if 'price' in df.columns:
    invalid_price = (df['price'] <= 0).sum()
    validation_results['Price must be positive'] = invalid_price

# Rule 2: Host ID should not be null
if 'host_id' in df.columns:
    null_host_id = df['host_id'].isnull().sum()
    validation_results['Host ID should not be null'] = null_host_id

# Rule 3: Minimum nights must be ≥ 1
if 'minimum_nights' in df.columns:
    invalid_min_nights = (df['minimum_nights'] < 1).sum()
    validation_results['Minimum nights must be ≥ 1'] = invalid_min_nights


# Step 2: Generate a validation report
validation_report = pd.DataFrame({
    'Validation Rule': list(validation_results.keys()),
    'Violations': list(validation_results.values())
})

validation_report['Status'] = validation_report['Violations'].apply(
    lambda x: 'PASS' if x == 0 else 'FAIL'
)

print("Validation Report:")
print(validation_report)

print(f"\nTotal rows in dataset: {len(df)}")

decision = (
    "Dataset is valid"
    if validation_report['Violations'].sum() == 0
    else "Some rows need to be removed or corrected"
)

print(f"\nDecision: {decision}")


# Step 3: Remove invalid rows if any violations exist
if validation_report['Violations'].sum() > 0:
    print("\nRemoving invalid rows...")

    before_rows = len(df)

    if 'price' in df.columns:
        df = df[df['price'] > 0]

    if 'host_id' in df.columns:
        df = df[df['host_id'].notna()]

    if 'minimum_nights' in df.columns:
        df = df[df['minimum_nights'] >= 1]

    after_rows = len(df)

    print(f"Rows before cleaning: {before_rows}")
    print(f"Rows after cleaning: {after_rows}")
    print(f"Rows removed: {before_rows - after_rows}")

# Final output
validation_report


Validation Report:
              Validation Rule  Violations Status
0      Price must be positive           0   PASS
1  Host ID should not be null           0   PASS
2  Minimum nights must be ≥ 1           0   PASS

Total rows in dataset: 48868

Decision: Dataset is valid


Unnamed: 0,Validation Rule,Violations,Status
0,Price must be positive,0,PASS
1,Host ID should not be null,0,PASS
2,Minimum nights must be ≥ 1,0,PASS


## F. Publishing

Select final columns for publication.



In [326]:
# Define the key columns we want to include in the final published dataset
target_columns = [
    'id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
    'neighbourhood', 'latitude', 'longitude', 'room_type',
    'price', 'minimum_nights', 'number_of_reviews',
    'reviews_per_month', 'price_per_minimum_night',
    'review_rate_per_month', 'is_expensive'
]

# Keep only columns that are actually available in the dataframe
available_columns = [col for col in target_columns if col in df.columns]

# Create a new dataframe containing only the relevant columns
df_final = df[available_columns].copy()

print(f"Number of columns selected for publication: {len(available_columns)}")
print("Columns included:")
print(available_columns)

df_final.head()


Number of columns selected for publication: 16
Columns included:
['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'price_per_minimum_night', 'review_rate_per_month', 'is_expensive']


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,price_per_minimum_night,review_rate_per_month,is_expensive
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,149.0,0.189,True
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,225.0,0.371739,True
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,0.0,50.0,0.0,True
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,89.0,4.622878,False
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,8.0,0.09,False


Export cleaned dataset as "cleaned_airbnb.csv"

In [327]:
# Export the final cleaned dataset to CSV
output_file = "cleaned_airbnb.csv"
df_final.to_csv(output_file, index=False)

print(f"Cleaned dataset has been saved as '{output_file}'.")
print(f"Dataset shape: {df_final.shape}")


Cleaned dataset has been saved as 'cleaned_airbnb.csv'.
Dataset shape: (48868, 16)


Export summary reports (e.g., validation or aggregations) as CSV.

In [328]:
# Export validation report
validation_report.to_csv("validation_report.csv", index=False)

# Export neighborhood summary if available
if 'neighbourhood_group' in df.columns:
    summary.to_csv("summary_by_neighborhood.csv", index=True)

print("Summary reports exported successfully:")
print(" - validation_report.csv")
print(" - summary_by_neighborhood.csv")


Summary reports exported successfully:
 - validation_report.csv
 - summary_by_neighborhood.csv
