# Exercise 1

In this activity, you will work with the New York City Airbnb Open Data—a popular, messy, and highly realistic dataset used by data analysts and data scientists around the world. The dataset includes thousands of Airbnb listings along with details about hosts, locations, prices, reviews, and more.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8Pmq3Gv7y7z_2Xun-OqMIk43kd5u0TXEatw&s"/>


To guide your work, you will follow the 6-Step Data Wrangling Process, a professional workflow used in industry:

- Discovering. Become familiar with the data by exploring its structure and observing patterns.

- Structuring. Fix issues related to format, column types, indices, and duplicated records.

- Cleaning. Handle missing values, outliers, and inconsistencies so the dataset becomes trustworthy.

- Enriching. Add new features, aggregate information, and enhance the dataset for deeper insights.

- Validating. Apply rule-based checks to ensure the data meets quality standards.

- Publishing. Export and prepare the cleaned dataset for downstream analysis or reporting.
Throughout this exercise, you will answer questions and fill in code cells based on these six steps. You will mimic what professional data scientists do when preparing data for business intelligence dashboards, machine learning models, or exploratory analysis.

In [479]:
import kagglehub
import os
import pandas as pd
import numpy as np

# Download latest version
path = kagglehub.dataset_download("dgomonov/new-york-city-airbnb-open-data")

print("Path to dataset files:", path)

if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

Using Colab cache for faster access to the 'new-york-city-airbnb-open-data' dataset.
Path to dataset files: /kaggle/input/new-york-city-airbnb-open-data
True


## A. Discovering


Display the first 10 rows

In [480]:
df.head(10)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,2019-06-22,0.59,1,129
6,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,45,49,2017-10-05,0.4,1,0
7,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,79,2,430,2019-06-24,3.47,1,220
8,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.80178,-73.96723,Private room,79,2,118,2017-07-21,0.99,1,0
9,5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,2019-06-09,1.33,4,188


Display the column names:

In [481]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

Display the dataframe shape:

In [482]:
df.shape

(48895, 16)

Identify the numerical features:

In [483]:
numeric_dtypes = df.dtypes[df.dtypes.apply(pd.api.types.is_numeric_dtype)]
print(numeric_dtypes)

id                                  int64
host_id                             int64
latitude                          float64
longitude                         float64
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object


Identify the categorical features

In [484]:
object_dtypes = df.select_dtypes(include=['object']).dtypes
print(object_dtypes)

name                   object
host_name              object
neighbourhood_group    object
neighbourhood          object
room_type              object
last_review            object
dtype: object


In [485]:
df.neighbourhood_group.unique()

array(['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx'],
      dtype=object)

In [486]:
df.neighbourhood.unique()

array(['Kensington', 'Midtown', 'Harlem', 'Clinton Hill', 'East Harlem',
       'Murray Hill', 'Bedford-Stuyvesant', "Hell's Kitchen",
       'Upper West Side', 'Chinatown', 'South Slope', 'West Village',
       'Williamsburg', 'Fort Greene', 'Chelsea', 'Crown Heights',
       'Park Slope', 'Windsor Terrace', 'Inwood', 'East Village',
       'Greenpoint', 'Bushwick', 'Flatbush', 'Lower East Side',
       'Prospect-Lefferts Gardens', 'Long Island City', 'Kips Bay',
       'SoHo', 'Upper East Side', 'Prospect Heights',
       'Washington Heights', 'Woodside', 'Brooklyn Heights',
       'Carroll Gardens', 'Gowanus', 'Flatlands', 'Cobble Hill',
       'Flushing', 'Boerum Hill', 'Sunnyside', 'DUMBO', 'St. George',
       'Highbridge', 'Financial District', 'Ridgewood',
       'Morningside Heights', 'Jamaica', 'Middle Village', 'NoHo',
       'Ditmars Steinway', 'Flatiron District', 'Roosevelt Island',
       'Greenwich Village', 'Little Italy', 'East Flatbush',
       'Tompkinsville', 'Asto

In [487]:
df.room_type.unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

What are the top 5 questions this dataset might help answer?

In [520]:
# does place can affect the price
# does place have factor in many reviews
# does room type can affect price
# does price can affect the fast booking
# does have higher price can affect the reviews per month

## B. Structuring

Check if the dataset has proper column types

In [489]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [490]:
df.dtypes

Unnamed: 0,0
id,int64
name,object
host_id,int64
host_name,object
neighbourhood_group,object
neighbourhood,object
latitude,float64
longitude,float64
room_type,object
price,int64


## last review must be datetime must address first null before changing -> change only those who have value

Check if the dataset has index issues

In [491]:
print(f"Is index unique: {df.id.is_unique}")
print(f"Is index unique: {df.host_id.is_unique}")

Is index unique: True
Is index unique: False


Check if the dataset has duplicated rows

In [492]:
print(f"Number of duplicated rows: {df.duplicated().sum()}")

Number of duplicated rows: 0


Convert columns to correct types (e.g., dates → datetime).

In [493]:
df['last_review'] = pd.to_datetime(df['last_review'])

In [494]:
df.dtypes

Unnamed: 0,0
id,int64
name,object
host_id,int64
host_name,object
neighbourhood_group,object
neighbourhood,object
latitude,float64
longitude,float64
room_type,object
price,int64


Identify at least one structural issue and describe how to fix it.

#### the ``id`` column is redudunt the dataframe have index, if the ``id`` doesnt provide any value to the dataset we can remove it and use instead the default index that is present

In [495]:
df = df.drop('id', axis=1)

In [496]:
df.head()

Unnamed: 0,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,NaT,,1,365
3,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## C. Cleaning



Check for missing values

In [497]:
df.shape

(48895, 15)

In [498]:
df.isnull().sum()

Unnamed: 0,0
name,16
host_id,0
host_name,21
neighbourhood_group,0
neighbourhood,0
latitude,0
longitude,0
room_type,0
price,0
minimum_nights,0


In [499]:
df[df['name'].isna() | df['host_name'].isna()]

Unnamed: 0,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
360,Bienvenue,526653,,Queens,Queens Village,40.72413,-73.76133,Private room,50,1,43,2019-07-08,0.45,1,88
2700,Cozy Studio in Flatbush,7779204,,Brooklyn,Flatbush,40.64965,-73.96154,Entire home/apt,100,30,49,2017-01-02,0.69,1,342
2854,,6676776,Peter,Manhattan,Battery Park City,40.71239,-74.0162,Entire home/apt,400,1000,0,NaT,,1,362
3703,,11395220,Anna,Manhattan,East Village,40.73215,-73.98821,Entire home/apt,200,1,28,2015-06-08,0.45,1,341
5745,SPRING in the City!! Zen-Style Tranquil Bedroom,919218,,Manhattan,Harlem,40.80606,-73.95061,Private room,86,3,34,2019-05-23,1.0,1,359
5775,,20700823,Jesse,Manhattan,Greenwich Village,40.73473,-73.99244,Entire home/apt,225,1,1,2015-01-01,0.02,1,0
5975,,22686810,Michaël,Manhattan,Nolita,40.72046,-73.9955,Entire home/apt,215,7,5,2016-01-02,0.09,1,0
6075,Charming Room in Prospect Heights!,23077718,,Brooklyn,Crown Heights,40.67512,-73.96146,Private room,50,1,0,NaT,,1,0
6269,,21600904,Lucie,Brooklyn,Williamsburg,40.7137,-73.94378,Private room,150,1,0,NaT,,1,0
6567,,1832442,Carolina,Brooklyn,Bushwick,40.70046,-73.92825,Private room,70,1,0,NaT,,1,0


In [500]:
initial_rows = df.shape[0]
df.dropna(subset=['host_name', 'name'], how='any', inplace=True)
rows_after_drop = df.shape[0]
print(f"Number of rows before dropping: {initial_rows}")
print(f"Number of rows after dropping where 'host_name' or 'name' are null: {rows_after_drop}")

Number of rows before dropping: 48895
Number of rows after dropping where 'host_name' or 'name' are null: 48858


Check for outliers in `minimum_nights` and apply

In [501]:
from scipy.stats import zscore

# Keep track of the initial shape
initial_shape = df.shape
print(f"Initial DataFrame shape: {initial_shape}")

# Calculate Z-scores for 'minimum_nights'
z_scores = np.abs(zscore(df['minimum_nights']))

# Define the threshold for outlier removal
threshold = 3

# Identify rows that are outliers
outlier_mask = z_scores >= threshold
outliers = df[outlier_mask].copy()

print(f"\nRows identified as outliers in 'minimum_nights' (Z-score >= {threshold}):")
display(outliers)

# Remove outliers from the DataFrame
df = df[~outlier_mask]

# Display the shape after outlier removal
final_shape = df.shape
print(f"\nDataFrame shape after removing outliers: {final_shape}")
print(f"Number of rows removed: {initial_shape[0] - final_shape[0]}")

Initial DataFrame shape: (48858, 15)

Rows identified as outliers in 'minimum_nights' (Z-score >= 3):


Unnamed: 0,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
14,West Village Nest - Superhost,11975,Alina,Manhattan,West Village,40.73530,-74.00525,Entire home/apt,120,90,27,2018-10-31,0.22,1,0
61,Sunny & Spacious Chelsea Apartment,60278,Petra,Manhattan,Chelsea,40.74623,-73.99530,Entire home/apt,375,180,5,2018-11-03,0.12,1,180
106,Large 2 Bedroom Great for Groups!,87773,Shelly,Brooklyn,Bedford-Stuyvesant,40.68373,-73.92377,Entire home/apt,125,90,162,2019-06-28,1.46,2,137
245,"Quiet, clean midtown apt w. elevato",23619,Anna/Fonzy,Manhattan,Midtown,40.75749,-73.96897,Entire home/apt,110,200,92,2019-04-30,0.90,1,140
349,LARGE 1BR (CONV 2BR) CROWN HEIGHTS,503800,Sadatu,Brooklyn,Crown Heights,40.67473,-73.94494,Entire home/apt,100,90,0,NaT,,1,365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48043,Luxury TriBeCa Apartment at an amazing price,271248669,Jenny,Manhattan,Tribeca,40.71206,-74.00999,Entire home/apt,6500,180,0,NaT,,1,365
48145,Williamsburg Waterfront Apartment,212865,Waad,Brooklyn,Williamsburg,40.72045,-73.96313,Entire home/apt,150,90,0,NaT,,1,91
48205,Room with Amazing location in Manhattan,31846499,Veronika,Manhattan,East Harlem,40.78774,-73.94627,Private room,73,180,0,NaT,,1,0
48312,Furnished room for rent in Bronx SINGLE FEMALE,272247972,Kadeen,Bronx,Olinville,40.88116,-73.86547,Shared room,25,90,0,NaT,,1,190



DataFrame shape after removing outliers: (48531, 15)
Number of rows removed: 327


In [502]:
df.room_type.unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

Check for outliers in `room_type` and apply

In [503]:
print(f"Initial row count: {len(df)}")

# Threshold for Z-score
threshold = 3

# Function to get rows that are NOT outliers per room type
def remove_outliers(group):
    z_scores = stats.zscore(group['price'])
    abs_z_scores = abs(z_scores)
    return group[abs_z_scores <= threshold]

# Remove outliers
df_cleaned = df.groupby('room_type', group_keys=False).apply(remove_outliers)

print(f"Row count after removing outliers: {len(df_cleaned)}")




Initial row count: 48531
Row count after removing outliers: 48162


  df_cleaned = df.groupby('room_type', group_keys=False).apply(remove_outliers)


Check for invalid values (e.g., price ≤ 0)

In [504]:
df[df['price'] <= 0]
print(df.shape[0])

48531


In [505]:
df = df[df['price'] > 0]
print(df.shape[0])

48520


Apply a missing-value strategy (drop, fill, or flag)


In [506]:
df.isnull().sum()

Unnamed: 0,0
name,0
host_id,0
host_name,0
neighbourhood_group,0
neighbourhood,0
latitude,0
longitude,0
room_type,0
price,0
minimum_nights,0


In [507]:
initial_rows = df.shape[0]
df.dropna(subset=['last_review', 'reviews_per_month'], how='any', inplace=True)
rows_after_drop = df.shape[0]
print(f"Number of rows before dropping nulls in 'last_review' or 'reviews_per_month': {initial_rows}")
print(f"Number of rows after dropping nulls: {rows_after_drop}")
print(f"Number of rows removed: {initial_rows - rows_after_drop}")

Number of rows before dropping nulls in 'last_review' or 'reviews_per_month': 48520
Number of rows after dropping nulls: 38645
Number of rows removed: 9875


Apply an outlier-handling strategy

In [508]:
# answer already compiled above

## D. Enriching

Create at least 3 new variables:
- price_per_minimum_night
- review_rate_per_month (combine ratings + frequency)
- is_expensive (boolean flag based on price threshold)


In [509]:
# 1. price_per_minimum_night
df['price_per_minimum_night'] = df['price'] / df['minimum_nights']

# 2. review_rate_per_month (combining number_of_reviews and reviews_per_month)
# Assuming reviews_per_month is already a rate, multiplying by number_of_reviews can give a weighted activity score.
# NaNs in reviews_per_month should already be handled by previous steps (dropped).
df['review_rate_per_month'] = df['number_of_reviews'] * df['reviews_per_month']

# 3. is_expensive (boolean flag based on price threshold)
# Define threshold as the 75th percentile of the price
price_threshold = df['price'].quantile(0.75)
df['is_expensive'] = df['price'] > price_threshold

print("New variables created:")
display(df[['price_per_minimum_night', 'review_rate_per_month', 'is_expensive']].head())

New variables created:


Unnamed: 0,price_per_minimum_night,review_rate_per_month,is_expensive
0,149.0,1.89,False
1,225.0,17.1,True
3,89.0,1252.8,False
4,8.0,0.9,False
5,66.666667,43.66,True


In [510]:
df.columns

Index(['name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood',
       'latitude', 'longitude', 'room_type', 'price', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365',
       'price_per_minimum_night', 'review_rate_per_month', 'is_expensive'],
      dtype='object')

Create 1 aggregated summary table

(e.g. average price per neighborhood number of hosts per borough)

In [511]:
average_price_per_neighborhood = df.groupby('neighbourhood')['price'].mean().reset_index()
print("Average price per neighborhood:")
display(average_price_per_neighborhood.head())

Average price per neighborhood:


Unnamed: 0,neighbourhood,price
0,Allerton,90.594595
1,Arden Heights,67.25
2,Arrochar,118.25
3,Arverne,158.515152
4,Astoria,102.258156


In [512]:
df.head()

Unnamed: 0,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,price_per_minimum_night,review_rate_per_month,is_expensive
0,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,149.0,1.89,False
1,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,225.0,17.1,True
3,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,89.0,1252.8,False
4,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0,8.0,0.9,False
5,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,2019-06-22,0.59,1,129,66.666667,43.66,True


## E. Validating

Create at least 3 validation checks
- Price must be positive.
- Host ID should not be null.
- Minimum nights must be ≥ 1.
- Generate a validation report:
- Count of rows violating each rule.

Decide if invalid rows (if any) should be removed or corrected?

In [513]:
# Validation checks
violations = {}

# Price must be positive
price_violation_count = df[df['price'] <= 0].shape[0]
violations['price_not_positive'] = price_violation_count

# Host ID should not be null
# Since 'host_id' is an integer type (int64), it generally doesn't contain NaN values.
# If it were to contain NaN, its dtype would likely be float or object.
# However, we can still explicitly check for any non-finite values if there's a chance of corruption.
host_id_violation_count = df['host_id'].isnull().sum()
violations['host_id_is_null'] = host_id_violation_count

# Minimum nights must be >= 1
minimum_nights_violation_count = df[df['minimum_nights'] < 1].shape[0]
violations['minimum_nights_less_than_1'] = minimum_nights_violation_count

# Generate validation report
print("--- Validation Report ---")
for rule, count in violations.items():
    print(f"Rule: {rule.replace('_', ' ').capitalize()}")
    print(f"  Number of violations: {count}")
    print("-" * 30)

# Decide on a strategy for invalid rows
print("\n--- Strategy for Invalid Rows ---")
if all(count == 0 for count in violations.values()):
    print("No invalid rows found based on the defined rules. No action needed.")
else:
    print("Invalid rows were found. The common strategy for such violations, especially for core numerical data like price and minimum_nights, is to:")
    print("1. Remove rows with 'price <= 0' (already performed in previous cleaning steps).")
    print("2. Remove rows where 'host_id' is null (no violations found, but if there were, removing them is a sensible approach).")
    print("3. Remove rows where 'minimum_nights < 1'. These values are illogical for a rental listing.")
    print("\nApplying removal strategy for any remaining violations...")

    initial_rows = df.shape[0]

    # Remove rows where minimum_nights < 1
    if minimum_nights_violation_count > 0:
        df = df[df['minimum_nights'] >= 1]
        print(f"Removed {minimum_nights_violation_count} rows where 'minimum_nights' was less than 1.")

    # The 'price <= 0' check was already handled, but for completeness in this section:
    if price_violation_count > 0:
        df = df[df['price'] > 0]
        print(f"Removed {price_violation_count} rows where 'price' was not positive.")

    # The 'host_id_is_null' check showed no violations, but if it had:
    if host_id_violation_count > 0:
        df.dropna(subset=['host_id'], inplace=True)
        print(f"Removed {host_id_violation_count} rows where 'host_id' was null.")

    final_rows = df.shape[0]
    print(f"Total rows before validation strategy: {initial_rows}")
    print(f"Total rows after validation strategy: {final_rows}")
    print(f"Total rows removed during this validation step: {initial_rows - final_rows}")

--- Validation Report ---
Rule: Price not positive
  Number of violations: 0
------------------------------
Rule: Host id is null
  Number of violations: 0
------------------------------
Rule: Minimum nights less than 1
  Number of violations: 0
------------------------------

--- Strategy for Invalid Rows ---
No invalid rows found based on the defined rules. No action needed.


Select final columns for publication.



In [514]:
df.columns

Index(['name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood',
       'latitude', 'longitude', 'room_type', 'price', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365',
       'price_per_minimum_night', 'review_rate_per_month', 'is_expensive'],
      dtype='object')

In [515]:
df = df.drop('host_name', axis=1)
print("The 'host_name' column is removed as it will not be significant for the analysis")

The 'host_name' column is removed as it will not be significant for the analysis


Export cleaned dataset as "cleaned_airbnb.csv"

In [516]:
df.to_csv('cleaned_airbnb.csv', index=False)
print("Cleaned dataset exported to 'cleaned_airbnb.csv'")

Cleaned dataset exported to 'cleaned_airbnb.csv'


In [517]:
df.shape

(38645, 17)

In [518]:
df.columns

Index(['name', 'host_id', 'neighbourhood_group', 'neighbourhood', 'latitude',
       'longitude', 'room_type', 'price', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365',
       'price_per_minimum_night', 'review_rate_per_month', 'is_expensive'],
      dtype='object')

Export summary reports (e.g., validation or aggregations) as CSV.

In [519]:
# Export aggregated summary: Average price per neighborhood
average_price_per_neighborhood.to_csv('average_price_per_neighborhood.csv', index=False)
print("Aggregated summary 'average_price_per_neighborhood.csv' exported.")

Aggregated summary 'average_price_per_neighborhood.csv' exported.
