# Data Wrangling Capstone 2

### Abstract

With the growing popularity of Airbnb in New York City, hosts face increasing challenges in pricing their listings competitively and effectively. The goal is to develop a predictive modeling framework to assist both prospective and current Airbnb hosts in making data-driven pricing decisions. Using historical listing data, including property features, neighborhood demographics, and pricing trends, we will perform data cleaning and exploratory analysis to uncover factors influencing rental prices. A regression-based model will be built to predict optimal listing prices, supported by additional models for pricing optimization and demand forecasting. The goal is to provide hosts with actionable insights to improve their market entry strategy and maximize rental performance in NYC’s competitive short-term rental landscape.

In [49]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

### Initial Inspection

Before cleaning the data, it is important to inspect the structure of the dataset. We check the data types, non-null counts, and basic statistics to identify missing values, column types, and any potential outliers or anomalies.

In [52]:
df = pd.read_csv('C:\\Users\\jjani\\Downloads\\AB_NYC_2019.csv')

In [53]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [55]:
df.shape

(48895, 16)

In [56]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [57]:
df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [58]:
df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

### Drop Irrelevant Columns

Some columns like IDs, names, and free-text fields (e.g., name, host_name) are not useful for analysis or modeling without extensive text processing. We remove these to simplify the dataset and focus on features that are more directly related to pricing.

In [62]:
# Drop irrelevant columns
# Drop rows with missing names or host names — they're not essential but useful to keep clean
df = df.dropna(subset=['name', 'host_name'])

# Fill missing values in 'last_review' with 'No Review'
df['last_review'] = df['last_review'].fillna('No Review')

### Fill Missing Values

The reviews_per_month column has missing values for listings with no reviews. Rather than dropping these rows, we assume no reviews means zero reviews per month, and fill the missing values with 0.

In [65]:

# Fill missing values in 'reviews_per_month' with 0.0 since no reviews means 0/month
df['reviews_per_month'] = df['reviews_per_month'].fillna(0.0)

# Final check: this should return an empty series
print(df.isnull().sum()[df.isnull().sum() > 0])


Series([], dtype: int64)


### Remove Invalid/Extreme Prices

We remove listings with zero or negative prices, as these are not realistic. Additionally, listings priced above $1000 per night are considered outliers that could skew our analysis. Capping prices ensures we work with data that's more typical of most listings.

In [68]:

# Remove listings with invalid or extreme prices
df = df[df['price'] > 0]
df = df[df['price'] <= 1000]
df.reset_index(drop=True, inplace=True)


## Final Inspection of the Cleaned Dataset

After completing the data wrangling steps, we inspect the cleaned dataset to ensure all transformations were applied correctly. We check for any remaining missing values and verify that the dataset is now ready for analysis.

In [85]:

# Final inspection
df.isnull().sum()


id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

In [87]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,No Review,0.0,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
