# Exploratory Data Analysis on New York City Airbnb Listings 2024 Dataset


## Acknowledgements
The EDA is performed on the Airbnb Listings dataset of New York City (5 January, 2024).

The original dataset can be found [here](https://insideairbnb.com/get-the-data/).

## Objectives
- Exploration and cleaning of the dataset
- Visualization of the relationships among different attributes

## Loading Data

In [None]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# mount the dataset
fpath = '/content/airbnb_listings_nyc.csv'
nyc = pd.read_csv(fpath)

## Understanding and Cleaning Data

In [None]:
# display the shape of the dataset
nyc.shape

In [None]:
# display the first few rows of the dataset
nyc.head()

In [None]:
# display columns and their data types
nyc.dtypes

In [None]:
# rename 'neighbourhood_group' to 'town'
nyc = nyc.rename(columns={'neighbourhood_group': 'town'})

In [None]:
# review the columns
nyc.columns

In [None]:
# remove duplicate rows if any
nyc.duplicated().sum()
nyc.drop_duplicates(inplace=True)

In [None]:
# find out if there are any null values in the dataset
nyc.isnull().sum()

In [None]:
# drop the insignificant columns
nyc.drop(['id', 'name', 'host_name', 'last_review'], axis=1, inplace=True)

# review changes
nyc.head()

In [None]:
# replace null values in 'reviews_per_month' with 0
nyc.fillna({'reviews_per_month': 0}, inplace=True)

In [None]:
# replace null values in 'license' with 'No License'
nyc.fillna({'license': 'No License'}, inplace=True)

In [None]:
# review dataset for null values
nyc.isnull().sum()

In [None]:
# get concise information about the dataset
nyc.info()

In [None]:
# examine continuous variables
nyc.describe()

In [None]:
# display unique values of town
nyc.town.unique()

In [None]:
# examine the number of unique values of neighbourhood
len(nyc.neighbourhood.unique())

In [None]:
# display unique values of room_type
nyc.room_type.unique()

In [None]:
# examine the number of unique values of license
len(nyc.license.unique())

## Visualizing Data

In [None]:
# display the number of listings in each town
nyc.town.value_counts()

In [None]:
# count plot for distribution of listings in each town
plt.figure(figsize=(10, 6))
sns.countplot(x='town', data=nyc, palette='magma')
plt.title('Distribution of Listings in Each Town')
plt.xlabel('Town')
plt.ylabel('Number of Listings')
plt.show()

In [None]:
# display the count of each room_type
nyc.room_type.value_counts()

In [None]:
# count plot for distribution of room_type
plt.figure(figsize=(10, 6))
sns.countplot(x='room_type', data=nyc, palette='crest')
plt.title('Distribution of Room Types')
plt.xlabel('Room Type')
plt.ylabel('Number of Listings')
plt.show()

In [None]:
# box plot for price per town
plt.figure(figsize=(10, 6))
sns.boxplot(x='town', y='price', data=nyc, palette='viridis')
plt.title('Price Distribution Per Town')
plt.xlabel('Town')
plt.ylabel('Price')
plt.show()

In [None]:
# display summary stats for price
print("Mean = %s" % (np.mean(nyc.price)))
print("Max = %s" % (np.max(nyc.price)))
print("Min = %s" % (np.min(nyc.price)))
print("Median = %s" % (np.median(nyc.price)))
print("Standard Dev = %s" % (np.std(nyc.price)))

In [None]:
# Removing the outliers and plotting again, this time with log of price to get a clearer understanding
plt.figure(figsize=(10, 6))
nyc = nyc.query("price < 20000")
nyc['log_price'] = np.log(nyc['price'])
sns.boxplot(x='town', y='log_price', data=nyc, palette='viridis')
plt.title('Price Distribution Per Town (Excluding Outliers)')
plt.xlabel('Town')
plt.ylabel('Log of Price')
plt.tight_layout()
plt.show()

In [None]:
# bar plot for average price of listings in each town
plt.figure(figsize=(10, 6))
sns.barplot(x='town', y='price', data=nyc, palette='magma', errorbar=None)
plt.title('Average Price of Listings in Each Town')
plt.xlabel('Town')
plt.ylabel('Average Price')
plt.show()

In [None]:
# check the number of listings above the mean price
meanprice = str(np.mean(nyc.price))
len(nyc.query("price >= " + meanprice))

In [None]:
# box plot for correlation between room_type and price
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type', y='log_price', data=nyc, palette='crest')
plt.title('Price Distribution Per Room Type')
plt.xlabel('Room Type')
plt.ylabel('Log of Price')
plt.show()

In [None]:
# scatter plot for correlation between price and number_of_reviews
plt.figure(figsize=(10, 6))
sns.scatterplot(x='log_price', y='number_of_reviews', data=nyc, alpha=0.6, color='blue')
plt.title('Correlation Between Price and Number of Reviews')
plt.xlabel('Log of Price')
plt.ylabel('Number of Reviews')
plt.show()

In [None]:
# scatter plot for correlation between price and reviews_per_month
plt.figure(figsize=(10, 6))
sns.scatterplot(x='log_price', y='reviews_per_month', data=nyc, alpha=0.6, color='green')
plt.title('Correlation Between Price and Reviews per month')
plt.xlabel('Log of Price')
plt.ylabel('Reviews Per Month')
plt.show()

In [None]:
# group the listings based on license
has_license = []
for i in nyc["license"]:
    if i == "No License":
      has_license.append("No License")
    elif i == "Exempt":
      has_license.append("Exempt")
    else:
      has_license.append("License Permitted")
nyc["has_license"] = has_license
nyc.has_license.value_counts()

In [None]:
# pie chart for distribution of license among the listings
license_data = nyc.has_license.value_counts()
plt.pie(license_data, labels=['No License','Exempt', 'License Permitted'], autopct='%.2f%%', colors=['skyblue', 'orchid', 'salmon'])
plt.title('Distribution of License')
plt.show()

In [None]:
# box plot for correlation between price and license
plt.figure(figsize=(10, 6))
sns.boxplot(nyc, x = "has_license", y = "log_price")
plt.title('Correlation Between Price and License')
plt.xlabel('License')
plt.ylabel('Log of Price')
plt.show()

In [None]:
# heat map for correlation between numerical attributes
nyc_map = nyc.iloc[:, 0:14]
corr = nyc_map.corr(method='kendall')
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, cmap='GnBu')
plt.title('Correlation Matrix of Numerical Features')
nyc.columns

## References
- [Kaggle Dataset 1](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)
- [Kaggle Dataset 2](https://www.kaggle.com/datasets/vrindakallu/new-york-dataset/data)

## Author
[Sarthak Kumar Das](https://www.linkedin.com/in/sarthak-kumar-das/)