# Airbnb Listings: Price Analysis

## Project overview
In this project, we will analyze the New York City Airbnb listings dataset to uncover insights into factors affecting rental prices. The analysis will involve data cleaning, exploratory data analysis (EDA), and visualization.

## Import required libraries and load dataset
In this section, we will import the necessary libraries and load the Airbnb dataset.

In [1]:
import pandas as pd

In [2]:
pd.options.mode.chained_assignment = None

In [3]:
data = pd.read_csv(r'D:\REGINA\DA DS\airbnb_listings\Airbnb_Open_Data.csv', low_memory=False)

In [4]:
data.head()

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,...,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,...,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              102599 non-null  int64  
 1   NAME                            102349 non-null  object 
 2   host id                         102599 non-null  int64  
 3   host_identity_verified          102310 non-null  object 
 4   host name                       102193 non-null  object 
 5   neighbourhood group             102570 non-null  object 
 6   neighbourhood                   102583 non-null  object 
 7   lat                             102591 non-null  float64
 8   long                            102591 non-null  float64
 9   country                         102067 non-null  object 
 10  country code                    102468 non-null  object 
 11  instant_bookable                102494 non-null  object 
 12  cancellation_pol

## Data cleaning and preprocessing
This section involves cleaning the dataset to ensure it's ready for analysis. 
We will handle missing values, duplicates, and data type conversions.

In [6]:
# Handle duplicates
data.duplicated().sum()
data = data.drop_duplicates()

In [7]:
# Clean column names
data = data.rename(columns={'NAME':'name',\
                           'host id':'host_id',\
                           'host name':'host_name',\
                           'neighbourhood group':'neighbourhood_group',\
                           'country code':'country_code',\
                           'room type':'room_type',\
                           'Construction year':'construction_year',\
                           'service fee':'service_fee',\
                           'minimum nights':'minimum_nights',\
                           'number of reviews':'number_of_reviews',\
                           'last review':'last_review',\
                           'reviews per month':'reviews_per_month',\
                           'review rate number':'review_rate_number',\
                           'calculated host listings count':'calculated_host_listings_count',\
                           'availability 365':'availability_365'})

In [8]:
# Calculate missing values
data.isna().sum() / len(data) * 100

id                                 0.000000
name                               0.244959
host_id                            0.000000
host_identity_verified             0.283172
host_name                          0.395853
neighbourhood_group                0.028415
neighbourhood                      0.015677
lat                                0.007839
long                               0.007839
country                            0.521272
country_code                       0.128358
instant_bookable                   0.102883
cancellation_policy                0.074467
room_type                          0.000000
construction_year                  0.209685
price                              0.242019
service_fee                        0.267495
minimum_nights                     0.391934
number_of_reviews                  0.179310
last_review                       15.512748
reviews_per_month                 15.499030
review_rate_number                 0.312567
calculated_host_listings_count  

In [9]:
# Handle missing values
data['host_identity_verified'] = data['host_identity_verified'].fillna('unconfirmed')
data['host_name'] = data['host_name'].fillna('No data')
data['neighbourhood_group'] = data['neighbourhood_group'].fillna('No data')
data['neighbourhood'] = data['neighbourhood'].fillna('No data')
data['lat'] = data['lat'].fillna('No data')
data['long'] = data['long'].fillna('No data')
data['country'] = data['country'].fillna('Outside of the United States')
data['country_code'] = data['country_code'].fillna('Outside US')
data['cancellation_policy'] = data['cancellation_policy'].fillna('No data')
data['construction_year'] = data['construction_year'].fillna('No data')
data['last_review'] = data['last_review'].fillna('No data')
data['long'] = data['long'].fillna('No data')
data['house_rules'] = data['house_rules'].fillna('No data')
data['availability_365'] = data['availability_365'].fillna(0)
data['calculated_host_listings_count'] = data['calculated_host_listings_count'].fillna(0)
data['number_of_reviews'] = data['number_of_reviews'].fillna(0)
data['review_rate_number'] = data['review_rate_number'].fillna(0)
data['reviews_per_month'] = data['reviews_per_month'].fillna(0)
data['minimum_nights'] = data['minimum_nights'].fillna(data['minimum_nights'].median())
data = data.drop('license', axis=1)
data = data.dropna()
data = data.reset_index()

In [10]:
# Recheck for missing values
data.isna().sum()

index                             0
id                                0
name                              0
host_id                           0
host_identity_verified            0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
lat                               0
long                              0
country                           0
country_code                      0
instant_bookable                  0
cancellation_policy               0
room_type                         0
construction_year                 0
price                             0
service_fee                       0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
review_rate_number                0
calculated_host_listings_count    0
availability_365                  0
house_rules                       0
dtype: int64

In [11]:
# Data type conversion
data['host_identity_verified'] = data['host_identity_verified'].map({'unconfirmed':False, 'verified':True})
data['instant_bookable'] = data['instant_bookable'].astype(bool)