dataset link : https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/version/3

### New York City Airbnb Dataset 

#### Context
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

#### Content
This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

#### Acknowledgements
This public dataset is part of Airbnb, and the original source can be found on this website.

#### Inspiration
What can we learn about different hosts and areas?
What can we learn from predictions? (ex: locations, prices, reviews, etc)
Which hosts are the busiest and why?
Is there any noticeable difference of traffic among different areas and what could be the reason for it?

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("AB_NYC_2019.csv")

In [3]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [4]:
df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [6]:
# drop duplicates if any

df.duplicated().sum()
df.drop_duplicates(inplace=True)
df.shape

(48895, 16)

In [7]:
df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [8]:
# convert to datetime

df["last_review"] = pd.to_datetime(df.last_review)

In [9]:
df.fillna(df.mean(),inplace=True)
df.last_review.fillna(method="ffill", inplace=True)

  df.fillna(df.mean(),inplace=True)
  df.fillna(df.mean(),inplace=True)


In [10]:
df.dtypes

id                                         int64
name                                      object
host_id                                    int64
host_name                                 object
neighbourhood_group                       object
neighbourhood                             object
latitude                                 float64
longitude                                float64
room_type                                 object
price                                      int64
minimum_nights                             int64
number_of_reviews                          int64
last_review                       datetime64[ns]
reviews_per_month                        float64
calculated_host_listings_count             int64
availability_365                           int64
dtype: object

In [11]:
# drop unnecessary columns
df.drop(['name','id','host_name','last_review'], axis=1, inplace=True)


In [12]:
df.isnull().sum()

host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

In [13]:
df.shape

(48895, 12)

In [14]:

df.head()

Unnamed: 0,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,4632,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,1.373221,1,365
3,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


## One Hot Encoding

In [15]:
one_hot_roomtype = pd.get_dummies(df.room_type).add_prefix('room_type-')
one_hot_neighbourhood = pd.get_dummies(df.neighbourhood).add_prefix('neighbourhood-')
one_hot_neighbourhood_group = pd.get_dummies(df.neighbourhood_group).add_prefix('neighbourhoodGroup-')


In [16]:
df = df.drop(['neighbourhood_group','neighbourhood','room_type','host_id'],axis = 1)
df

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,40.64749,-73.97237,149,1,9,0.210000,6,365
1,40.75362,-73.98377,225,1,45,0.380000,2,355
2,40.80902,-73.94190,150,3,0,1.373221,1,365
3,40.68514,-73.95976,89,1,270,4.640000,1,194
4,40.79851,-73.94399,80,10,9,0.100000,1,0
...,...,...,...,...,...,...,...,...
48890,40.67853,-73.94995,70,2,0,1.373221,2,9
48891,40.70184,-73.93317,40,4,0,1.373221,2,36
48892,40.81475,-73.94867,115,10,0,1.373221,1,27
48893,40.75751,-73.99112,55,1,0,1.373221,6,2


In [17]:
df = pd.concat([df,one_hot_neighbourhood,one_hot_roomtype,one_hot_neighbourhood],axis='columns')

In [18]:
df.head()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood-Allerton,neighbourhood-Arden Heights,...,neighbourhood-Westerleigh,neighbourhood-Whitestone,neighbourhood-Williamsbridge,neighbourhood-Williamsburg,neighbourhood-Willowbrook,neighbourhood-Windsor Terrace,neighbourhood-Woodhaven,neighbourhood-Woodlawn,neighbourhood-Woodrow,neighbourhood-Woodside
0,40.64749,-73.97237,149,1,9,0.21,6,365,0,0,...,0,0,0,0,0,0,0,0,0,0
1,40.75362,-73.98377,225,1,45,0.38,2,355,0,0,...,0,0,0,0,0,0,0,0,0,0
2,40.80902,-73.9419,150,3,0,1.373221,1,365,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40.68514,-73.95976,89,1,270,4.64,1,194,0,0,...,0,0,0,0,0,0,0,0,0,0
4,40.79851,-73.94399,80,10,9,0.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### DATA NORMALIZATION

### Normalizing data to ensure all data are in same range while plotting data

In [22]:
# copy the data
df_min_max_scaled = df.copy()
df_min_max_scaled.drop(['price'], axis = 1,inplace=True)
# apply normalization techniques
for column in df_min_max_scaled.columns:
   
        df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min()) 
      
# view normalized data
display(df_min_max_scaled)

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood-Allerton,neighbourhood-Arden Heights,neighbourhood-Arrochar,...,neighbourhood-Westerleigh,neighbourhood-Whitestone,neighbourhood-Williamsbridge,neighbourhood-Williamsburg,neighbourhood-Willowbrook,neighbourhood-Windsor Terrace,neighbourhood-Woodhaven,neighbourhood-Woodlawn,neighbourhood-Woodrow,neighbourhood-Woodside
0,0.357393,0.511921,0.000000,0.014308,0.003419,0.015337,1.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.614199,0.490469,0.000000,0.071542,0.006326,0.003067,0.972603,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.748252,0.569257,0.001601,0.000000,0.023307,0.000000,1.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.448496,0.535649,0.000000,0.429253,0.079159,0.000000,0.531507,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.722820,0.565324,0.007206,0.014308,0.001539,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,0.432502,0.554109,0.000801,0.000000,0.023307,0.003067,0.024658,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48891,0.488906,0.585684,0.002402,0.000000,0.023307,0.003067,0.098630,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48892,0.762117,0.556517,0.007206,0.000000,0.023307,0.000000,0.073973,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48893,0.623612,0.476639,0.000000,0.000000,0.023307,0.015337,0.005479,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
