# Airbnb NYC 2019


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

Load data and perform first exploration

In [None]:
df_airbnb = pd.read_csv('./Data/AB_NYC_2019.csv')

In [None]:
df_airbnb.head()

In [None]:
print('Number of entries: {} \nNumber of features: {}'.format(df_airbnb.shape[0], df_airbnb.shape[1]))

A first look suggests that columns such as **id**, **name**, **host_id** and **host_name** can be discarded from the analysis. 

The column **name** could be used to incorporate some more features (keywords appearing in the name). This would be discarded on the first simple models.

In [None]:
df_airbnb.isnull().sum()

We have missing data for the following features:
* name --> not a big deal, we can ignore it for the moment
* host_name --> not a big deal, we will ignore this variable
* last_review --> comes from entries without any review
* reviews_per_month --> comes from entries without any review; we will replace it by 0

In [None]:
# replace NaN by 0 for column reviews per month
df_airbnb.reviews_per_month.fillna(0, inplace=True)

# drop features id, host_id and host_name
df_airbnb.drop(['id', 'host_id', 'host_name'], axis=1, inplace=True)

## First data exploration

# describe sth here! 

Explore cases where there may be an error with the price. 

In [None]:
df_airbnb[df_airbnb.price > 1000].sort_values('price')

We will drop the cases with price = \\$0 since they may correspond to errors.

In [None]:
df_airbnb.drop(df_airbnb[df_airbnb.price < 10]. index, axis=0, inplace=True)

In [None]:
plt.figure(figsize=(10,5))
df_airbnb['price'].plot(kind='hist', bins=80,logx=True, title='Price distribution')
plt.xlabel('Price $')
plt.show()

In [None]:
df_airbnb['price'].describe()

In [None]:
def print_pct_price(price):
    print('Appartments with price over ${}: {:.2f}%'.\
          format(price, 100 * len(df_airbnb[df_airbnb.price > price])/len(df_airbnb)))

print_pct_price(200)
print_pct_price(300)
print_pct_price(500)
print_pct_price(1000)

In [None]:
plt.figure(figsize=(10,5))
df_airbnb[df_airbnb['price'] < 300].price.plot(kind='hist', bins=100,title='Price distribution')
plt.xlabel('Price $')
plt.show()

Insights: 
* Mean price is \\$153 and median price is \\$106. This reflects large outliers that increase the mean price.
* Min price is \\$10 while max price is \\$1k
* Around 95% of the appartments have a price lower than \\$300. From these, a good amount concentrate in a range lower than \\$100. We can also observe some peaks around round prices (100, 150, etc.)
* Appartments over \\$1k represent less than 0.5\% of the cases.

In [None]:
f, ax = plt.subplots(2,2, figsize=(20,12))
ax = ax.flatten()

df_airbnb.plot(kind='scatter', x='calculated_host_listings_count', y='price', ax=ax[0], \
               title='Price vs Number of properties per host')
df_airbnb.plot(kind='scatter', x='number_of_reviews', y='price', ax=ax[1],\
              title='Price vs Number of reviews per property')
df_airbnb.plot(kind='scatter', x='minimum_nights', y='price', ax=ax[2],\
              title='Price vs Number of min nights')
df_airbnb.plot(kind='scatter', x='availability_365', y='price', ax=ax[3], \
              title='Price vs Availability')
plt.show()

In [None]:
f, ax = plt.subplots(1,2, figsize=(20,5))
ax = ax.flatten()

sns.violinplot(data=df_airbnb[df_airbnb.price < 300], x='neighbourhood_group', y='price', ax=ax[0], \
               title='Price vs Neighbourhood')
sns.violinplot(data=df_airbnb[df_airbnb.price < 300], x='room_type', y='price', ax=ax[1],\
              title='Price vs Type of room')
plt.show()

## TODO THINGS:

* First models based on numerical and categorical data --> drop name, host_name


In [None]:
df_airbnb.dtypes