# Airbnb Data Analysis
### How to make your home the best in the platform?
---
This notebook is part of the Udacity's Data Scientist Nanodegree program

In this project, I will be investigating Airbnb data and answering relevant questions using the **CRISP-DM** process:

1. Business Understanding
2. Data Understanding
3. Data Preparation 
4. Data Modeling
5. Results & Evaluation

# 1. Business Understanding

1. Does more expensive houses have higher reviews?
2. What are the main features that influences the review rates? What about the prices?
3. Which city has the best listings? Which one has more expensive ones? Is there a connection in that?

# 2. Data Understanding

### Importing Necessary Libraries

In [93]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder, normalize
from sklearn.model_selection import train_test_split, cross_val_score

pd.set_option('display.max_columns', 500)
%matplotlib inline

### Reading Data
**Boston Airbnb Data**

In [2]:
df_bos_cal = pd.read_csv('BostonData/calendar.csv')
df_bos_lis = pd.read_csv('BostonData/listings.csv')
df_bos_rev = pd.read_csv('BostonData/reviews.csv')

**Seattle Aribnb Data**

In [3]:
df_sea_cal = pd.read_csv('SeattleData/calendar.csv')
df_sea_lis = pd.read_csv('SeattleData/listings.csv')
df_sea_rev = pd.read_csv('SeattleData/reviews.csv')

In [4]:
display(df_bos_cal.head(), df_sea_cal.head())


Unnamed: 0,listing_id,date,available,price
0,12147973,2017-09-05,f,
1,12147973,2017-09-04,f,
2,12147973,2017-09-03,f,
3,12147973,2017-09-02,f,
4,12147973,2017-09-01,f,


Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,t,$85.00
1,241032,2016-01-05,t,$85.00
2,241032,2016-01-06,f,
3,241032,2016-01-07,f,
4,241032,2016-01-08,f,


In [5]:
display(df_bos_lis.head(), df_sea_lis.head())

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


In [6]:
display(df_bos_rev.head(), df_sea_rev.head())

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,1178162,4724140,2013-05-21,4298113,Olivier,My stay at islam's place was really cool! Good...
1,1178162,4869189,2013-05-29,6452964,Charlotte,Great location for both airport and city - gre...
2,1178162,5003196,2013-06-06,6449554,Sebastian,We really enjoyed our stay at Islams house. Fr...
3,1178162,5150351,2013-06-15,2215611,Marine,The room was nice and clean and so were the co...
4,1178162,5171140,2013-06-16,6848427,Andrew,Great location. Just 5 mins walk from the Airp...


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...


# 3. Data Preparation

- Since both cities have similar datasets, I'll create a pipeline to clean both of them the same way

**Calendar Dataset**
- My goal with this dataset is to get the price for each house
    - Since each house has several prices (depending on the date), I'll get the average price for each one
    - This will help me answer question 1 and 2

In [7]:
# making a copy of the original dataset
df_bos_cal_c = df_bos_cal.copy()

# displaying the info 
df_bos_cal_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308890 entries, 0 to 1308889
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   listing_id  1308890 non-null  int64 
 1   date        1308890 non-null  object
 2   available   1308890 non-null  object
 3   price       643037 non-null   object
dtypes: int64(1), object(3)
memory usage: 39.9+ MB


- As we can see, there are multiple empty values in the price column. Since that's our main goal, I'll drop these rows

In [8]:
# droping null values 
df_bos_cal_c.dropna(inplace=True)

# checking null values 
df_bos_cal_c.isnull().sum()

listing_id    0
date          0
available     0
price         0
dtype: int64

The datatypes are also wrong:
- The date column should be datatime64
- The price column should be float64

**For that, first I'll need to format the price column to only contain numbers:**

In [9]:
# removing the $ sign and replacing ',' to '.'
df_bos_cal_c.price = df_bos_cal_c.price.apply(lambda x: x.replace('$', ''))
df_bos_cal_c.price = df_bos_cal_c.price.apply(lambda x: x.replace(',', '.'))


# removing the '.00' part of the number (causes error when transforming into float64)
df_bos_cal_c.price = df_bos_cal_c.price.apply(lambda x: x[:-3])

# checking column
df_bos_cal_c.price

365        65
366        65
367        65
368        75
369        75
           ..
1308875    62
1308876    62
1308877    62
1308878    62
1308879    62
Name: price, Length: 643037, dtype: object

**Now I can just fix the datatypes**

In [10]:
# fixing price datatype to float
df_bos_cal_c.price = df_bos_cal_c.price.astype('float64')

# fixing date datatype to datetime
df_bos_cal_c.date = df_bos_cal_c.date.astype('datetime64')

**Dropping the available column since it's not usefull**

In [11]:
# dropping column
df_bos_cal_c.drop(columns=['available'], axis=1, inplace=True)

# checking dataset
df_bos_cal_c.head()

Unnamed: 0,listing_id,date,price
365,3075044,2017-08-22,65.0
366,3075044,2017-08-21,65.0
367,3075044,2017-08-20,65.0
368,3075044,2017-08-19,75.0
369,3075044,2017-08-18,75.0


**Getting the highest price for each house**

In [12]:
df_bos_cal_c.groupby('listing_id')['price'].mean().to_frame().reset_index()

Unnamed: 0,listing_id,price
0,3353,35.204819
1,5506,147.267442
2,6695,197.407407
3,6976,65.000000
4,8792,154.000000
...,...,...
2901,14924831,169.515152
2902,14928000,55.000000
2903,14928333,105.380531
2904,14933380,49.000000


Looks great. Now I have the average price for each listing!

**Reviews Dataset**
- My objective with this dataset is to get the number reviews for each listing
    - This dataset will also help me with questions 1 and 2

In [13]:
# checking first rows of the dataset
df_bos_rev.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,1178162,4724140,2013-05-21,4298113,Olivier,My stay at islam's place was really cool! Good...
1,1178162,4869189,2013-05-29,6452964,Charlotte,Great location for both airport and city - gre...
2,1178162,5003196,2013-06-06,6449554,Sebastian,We really enjoyed our stay at Islams house. Fr...
3,1178162,5150351,2013-06-15,2215611,Marine,The room was nice and clean and so were the co...
4,1178162,5171140,2013-06-16,6848427,Andrew,Great location. Just 5 mins walk from the Airp...


In [14]:
# making a copy of the original dataset
df_bos_rev_c = df_bos_rev.copy()

# checking the dataset info
df_bos_rev_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68275 entries, 0 to 68274
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   listing_id     68275 non-null  int64 
 1   id             68275 non-null  int64 
 2   date           68275 non-null  object
 3   reviewer_id    68275 non-null  int64 
 4   reviewer_name  68275 non-null  object
 5   comments       68222 non-null  object
dtypes: int64(3), object(3)
memory usage: 3.1+ MB


**Getting the number of reviews for each house**

In [58]:
df_bos_rev_c.groupby(['listing_id'])['id'].count().to_frame().reset_index().rename(columns={'id':'reviews'})

Unnamed: 0,listing_id,reviews
0,3353,34
1,5506,36
2,6695,47
3,6976,41
4,8792,18
...,...,...
2824,14813006,1
2825,14823724,1
2826,14842237,1
2827,14843050,2


**Listings Dataset**
- My objective with this dataset is to get the main features of each listing and it's review status
    - I will use this dataset to asweer all 3 questions

In [89]:
# creating copy of the dataset
df_bos_lis_c = df_bos_lis.copy()

In [90]:
df_bos_lis_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3585 entries, 0 to 3584
Data columns (total 95 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3585 non-null   int64  
 1   listing_url                       3585 non-null   object 
 2   scrape_id                         3585 non-null   int64  
 3   last_scraped                      3585 non-null   object 
 4   name                              3585 non-null   object 
 5   summary                           3442 non-null   object 
 6   space                             2528 non-null   object 
 7   description                       3585 non-null   object 
 8   experiences_offered               3585 non-null   object 
 9   neighborhood_overview             2170 non-null   object 
 10  notes                             1610 non-null   object 
 11  transit                           2295 non-null   object 
 12  access

**Getting only necessary columns**

In [96]:
# selecting only necessary columns
df_bos_lis_c = df_bos_lis_c[['id', 'market', 'host_id', 'host_is_superhost', 'neighbourhood_cleansed', 
                             'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 
                             'bed_type', 'amenities', 'price', 'weekly_price', 'monthly_price', 'cleaning_fee', 
                             'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
                             'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 
                             'review_scores_location', 'review_scores_value']]

df_bos_lis_c.head()

Unnamed: 0,id,market,host_id,host_is_superhost,neighbourhood_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,weekly_price,monthly_price,cleaning_fee,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
0,12147973,Boston,31303940,f,Roslindale,House,Entire home/apt,4,1.5,2.0,3.0,Real Bed,"{TV,""Wireless Internet"",Kitchen,""Free Parking ...",$250.00,,,$35.00,0,,,,,,,
1,3075044,Boston,2572247,f,Roslindale,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",$65.00,$400.00,,$10.00,36,94.0,10.0,9.0,10.0,10.0,9.0,9.0
2,6976,Boston,16701,t,Roslindale,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",""Wireless Internet"",""Air Condit...",$65.00,$395.00,"$1,350.00",,41,98.0,10.0,9.0,10.0,10.0,9.0,10.0
3,1436513,Boston,6031442,f,Roslindale,House,Private room,4,1.0,1.0,2.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Air Conditio...",$75.00,,,$50.00,1,100.0,10.0,10.0,10.0,10.0,10.0,10.0
4,7651065,Boston,15396970,t,Roslindale,House,Private room,2,1.5,1.0,2.0,Real Bed,"{Internet,""Wireless Internet"",""Air Conditionin...",$79.00,,,$15.00,29,99.0,10.0,10.0,10.0,10.0,9.0,10.0


**Checking null values**

In [97]:
df_bos_lis_c.isnull().sum()

id                                0
market                           14
host_id                           0
host_is_superhost                 0
neighbourhood_cleansed            0
property_type                     3
room_type                         0
accommodates                      0
bathrooms                        14
bedrooms                         10
beds                              9
bed_type                          0
amenities                         0
price                             0
weekly_price                   2693
monthly_price                  2697
cleaning_fee                   1107
number_of_reviews                 0
review_scores_rating            813
review_scores_accuracy          823
review_scores_cleanliness       818
review_scores_checkin           820
review_scores_communication     818
review_scores_location          822
review_scores_value             821
dtype: int64