# Airbnb Parisf
## by Mathieu Rella

# I. Business Understanding

We will be exploring Airbnb paris data to try to find answers to some questions like :

- Where is it good to rent on airbnb in paris ?
- Which season is the more profitable for the host ?
- What do really believe the guest of paris listing ?
- Can we predict the price of a listing ?

# II. Data Understanding

In [83]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import qgrid
import plotly.graph_objects as go

import plotly
plotly.__version__
import json
from plotly.offline import download_plotlyjs, init_notebook_mode,  iplot
init_notebook_mode(connected=True)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
# Sklearn ML Modules
from sklearn.preprocessing import MultiLabelBinarizer,LabelEncoder,OneHotEncoder,StandardScaler 
import sklearn.metrics as mtr
import math

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [84]:
# load all the dataset into a pandas dataframe

df_list = pd.read_csv('Data/listings.csv')
df_rev = pd.read_csv('Data/Reviews.csv')
df_cal = pd.read_csv('Data/calendar.csv')

## 1. Listings Dataframe

In [85]:
df_list.head(3)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2577,https://www.airbnb.com/rooms/2577,20200911161645,2020-09-13,Loft for 4 by Canal Saint Martin,"100 m2 loft (1100 sq feet) with high ceiling, ...",,https://a0.muscache.com/pictures/09da057c-0120...,2827,https://www.airbnb.com/users/show/2827,...,10.0,10.0,10.0,,t,1,1,0,0,0.05
1,3109,https://www.airbnb.com/rooms/3109,20200911161645,2020-09-13,zen and calm,<b>The space</b><br />I bedroom appartment in ...,Good restaurants<br />very close the Montparna...,https://a0.muscache.com/pictures/baeae9e2-cd53...,3631,https://www.airbnb.com/users/show/3631,...,10.0,10.0,10.0,,f,1,1,0,0,0.2
2,4886,https://www.airbnb.com/rooms/4886,20200911161645,2020-09-13,Country-Style Studio Hip Area FREE CRUISE & WIFI,Bright and Cozy Studio Apartment for 2 Guests...,2 Free River Cruise Tix with your booking ! M...,https://a0.muscache.com/pictures/395578/e7f46d...,6792,https://www.airbnb.com/users/show/6792,...,9.0,10.0,9.0,7511101570436.0,f,8,8,0,0,0.19


In [86]:
df_list.shape

(67565, 74)

The Paris Airbnb listing dataframe is higly populated with 74 columns and 67 565 rows. Those rows seems to correspond to each listing in Paris.

   ####                  B. Checking Null Value

In [87]:
set(df_list.columns[df_list.isnull().mean()==0])

{'accommodates',
 'amenities',
 'availability_30',
 'availability_365',
 'availability_60',
 'availability_90',
 'calculated_host_listings_count',
 'calculated_host_listings_count_entire_homes',
 'calculated_host_listings_count_private_rooms',
 'calculated_host_listings_count_shared_rooms',
 'calendar_last_scraped',
 'has_availability',
 'host_id',
 'host_url',
 'host_verifications',
 'id',
 'instant_bookable',
 'last_scraped',
 'latitude',
 'listing_url',
 'longitude',
 'maximum_nights',
 'minimum_nights',
 'neighbourhood_cleansed',
 'number_of_reviews',
 'number_of_reviews_l30d',
 'number_of_reviews_ltm',
 'price',
 'property_type',
 'room_type',
 'scrape_id'}

In [88]:
set(df_list.columns[df_list.isnull().mean()==1])

{'bathrooms', 'calendar_updated', 'neighbourhood_group_cleansed'}

In [89]:
df_list.isnull().sum()

id                                                  0
listing_url                                         0
scrape_id                                           0
last_scraped                                        0
name                                               64
description                                      1374
neighborhood_overview                           26467
picture_url                                         1
host_id                                             0
host_url                                            0
host_name                                          10
host_since                                         10
host_location                                     155
host_about                                      32494
host_response_time                              39784
host_response_rate                              39784
host_acceptance_rate                            25126
host_is_superhost                                  10
host_thumbnail_url          

In [90]:
df_list_null = df_list.isnull().sum()
print(df_list_null[df_list_null > 0].count())

43


43 columns on 76 have missing value.
3 of them are completely empty and need to be removed

## 2. Review Dataframe

In [111]:
df_rev.head(3)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2577,366217274,2019-01-02,28047930,Kate,Beautiful apartment in a really handy location...
1,3109,123127969,2016-12-27,12389804,Sophie,The host canceled this reservation the day bef...
2,3109,123274144,2016-12-28,67553494,Tom'S,The host canceled this reservation 2 days befo...


In [112]:
df_rev.shape

(1308133, 6)

In [113]:
df_rev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308133 entries, 0 to 1308132
Data columns (total 6 columns):
listing_id       1308133 non-null int64
id               1308133 non-null int64
date             1308133 non-null object
reviewer_id      1308133 non-null int64
reviewer_name    1308133 non-null object
comments         1307602 non-null object
dtypes: int64(3), object(3)
memory usage: 59.9+ MB


Considering the higly populated Paris Airbnb listing dataframe, it's seems quite normal to see that much review with 1 308 133 rows and 6 columns

   ####                  B. Checking Null Value

In [122]:
set(df_rev.columns[df_rev.isnull().mean()==0])

{'comments',
 'date',
 'id',
 'listing_id',
 'neighbourhood',
 'polarity_sentiment',
 'reviewer_id',
 'reviewer_name',
 'textBlob_polarity_analysis'}

In [123]:
df_rev.isnull().sum()

listing_id                    0
id                            0
date                          0
reviewer_id                   0
reviewer_name                 0
comments                      0
polarity_sentiment            0
textBlob_polarity_analysis    0
neighbourhood                 0
dtype: int64

The comment features is the only one with missing values : 531
Wich does not seem representative according the total number of rows of over 1 million.

## 3. Calendar Dataframe

   ####                  A. Calendar Overview

In [124]:
df_cal.head(3)

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2577,2020-09-13,f,$125.00,$125.00,3.0,1125.0
1,24260,2020-09-12,f,$76.00,$76.00,10.0,180.0
2,24260,2020-09-13,f,$76.00,$76.00,10.0,180.0


In [125]:
df_cal.shape

(24661709, 7)

In [126]:
df_cal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24661709 entries, 0 to 24661708
Data columns (total 7 columns):
listing_id        int64
date              object
available         object
price             object
adjusted_price    object
minimum_nights    float64
maximum_nights    float64
dtypes: float64(2), int64(1), object(4)
memory usage: 1.3+ GB


The calendar Dataframe seems to references all the day each listing is not available

   ####                  B. Checking Null Value

In [127]:
set(df_cal.columns[df_cal.isnull().mean()==0])

{'adjusted_price', 'available', 'date', 'listing_id', 'price'}

In [128]:
df_cal.isnull().sum()

listing_id           0
date                 0
available            0
price                0
adjusted_price       0
minimum_nights    1110
maximum_nights    1110
dtype: int64

The minimum_night and maximum_night features appears to have the same number of null values : 1 110.
Wich does not seem representative according the total number of rows of over 2 million.
It doesn't need data Preparation.

In [132]:
df_cal.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,year,month
0,2577,2020-09-13,False,125,125,3.0,1125.0,2020,9
1,24260,2020-09-12,False,76,76,10.0,180.0,2020,9
2,24260,2020-09-13,False,76,76,10.0,180.0,2020,9
3,24260,2020-09-14,False,76,76,10.0,180.0,2020,9
4,24260,2020-09-15,False,76,76,10.0,180.0,2020,9
