# Airbnb Paris
## by Mathieu Rella

# I. Business Understanding

We will be exploring Airbnb paris data to try to find answers to some questions like :

- Where is it good to rent on airbnb in paris ?
- Which season is the more profitable for the host ?
- What do really believe the guest of paris listing ?
- Can we predict the price of a listing ?

In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import qgrid
import plotly.graph_objects as go

import plotly
plotly.__version__
import json
from plotly.offline import download_plotlyjs, init_notebook_mode,  iplot
init_notebook_mode(connected=True)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
# Sklearn ML Modules
from sklearn.preprocessing import MultiLabelBinarizer,LabelEncoder,OneHotEncoder,StandardScaler 
import sklearn.metrics as mtr
import math

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [3]:
# load all the dataset into a pandas dataframe

df_list = pd.read_csv('Data/listings.csv')
df_rev = pd.read_csv('Data/Reviews.csv')
df_cal = pd.read_csv('Data/calendar.csv')

# IV. Modelling

## Predicting Price

In [144]:
# Data type of price column is object.
print("- The price datatype is an", df_list['price'].dtypes, ".")
print("- As seen before the price column have",df_list['price'].isnull().sum(), "missing value, wich means we do not need to impute certain row. Because we want it as a float we need to convert th datatypes as a float and replace special characters as $ signs.")
# datatypes is converted to float
df_list['price'] = df_list['price'].replace(r'[$,%]', '', regex = True).astype(float)
print("- The new price datatype is a",df_list['price'].dtypes)

- The price datatype is an object .
- As seen before the price column have 0 missing value, wich means we do not need to impute certain row. Because we want it as a float we need to convert th datatypes as a float and replace special characters as $ signs.
- The new price datatype is a float64


Now we need to list our columns of interest, the one that we believe impact the most the price of a listing

In [145]:
df_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67565 entries, 0 to 67564
Data columns (total 74 columns):
id                                              67565 non-null int64
listing_url                                     67565 non-null object
scrape_id                                       67565 non-null int64
last_scraped                                    67565 non-null object
name                                            67501 non-null object
description                                     66191 non-null object
neighborhood_overview                           41098 non-null object
picture_url                                     67564 non-null object
host_id                                         67565 non-null int64
host_url                                        67565 non-null object
host_name                                       67555 non-null object
host_since                                      67555 non-null object
host_location                                   67410 

In [146]:
# I pass all the interesting feature into an array
interest_col = ['price','number_of_reviews','reviews_per_month',
                'review_scores_cleanliness','review_scores_checkin',
                'review_scores_communication','review_scores_location',
                'review_scores_value','bathrooms_text','bedrooms','beds',
                'availability_30','availability_60','availability_90',
                'review_scores_accuracy','availability_365','review_scores_rating',
                'host_is_superhost','host_identity_verified',
                'host_listings_count','host_has_profile_pic']

# Delete all the features except the one of interest
list_modeling = df_list.copy()
list_modeling.drop(list_modeling.columns.difference(interest_col), 1, inplace=True)
list_modeling

Unnamed: 0,host_is_superhost,host_listings_count,host_has_profile_pic,host_identity_verified,bathrooms_text,bedrooms,beds,price,availability_30,availability_60,...,availability_365,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month
0,f,2.0,t,t,2 baths,2.0,2.0,125.00,0,0,...,285,1,100.0,10.0,10.0,10.0,10.0,10.0,10.0,0.05
1,f,1.0,t,f,1 bath,,1.0,60.00,22,52,...,357,9,100.0,10.0,10.0,10.0,10.0,10.0,10.0,0.20
2,f,9.0,t,f,1 bath,,2.0,89.00,3,33,...,63,25,90.0,9.0,10.0,9.0,9.0,10.0,9.0,0.19
3,f,9.0,t,f,1 bath,1.0,2.0,102.68,0,11,...,41,22,96.0,10.0,10.0,10.0,10.0,10.0,10.0,0.21
4,f,1.0,t,t,1 bath,,1.0,60.00,24,54,...,84,225,90.0,9.0,9.0,9.0,10.0,10.0,8.0,1.65
5,t,4.0,t,t,1 bath,2.0,2.0,90.00,0,19,...,288,269,94.0,10.0,9.0,10.0,10.0,10.0,10.0,2.34
6,f,0.0,t,t,1 bath,1.0,1.0,130.00,15,45,...,350,6,96.0,10.0,10.0,10.0,10.0,10.0,10.0,0.05
7,f,3.0,t,t,1 bath,1.0,1.0,75.00,0,0,...,196,0,,,,,,,,
8,t,1.0,t,t,1 bath,1.0,1.0,75.00,27,56,...,343,25,98.0,10.0,10.0,10.0,10.0,10.0,10.0,0.27
9,t,4.0,t,t,1 bath,,2.0,80.00,0,0,...,13,47,97.0,10.0,10.0,10.0,10.0,9.0,9.0,0.36


In [147]:
# Let's clean those column to make it usable in our model

# Bathrooms_text features returns a string we need to modify to only keep the # of bath
list_modeling['bathrooms_text'] = list_modeling['bathrooms_text'].str.replace(r"[a-zA-Z]",'').str.replace(' ', '').str.replace('-', '')
#replace the bathrooms outlier
list_modeling['bathrooms_text'] = list_modeling['bathrooms_text'].str.replace('50', '5').str.replace('1.5', '1').str.replace('2.5', '2').str.replace('3.5', '3').str.replace('4.5', '4').str.replace('5.5', '5').str.replace('6.5', '6').str.replace('7.5', '7')
list_modeling['bathrooms_text'] = pd.to_numeric(list_modeling['bathrooms_text'], downcast="float")
list_modeling['bathrooms_text'] = list_modeling['bathrooms_text'].fillna(0)

# fill 0 for Nan values in all ratings features
ratings_features = ['review_scores_rating','review_scores_accuracy','review_scores_cleanliness',
                    'review_scores_checkin','review_scores_communication','review_scores_location',
                    'review_scores_value','reviews_per_month']
for col in ratings_features:
    list_modeling[col] = list_modeling[col].fillna(list_modeling[col].mean())
    
# Clean the price features
#list_modeling["price"] = list_modeling["price"].str.replace('[\$\,]|\.\d*', '').astype(int)

# fill 1 for Nan values in bedrooms features
list_modeling['bedrooms'] = list_modeling['bedrooms'].fillna(1)

# fill 1 for Nan values in beds
list_modeling['beds'] = list_modeling['bedrooms'].fillna(1)

# drop the row with null value only 10
list_modeling = list_modeling.dropna(how='any',axis=0)



# Change those features to boolean
list_modeling['host_is_superhost'] = list_modeling['host_is_superhost'].str.replace('t','1').str.replace('f', '0').astype(int)
list_modeling['host_has_profile_pic'] = list_modeling['host_has_profile_pic'].str.replace('t','1').str.replace('f', '0').astype(int)
list_modeling['host_identity_verified'] = list_modeling['host_identity_verified'].str.replace('t','1').str.replace('f', '0').astype(int)

list_modeling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67555 entries, 0 to 67564
Data columns (total 21 columns):
host_is_superhost              67555 non-null int64
host_listings_count            67555 non-null float64
host_has_profile_pic           67555 non-null int64
host_identity_verified         67555 non-null int64
bathrooms_text                 67555 non-null float32
bedrooms                       67555 non-null float64
beds                           67555 non-null float64
price                          67555 non-null float64
availability_30                67555 non-null int64
availability_60                67555 non-null int64
availability_90                67555 non-null int64
availability_365               67555 non-null int64
number_of_reviews              67555 non-null int64
review_scores_rating           67555 non-null float64
review_scores_accuracy         67555 non-null float64
review_scores_cleanliness      67555 non-null float64
review_scores_checkin          67555 non-nu

In [148]:
# X as all the column of interest except the one we want to predict.
X = list_modeling.loc[:, list_modeling.columns != 'price']
# We want to predict the price so y = price
y = list_modeling['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)

#Instantiate
lm_model = LinearRegression(normalize=True) 

#Fitting the model
lm_model.fit(X_train, y_train) 

#Predict using your model
y_test_preds = lm_model.predict(X_test)
y_train_preds = lm_model.predict(X_train)

#Scoring our model
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)

In [149]:
print(test_score)
print(train_score)

0.07877492745907355
0.07597810404274774


I used the features of the data list that I think best define the price of a listings
Once the linear regression model was run, I obtained an rsquared of 0.078 on the test dataset and 0.075 on the train dataset.
The model explains 7.8% of the price variation in the test dataset and 7.5% in the training dataset.
The results are not satisfactory enough to conclude that a linear regression model with the features that’s seems interesting can predict the price of a listing.

# V. Conclusions

Thanks to this analysis, we were able to assess the environment of the Paris Airbnb . We found that some neighbourhoods are more popular on airbnb than others and that this follows a logic adopted long before the creation of airbnb itself where the center will be more exclusive than the surrounding neighbourhoods.
As far as the host are concerned, due to legislation and availability, it was more profitable for them to rent during spring and summer season and to reserve the winter months to carry out all the maintenance necessary for the listings.
Finally, it was noted that the 1st district,Hotel de Ville, was the most appreciated district by tourists, no doubt because of its proximity to many of the cultural and historical places that make Paris.
Finally, I tried unsuccessfully to create a model to predict the price but it seems that a simple linear regression is not enough.