# Airbnb Paris
## by Mathieu Rella

# I. Business Understanding

We will be exploring Airbnb paris data to try to find answers to some questions like :

- Where is it good to rent on airbnb in paris ?
- Which season is the more profitable for the host ?
- What do really believe the guest of paris listing ?
- Can we predict the price of a listing ?

In [83]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import qgrid
import plotly.graph_objects as go

import plotly
plotly.__version__
import json
from plotly.offline import download_plotlyjs, init_notebook_mode,  iplot
init_notebook_mode(connected=True)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
# Sklearn ML Modules
from sklearn.preprocessing import MultiLabelBinarizer,LabelEncoder,OneHotEncoder,StandardScaler 
import sklearn.metrics as mtr
import math

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [84]:
# load all the dataset into a pandas dataframe

df_list = pd.read_csv('Data/listings.csv')
df_rev = pd.read_csv('Data/Reviews.csv')
df_cal = pd.read_csv('Data/calendar.csv')

# III. Data Preparation

#### A. Paris Choropleth

In [91]:
df_list['neighbourhood_cleansed'].value_counts()

Buttes-Montmartre      7591
Popincourt             6635
Vaugirard              5094
Entrepôt               4674
Batignolles-Monceau    4536
Ménilmontant           3859
Buttes-Chaumont        3837
Passy                  3350
Opéra                  3227
Temple                 3185
Reuilly                2733
Observatoire           2583
Bourse                 2388
Gobelins               2358
Panthéon               2250
Hôtel-de-Ville         2081
Luxembourg             2012
Palais-Bourbon         1902
Élysée                 1848
Louvre                 1422
Name: neighbourhood_cleansed, dtype: int64

the 'neighbourhood_cleansed' features present the different neighbourhood in paris, there are 20 of them accordingly to the number of district in paris

In [92]:
paris_url = 'https://raw.githubusercontent.com/mathieurella/Udacity_DataScientist_Airbnb-Paris/main/paris.geojson'

In [93]:
import urllib.request

def read_geojson(url):
    with urllib.request.urlopen(url) as url:
        jdata = json.loads(url.read().decode())
    return jdata

In [94]:
jdata = read_geojson(paris_url)

In [95]:
# Retrieve all the location from the geojson url
locations = [0+k for k in range(100)]
text = [feat['properties']['name']  for feat in jdata['features'] if feat['id'] in locations] #province names

In [96]:
# Create a New Dataframe with the neighbourhood and listing
df_list_neighbourhood = pd.DataFrame(df_list['neighbourhood_cleansed'].value_counts().reset_index().values, columns=["neighbourhood", "# Listing"])

#Order the neighbourhood column to be as the text one
df_list_neighbourhood.neighbourhood = df_list_neighbourhood.neighbourhood.astype("category")
df_list_neighbourhood.neighbourhood.cat.set_categories(text, inplace=True)
df_list_neighbourhood = df_list_neighbourhood.sort_values(["neighbourhood"])

In [97]:
# Check if the list are the same
df_neighbourhood_list = df_list_neighbourhood['neighbourhood'].tolist()

if text == df_neighbourhood_list: 
    print ("The lists are identical") 
else : 
    print ("The lists are not identical")

The lists are identical


In [98]:
z = df_list_neighbourhood['# Listing'].tolist()
neigh = df_list_neighbourhood['neighbourhood'].tolist()

In [99]:
mapboxt = open(".mapbox_token").read().rstrip() #my mapbox_access_token  must be used only for special mapbox style

In [100]:
fig= go.Figure(go.Choroplethmapbox(z=z,
                            locations=locations,
                            colorscale='reds',
                            colorbar=dict(thickness=20, ticklen=3),
                            geojson=jdata,
                            text=neigh,
                            hoverinfo='all',
                            marker_line_width=1, marker_opacity=0.75))
                            
                            
fig.update_layout(title_text= 'Choroplethmapbox',
                  title_x=0.5, width = 700,# height=700,
                  mapbox = dict(center= dict(lat=48.8566,  lon=2.3522),
                                 accesstoken= mapboxt,
                                 style='light',
                                 zoom=10,
                               ));

#fig.show()

In [101]:
fig.data[0].hovertemplate =  '<b>neighbourhood</b>: <b>%{text}</b>'+\
                             '<br> <b># Listings </b>: %{z}<br>'
fig.update_layout(title_text= "# Listing per Neighbourhood");
iplot(fig)

Paris is made up of 20 arrondissements, with a sparse center and neighborhoods that get bigger and bigger as you get closer to the peripheral boulevard.

#### B. Average Price per Neighbouhood

In [102]:
# Create a lightter copy of df_list
df_list_avg_price = df_list[['neighbourhood_cleansed', 'price', 'id']].copy()
df_list_avg_price["price"] = df_list_avg_price["price"].str.replace('[\$\,]|\.\d*', '').astype(int)
df_list_avg_price = df_list_avg_price.groupby(['neighbourhood_cleansed'])[['price']].mean()
df_list_avg_price = df_list_avg_price.reset_index()
df_list_avg_price = df_list_avg_price.sort_values(["price"], ascending=False)

# Bar Chart representing the average Price per Neighbourhood
fig = px.bar(df_list_avg_price, x='neighbourhood_cleansed', y='price')
fig.show()

This bar chart correspond to the average price of a listing per neighbourhood, as expected we see a correlation between the listing price and the neighbourhood by itsel but also by the number of listing per neighboourhood.
L'élysee and the louvre neighbourhood being some of the most valued neighbourhood in paris.
the elysee and louvre being small and very sought after by tourists, on the other hand meilmontant is a more popular neighborhood with less tourist attraction the average price of an accommodation is much lower 

#### C. Property type Proportion in Paris

In [103]:
df_list_avg_price_prp = df_list[['neighbourhood_cleansed', 'price', 'id', 'property_type']].copy()
df_list_avg_price_prp["price"] = df_list_avg_price_prp["price"].str.replace('[\$\,]|\.\d*', '').astype(int)
#df_list_avg_price_prp = df_list_avg_price.groupby(['neighbourhood_cleansed', 'property_type'])[['price']].mean()

In [104]:
# Proportion of property type in Paris
y = (df_list_avg_price_prp['property_type'].value_counts()/df_list_avg_price_prp.shape[0]).head(10)
x = df_list_avg_price_prp['property_type'].value_counts().head(10).index.tolist()
text = df_list_avg_price_prp['property_type'].value_counts().head(10)


fig = px.bar( y=y, x=x, text=text, labels=dict(x="Property Type", y="% of property type listing", color="Place"))
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

The Entire Apartment property type is the type of listing with over 80% of the listing, followed by private room in apartment with only 5%.
let's compare those statistic with the neighbourhood and average price

#### D. Average Price per Neighbourhood and property type

In [105]:
values_keep_prop = ('Entire apartment','Private room in apartment','Entire condominium', 'Room in boutique hotel',
'Entire loft')
df_list_avg_price_prp2 = df_list_avg_price_prp.loc[df_list_avg_price_prp['property_type'].isin(values_keep_prop)]

df_list_avg_price_prp = df_list_avg_price_prp2.groupby(['neighbourhood_cleansed','property_type'])[['price']].mean().reset_index()
df_list_avg_price_prp

Unnamed: 0,neighbourhood_cleansed,property_type,price
0,Batignolles-Monceau,Entire apartment,104.527297
1,Batignolles-Monceau,Entire condominium,103.472727
2,Batignolles-Monceau,Entire loft,123.853659
3,Batignolles-Monceau,Private room in apartment,57.249284
4,Batignolles-Monceau,Room in boutique hotel,170.757353
5,Bourse,Entire apartment,145.050772
6,Bourse,Entire condominium,101.750000
7,Bourse,Entire loft,190.980769
8,Bourse,Private room in apartment,71.174194
9,Bourse,Room in boutique hotel,213.878788


The average price of the room in boutique hotel property type is higher tan the other property type, followed by the loft, entire condominium, Entire apartment and finnaly the private room in apartment.

#### E. Most Common Amenities

In [106]:
# Check the amenities columns
df_list['amenities'].head()

0    ["Host greets you", "Washer", "Essentials", "L...
1    ["Washer", "Smart lock", "Laptop-friendly work...
2    ["Cable TV", "Host greets you", "Essentials", ...
3    ["Cable TV", "Washer", "Host greets you", "Ess...
4    ["Host greets you", "Washer", "Cooking basics"...
Name: amenities, dtype: object

For each row the amenities column is a list of the different amenities available for the listing, to make it readable we're going to turn this column into a new dataframe where each column represent an amenities. 

In [107]:
amenities = df_list['amenities'].apply(lambda x: [amenity.replace('"', "").replace("[", "").replace("]", "") 
                                               for amenity in x.split(",")])

mlb = MultiLabelBinarizer()
df_amenities_final = pd.DataFrame(mlb.fit_transform(amenities), columns=mlb.classes_, index=amenities.index)

In [108]:
df_amenities_final.head()

Unnamed: 0,Unnamed: 1,linens,toiletries,Air conditioning,Airport shuttle,Alarm system,BBQ grill,Baby bath,Baby monitor,Babysitter recommendations,...,Smart lock,Smoke alarm,Stove,Suitable for events,Sun loungers,TV,Trash compactor,Washer,Waterfront,Wifi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [109]:
test = df_amenities_final.sum().sort_values(ascending=False)
test = test.rename_axis('amenities').reset_index(name='counts')
test.head(10)

Unnamed: 0,amenities,counts
0,Wifi,62988
1,Kitchen,61489
2,Essentials,60513
3,Heating,58215
4,Smoke alarm,48317
5,Hangers,47313
6,Hair dryer,46584
7,Iron,45984
8,TV,44360
9,Washer,41773


In [110]:
fig = px.bar( y=test['counts'].head(), x=test['amenities'].head(), text=test['counts'].head(),labels=dict(x="Amenities", y="Count of listing with amenities", color="Place"))
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

we observed that wifi, kitchen, heating, essentials and smoke alarm are the 5 most common amenities in paris Airbnb Listing.