## AirBnb Rio

Installing libs

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
%matplotlib inline

Reading data

In [2]:
df = pd.read_csv('./listings.csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,17878,https://www.airbnb.com/rooms/17878,20210321015201,2021-03-25,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",Discounts for long term stays. <br />- Large b...,This is the one of the bests spots in Rio. Bec...,https://a0.muscache.com/pictures/65320518/3069...,68997,https://www.airbnb.com/users/show/68997,...,10.0,10.0,9.0,,t,1,1,0,0,2.0
1,24480,https://www.airbnb.com/rooms/24480,20210321015201,2021-03-22,Nice and cozy near Ipanema Beach,My studio is located in the best of Ipanema. ...,"The beach, the lagoon, Ipanema is a great loca...",https://a0.muscache.com/pictures/11955612/b28e...,99249,https://www.airbnb.com/users/show/99249,...,10.0,10.0,9.0,,f,1,1,0,0,0.67
2,35636,https://www.airbnb.com/rooms/35636,20210321015201,2021-03-25,Cosy flat close to Ipanema beach,This cosy apartment is just a few steps away ...,The apartment street is very quiet and safe ....,https://a0.muscache.com/pictures/20009355/38b6...,153232,https://www.airbnb.com/users/show/153232,...,10.0,10.0,9.0,,f,1,1,0,0,2.0
3,35764,https://www.airbnb.com/rooms/35764,20210321015201,2021-03-25,COPACABANA SEA BREEZE - RIO - 20 X Superhost,Our newly renovated studio is located in the b...,Our guests will experience living with a local...,https://a0.muscache.com/pictures/23782972/1d3e...,153691,https://www.airbnb.com/users/show/153691,...,10.0,10.0,10.0,,f,1,1,0,0,2.79
4,41198,https://www.airbnb.com/rooms/41198,20210321015201,2021-03-22,"Modern 2bed,Top end of Copacabana","<b>The space</b><br />Stay in this, Modern,cle...",,https://a0.muscache.com/pictures/3576716/2d6a9...,178975,https://www.airbnb.com/users/show/178975,...,9.0,9.0,9.0,,f,2,2,0,0,0.18


### Neighbourhood Overview

#### What neighbourhoods have the biggest number of properties?

First, we will analyze the number of properties in each neighbourhood.

In [18]:
fig = px.bar((df.neighbourhood_cleansed.value_counts().nlargest(10)/df.shape[0]), 
             orientation='h', 
             title="Top 10 neighbourhoods by number of properties",
            template='simple_white')
fig.update_yaxes(categoryorder='total ascending')
fig.update_layout(showlegend=False)

Copacabana has almost 30% of the properties in Rio de Janeiro. Barra da Tijuca and Ipanema have 10% each.
These 3 neighbourhoods represent almost 50% of the properties in Rio.

#### What are the most expensive neighbourhoods?

Next, we will analyze the neighbourhoods with the highest prices.

In [4]:
df['price'].value_counts()

$200.00      908
$250.00      899
$150.00      878
$300.00      797
$500.00      654
            ... 
$4,433.00      1
$901.00        1
$1,142.00      1
$706.00        1
$761.00        1
Name: price, Length: 1611, dtype: int64

In [5]:
df['price'].head()

0    $200.00
1    $307.00
2    $275.00
3    $120.00
4    $494.00
Name: price, dtype: object

We need to clean and transform the variable 'price' to answer the question

In [7]:
df['price'] = df['price'].map(lambda p: int(p[1:-3].replace(",", "")))
df['price'].value_counts()

200     908
250     899
150     878
300     797
500     654
       ... 
477       1
4308      1
525       1
6670      1
2015      1
Name: price, Length: 1611, dtype: int64

Now, we can understand the distribution of the variable. The boxplot below shows the outliers.

In [8]:
fig = px.box(df['price'])
fig.show()

In [13]:
fig = px.box(df, x='neighbourhood_cleansed', y='price')
fig.update_yaxes(categoryorder='total descending')
fig.show()

We will use the median so that the outliers have a smaller effect.

In [12]:
fig = px.bar(df.groupby('neighbourhood_cleansed').median()['price'].nlargest(10), 
             orientation='h', 
             title="Top 10 neighbourhoods by highest prices",
            template='simple_white')
fig.update_yaxes(categoryorder='total ascending')
fig.update_layout(showlegend=False)

Still, we will treat these outliers. The histogram below shows the distribution and the upper limit and lower limit used, also based on the boxplot seen before.

In [9]:
fig = px.histogram(df[(df.price<1114)]['price'])
fig.show()

In [10]:
df_adj_price = df[(df.price<1114) & (df.price>9)]

In [11]:
fig = px.bar(df_adj_price.groupby('neighbourhood_cleansed').median()['price'].nlargest(10), 
             orientation='h', 
             title="Top 10 neighbourhoods by highest prices",
            template='simple_white')
fig.update_yaxes(categoryorder='total ascending')
fig.update_layout(showlegend=False)