<a href="https://colab.research.google.com/github/pranilthorat/capstone-project-almabetter/blob/main/github_airbnb_capstone_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. </b>

## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

## <b> Explore and analyze the data to discover key understandings (not limited to these) such as : 
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>

* **AIRBNB** is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.

In [3]:
#Importing all Necessary Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import plotly.express as px
from shapely.geometry import Point,Polygon
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from shapely import wkt
import warnings
warnings.filterwarnings(action='ignore')
%matplotlib inline

Data Frame of Airbnb

In [4]:
#mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
#dataframe of airbnb_df
airbnb="/content/drive/MyDrive/Airbnb NYC 2019.csv"
airbnb_df=pd.read_csv(airbnb)


In [6]:
#head of airbnb_df
airbnb_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [7]:
#tail of airbnb_df
airbnb_df.tail()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2
48894,36487245,Trendy duplex in the very heart of Hell's Kitchen,68119814,Christophe,Manhattan,Hell's Kitchen,40.76404,-73.98933,Private room,90,7,0,,,1,23


In [8]:
#checking the size of the airbnb_df
airbnb_df.size

782320

In [9]:
#checking shape of airbnb_df
airbnb_df.shape

(48895, 16)


cleaning the data▶

In [10]:
#checking null values or unknown data 
airbnb_df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [11]:
#getting thoes colunms which is required for analysis of hosts 
airbnb_df_new=airbnb_df.loc[:,['host_id','host_name','price','number_of_reviews','reviews_per_month','availability_365','neighbourhood_group','neighbourhood']]
airbnb_df_new   

Unnamed: 0,host_id,host_name,price,number_of_reviews,reviews_per_month,availability_365,neighbourhood_group,neighbourhood
0,2787,John,149,9,0.21,365,Brooklyn,Kensington
1,2845,Jennifer,225,45,0.38,355,Manhattan,Midtown
2,4632,Elisabeth,150,0,,365,Manhattan,Harlem
3,4869,LisaRoxanne,89,270,4.64,194,Brooklyn,Clinton Hill
4,7192,Laura,80,9,0.10,0,Manhattan,East Harlem
...,...,...,...,...,...,...,...,...
48890,8232441,Sabrina,70,0,,9,Brooklyn,Bedford-Stuyvesant
48891,6570630,Marisol,40,0,,36,Brooklyn,Bushwick
48892,23492952,Ilgar & Aysel,115,0,,27,Manhattan,Harlem
48893,30985759,Taz,55,0,,2,Manhattan,Hell's Kitchen


In [12]:
#handeling missing values 
airbnb_df_new.fillna({'reviews_per_month':0}, inplace=True)
airbnb_df_new.dropna(subset=['host_name'],inplace=True)

In [13]:
#lenght of new airbnb_df
len(airbnb_df_new)

48874

In [14]:
#checking null values or missing data 
airbnb_df_new.isnull().sum()

host_id                0
host_name              0
price                  0
number_of_reviews      0
reviews_per_month      0
availability_365       0
neighbourhood_group    0
neighbourhood          0
dtype: int64

In [15]:
#we have clean data now
print(type(airbnb_df_new["host_id"]))
print(type(airbnb_df_new['host_name']))
print(type(airbnb_df_new["availability_365"]))
print(type(airbnb_df_new["number_of_reviews"]))
print(type(airbnb_df_new["reviews_per_month"]))
print(type(airbnb_df_new["price"]))
print(type(airbnb_df_new['neighbourhood_group']))
print(type(airbnb_df_new['neighbourhood']))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [16]:
#head of airbnb_df_new
airbnb_df_new.head()

Unnamed: 0,host_id,host_name,price,number_of_reviews,reviews_per_month,availability_365,neighbourhood_group,neighbourhood
0,2787,John,149,9,0.21,365,Brooklyn,Kensington
1,2845,Jennifer,225,45,0.38,355,Manhattan,Midtown
2,4632,Elisabeth,150,0,0.0,365,Manhattan,Harlem
3,4869,LisaRoxanne,89,270,4.64,194,Brooklyn,Clinton Hill
4,7192,Laura,80,9,0.1,0,Manhattan,East Harlem


In [17]:
#tail of airbnb_df_new
airbnb_df_new.tail()

Unnamed: 0,host_id,host_name,price,number_of_reviews,reviews_per_month,availability_365,neighbourhood_group,neighbourhood
48890,8232441,Sabrina,70,0,0.0,9,Brooklyn,Bedford-Stuyvesant
48891,6570630,Marisol,40,0,0.0,36,Brooklyn,Bushwick
48892,23492952,Ilgar & Aysel,115,0,0.0,27,Manhattan,Harlem
48893,30985759,Taz,55,0,0.0,2,Manhattan,Hell's Kitchen
48894,68119814,Christophe,90,0,0.0,23,Manhattan,Hell's Kitchen


###<font color=blue>__1. What can we learn about different hosts and areas?__</font>

__i) How many number of hosts which has availability of 365 days i.e 24x7__

In [18]:
#printing the maximum availability
airbnb_df_new["availability_365"].max()

365

In [19]:
#Fetching data host_id and availability_365 columns
availability_365_days=airbnb_df_new.loc[:,['host_id','availability_365']]
#Filtering out maximum availability of 365 days
host_available_365_days=availability_365_days[availability_365_days['availability_365']>364]
how_many_host_available_for_365_days=host_available_365_days['host_id'].nunique()
print(f"There are totally {how_many_host_available_for_365_days} which has availability of 365 days")


There are totally 894 which has availability of 365 days


__ii) Which host has the highest	number of reviews?__

In [20]:
#Grouping host_id and host_name and perform sum aggregation function
groupby_host_id_and_host_name=airbnb_df_new.groupby(['host_id','host_name']).sum()[['number_of_reviews']].reset_index()
highest_number_of_reviews=groupby_host_id_and_host_name.sort_values('number_of_reviews',ascending=False).head(10)

In [21]:
#Replacing hostid with hypen symbol at end of each id. Because Plotly assumes it has integer
highest_number_of_reviews['host_id']=highest_number_of_reviews['host_id'].astype('string').apply(lambda x:x+"_")
#plotting bar graph and styling with pattern shape
fig = px.bar(highest_number_of_reviews, y='number_of_reviews', x='host_name', text='number_of_reviews',
             color='host_id',opacity=.8)      
#Updating traces and layout to beautify the plot and setting the font size
fig.update_traces(textfont=dict(size=15,color='White'))
fig.update_layout(title='Host has the highest number of reviews',xaxis=dict(titlefont = dict(size=15),tickfont = dict(size=14)),yaxis=dict(titlefont = dict(size=15),tickfont = dict(size=13),
        showgrid=True,gridcolor='rgb(26, 173, 102)',
        showticklabels=True),plot_bgcolor='black')
#show figure
fig.show()

<u>From above plot</u>
1. ___Maya___ is the ___highest number of reviews___ with __2273__.
2. ___Brooklyn and Breakfast Len___ is the ___Second highest___ number of reviews with __2205__.

__iii) Which host has the highest number of reviews per month?__

In [22]:
#Grouping host_id and host_name and perform sum aggregation function
groupby_host_id_and_host_name=airbnb_df_new.groupby(['host_id','host_name']).sum()[['reviews_per_month']].reset_index()
highest_number_of_reviews_per_month=groupby_host_id_and_host_name.sort_values('reviews_per_month',ascending=False).head(10)
highest_number_of_reviews_per_month['reviews_per_month']=highest_number_of_reviews_per_month['reviews_per_month'].apply(lambda x:np.round(x,2))

In [23]:
#plotting bar graph and setting continuous color sclae
fig = px.bar(highest_number_of_reviews_per_month, y='reviews_per_month', x='host_name',text='reviews_per_month',color='reviews_per_month',opacity=.8,color_continuous_scale='tealgrn')
#Updating traces and layout to beautify the plot and setting the font size
fig.update_traces(textfont=dict(size=15,color='White'))
fig.update_layout(title='Host has the highest number of reviews per month',xaxis=dict(titlefont = dict(size=15),tickfont = dict(size=14)),yaxis=dict(titlefont = dict(size=15),tickfont = dict(size=13),
        showgrid=True,gridcolor='rgb(26, 173, 102)',
        showticklabels=True),plot_bgcolor='black')
#show figure
fig.show()

<u>From above plot</u>
1. ___Sonder (NYC)___ has the ___highest number___ of reviews per month.
2. The ___next three highest___ number of reviews per month hosts are ___Row NYC, Lakshmee and Danielle___.

_Now let us come to the area, Well we can learn following points about the areas:_ 


In [24]:
#Extracing the columns which is required for analysis of area
new_airbnb_df=airbnb_df_new.loc[:,['neighbourhood_group','neighbourhood','price',]]

In [25]:
#Grouing the Neighborhood group and perofrm mean aggregation function
result_group=new_airbnb_df.groupby(['neighbourhood_group'])['price'].mean().reset_index()
result_group['price']=result_group['price'].apply(lambda x:np.round(x,2))
#Sorting the results
result_group.sort_values('price',ascending=False,inplace=True)

In [26]:
# Computing Cumulative Percentage
result_group['cum_percent'] = 100*(result_group['price'].cumsum() / result_group['price'].sum())

In [27]:
#plotting bar graph price vs neighbourhood_group 
trace1 = go.Bar(
    x=result_group['neighbourhood_group'],
    y=result_group['price'],
    text=result_group['price'],
    name='price',
    marker=dict(
        color='rgb(34,163,192)'
               ),opacity=.80
)
#plotting Scatter graph cumulative percent vs neighbourhood_group 
trace2 = go.Scatter(
    x=result_group['neighbourhood_group'],
    y=result_group['cum_percent'],
    name='cum_percent',
    yaxis='y2'

)
#Merging those two plots to get parreto chart and finding the most expensive and most cheapest state 
fig = make_subplots(specs=[[{"secondary_y": True}]])
#Updating traces and layout to beautify the plot and setting the font size
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 1000, title = 'Average price ditribution of Neighbourhood group',xaxis=dict(tickfont = dict(size=14),
      tickangle=-90
    ),plot_bgcolor='black')
#show figure
fig.show()

<u>From above plot</u>
1. ___Manhattan State___ is the ___most expensive___ one with average price of room is __196.9__. which contribute __31.59%__ of price distribution from total neighborhood group of __100%__ price distribution.
2. ___Bronx___ is the ___cheapest___ one with price of room is __87.51__.
2. ___Manhattan,Brooklyn and Staten Island___ cumulative price percentage is __69.98%__.Which lies in total price distribution of __100%__.

__i) Which is the most cheapest area and show the price distribution?__

In [28]:
#Grouing the Neighborhood group of cheapest area and perofrm mean aggregation function
result_bronx=new_airbnb_df.groupby(['neighbourhood_group','neighbourhood'])['price'].mean().reset_index()
result_bronx['price']=result_bronx['price'].apply(lambda x:np.round(x,2))
result_bronx=result_bronx.loc[result_bronx['neighbourhood_group']=='Bronx']
result_bronx=result_bronx[['neighbourhood','price']]
#Sorting the results
result_bronx.sort_values('price',ascending=False,inplace=True)

In [29]:
#Computing Cumulative Percentage
result_bronx['cum_percent'] = 100*(result_bronx['price'].cumsum() / result_bronx['price'].sum())

In [30]:
#plotting bar graph price vs neighbourhood_group 
trace1 = go.Bar(
    x=result_bronx['neighbourhood'],
    y=result_bronx['price'],text=result_bronx['price'],
    name='price',
    marker=dict(
        color='rgb(252, 227, 3)'
               ),opacity=.80
)
#plotting Scatter graph cumulative percent vs neighbourhood_group 
trace2 = go.Scatter(
    x=result_bronx['neighbourhood'],
    y=result_bronx['cum_percent'],
    name='cum_percent',
    yaxis='y2'
)
#Merging those two plots to get parreto chart and finding the most cheapest area
fig = make_subplots(specs=[[{"secondary_y": True}]])
#Updating traces and layout to beautify the plot and setting the font size
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 800, width = 1600,xaxis=dict(
      tickangle=-90
    ))
fig.update_layout(title='Bronx is Cheapest area and Average Price distribution of that area',autosize=False,plot_bgcolor='black',xaxis=dict(tickfont = dict(size=14)))
#show figure
fig.show()

<u>From above plot</u>
1. ___Riverdale___ is the ___most expensive___ one in the ___Bronx___ with average price of room is __442.09__. which contribute __9.84%__ of price distribution from total Bronx neighborhood place of __100% price distribution__.
2. ___Hunts point___ is the ___cheapest___ one with price of room is __50.5__.
2. ___Riverdale,City Island,Spuyten Duyvil,Eastchester,Unionport,Westchester square and West Farms___ cumulative price percentage is __28.79%__.Which lies in total Bronx neighborhood place price distribution of 100%.

__ii) Which is the most expensive area and show the price distribution?__

In [31]:
#Grouing the Neighborhood group of expesive area and perofrm mean aggregation function
result_manhattan=new_airbnb_df.groupby(['neighbourhood_group','neighbourhood'])['price'].mean().reset_index()
result_manhattan['price']=result_manhattan['price'].apply(lambda x:np.round(x,2))

In [32]:
#Sorting the results
result_manhattan=result_manhattan.loc[result_manhattan['neighbourhood_group']=='Manhattan']
result_manhattan=result_manhattan[['neighbourhood','price']]
result_manhattan.sort_values('price',ascending=False,inplace=True)

In [33]:
# Computing Cumulative Percentage
result_manhattan['cum_percent'] = 100*(result_manhattan['price'].cumsum() / result_manhattan['price'].sum())

In [34]:
#plotting bar graph price vs neighbourhood_group 
trace1 = go.Bar(
    x=result_manhattan['neighbourhood'],
    y=result_manhattan['price'],text=result_manhattan['price'],
    name='price',
    marker=dict(
        color='rgb(252, 227, 3)'
               ),opacity=.80
)
#plotting Scatter graph cumulative percent vs neighbourhood_group 
trace2 = go.Scatter(
    x=result_manhattan['neighbourhood'],
    y=result_manhattan['cum_percent'],
    name='cum_percent',
    yaxis='y2'
)
#Merging those two plots to get parreto chart and finding the most expensive area
fig = make_subplots(specs=[[{"secondary_y": True}]])
#Updating traces and layout to beautify the plot and setting the font size
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 800, width = 1400,xaxis=dict(
      tickangle=-90
    ))
fig.update_layout(title='Average Price distribution of Expensive areas in Manhattan Neighborhood',autosize=False,plot_bgcolor='black',xaxis=dict(tickfont = dict(size=14)))
#show figure
fig.show()

<u>From above plot</u>
1. ___Tribecca___ is the ___most expensive___ one in the ___Manhattan___ with average price of room is __490.64__. which contribute __7.22%__ of price distribution from total Manhattan neighborhood place of __100% price distribution__.
2. ___Inwood___ is the ___cheapest___ one with price of room is __88.9__.
2. ___Tribecca,Battery park city,Flatiron District,NoHo and SoHo___ cumulative price percentage is __26.24%__.Which lies in total Manhattan neighborhood place price distribution of 100%.