<a href="https://colab.research.google.com/github/mrkhadus/eda-airbnb/blob/main/Airbnb_Bookings_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. </b>

## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

## <b> Explore and analyze the data to discover key understandings (not limited to these) such as : 
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data  = pd.read_csv('/content/drive/MyDrive/AlmaBetter: Capstone Projects/Airbnb/Airbnb NYC 2019.csv')

In [None]:
# Lets get a quick insight of the data
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [None]:
data.shape

(48895, 16)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [None]:
# Let us check what are the unique values in neighbourhood-groups
data['neighbourhood_group'].unique()

array(['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx'],
      dtype=object)

In [None]:
# Now we want to know which of the available places will cost minimum

data['price']=data['price'].replace(0,np.NaN)
data[['name','price']].sort_values(by='price').reset_index(drop=True).dropna()

Unnamed: 0,name,price
0,"Girls only, cozy room one block from Times Square",10.0
1,Gigantic Sunny Room in Park Slope-Private Back...,10.0
2,Voted #1 Airbnb In NYC,10.0
3,"Very Spacious bedroom, steps from CENTRAL PARK.",10.0
4,"Charming, bright and brand new Bed-Stuy home",10.0
...,...,...
48879,2br - The Heart of NYC: Manhattans Lower East ...,9999.0
48880,Spanish Harlem Apt,9999.0
48881,1-BR Lincoln Center,10000.0
48882,Furnished room in Astoria apartment,10000.0


In [None]:
# We want to average prices with the specific neighbourhood_places
data['neighbourhood_group']
data.groupby('neighbourhood_group')['price'].mean().sort_values()

neighbourhood_group
Bronx             87.577064
Queens            99.517649
Staten Island    114.812332
Brooklyn         124.438915
Manhattan        196.884903
Name: price, dtype: float64

*We get to know that the cost is more in Manhattan neighbouring areas where as it is the lowest in Bronx neighbouring areas.*

In [None]:
# Lets count which hosts have the maximum counts, by using host_id
data['host_id'].value_counts().sort_values(ascending=False)

219517861    327
107434423    232
30283594     121
137358866    103
16098958      96
            ... 
271916367      1
271928929      1
271925782      1
242376689      1
68119814       1
Name: host_id, Length: 37457, dtype: int64

In [None]:
# Lets find the hosts with top 3 bookings.
data['host_id'].value_counts().sort_values(ascending=False).head(3)

219517861    327
107434423    232
30283594     121
Name: host_id, dtype: int64

In [None]:
data.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [None]:
# Let us check which location among the all has the highest reviews received. 
df1 = data.groupby('neighbourhood_group')['availability_365'].mean().round(2).dropna().sort_values(ascending=False)
#data.groupby('neighbourhood')['reviews_per_month'].mean().round(2).dropna().sort_values(ascending=False)
df1

neighbourhood_group
Staten Island    199.68
Bronx            165.76
Queens           144.45
Manhattan        111.98
Brooklyn         100.23
Name: availability_365, dtype: float64

*we are able to fetch the neighbourhood locations where the places have highest average availability.*

In [None]:
df2 = data.groupby('neighbourhood_group')['price'].mean().round(2).dropna().sort_values()
df2

neighbourhood_group
Bronx             87.58
Queens            99.52
Staten Island    114.81
Brooklyn         124.44
Manhattan        196.88
Name: price, dtype: float64

In [None]:
# Lets check the similarity between the both location and affecting factors:- Price and availability
df3 = pd.merge(df1,df2, how='inner', right_on='neighbourhood_group',left_on='neighbourhood_group').reset_index()
df3

Unnamed: 0,neighbourhood_group,availability_365,price
0,Staten Island,199.68,114.81
1,Bronx,165.76,87.58
2,Queens,144.45,99.52
3,Manhattan,111.98,196.88
4,Brooklyn,100.23,124.44


In [None]:
# from the data lets check the average prices of different room types

data.groupby('room_type')['price'].mean()

room_type
Entire home/apt    211.810918
Private room        89.809131
Shared room         70.248705
Name: price, dtype: float64

*We now have the clear intution from the data that the Entire home/apartment is costlier than the private room which is cheaper than the shared room*

In [None]:
data['room_type'].value_counts()

Entire home/apt    25409
Private room       22326
Shared room         1160
Name: room_type, dtype: int64

*Customers opt for Entire home/Apartment the most followed by private room and very few of them opt for shared rooms.*

In [None]:
data.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [None]:
data2 = data[(data['price']>data['price'].mean()) & (data['reviews_per_month']>data['reviews_per_month'].mean()) & (data['availability_365']>0)]
data2 = data2.dropna()

In [None]:
data2[['host_id','number_of_reviews','availability_365','price']].groupby('host_id')[['number_of_reviews','availability_365','price']].mean().round(2).sort_values(['number_of_reviews','availability_365','price'],ascending=[False,False,True]).dropna()

Unnamed: 0_level_0,number_of_reviews,availability_365,price
host_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12949460,488.0,269.0,160.0
273174,447.0,207.0,575.0
3587751,404.0,341.0,220.0
627217,403.0,201.0,189.0
20116872,401.0,178.0,195.0
...,...,...,...
11814933,2.0,11.0,198.0
74841734,2.0,11.0,200.0
264101088,2.0,11.0,255.0
70715527,2.0,10.0,191.0


*From the above outcome we can come upto the conclusion where the hosts receive maximum number of reviews are the places with higher availability and where the prices are low as compared to others..*

***The Reviews count decreases as the availability decreases and price increases***

In [None]:
data.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [None]:
data[['host_id','availa']]