<a href="https://colab.research.google.com/github/kiran9615/EDA_airbnb-hosting-data-analysis/blob/main/Airbnb_Bookings_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.***

# ***This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.***

***Explore and analyze the data to discover key understandings (not limited to these) such as :***

*What can we learn about different hosts and areas?*

*What can we learn from predictions? (ex: locations, prices, reviews, etc)*

*Which hosts are the busiest and why?*

*Is there any noticeable difference of traffic among different areas and what could be the reason for it?*

In [1]:
#importing all the necessary modules needed during analysis
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
# mounting google drive in colab 
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#reading airbnb data 
file_path = '/content/drive/MyDrive/Datascience/Datasets/Airbnb NYC 2019.csv'
airbnbNY_data = pd.read_csv(file_path)

# ***Part 1: Basic datapreprocessing and Cleaning***

In [5]:
#----------------------Block 1----------------------------------

#checking for no. of observations and features in our dataset
airbnbNY_data.shape

(48895, 16)

*We have a listings of 48895 observations with 16 features*

In [6]:
#----------------------Block 2----------------------------------

# visualising first 5 observations from our dataset 
airbnbNY_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


 ****Data Dictionary****

*1. ID - ID is just an unique identifier of each row.*

*2. Name - It is a rental name which user sees while booking(hotel name).*

*3. Host ID - Every host on Airbnb gets an unique host id, host id here represents that id only.*

*4. Host name - Name of the host.*

*5. Neighbourhood_group - Whole New York is divied into 5 neighbourhood_group(boroughs) regionally. It shows in which borough particular listing is located.*

*6. Neighbourhood - Each borough is further sub divided into neighbourhoods. This column shows the neighbourhood of particular listing.*

*7. Latitude and longitude - These shows the geographical positions of a rental.*

*8. Room_Type -  Which room type a rental is offering.*

*9. Price - Price per night that a rental is charging.*

*10. Minimum Nights - For this much of nights minimally rental can be booked.*

*11. Number of reviews - Number of reviews a rental has got till now.*

*12. Review Per month - An average no. of reviews that a rental got per month.*

*13. Last review - Date on which a rental got their last review.*

*14. calculated_host_listings_count -  It is the no. of rentals a host is hosting in the dataset.*

*15. Availability 365 - It is the no. of days for which a rental is available to book.*

*NOTE - I have used the word rentals, listings and hostings interchangably here. All of these means the same.*

In [7]:
#----------------------Block 3----------------------------------

#getting basic information about dataset
airbnbNY_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

*Outcomes*

*1. Dataype of each columns*

*2. Memory usage due to dataset*

*3. Number of non null values in each columns*


In [8]:
#----------------------Block 4----------------------------------

# getting count of each null values in each column
airbnbNY_data.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

*1. Although Name and Host name has null values but it won't be of significant use in our analysis, we won't be caring much about it.*

*2. Last review and review per month both have same number of null values. Actual reason is they are interdependent. Last_review includes a date of last review of specific listing. If it is null that means listing hasn't got any reviews yet and for the same reason review per moth is also null.*

*3. If you look for the observations having review_per_month and last_review as null values, logiclly no. of reviews value should be zero there. So, lets verify this.* 



In [9]:
#----------------------Block 5----------------------------------

# checking for number_of_review unique values where we have null values for last_review and review_per_month 
airbnbNY_data[airbnbNY_data['last_review'].isnull() & airbnbNY_data['reviews_per_month'].isnull()]['number_of_reviews'].value_counts()

0    10052
Name: number_of_reviews, dtype: int64

*We will delete the last review column from dataframe as that is not much useful in terms of further data analysis and make reviews_per_ month  0 where reviews_per_month is Null.*

In [10]:
#----------------------Block 6----------------------------------

airbnbNY_data.drop(['last_review'],axis = 1, inplace = True)   # dropping last review column
airbnbNY_data['reviews_per_month'].fillna(0,inplace=True)    #filling null values with 0 in reviews_per_month column

*Name of hotel is also not of a significant use to us so we'll drop that as well*

In [11]:
#----------------------Block 7----------------------------------

airbnbNY_data.drop(['name'],axis = 1, inplace = True)   # dropping name column
airbnbNY_data.drop(['host_name'],axis = 1, inplace = True)   # dropping host_name column

In [12]:
#----------------------Block 8----------------------------------

#null values check
airbnbNY_data.isnull().sum()

id                                0
host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

*We have treated null values and dropped all unneccesary columns as well .*

In [13]:
#----------------------Block 9----------------------------------

#checking for descriptive summary of data
airbnbNY_data.drop(['id','host_id','latitude','longitude'], axis =1).describe()

Unnamed: 0,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0
mean,152.720687,7.029962,23.274466,1.09091,7.143982,112.781327
std,240.15417,20.51055,44.550582,1.597283,32.952519,131.622289
min,0.0,1.0,0.0,0.0,1.0,0.0
25%,69.0,1.0,1.0,0.04,1.0,0.0
50%,106.0,3.0,5.0,0.37,1.0,45.0
75%,175.0,5.0,24.0,1.58,2.0,227.0
max,10000.0,1250.0,629.0,58.5,327.0,365.0


*Descriptive table tells us that except availability 365, all other features has some serious outliers.*

*As a part of this project we are only performing EDA .So there might not be a need of outlier treatment but in some specific cases outliers affects our EDA task as well. So we'll look for temporary treatment at that point.*
