<a href="https://colab.research.google.com/github/piyushbg/AirBnb/blob/main/AirBnb_booking_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Bookings Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 - Piyush Sanjay Bagul**


# **Project Summary -**

Airbnb has been utilized by both hosts and guests to create personalized and unique travel experiences, leading to its recognition as a one-of-a-kind service worldwide. The vast amount of data generated by millions of listings on the platform is crucial for Airbnb's operations. This data can be analyzed to inform decision-making on security, business strategies, customer and host behavior, marketing initiatives, and the implementation of innovative services, among other things. The dataset used for analysis contains approximately 49,000 observations across 16 columns, containing a combination of categorical and numeric values.

 conducted an in-depth exploratory analysis of the dataset to identify ways in which Airbnb could enhance their business. I began by performing data cleaning tasks such as removing duplicate values, handling outliers, and dealing with missing values. I then visualized the data and identified several key insights by examining the relationships among the variables. Based on these insights, I developed potential solutions that could help Airbnb improve their business.

The solutions I have put forward are based on my analysis and understanding of the dataset. To gain a better understanding of customer behavior, additional variables such as ratings and user reviews would be beneficial. In addition, attributes such as amenities could also have a significant impact on customer behavior. Overall, the dataset contains a wealth of attributes that can be analyzed beyond the scope of my findings.

# **GitHub Link -**

https://github.com/piyushbg/AirBnb

# **Problem Statement**


Since its inception in 2008, Airbnb has revolutionized the way in which guests and hosts travel and experience the world. Today, it has become a one-of-a-kind service used by people all over the world. With millions of listings on its platform, data analysts have become a crucial factor in Airbnb's success. The vast amount of data generated by these listings can be analyzed to enhance security, inform business decisions, understand customer and provider behavior, introduce new services, guide marketing initiatives, and much more. While there are competitors like oyo, it is important for businesses to constantly strive for improvement as nothing is permanent.

The dataset used in this analysis consists of approximately 48,000 observations with 16 columns, containing a mix of categorical and numerical values.





#### **Define Your Business Objective?**

#  To make informed business decisions by gaining insights into the behavior of both customers and providers on the platform, or during their stays.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore') # setting ignore as a parameter


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
file_path="/content/drive/MyDrive/Colab Notebooks/AirBnbcap"
data=pd.read_csv(file_path +"/Airbnb NYC 2019.csv")

### Dataset First View

In [None]:
# Dataset First Look
data.head()


In [None]:
data.tail(2)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data[data.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(), cbar=False)

### What did you know about your dataset?

 1. Dataset has 48,895 rows and 16 columns and it is a mix of categorical and  numerical values.
 
 2. neighbourhood_group, neighbourhood, room_type belongs to catogorical values.
 
 3.id,latitude,longitude,price,minimum_nights,number_of_reviews,last_review, reviews_per_month, calculated_host_listings_count, availability_365 this columns belongs to numerical values.
 
 4.There are no Duplicate values but large no of Null values in colums like
 last review ,reviews per month 

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

In [None]:
len(data["host_name"].unique())

In [None]:
data["neighbourhood_group"].unique()

### Variables Description 

* **id**        - unique id
* **name**      - description of the property
* **host id**   - unique id for host
* **host_name** - host name. There are 11452 hosts out of 48895 in which 21 are null values and remaining suggest that a single host owns multiple properties

* **neighbourhood_group** - location. we have 5 unique locations(Manhattan, Brooklyn, Queens, Bronx, Staten Island)

* **neighbourhood** -area under neighbourhood group
* **Longitude**     -location 
* **Latitude**      -location 
* **room_type**     -Room type -private,shared,Entire home/apt
* **price**         -price of room
* **minimum_nights**-minimum nights to be paid for.
* **number_of_reviews** - total count of reviews of that listing
* **last_review**       - last review date of that listing
* **reviews_per_month** - number of reviews per month of that listing
* **calculated_host_listings_count**-total number of listings registered under host
* **availability_365**  -  number of days that the listing is availabale during 365 days

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
column_list = data.columns.values.tolist()
for column_name in column_list:
  print("unique values in ",column_name,"is",data[column_name].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df=data.copy()

In [None]:
df.isnull().sum()

In [None]:
#Droping unwanted columns
df.drop(["last_review"],axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
#filling zero values
df.reviews_per_month.fillna(0, inplace=True)
df['host_name'].fillna('anonymous',inplace=True)

In [None]:
df.isnull().sum()

# In some rows price is zero ,it  can't be zero so we are going to delete that row

In [None]:
df.describe() #we can see price 5 is zero which is illogical

In [None]:
len(df[df['price']==0])#there are "11" price's which has 0 in it so we are goig to delete that 
df.drop(df[df['price']==0].index,inplace=True)



In [None]:
len(df[df['price']==0]) #now there are no zero price value 

#1.which location and which room type is praferable?

In [None]:
#which roomtype has a demand
prf_areas = df.groupby(['neighbourhood_group','room_type'])['minimum_nights'].count().reset_index()
prf_areas = prf_areas.sort_values(by='minimum_nights', ascending=False)
prf_areas.head(3)

# 2.Area wise price distribution

In [None]:
#area wise maximum price
max_price_df = df.groupby('neighbourhood_group',as_index=False)['price'].max().sort_values(['price'],ascending = False).rename(columns = {'price':'Maximum price','neighbourhood_group':'Area'})
max_price_df

In [None]:
#area wise minimum price
min_price_df = df.groupby('neighbourhood_group',as_index=False)['price'].min().sort_values(['price'],ascending = True).rename(columns = {'price':'Minimum price','neighbourhood_group':'Area'})
min_price_df

In [None]:
merge_price_df = pd.merge(max_price_df, min_price_df, on='Area') #merge of max and min price

In [None]:
merge_price_df

# 3.what insight we are getting from host and area?

In [None]:
#Top 10 host in the area
Host_area=df.groupby(["host_name","neighbourhood_group"])["calculated_host_listings_count"].sum().reset_index()

In [None]:
top_10=Host_area.sort_values(by='calculated_host_listings_count',ascending=False).head(10)
top_10

# 4.Top 10 Host :who has most number of reviews

In [None]:
#top 10 hosts who have most reviews
most_reviews = df.groupby(['host_name','room_type','neighbourhood_group','neighbourhood'])['number_of_reviews'].max().reset_index()
most_reviews= most_reviews.sort_values(by='number_of_reviews', ascending=False).head(10)
most_reviews

# 5.Average price of property based on location and  Room type

In [None]:
##Average price of property based on location and  Room type
avg_price = df.groupby(['neighbourhood_group','room_type'])['price'].mean().reset_index().rename(columns={"neighbourhood_group":"Area"})
avg_price

In [None]:
areas_reviews = df.groupby(['neighbourhood_group'])['number_of_reviews'].max().reset_index()
areas_reviews

In [None]:
area_price =df.groupby(['price'])['number_of_reviews'].max().reset_index()
area_price.head(5)

### What all manipulations have you done and insights you found?

Regarding  data wrangling, it's important to ensure that the columns you remove are truly not relevant to your analysis and that i have a clear reason for doing so. Additionally, it's important to properly handle missing data, such as the null values you found in the host name and reviews_per_month columns. Filling in missing values with appropriate values like "anonymous" or "zero" is a good approach, but it's important to ensure that these values do not skew your analysis.

Your observations on the room types and locations that are most popular among guests and hosts are interesting and can provide valuable insights into the trends in the Airbnb market in New York. However, it's important to back up your statements with evidence and data to support your conclusions. Additionally, it's important to consider other factors that may be influencing these trends, such as the location of tourist attractions or the availability of public transportation.

### ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# chart-1

In [None]:
# Chart - 1 visualization code
room_type = prf_areas['room_type']
stayed = prf_areas['minimum_nights']

fig = plt.figure(figsize = (10, 5))
 
# creating the bar plot
plt.bar(room_type, stayed, color ='red',
        width = 0.2)
 
plt.xlabel("Room Type")
plt.ylabel("Minimum number of nights stayed")
plt.title("Traffic Areas")
plt.show()

# 1. Why did you pick the specific chart?
##### we can see which groups are highest or most common, and how other groups compare against the others. 
##### 2. What is/are the insight(s) found from the chart?
From the above bar graph We can Stay that People are preferring Entire home/apt or Private room which are present in Manhattan, Brooklyn, Queens and people are preferring listings which are less in price.
##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.
1.the graph shows entire home/apart and private rooms are being more preferred by the guests. In this way airbnb can increase these type of rooms.

2.shared rooms are less prefered may be because of price is verc close to private room

#### Chart - 2

In [None]:
# Chart - 2 visualization code

merge_price_df.plot(x="Area", y='Minimum price', kind="bar",color='blue',width=0.1)
merge_price_df.plot(x="Area", y='Maximum price', kind="bar",color='maroon',width=0.1)


##### 1. Why did you pick the specific chart?

to understand the maximum price and the minimum price in the respective area.

##### 2. What is/are the insight(s) found from the chart?

  there is less demand in  staten island, minimun price is high compared to others. I think its beacause of being an tourist place it has all sort pricings
Manhattan and Brooklyn has lavish homes to stay being the more choosen areas from rest of the places because manhatten is costliest city in the world 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

AirBnb has to provide different price category rooms so all kind of economic level people can afford so AirBnb can reach upto large section

#### Chart - 3

In [None]:
host_names = top_10["host_name"]
host_listings_count = top_10['calculated_host_listings_count']
fig = plt.figure(figsize =(10, 7))
plt.pie(host_listings_count,labels=host_names)

##### 1. Why did you pick the specific chart?

 Pie chart shows hosts who's property's are listed on the top


##### 2. What is/are the insight(s) found from the chart

Top host are hotel chain's rather then individual hosts

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

I think it will help AirBnb to giving some unique features and give some Discount to attract the customers because it is easy to impliment in hotel chain's by collabarating with them. 

#### Chart - 4

In [None]:
# Chart - 4 visualization code
host_name = list(most_reviews['host_name'])
review_count = list(most_reviews['number_of_reviews'])
host_name.reverse()
review_count.reverse()

plt.title('most reviews', {'fontsize':14})
plt.xlabel('Number of reviews',{'fontsize':14})
plt.ylabel('Host name',{'fontsize':14})
plt.barh(host_name, review_count,color='blue')


##### 1. Why did you pick the specific chart?

to visuallise who has most reviews among the host

##### 2. What is/are the insight(s) found from the chart?

It seems that Dona is the busy host who is an individual. That shows hotel chain's having more properties are not getting reviewed may be footfall is less comparing to there capacity

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 I am assuming if airbnb did not consider this as organizations hold most number of listings, people will shift to another platform and for another host.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.style.use('classic')
plt.figure(figsize=(13,7))
plt.title("Neighbourhood Group vs. Availability Room")
sns.boxplot(data=df, x='neighbourhood_group',y='availability_365')
plt.show()

##### 1. Why did you pick the specific chart?

to understand the availability of rooms through out the year in differnt Areas

##### 2. What is/are the insight(s) found from the chart?

By seeing the above box plot, It seems rooms in staten island are availble nearly 250 days throughout the year and remaining come close to 150. By this I can say there is no much demand in staten island compared to remaining

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 they can focus on other features instead of marketing alredy they have good business  

#### Chart - 6

In [None]:
# Chart - 6 visualization code
area = area_price['price']
price = area_price['number_of_reviews']

fig = plt.figure(figsize = (10, 5))
 
# creating the bar plot
plt.scatter(area, price)
 
plt.xlabel("Price")
plt.ylabel("Number of Review")
plt.title("Price vs Number of Reviews")
plt.show()

##### 1. Why did you pick the specific chart?

To anaylsie where the number review saturation at what price range.

##### 2. What is/are the insight(s) found from the chart?

high number reviews mens footfall is high so peple preffer cheap hotels

Answer Here

Answer Here

#### Chart - 7

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(df.longitude, df.latitude, hue=df.neighbourhood_group)
plt.title('NewYork City Map')
plt.show()

##### 1. Why did you pick the specific chart?

to see the neighbourhood group using latitude and longitude to check the places in the map.

##### 2. What is/are the insight(s) found from the chart?

To get knowledge ofvarious places of neighbourhood groups in this demographic distribution 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

too analyse the data in much more sense with regards to business aspects by places

#### Chart - 8

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(12,8))
plt.title("Room Type in a Neighbourhood Group")
sns.countplot(df.neighbourhood_group,hue=df.room_type)
plt.show()

##### 1. Why did you pick the specific chart?

countplot basically produces a histogram/bar type chart to count values of categorical data

##### 2. What is/are the insight(s) found from the chart?

 private rooms are more in Brooklyn followed by Manhattan and least at staten island. shared rooms are less in all the regions.manhattan has more number of entire home/apart followed by Brooklyn and least is at staten island

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr = df.corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

correlation heat map gives the correlation between variables used to perform analysis in a more reliable way.the value ranges from (-1,1).

##### 2. What is/are the insight(s) found from the chart?

that there is correlation between id and host id, also between number of reviews and reviews_per_month. So it will help me analys with any variable from the pair of correlated variables.

#### Chart - 10 - Pair Plot 

In [None]:
# Pair Plot visualization code
sns.pairplot(df,hue='number_of_reviews')

##### 1. Why did you pick the specific chart?

Seaborn Pairplot allows us to plot pairwise relationships between variables within a dataset. It gives us in single large picture. Here I have used number_of_reviews variable (as my objective is to understand guests and hosts behaviour) to check the relatioship with other variables. This is used essentially to get to know about our data and how our target variable is related with the rest of the variables

## **5. Solution to Business Objective**





1.   There is more demand for private and entire home/apartments, so they can increase the number of these room types and redusw shared rooms. So that they can grow there business
2.   Few Attributes like  ameneties are missing, which could help us to analyse guest behaviour more.


3. Manhattan has most number of listings, followed by Brooklyn and Queens. Staten Island has least number of listings.
4.  individual hosting a property is more busy than a corporate who hosts. Need to look into those and should try to provide the same facilities an individual is providing.

5.  Reviews has to be given specific , it will make more sense to understand customer's behaviour. Now in this project assuming reviews to be positve 
6.   There is more demand for private and entire home/apartments, so they can increase the number of these room types and limit shared rooms. So that they can get more people to stay.

7. should give incentives to those busy individual hosts to encourage them to maintain












# **Conclusion**

After analyzing data from the Airbnb rental market, it was found that Manhattan and Brooklyn are the most popular areas for hosts to do business. Customers are willing to pay the highest prices in these areas, with Brooklyn and Manhattan having the highest average rental prices of approximately $10,000. On the other hand, the lowest average rental price is around 10.

The majority of customers prefer private and entire home types of listings. Among the different types of listings, private rooms and entire homes were found to be the most popular. Additionally, organizations are the top hosts having more listings; however, individual hosts are more preferred by customers than company-hosted properties.

Interestingly, Staten Island properties are more available throughout the year than properties in other areas. Lastly, the most popular hosts are Sondar, Blueground, and Kara.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***