<a href="https://colab.research.google.com/github/navinsinghdo/AirBnb1/blob/main/AirBnB_Booking_Analysis_(_EDA)_Capstone_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

AirBnB Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **GitHub Link -**

https://github.com/navinsinghdo/capstone.git

# **Problem Statement**


Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
This dataset has around 49,000 observations in it with 14 columns and it is a mix between categorical and numeric values.

**Explore and analyze the data to discover key understandings (not limited to these) such as :**

What can we learn about different hosts and areas?

What can we learn from predictions? (ex: locations, prices, reviews, etc)

Which hosts are the busiest and why?

Is there any noticeable difference of traffic among different areas and what could be the reason for it?

#### **Define Your Business Objective?**

To understand the customer demographics, and to understand how we can make business model more profitable.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from prettytable import PrettyTable
from tabulate import tabulate

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
airbnb_df = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Airbnb_NYC_2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
airbnb_df.head()

In [None]:
airbnb_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows:", airbnb_df.shape[0])
print("Number of columns:", airbnb_df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info()

In [None]:
# check the data types of the variables
airbnb_df.dtypes

When it comes to data, there are many different sorts of quality issues, which is why data cleaning is one of the most time-consuming aspects of data analysis.

Formatting issues (e.g., rows and columns merged), missing data, duplicated rows, spelling discrepancies, and so on could all be present. These difficulties could make data analysis difficult, resulting in inaccuracies or inappropriate results. As a result, these issues must be addressed before data can be analyzed. Data cleaning is frequently done in an unplanned, difficult-to-define manner.

#### Duplicate Values

Duplicate values are those observations which are repeated more than once. Duplicate values doesn't add any value to the analysis rather makes it biased towards that value and hence it should be removed.

In [None]:
# check for duplicates present in the dataset
airbnb_df.duplicated().sum()

#### Missing Values/Null Values

Missing values are those observations for which the value is not given. These values needs to be treated for our analysis to give correct results. Missing values are usually represented in the form of Nan or null or None in the dataset. df.info() the function can be used to give information about the dataset.
Missing values can be dealt by:
1. Deleting the columns with missing data
2. Deleting the rows with missing data
3. Imputing the missing data with an appropriate value
4. Imputing the missing data with an additonal column etc

In [None]:
# check for missing values present in the dataset
airbnb_df.isna().any()

In [None]:
# Visualization of missing values
airbnb_df.isna().sum()

### What did you know about your dataset?

We have 2 columns in which there are missing values present.

The columns: **id** and **name** are irrelevant for our analysis and hence we can drop these columns

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb_df.columns

In [None]:
# Dataset Describe
airbnb_df.describe()

### Variables Description

**id**: It is an unique id given to the property listed in Airbnb NYC which is a numerical variable.

**name**: It represents the name of the airbnb listed property which is a categorical feature.

**host_id**: This is an unique id given to the host of the property which is a numerical variable.

**host_name**: The name of the host of the property listed which is a categorical variable.

**neighbourhood_group**: This represents a big neighbourhood inside which there are many mini neighbourhoods which is a categorical variable. There are 5 neighbourhood groups in the data:

Manhattan

Brooklyn

Staten Island

Queens

Bronx

**neighbourhood**: This represents all the mini neighbourhoods present in NYC which is another categorical variable.

**latitude**: latitude coordinates

**longitude**: longitude coordinate

**room_type**: This represents the type of room in the listed property which is a categorical variable. There are three room types available in the data:

**Entire Home/Apt, Private Rooms, Shared Rooms price**: Represents the price per day of stay in the respective listed property which is a numerical variable.

**minimum_nights**: This represents the minimum number of nights a person has to pay for or stay in the property which is a numerical variable.

**number_of_reviews**: The number of reviews given to the property and the host which is a numerical variable.

**calculated_host_listings_count**: How many listings a particular host has in NYC which is another numerical variable.

**availability_365**: Availability of the property out of 365 days which is also a numerical variable














### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in airbnb_df.columns.tolist():
  print("No. of unique values in ",i,"is",airbnb_df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

**Missing value treatment**


In [None]:
# Write your code to make your dataset analysis ready.

# drop the unnecessary columns from the dataframe
airbnb_df.drop(['id', 'name'], axis=1, inplace=True)
airbnb_df.head()

In [None]:
# replace all NaN values in host_name by 'no name'
airbnb_df.host_name.fillna('No Name', inplace=True)
airbnb_df.head()

In [None]:
# check for any missing value
airbnb_df.isnull().any()

In [None]:
airbnb_df.info()

### What all manipulations have you done and insights you found?

Duplicates and missing values have been removed from the dataset. The dataset now is clean and ready for data exploration/analysis.

In [None]:
# description of the clean dataset
airbnb_df.describe()

We can see that there are few properties which have listed price as 0. This might be due to some error in data collection as we expect nobody is giving Airbnb stays for free. We will exclude the price 0 for our analysis.

In [None]:
# exclude the records which have price as zero
airbnb_df = airbnb_df.loc[airbnb_df['price'] > 0]

In [None]:
airbnb_df.shape

In [None]:
# check the description
airbnb_df.describe()

The price seems alright now. Though the maximum price is at 10000 $, we will consider it as a natural observation and not an outlier because there are few high values for minimum nights spent and as a reason the price might be higher for those stays.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

The dataset is now ready to be explored.
We will be doing some univariate, bivariate and multivariate analysis to find out interesting insights.

We will uncover few insights from the dataset like:

Which hosts has got the most listings in NYC?

Which hosts are the busiest?

Which neighbourhood group has the most listings in NYC?

Which neighbourhood group and neighbourhood are the most expensive and the most affordable?

Different room types available in different neighbourhood groups.

Which neighbourhood group or neighbourhood is more available out of 365 days?

Cost of each room type present in NYC.
and many more.

#### Chart - 1

**What is the range of prices of the Airbnb listings in NYC?**

In [None]:
# check the distribution of price
plt.figure(figsize=(20,12))
sns.displot(airbnb_df['price'],bins=100)
plt.show()

##### 1. Why did you pick the specific chart?

We picked this chart to understand the price range distribution across NYC

##### 2. What is/are the insight(s) found from the chart?

The price column is heavily skewed with most of the price ranging between 10$ to 200$.

There are a very few observations where the minimum nights is very high and thus for those very few observations the price could go high upto 10000$. Hence, they aren't considered as outliers here.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights gives an indication regarding the price distribution.

#### Chart - 2

**How many unique Airbnb hosts are there in NYC?**

In [None]:
# find unique hosts
airbnb_df[['host_id']].nunique()

There are 37455 unique Airbnb hosts in New York City.

It is clear with the count of hosts and properties listed that there are hosts who have multiple properties listed in Airbnb NYC.

Who are the hosts with the most multiple property listings in Airbnb NYC?

In [None]:
# top 20 hosts on the basis of count of listings
top_hosts_listings = airbnb_df.groupby(['host_id','host_name'])['host_id'].count().sort_values(ascending=False)[:20]
print('The top 20 hosts with the most property listings in NYC are:\n')
top_hosts_listings_df = top_hosts_listings.reset_index(name='Listing Count')
table = tabulate(top_hosts_listings_df, headers=['Host ID', 'Host Name', 'Listing Count'], tablefmt='pretty')
print(table)

# plot the top 20 hosts on the basis of count of listings
top_hosts_listings.plot.barh().invert_yaxis()
plt.xlabel('Number of properties listed')
plt.ylabel('Host Name and Host ID')
plt.title('Top 20 Hosts having the highest number of properties listed')
plt.show()

##### 1. Why did you pick the specific chart?

We decided the bar plot to define the most popular host as this is a categorical data.

##### 2. What is/are the insight(s) found from the chart?

There are many hosts with more than 100 properties listed in Airbnb NYC.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see from the above plot that Sonder (NYC) has the highest number of properties listed with a total of 327 properties in Airbnb NYC. Providing a official mark for the hosts who has 50+ properties registered on Airbnb will be a appreciation for the host and customers can identify popular hosts.

#### Chart - 3

**Who are the hosts with the most number of reviews?**

In [None]:
# top 10 hosts on the basis of reviews
top_hosts_reviews = airbnb_df.groupby(['host_id','host_name'])['number_of_reviews'].sum().sort_values(ascending=False)[:10]
print('The top 10 hosts who has got the most number of reviews:\n')
top_hosts_reviews_df = top_hosts_reviews.reset_index(name='review_count')
table = tabulate(top_hosts_reviews_df, headers=['Host ID','Host Name','Review Count'],tablefmt='fancy_grid')
print(table)

# plot the top 10 hosts on the basis of reviews
top_hosts_reviews.plot.barh().invert_yaxis()
plt.ylabel('Host Name and Host ID')
plt.xlabel('Total number of reviews')
plt.title('Top 10 Hosts having the highest number of reviews')
plt.show()

##### 1. Why did you pick the specific chart?

We decided the bar plot to define the most popular host in terms of review as this is a categorical data

##### 2. What is/are the insight(s) found from the chart?

We are getting a review about the top 10 hosts who have received the highest number of reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help in creating a positive business impact as high number of reviews often indicates to high customer satisfaction and positive experience from the customers. This can help the hosts to create a more positive atmosphere to attract more hosts and generate more revenue.

#### Chart - 4

**Which is the most affordable borough?**

In [None]:
# Most affordable borough per day
most_affordable_borough = airbnb_df.groupby('neighbourhood_group')['price'].median().sort_values(ascending=False).reset_index()
print("The most affordable borough in NYC:\n")
table = tabulate(most_affordable_borough,headers=['Neighbourhood Group','Price'],tablefmt='grid')
print(table)

# Plotting the most affordable borough per day
sns.barplot(x='price',y='neighbourhood_group', data=most_affordable_borough)
plt.xlabel("Price")
plt.ylabel("Different Boroughs of NYC")
plt.title("Most affordable borough in NYC per day")
plt.show()

##### 1. Why did you pick the specific chart?

The specific chart is chosen to compare the categorical and numerical data i.e. providing insights about the most affordable boroughs of NYC.

##### 2. What is/are the insight(s) found from the chart?

It shows which borough is the most affordable amongst the five in NYC.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart can be useful for business understanding of the most affordable borough and potentially target their marketing and pricing strategies accordingly

#### Chart - 5

**What are the different types of room available in the properties in NYC?**

In [None]:
# Room types in different neighbourhood groups
room_type_neighbourhood_groups = airbnb_df.groupby(['neighbourhood_group', 'room_type'])['room_type'].count()
print('The different room types available in different neighbourhood groups: \n')
room_type_neighbourhood_groups_df = room_type_neighbourhood_groups.reset_index(name='Count')
table = tabulate(room_type_neighbourhood_groups_df, headers=['Neighbourhood Group', 'Room Type', 'Count'], tablefmt='pretty')
print(table)

# Plot the distribution
plt.figure()
sns.countplot(data=airbnb_df, x='neighbourhood_group', hue='room_type')
plt.title('Room types in different neighbourhood groups')
plt.show()

#### Chart - 6

**Types of rooms avalilable around NYC**

In [None]:
# different room types
different_room_types = airbnb_df['room_type'].value_counts()
df = pd.DataFrame({'Room Types': different_room_types.index, 'Count': different_room_types.values})
print('The top neighbourhood groups which has the most number of properties listed:\n')
table = tabulate(df, headers=["Room Types","Count"],tablefmt='fancy_grid')
print(table)

plt.figure()
plt.title('Different types of room types')
plt.pie(airbnb_df['room_type'].value_counts(), labels=airbnb_df['room_type'].value_counts().index, autopct='%1.1f%%', startangle=180)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is being used to differentiate between the categorial data among the different room types of NYC boroughs, and the Pie-chart is used to show the percentage of rooms for each category around the different NYC boroughs.

##### 2. What is/are the insight(s) found from the chart?

It gives an idea about the different types of room available around NYC.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gives an idea about rooms availability across boroughs in NYC. The pricing strategy can be changed or determined according to the availability of the rooms.

#### Chart - 7

**Hosts provided price for each borough**.

In [None]:
airbnb_df_1 = airbnb_df.loc[airbnb_df['price']<100].sort_values(['price'])

plt.figure()
plt.title('Neighbourhood Groups on the basis of availability')
sns.boxplot(data=airbnb_df_1, x='neighbourhood_group', y='price')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was taken up to show the price distribution for each borough from 0 to 100 dollars.

##### 2. What is/are the insight(s) found from the chart?

The charts gives an insight about the price distribution across NYC.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It will help to differentiate between the price range across different boroughs. It will help the customers to plan a budget stay according to their need.

#### Chart - 8

**Locations/Map of Neighbourhood Groups**

In [None]:

# plot neighbourhood groups on the basis of latitude and longitude
plt.figure(figsize=(10, 8))
sns.scatterplot(x='longitude', y='latitude', hue='neighbourhood_group', data=airbnb_df)
plt.title('Locations of different neighbourhood groups in NYC over a map')
plt.show()

Percentage of properties around NYC

In [None]:
# distribution of the properties around NYC through a pie chart
neighbourhood_groups = airbnb_df['neighbourhood_group'].value_counts()
df = pd.DataFrame({'Neighbourhood Group': neighbourhood_groups.index, 'Count': neighbourhood_groups.values})
print('The top neighbourhood groups which has the most number of properties listed:\n')
table = tabulate(df, headers=["Neighbourhood Group","Count"],tablefmt='fancy_grid')
print(table)

plt.figure(figsize=(10, 8))
plt.title('Distribution of properties across NYC broughs')
plt.pie(airbnb_df['neighbourhood_group'].value_counts(), labels=airbnb_df['neighbourhood_group'].value_counts().index, autopct='%1.1f%%', startangle=180)
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot chart was selected to visualize the distribution of the properties based upon the geographical location across NYC.

The pie chart provides a visual representation of how different properties across NYC contribute as a whole.

##### 2. What is/are the insight(s) found from the chart?

It gives an insight about how many properties are listed among different Boroughs. As seen from the chart, Manhattan has the most number of properties listed followed by Brooklyn, Queens, Bronx and Staten Island

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the pie chart will help us to modify the listing price based on the demands like Staten Island, Bronx and let other buisness partners know the popular place based on the density of the properties.

#### Chart - 9

**Which neighbourhoods are the most expensive and the most affordable?**

In [None]:
# find the top 10 costly neighbourhoods in NYC
high_priced_neighbourhoods = airbnb_df.groupby(['neighbourhood'])['price'].mean().sort_values(ascending=False)[:10]
print('The most expensive neighbourhoods: \n')
high_priced_neighbourhoods_df = high_priced_neighbourhoods.reset_index(name='Average Price')
table = tabulate(high_priced_neighbourhoods_df, headers = ['Neighbourhood','Average Price'],tablefmt='grid')
print(table)

# plot the costliest neighbourhoods
sns.barplot(y=high_priced_neighbourhoods.index,x=high_priced_neighbourhoods.values)
plt.ylabel('Neighbourhoods')
plt.xlabel('Average Price per day')
plt.title('Top 10 costliest neighbourhoods')
plt.show()

In [None]:
# find the top 10 cheaper neighbourhoods in NYC
low_priced_neighbourhoods = airbnb_df.groupby(['neighbourhood'])['price'].mean().sort_values(ascending=True)[:10]
print('The most affordable neighbourhoods: \n')
low_priced_neighbourhoods_df = low_priced_neighbourhoods.reset_index(name='Average Price')
table = tabulate(low_priced_neighbourhoods_df, headers = ['Neighbourhood','Average Price'],tablefmt='grid')
print(table)

# plot the cheaper neighbourhoods
sns.barplot(y=low_priced_neighbourhoods.index,x=low_priced_neighbourhoods.values)
plt.ylabel('Neighbourhoods')
plt.xlabel('Average Price per day')
plt.title('Top 10 cheaper neighbourhoods')
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is used to visualize the top 10 costliest and cheapest neighbourhoods in NYC based on the average price per day

##### 2. What is/are the insight(s) found from the chart?

The chart of the top 10 costliest neighbourhoods with the highest average prices per day.

The chart of the top 10 cheaper neighbourhoods with the lowest average prices per day.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help to adjust the pricing strategies based on the average prices of these neighbourhoods, ensuring they remain competitive and attract the right customers. Different businesses can build on these insights to create their presence in the neighbourhoods understanding on the price range.

#### Chart - 10

**Hosts who providing more stay for the paid price**

In [None]:
df = airbnb_df.sort_values('minimum_nights',ascending=False)[:20]
df[['host_id','host_name','neighbourhood_group','neighbourhood','room_type','minimum_nights','price']]

Neighbourhoods which offer more stay

In [None]:
top_10_neighbourhoods_min_nights = airbnb_df.groupby(['neighbourhood']).agg({'minimum_nights':'mean','price':'mean'}).sort_values('minimum_nights',ascending=False)[:10]
print('The neighbourhoods preferred for longer stays: \n')
top_10_neighbourhoods_min_nights

1. Why did you pick the specific chart?

It will be wiser if the data shown are in table format.

##### 2. What is/are the insight(s) found from the chart?

Host Genevieve (Entire home/apt) provide more stay for the paid price.

Spuyten Duyvil (neighbourhood) provides more stay for the paid price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can recommend customer that Entire home/apt are available for long stay based on the price.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Price Analysis and Recommendations**: This analysis can be useful for Airbnb guests and property owners to understand price trends and adjust their prices accordingly. Recommendations can be given to hosts in expensive neighborhoods to emphasize the benefits they offer, and to hosts in lower price neighborhoods to attract budget travellers.

**Host Performance and Market Insights**: This insight can be valuable to Airbnb itself or other stakeholders they want to understand hosts performance and identify successful hosts.

**Borough-level analysis**: This information can be useful for travelers or visitors looking for less complicated options. Recommendations can be given to guests in order to find the most affordable metropolitan area to stay, highlighting potential cost savings and benefits for those areas.

**Room Characteristics and Area Group Analysis:** It provides insight into how room types are categorized in different area groups. This type of research can help guests understand the options and make the right choice based on their preferences. Guests can be given recommendations for common rooms in each community group, allowing them to choose the type of accommodation they need.

**Geospatial Analysis:** This geospatial analysis can be useful for visitors to visualize the distribution of neighborhood clusters in NYC. Recommendations can be given to list more properties in Staten Island and Bronx.

# **Conclusion**

The EDA on the given Airbnb has given a various aspect on pricing dynamics and popular neighbourhoods.
By analyzing the data we can tell that NYC is suitable for both commoners and high class people.
Analysis on the neighbourhood tells which locations are popular and which room types are available on those location.
Analysis on the room types tells which rooms are widely and narrowly spread.
Analysis on the minimum nights provides which areas are most suitable for long, medium and short stays.
Through these Analysis AirBnb can recommend more customer in coming days.
We have reached the end of our analysis of Airbnb listings in NYC. We have explored, visualized most of the features and uncovered a lot of insights which will definitely assist the company in decision making to attract more tourists. These insights will help the customers, hosts and the company to optimize their offerings to earn a win-win situation for every party.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***