# **Project Name**    - Airbnb Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name       -** Krishna kumar singh

# **Project Summary -**

 Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Explore and analyse the data to discover key understandings.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalised way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Explore and analyse the data to discover key understandings.**

#### **Define Your Business Objective?**

 This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Our motive is to delete missing values and filling the null value if possible, also create visualization on the basis of given dataset which will show the growth or loss in business.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')


In [None]:
from google.colab import drive
drive.mount('/content/drive')

file_path = "/content/drive/MyDrive/project_data/Airbnb NYC 2019.csv"
Airbnb_df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look

Airbnb_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Airbnb_df.shape

# Shape gives count of row and column, here rows = 48895 and columns = 16

### Dataset Information

In [None]:
# Dataset Info
Airbnb_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Drop_duplicates delete all the duplicate rows from datset
Airbnb_df.drop_duplicates(inplace = True)

unique_df = Airbnb_df.shape[0]
unique_df
# Here after deleting duplicate value no change in number of rows that means there is no duplicate value in this dataset

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

missing_value = Airbnb_df.isnull().sum().sort_values(ascending = False)
missing_value
# name have 16 null values, host_name have 21, last_review and reviews_per_month have 10052 null value

In [None]:
# Visualizing the missing values

Null_values = Airbnb_df.isnull() == True
Airbnb_df.fillna(np.nan , inplace = True)
Airbnb_df

### What did you know about your dataset?

This dataset contains information about Airbnb hotel booking in NYC. This dataset have 48895 rows and 16 columns, absically this data shows which neighbourhood and neighbourhood group is good to stay on the basis of their price, room type, rating and availability, It also gives the geographical location of that area by providing longitude and latitude.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Airbnb_df.columns

In [None]:
# Dataset Describe

Airbnb_df.describe()

### Variables Description





#####  **The column and the data it represents**

#####  **1.  ID**  - Unique ID
#####  **2.  name** - Name of the listings
#####  **3.  host_id** - Unique host ID
#####  **4.  host_name** - Name of the host
#####  **5.  neighbourhood_group** - Location
#####  **6.  neighbourhood** - Area
#####  **7.  latitude** - Latitude range
#####  **8.  longitude** - Longitude range
#####  **9.  room_type** - Type of listing
#####  **10. price** - Price of listing
#####  **11. minimum_nights** - Minimum nights to be paid for
#####  **12. number_of_reviews** - Number of reviews
#####  **13. last_review** - Content of the last review
#####  **14. review_per_month** - Number of checks per month
#####  **15. calculated_host_listing_count** - Total count
#####  **16. availability_365** - Availability around the year

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print(Airbnb_df.apply(lambda col:col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
missing_value[:4]

In [None]:
# first as we checked above that out of all the 16 columns last_revies is in different data type so first chnge it from object to datetime

Airbnb_df['last_review'] = Airbnb_df['last_review'].astype('datetime64[ns]')

# Now we fill the last_review , we can't ignore whole column where out of 48895 only 10052 are null

average_date = Airbnb_df['last_review'].mean()
Airbnb_df['last_review'].fillna(average_date, inplace=True)

In [None]:
# For reviews_per_month we will fill the missing value with 0

Airbnb_df['reviews_per_month'].fillna(value=0 , inplace = True)

In [None]:
# For host_name we will fill the missing value with unknown
Airbnb_df['host_name'].fillna(value='unknown' , inplace = True)

In [None]:
# For name we will fill the missing value with unknown
Airbnb_df['name'].fillna(value='unknown' , inplace = True)

In [None]:
# Now we check the count of null values

Airbnb_df.isnull().sum()

In [None]:
# Again here we will check data type of all columns given

Airbnb_df.info()

### What all manipulations have you done and insights you found?

First we checked the data, in this data there are 4 columns which have missing values/null values, after that cheking the datatype by using
 .info() only one column for which datatype is wrong so for last_review we convert it from object type to datetime format , after that to fill the missing value I use ffill which fills the previous value of that column , after that in review_per_month column replaced null value with zero simillarly for host_name and name column null values are replaced with "unknown" for both the columns. After that checking for null values using .isnull().sum() will give us count of null values which in this xase is zero, and .info() gives the data type for all the columns.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 -  Popular neighbourhood group

In [None]:
# Chart - 1 visualization code
neighbourhood_reviews = Airbnb_df.groupby('neighbourhood_group')['number_of_reviews'].sum().reset_index()

#  Sort the neighborhoods based on the number of reviews
neighbourhood_reviews_sorted = neighbourhood_reviews.sort_values(by='number_of_reviews', ascending=False)

#  Plot the bar chart
plt.figure(figsize=(10, 6))
plt.bar(neighbourhood_reviews_sorted['neighbourhood_group'], neighbourhood_reviews_sorted['number_of_reviews'], color='skyblue')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Number of Reviews')
plt.title('Most Preferred Neighbourhoods in Airbnb Bookings')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()

##### 1. Why did you pick the specific chart?

Here bar chart shows the neighbouhood group with highest number of reviews, bar chart preferred where we need to change over time for different categories.

##### 2. What is/are the insight(s) found from the chart?

Above chart shows that Brooklyn has maximun number of reviews which is why it is most popular neighbourhood group in all.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As per the above visualization we can say that mostly tourist choose or like to stay in Brooklyn and Manhattan.

Bronx and Strten Island has least number of reviews which shows their less popularity that's why most tourist do not prefer to stay there.

#### Chart - 2 - Price Distribution (Histogram)

In [None]:
# Chart - 2 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(data=Airbnb_df, x='price', bins=50, kde=True)
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()






##### 1. Why did you pick the specific chart?

The histogram with the KDE curve provides a comprehensive view of how Airbnb prices are distributed, allowing for easy identification of central tendencies, spread, and any potential skewness or outliers in the data.








##### 2. What is/are the insight(s) found from the chart?

It identify the most common price ranges for Airbnb listings also determines if there are any outliers (extremely high or low prices) and whether the price distribution is skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The histogram shows how Airbnb prices are distributed. By looking at the height of the bars, you can see which price ranges have the most listings.The width of the histogram and the shape of the KDE curve provide insights into the spread of prices. A wider spread indicates more variability in prices. By looking to the graph we can clearly say that the more the price the lower the bookings so the price should not be much higher for business growth.

#### Chart - 3 - Number of listings in Neighbourhood group

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(10, 6))
sns.countplot(data=Airbnb_df, x='neighbourhood_group')
plt.title('Number of Listings per Neighborhood group')
plt.xlabel('Neighborhood group')
plt.ylabel('Number of Listings')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

Bar chart usually preferred when we want to show the variation in different categories.

##### 2. What is/are the insight(s) found from the chart?

Here above chart shows how number of listings are varying with the neighbourhood group, and shows Manhattan has maximum number of listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As per the positive insight here it shows people most likely want to spend time on Brooklyn and Manhattan, whereas in negative term Straten Island need to do some work to attract people to visit there, as least no. of listings are from there.

#### Chart - 4 - Price distribution by neighbourhood groups(Box Plot)

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(10, 6))
sns.boxplot(data=Airbnb_df, x='neighbourhood_group', y='price')
plt.title('Price Distribution by Neighborhood group')
plt.xlabel('Neighborhood group')
plt.ylabel('Price')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

A box plot (or box-and-whisker plot) is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It can also highlight outliers.

##### 2. What is/are the insight(s) found from the chart?

Each box plot represents the distribution of prices within a specific neighborhood group
the length of the box shows the range within which the middle 50% of prices fall (IQR).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The line inside each box represents the median price for that neighborhood group. This gives an idea of the typical price. This price shows the average price in neighbourhood groups as well as affordable neighbourhood groups.


#### Chart - 5 - Price vs Number of reviews (Scatter Plot)

In [None]:
# Chart - 5 visualization code

plt.figure(figsize=(10, 6))
sns.scatterplot(data=Airbnb_df, x='number_of_reviews', y='price')
plt.title('Price vs. Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Price')
plt.show()


##### 1. Why did you pick the specific chart?

The scatter plot created by this code will show individual data points representing the relationship between the number of reviews and the price of Airbnb listings.

##### 2. What is/are the insight(s) found from the chart?

By examining the scatter plot, you can see if there is any apparent relationship or trend between the number of reviews and the price. For example:
If points tend to rise as the number of reviews increases, it may suggest that more reviewed listings are more expensive.
If points are scattered without any discernible pattern, it may suggest no clear relationship between the number of reviews and the price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In this plot we can clearly see that point start to rise when number of reviews rises that means a higher number of reviews have the higher price.

#### Chart - 6 - Room type distribution (Pie chart)

In [None]:
# Chart - 6 visualization code

room_type_counts = Airbnb_df['room_type'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(room_type_counts, labels=room_type_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Room Type Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

Here used Pie chart for percentage distribution of different room type

##### 2. What is/are the insight(s) found from the chart?

Chart shows that maximum number of room type is Entire home/apt type, most people prefer this because of their family vacation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Above chart shows mostly family or couple visit these neighbourhood group thats why most number of room type are either Entire home/apt or Private type , shared room are very less.

#### Chart - 7 - Price distribution by Neighbourhood group (Violin Plot)

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(10, 6))
sns.violinplot(data=Airbnb_df, x='neighbourhood_group', y='price')
plt.title('Price Distribution by Neighborhood group')
plt.xlabel('Neighborhood group')
plt.ylabel('Price')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

A violin plot is a method of plotting numeric data and can be used to visualize the distribution of the data across different categories. In this case, the violin plot shows the distribution of Airbnb prices for different neighborhood groups.

##### 2. What is/are the insight(s) found from the chart?


The width of each violin at different y-axis levels represents the density of data points at that price level. A wider section indicates a higher density of listings with that price and the violin plot is symmetric along its central axis, which helps to easily see the distribution on both sides.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By comparing the shape of the violins, you can determine which neighborhood groups have generally higher or lower prices. For instance, if one violin is taller or has a wider spread at higher prices, it indicates that neighborhood group tends to have more expensive listings.

#### Chart - 8 - Availability by room type (Box plot)

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(10, 6))
sns.boxplot(data=Airbnb_df, x='room_type', y='availability_365')
plt.title('Availability by Room Type')
plt.xlabel('Room Type')
plt.ylabel('Availability (365 days)')
plt.show()


##### 1. Why did you pick the specific chart?

Box plot shows the mean value, upper and lower whisker as well as it also shows outliers, but in this case box plot used to show the availability of room in 365 days.

##### 2. What is/are the insight(s) found from the chart?

Here box plot shows that availability of Shared room type are available in whole year, because mostly people prefer to stay in Entire home/apt or in Private room.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Above insights shows, to grow in business in these neighbourhood group a host has to change most of the shared room in private room or in entire home type.

#### Chart - 9 - Number of reviews over time (Line chart)

In [None]:
# Chart - 9 visualization code

Airbnb_df['last_review'] = pd.to_datetime(Airbnb_df['last_review'])
Airbnb_df.set_index('last_review', inplace=True)
Airbnb_df['number_of_reviews'].resample('M').sum().plot(kind='line', figsize=(10, 6))
plt.title('Number of Reviews Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

Line chart choosen to show the trend over time.

##### 2. What is/are the insight(s) found from the chart?

Chart shows that up until 2019 reviews were either zero or few but after 2019 number of reviews incresed exponentially maybe beacause of digitalization or increase in online booking.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Increase in number of reviews over time shows that now people are digitalized and aware of sharing the feedback of their stay, so it will increase the reach of that particular location as well as their positive and negative reviews.

#### Chart - 10 - Top 10 Hosts by Number of Listings

In [None]:
# Chart - 10 visualization code

top_hosts = Airbnb_df['host_id'].value_counts().nlargest(10)
plt.figure(figsize=(10, 6))
sns.barplot(x=top_hosts.index, y=top_hosts.values)
plt.title('Top 10 Hosts by Number of Listings')
plt.xlabel('Host ID')
plt.ylabel('Number of Listings')
plt.xticks(rotation=90)
plt.show()



##### 1. Why did you pick the specific chart?

Bar chart used to show the variation in different categories.

##### 2. What is/are the insight(s) found from the chart?

Here Bar chart shows top 10 host by number of listings, X-axis shows the host-id and Y-axis shows the number of listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Here max number of listings for the host shows the popularity of that host in that neighbourhood group

#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

numerical_df = Airbnb_df.select_dtypes(include=['float64', 'int64'])
plt.figure(figsize=(10, 6))
correlation_matrix = numerical_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

Heatmap to show the correlation between numerical features.

##### 2. What is/are the insight(s) found from the chart?

The values range from -1 to 1.
1 indicates a perfect positive correlation.
-1 indicates a perfect negative correlation.
0 indicates no correlation.

#### Chart - 12 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(numerical_df)

plt.title('Pair Plot of Numerical Variables')
plt.show()


##### 1. Why did you pick the specific chart?

By creating this pair plot, we can visually inspect the relationships between different numerical variables in your Airbnb dataset, which can help us to  identify any correlations or patterns that may exist.

##### 2. What is/are the insight(s) found from the chart?

Pair plots are useful for visualizing relationships between multiple numerical variables at once, as well as understanding their distributions.like in case of price and number of reviews, A visible positive correlation suggests that listings with higher prices tend to have more reviews, which might indicate higher popularity or perceived value.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

From above data and after all the visualization it is clear that to grow the business it is important for that host to good number of reviews and the price should not be higher than other hosts. Here are some factors that affect the business growth:
1. Implement a dynamic pricing strategy that adjusts prices based on demand, seasonality, and competitor pricing. Use historical price data to identify trends and set optimal pricing.
2. Analyze the price distribution across different neighborhoods to understand the market segments and set competitive prices.
3. Offer discounts during low-demand periods to increase occupancy rates.

# **Conclusion**

In conclusion, the EDA provides a comprehensive understanding of the Airbnb market dynamics, offering valuable insights that can inform strategic decisions. Continuous analysis and adaptation are essential to staying competitive and maximizing revenue in the ever-evolving hospitality market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***