 **Airbnb Booking Analysis :-**




 **Project Type**    - EDA


**Contribution -** Individual

 **Project Summary -**

Airbnb is an online marketplace that connects people who want to rent out their homes with people looking for accommodations in that locale. NYC is the most populous city in the United States, and one of the most popular tourism and business places globally.







Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world.It provides various rental options for different customer segments. Based on customer budget, they can either opt for an entire house or just a room or even better share a room. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers and providers behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.

**GitHub Link -**

https://github.com/prashant4431/EDA_Airbnb_Booking_Analysis

 **Problem Statement**



 In this project I have analysed Airbnb’s New York City(NYC) data of 2019. NYC
is not only the most famous city in the world but also top global destination for
visitors drawn to its museums, entertainment, restaurants and commerce.

 Our main objective is to find out the key metrics that influence the listing of
properties on the platform. For this, we will explore and visualize the dataset
from Airbnb in NYC using basic exploratory data analysis (EDA) techniques.

We will be finding out the distribution of every Airbnb listing based on their
location, including their price range, room type, listing name, and other

 **Business Objective**

To connect travellers with hosts who provide accommodation through Airbnb platform which provides a variety of options to choose from, utlimately leading to customer satisfaction and making it the most reliable  booking platform.

**Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

***Mounting Drive***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Reading DataSet**

In [None]:
airbnb_path = "/content/drive/MyDrive/Almabetter/EDA_Airbnb_Booking_Analysis/dat11/Airbnb NYC 2019.csv"
df = pd.read_csv(airbnb_path)


 Dataset First View

In [None]:
df


**Description of the Columns/Variables:-**

**id** - Unique ID

**name** - Name of the listing

**host_id** - Unique host_id

**host_name** - Name of the host

**neighbourhood_group** - location

**neighborhood** - area

**latitude** - Latitude range

**longitude** - Longitude range

**room_type** - Type of listing

**price** - Price of listing

**minimum_nights** - Minimum nights to be paid for

**Number_of reviews** - Number of reviews

**last_review** - Content of the last review

**reviews_per_month** - Number of checks per month

**calculated _host_listing_count** - Total count

**availability_365** - Availability around the year



In [None]:
#Lets check the shape
df.shape

 **Dataset Information**

In [None]:
#Checking variables
df.columns

In [None]:
#Lets Rename Some Columns to have better information about variables
rename_col = {'id':'listing_id','name':'listed_name','number_of_reviews':'total_reviews','calculated_host_listings_count':'host_listings_count'}
df = df.rename(columns = rename_col)
df.head(1)

In [None]:
df.info()

Categorical variables - host_name, neighbourhood_group, neighbourhood and room_type

Numerical variables - host_id, latitude, longitude, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, host_listings_count, availability_365

In [None]:
# Checking Duplicates
df = df.drop_duplicates()
df.count()

So, there are no duplicates.

 **Missing Values/Null Values**

In [None]:
df.isnull().sum()

In [None]:
#Let's remove the null values of listed_name and host_id
df['listed_name'].fillna('not_mentioned',inplace=True)
df['host_name'].fillna('unknown',inplace=True)

In [None]:
# Check if the values removes?
df[['listed_name','host_name']].isnull().sum()

****

In [None]:
#Let's remove the last_review column as it is of least importance.
df = df.drop(['last_review'], axis=1)

In [None]:
# Let's replace the NAN values of reviews_per_month with 0
df['reviews_per_month'] = df['reviews_per_month'].replace(to_replace=np.nan,value=0).astype('int64')

In [None]:
df['reviews_per_month'].isnull().sum()

In [None]:
#Let's have the dataframe information now
df.info()

**Let's check for some unique values of variables.**

In [None]:
#listing/property Ids
df['listing_id'].nunique()

In [None]:
# neighborhood
df['neighbourhood'].nunique()


In [None]:
#neighborhood_group
df['neighbourhood_group'].nunique()

In [None]:
#hosts
df['host_name'].nunique()

**Describing the DataFrame**

In [None]:
df.describe()

In [None]:
#Let's see the mean price of room types.
df.groupby(['room_type'])['price'].mean()

**Let's Visualize the data :**

**Total Listing/Property count in Each Neighborhood Group**

In [None]:

counts = df['neighbourhood_group'].value_counts()

plt.figure(figsize=(9, 5))
plt.bar(counts.index, counts.values)
plt.title('Neighbourhood_group Listing Counts in NYC', fontsize=10)
plt.xlabel('Neighbourhood_Group', fontsize=10)
plt.ylabel('total listings counts', fontsize=10)
plt.show()


Manhattan and Brooklyn have the highest number of listings on Airbnb.



Staten Island has the fewest number of listings.

The distribution of listings across the different neighborhood groups is skewed, with a concentration of listings in Manhattan and Brooklyn.

The demand for Airbnb rentals is higher in Manhattan compared to the other neighborhoods.



**Top Neighborhoods by Listing/property using Bar plot**

In [None]:
 #create a new DataFrame that displays the top 10 neighborhoods in the Airbnb NYC dataset based on the number of listings in each neighborhood
Top_Neighborhoods = df['neighbourhood'].value_counts()[:10].reset_index()

# rename the columns of the resulting DataFrame to 'Top_Neighborhoods' and 'Listing_Counts'
Top_Neighborhoods.columns = ['Top_Neighborhoods', 'Listing_Counts']

# display the resulting DataFrame
Top_Neighborhoods

In [None]:
# Get the top 10 neighborhoods by listing count
top_10_neigbourhoods = df['neighbourhood'].value_counts().nlargest(10)

# Create a list of colors to use for the bars
colors = ['Violet', 'Indigo', 'Blue', 'Green', 'Yellow', 'Orange', 'Red', '#B4F0B4', 'Cyan', 'olive']

# Create a bar plot of the top 10 neighborhoods using the specified colors
top_10_neigbourhoods.plot(kind='bar', figsize=(15, 6), color = colors)

# Set the x-axis label
plt.xlabel('Neighbourhood', fontsize=14)

# Set the y-axis label
plt.ylabel('Total Listing Counts', fontsize=14)

# Set the title of the plot
plt.title('Listings by Top Neighborhoods in NYC', fontsize=15)


The top neighborhoods in New York City in terms of listing counts are Williamsburg, Bedford-Stuyvesant, Harlem, Bushwick, and the Upper West Side.

The top neighborhoods are primarily located in Brooklyn and Manhattan. This may be due to the fact that  have a higher overall population and a higher demand for housing.

**Total Counts Of Each Room Type.**

In [None]:
# create a new DataFrame that displays the number of listings of each room type in the Airbnb NYC dataset
top_room_type = df['room_type'].value_counts().reset_index()

# rename the columns of the resulting DataFrame to 'Room_Type' and 'Total_counts'
top_room_type.columns = ['Room_Type', 'Total_counts']

# display the resulting DataFrame
top_room_type



In [None]:
# Set the figure size
plt.figure(figsize=(10, 6))

# Get the room type counts
room_type_counts = df['room_type'].value_counts()

# Set the labels and sizes for the pie chart
labels = room_type_counts.index
sizes = room_type_counts.values

# Create the pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%')

# Add a legend to the chart
plt.legend(title='Room Type', bbox_to_anchor=(0.8, 0, 0.5, 1), fontsize='12')

# Show the plot
plt.show()

 Here the Entire Home/Apt counts 52.0% of the entire room types which is almost 21% of the shared room numbers. Also Private Room's share is 45.7%

The data suggests that travelers using Airbnb prefer accommodation options of Entire Home/Apt or Private room over the Shared room.

**Now we will look into availability variation throughout neighbourhoods, room standards.**

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(24, 8))
ax = axes.flatten()

sns.lineplot(data=df, x='neighbourhood_group', y='availability_365', hue='room_type', ax=ax[0])
ax[0].set_title('Room Availability throughout Neighbourhood/Room Type')
sns.scatterplot(data=df[df['price'] < 500], x="availability_365", y='price', hue='room_type', alpha=.9, palette="muted", ax=ax[1])
ax[1].set_title('Price vs Availability (Range $ 0 - 500)')

sns.countplot(data=df[df['availability_365']  == 365], x='neighbourhood_group', hue='room_type', palette='GnBu_d', ax=ax[2])
ax[2].set_title('Property Available 365 days')
sns.despine(fig)

From the above two graphs, Staten Island has most busiest Shared Room as well as Most available Private Room. Brroklyn and Manhattan has quite similar availabilities.

On the other hand, from Price vs Availability graph, we can notice that there is a wide price range of private rooms and the entire home room types.

And from the Count vs Neighbourhood group graph we can see that Manhattan has the most numbers of rooms available followed by Brooklyn.

**Stay Requirement counts by Minimum Nights using Bar chart**

In [None]:
# Group the DataFrame by the minimum_nights column and count the number of rows in each group
min_nights_count = df.groupby('minimum_nights').size().reset_index(name = 'count')

# Sort the resulting DataFrame in descending order by the count column
min_nights_count = min_nights_count.sort_values('count', ascending=False)

# Select the top 10 rows
min_nights_count = min_nights_count.head(15)

# Reset the index
min_nights_count = min_nights_count.reset_index(drop=True)

# Display the resulting DataFrame
min_nights_count

In [None]:
# Extract the minimum_nights and count columns from the DataFrame
minimum_nights = min_nights_count['minimum_nights']
count = min_nights_count['count']

# Set the figure size
plt.figure(figsize=(12, 4))

# Create the bar plot
plt.bar(minimum_nights, count)

# Add axis labels and a title
plt.xlabel('Minimum Nights', fontsize='14')
plt.ylabel('Count', fontsize='14')
plt.title('Stay Requirement by Minimum Nights', fontsize='15')

# Show the plot
plt.show()

The majority of listings on Airbnb have a minimum stay requirement of 1 or 2 nights, with 12720 and 11696 listings, respectively.

The number of listings with a minimum stay requirement decreases as  the stay increases, with 7999 listings of 3 nights.

There are relatively few listings with a minimum stay of 30 nights with 3760 listings .

**Average Minimum Price In Neighborhoods**

In [None]:
# create a new DataFrame that displays the average price of Airbnb rentals in each neighborhood
neighbourhood_avg_price = df.groupby("neighbourhood")["price"].mean().reset_index().rename(columns={"price": "avg_price"})


# select the top 10 neighborhoods with the lowest average prices
neighbourhood_avg_price = neighbourhood_avg_price.sort_values("avg_price").head(15)

# join the resulting DataFrame with the 'neighbourhood_group' column from the Airbnb NYC dataset, dropping any duplicate entries
neighbourhood_avg_price_sorted_with_group = neighbourhood_avg_price.join(df[['neighbourhood', 'neighbourhood_group']].drop_duplicates().set_index('neighbourhood'),
                                                                         on='neighbourhood')

# Display the resulting data
display(neighbourhood_avg_price_sorted_with_group.style.hide_index())

In [None]:
# Group the data by neighborhood and calculate the average price
neighbourhood_avg_price = df.groupby("neighbourhood")["price"].mean()

# Create a new DataFrame with the average price for each neighborhood
neighbourhood_prices = pd.DataFrame({"neighbourhood": neighbourhood_avg_price.index, "avg_price": neighbourhood_avg_price.values})

# Merge the average price data with the original DataFrame
df_merged = df.merge(neighbourhood_prices, on="neighbourhood")

# Convert 'avg_price' column to numeric
df_merged['avg_price'] = pd.to_numeric(df_merged['avg_price'])

# Create the scattermapbox plot
fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(1, 1, 1)
scatter = ax.scatter(df_merged["longitude"], df_merged["latitude"], c=df_merged["avg_price"], cmap="inferno")
ax.set_title("Average Airbnb Price by Neighborhoods in New York City")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
fig.colorbar(scatter, label="Average Price")
plt.show()

In [None]:
# Extract the values from the dataset
neighborhoods = neighbourhood_avg_price_sorted_with_group['neighbourhood']
prices = neighbourhood_avg_price_sorted_with_group['avg_price']

# Create the bar plot
plt.figure(figsize=(15,5))
plt.bar(neighborhoods, prices,width=0.5, color = 'lightsalmon')
plt.xlabel('Neighborhood')
plt.ylabel('Average Price')
plt.title('Average Price by Neighborhood')

# Show the plot
plt.show()



Most of these neighborhoods are located in the Bronx and Staten Island. These have a lower overall cost of living compared to Manhattan and Brooklyn.

These neighborhoods may be attractive to renters or buyers looking for more affordable housing options in the New York City area.

**Number of Max. Reviews by Each Neighborhood Group**

In [None]:
# Group the Airbnb data by neighbourhood group
reviews_by_neighbourhood_group = df.groupby("neighbourhood_group")["total_reviews"].max()

# Create a pie chart to visualize the distribution of maximum number of reviews among different neighbourhood groups
plt.pie(reviews_by_neighbourhood_group, labels=reviews_by_neighbourhood_group.index, autopct='%1.1f%%')

# Add a title to the chart
plt.title("Number of maximum Reviews by Neighborhood Group in NYC", fontsize='15')

# Display the chart
plt.show()

Queens and Manhattan seems to be the most popular ones on the basis of review with 26.5% and 25.5%.

Brooklyn also have a high percentage of reviews, i.e 20.5%. This indicates that it is a popular destination for tourists or visitors as well.


**Correlation Heatmap Visualization**

In [None]:
# Select the columns of interest
columns_of_interest = ['listing_id','host_id','latitude','longitude','price','minimum_nights','total_reviews','reviews_per_month','host_listings_count','availability_365']


# Create a new DataFrame with only the columns of interest
df_selected = df[columns_of_interest]

# Calculate the correlation matrix
corr = df_selected.corr()

# Display the correlation between columns
corr









In [None]:
# Set the figure size
plt.figure(figsize=(10,5))

# Visualize correlations as a heatmap
sns.heatmap(corr, cmap='crest',annot=True)

# Display heatmap
plt.show()


There is a moderate positive correlation (0.59) between the **host_id** and **listing_id** columns, which suggests that hosts with more listings are more likely to have unique host IDs.

There is a weak positive correlation (0.057) between the **price** column and the **host_listings_count** column, which suggests that hosts with more listings tend to charge higher prices for their listings.

There is a moderate positive correlation (0.23) between the **host_listings_count** column and the **availability_365** column, which suggests that hosts with more listings tend to have more days of availability in the next 365 days.

There is a strong positive correlation (0.56) between the **total_reviews** column and the **reviews_per_month** column, which suggests that listings with more total reviews tend to have more reviews per month.



**Solution to Business Objective :**

Manhattan and Brooklyn have the highest demand for Airbnb rentals, as evidenced by the large number of listings in these neighborhoods. This could make them attractive areas for hosts to invest in property.

Brooklyn comes in second with significant number of listings and cheaper prices as compared to the Manhattan: With most listings located in Williamsburg and Bedford-Stuyvesant.

Williamsburg, Bedford-Stuyvesant, Harlem, Bushwick, and the Upper West Side are the top neighborhoods in terms of listing counts, indicating strong demand for Airbnb rentals in these areas.

The average price of a listing in New York City is higher in the center of the city (Manhattan) compared to the outer boroughs. This could indicate that investing in property in Manhattan may be more lucrative for Airbnb rentals.

 Manhattan and Brooklyn have the largest number of hosts, indicating a high level of competition in these boroughs.

The data suggests that Airbnb rentals are primarily used for short-term stays, with relatively few listings requiring a minimum stay of 30 nights or more. Hosts may want to consider investing in property that can accommodate shorter stays in order to maximize their occupancy rate.

The majority of listings on Airbnb are for entire homes or apartments and also Private Rooms with relatively fewer listings for shared rooms. This suggests that travelers using Airbnb have a wide range of accommodation options to choose from, and hosts may want to consider investing in property that can accommodate multiple guests.

The data indicates that the availability of Airbnb rentals varies significantly across neighborhoods, with some neighborhoods having a high concentration of listings and others having relatively few.

The data indicates that there is a high level of competition among Airbnb hosts, with a small number of hosts dominating a large portion of the market. Hosts may want to consider investing in property in areas with relatively fewer listings in order to differentiate themselves from the competition.

Queens has the highest percentage of reviews , but it has the third highest number of listings, behind Manhattan and Brooklyn. This suggests that Queens may be a popular destination for tourists or visitors, even though it has fewer listings compared to Manhattan and Brooklyn.

---

# **Thank You !**


