# **Project Name**    - Capstone Project: Exploratory Data Analysis (AIRBNB)



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Since 2008, Airbnb has revolutionized the way people travel, offering a unique and personalized approach to experiencing the world. As Airbnb expanded globally, it became a household name, serving as a unique platform for both guests and hosts. The vast amount of data generated by millions of listings on Airbnb is invaluable for the company's growth and decision-making processes. This data is a treasure trove of insights that can be harnessed for various purposes, including enhancing security, making informed business decisions, understanding customer and host behavior, measuring performance on the platform, guiding marketing strategies, and developing innovative additional services.

The dataset under consideration contains approximately 49,000 observations across 16 columns, featuring a mix of categorical and numeric values. This project aims to explore and analyze this dataset to extract meaningful insights and drive key understandings that can benefit Airbnb.

# **GitHub Link -**

https://github.com/pratish219/Alma_Better/blob/main/Capstone_Project_Exploratory_Data_Analysis_(AIRBNB).ipynb

# **Problem Statement**


**To analyze Airbnb's extensive dataset, consisting of approximately 49,000 observations across 16 columns with a mix of categorical and numeric values, to extract valuable insights that can inform and improve Airbnb's operations, customer experience, and business strategies.**

#### **Define Your Business Objective?**

Airbnb, a global leader in the short-term accommodation and travel industry, faces the challenge of continually improving its platform to meet the evolving needs and preferences of both guests and hosts. In this context, the Airbnb dataset serves as a valuable resource for data-driven decision-making to enhance the company's operations, customer experience, and business strategies.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Airbnb NYC 2019.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.columns

In [None]:
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

### What did you know about your dataset?

Has 16 columns and the observations are 48895

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
cols = list[df.columns]
cols


In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for column '{column}': {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.drop(['latitude','longitude','last_review','reviews_per_month'], inplace = True , axis=1)

In [None]:
df.head()

### What all manipulations have you done and insights you found?

As we can see that **last_review** and **reviews_per_month** have a huge number of null values to tackle this we will drop these particular columns.
Before dropping columns, it's essential to consider the importance of each column in your analysis. Columns containing critical information for your analysis or those that significantly contribute to the understanding of the dataset should be retained, even if they have missing values.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
area_review = df.groupby('neighbourhood_group')['number_of_reviews'].sum().reset_index()
area_review

In [None]:
area = area_review['neighbourhood_group']
review = area_review['number_of_reviews']
fig = plt.figure(figsize=(10,5))
plt.barh(area,review,color = sns.color_palette("deep"))
plt.ylabel('Neighborhood')
plt.xlabel('No Of Reviews')
plt.title('No of Reviews in each Neighborhood')
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart (horizontal bar plot) was chosen because it effectively compares the number of reviews across different neighborhoods (represented on the y-axis) in a visually straightforward manner.
Since the comparison involves categorical data (neighborhoods) and a quantitative variable (number of reviews), a bar chart is a suitable choice. Additionally, horizontal bar charts are particularly useful when dealing with long category labels, making them easier to read.

##### 2. What is/are the insight(s) found from the chart?

The chart provides a clear comparison of the number of reviews for each neighborhood in the dataset.
It allows viewers to quickly identify neighborhoods with the highest and lowest numbers of reviews.
Viewers can also observe any significant disparities between neighborhoods in terms of review counts.
Identifying neighborhoods with high review counts could indicate popular areas, while those with low review counts might suggest less-visited or newly developed neighborhoods.
The chart can aid in understanding the distribution of reviews across different neighborhood groups, which can inform various decisions such as marketing strategies, resource allocation for property management, or identifying areas for improvement based on customer feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impacts:

Targeted Marketing Strategies: Understanding which neighborhoods attract the most reviews can help businesses target their marketing efforts more effectively. They can focus promotional campaigns or offers on these popular neighborhoods to attract more customers.

Improving Customer Experience: Identifying areas with high review counts allows businesses to prioritize resources for enhancing customer experiences in those neighborhoods. Addressing any issues raised in reviews can lead to increased customer satisfaction and loyalty.

Identifying Growth Opportunities: Areas with lower review counts may represent untapped markets or neighborhoods with potential for growth. Businesses can develop strategies to attract more visitors or customers to these areas, potentially leading to expansion opportunities.

Negative Growth Considerations:

Negative Publicity and Reputation Management: If certain neighborhoods consistently receive negative reviews or have significantly lower review counts compared to others, it could impact the overall reputation of the business or the perceived desirability of those areas. Negative publicity can deter potential customers and hinder business growth.

Resource Allocation Challenges: Focusing too heavily on neighborhoods with high review counts may result in neglecting areas with lower review counts. This could lead to disparities in service quality or customer experiences across different neighborhoods, ultimately affecting customer satisfaction and loyalty negatively.

Missed Opportunities for Improvement: Ignoring neighborhoods with lower review counts may mean overlooking valuable feedback and improvement opportunities. Businesses should proactively address issues raised in reviews from all neighborhoods to continuously enhance their offerings and customer experiences.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
price = df.groupby('number_of_reviews')['price'].max().reset_index()
price

In [None]:
reviews_price = price['number_of_reviews']
price_review =  price['price']
fig = plt.figure(figsize=(10,5))
plt.scatter(price_review,reviews_price)
plt.xticks(np.arange(0, 11000, 1000))
plt.xlabel('Price')
plt.ylabel('No Of Reviews')
plt.title('No Of Reviews Distributed Over The Price Horizon')
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot was chosen because it allows for the visualization of the relationship between two continuous variables: price and number of reviews.
Scatter plots are ideal for identifying patterns, trends, or correlations between variables, making them suitable for exploring the relationship between price and review counts.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows the distribution of reviews across different price points.
It helps identify whether there is any correlation or trend between the price of a product or service and the number of reviews it receives.
From the plot, one can observe if there are any clusters of data points indicating price ranges that attract more reviews.
Additionally, outliers in the data, such as products or services with exceptionally high reviews given their price, can be identified.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying price points that correlate with higher review counts can help businesses optimize pricing strategies. Products or services priced within these ranges may attract more customers and generate higher revenue.
Understanding the relationship between price and review counts can inform decisions related to product positioning, competitive pricing, and value proposition, leading to enhanced customer satisfaction and loyalty.
By leveraging insights from the scatter plot, businesses can focus on areas where they can maximize return on investment, allocate resources effectively, and tailor marketing strategies to target price points that resonate with customers.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='neighbourhood_group', palette='Set3')
plt.xlabel('Neighborhood Group')
plt.ylabel('Count')
plt.title('Distribution of Listings Across Neighborhoods')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

The count plot (a variation of a bar plot) was chosen because it effectively visualizes the distribution of categorical data, specifically the number of listings across different neighborhood groups.
Count plots are suitable for showing the frequency of observations within categorical variables, making them ideal for analyzing the distribution of listings across neighborhoods.


##### 2. What is/are the insight(s) found from the chart?

The count plot provides a clear representation of the distribution of Airbnb listings across different neighborhood groups.
It allows viewers to quickly identify which neighborhood groups have the highest and lowest numbers of listings.
From the plot, one can observe any disparities in the distribution of listings among neighborhood groups. This insight can be valuable for understanding the popularity or demand for accommodations in different areas.
Additionally, the plot can highlight any neighborhoods that stand out in terms of the number of listings, indicating areas with potentially higher competition or greater opportunities for business growth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of listings across neighborhood groups can help businesses, such as Airbnb hosts or property management companies, make informed decisions about property investments, pricing strategies, and marketing efforts.
Hosts can use insights from the plot to identify neighborhoods with high demand and adjust their pricing or promotional strategies accordingly to attract more guests and maximize occupancy rates.
Property management companies can use the information to allocate resources effectively, prioritize property management efforts, and identify areas for expansion or investment based on demand trends.
By leveraging insights from the count plot, businesses can tailor their offerings and services to meet the needs and preferences of customers in different neighborhood groups, ultimately leading to enhanced customer satisfaction and positive business outcomes.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='neighbourhood_group', hue='room_type', palette='Set3')
plt.xlabel('Neighborhood Group')
plt.ylabel('Count')
plt.title('Distribution of Room Types Across Neighborhoods')


plt.legend(title='Room Type', title_fontsize='12')
plt.show()

##### 1. Why did you pick the specific chart?

Similar to the previous count plot, this plot is also a count plot but with the addition of the hue parameter, which allows for the visualization of a third categorical variable (room type) within each category of the main variable (neighborhood group).
The count plot with hue differentiation is an excellent choice when we want to compare the distribution of multiple categories within each level of another categorical variable.

##### 2. What is/are the insight(s) found from the chart?

This plot provides insights into the distribution of different room types across neighborhood groups.
By observing the count of each room type within each neighborhood group, viewers can understand the variety and availability of accommodations in different areas.
The plot helps identify whether certain room types are more prevalent in specific neighborhood groups. For example, certain areas might have more entire homes/apartments available, while others might have a higher concentration of private rooms or shared rooms.
Comparing the distribution of room types across neighborhood groups can reveal patterns in the types of accommodations preferred or available in different areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of room types across neighborhood groups can help businesses, such as Airbnb hosts or property management companies, tailor their offerings to meet the diverse preferences of guests.
Hosts can use insights from the plot to optimize their listings by offering room types that are in high demand in specific neighborhoods, potentially increasing booking rates and revenue.
Property management companies can use the information to identify areas with underserved markets for certain room types and adjust their property acquisition or management strategies accordingly to meet demand.
By leveraging insights from the count plot with hue differentiation, businesses can enhance the guest experience, improve occupancy rates, and ultimately drive positive business outcomes through strategic decision-making and targeted marketing efforts.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
name_hosts = df.groupby('host_name')['number_of_reviews'].max().reset_index()
name_hosts.sort_values(by='number_of_reviews',ascending = False).head(10)

##### 1. Why did you pick the specific chart?

The bar plot was chosen because it effectively compares quantitative data (number of reviews) across different categories (host names).
Since the goal is to identify the top 10 busiest hosts based on the number of reviews they have received, a bar plot allows for easy visualization of this comparison.
The choice of a horizontal bar plot (with hosts' names on the y-axis and the number of reviews on the x-axis) provides a clear representation of the busiest hosts at a glance.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights the top 10 hosts who have received the most reviews, indicating their popularity or level of activity in hosting guests.
Viewers can quickly identify which hosts are the busiest and have likely garnered positive feedback from guests.
Comparing the number of reviews across hosts allows for insights into host performance, guest satisfaction, and potentially the quality of accommodations or services offered.
Hosts with a high number of reviews may have established a strong reputation in the marketplace and may be perceived as reliable and trustworthy by potential guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from identifying the top 10 busiest hosts can help businesses, such as Airbnb hosts or property management companies, understand the characteristics and behaviors of successful hosts.
Hosts can leverage their high review counts to attract more guests, increase booking rates, and potentially command higher prices for their accommodations.
Property management companies can learn from successful hosts' strategies and offer guidance or support to less active hosts to improve their performance and maximize their potential earnings.
By understanding the factors contributing to the success of the busiest hosts, businesses can develop targeted marketing strategies, allocate resources effectively, and enhance overall customer satisfaction, leading to positive business impacts and growth opportunities.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Optimize Listing Distribution and Pricing:

Utilize the insights on the distribution of listings across neighborhoods to optimize the allocation of resources and marketing efforts.
Consider adjusting pricing strategies based on room types and neighborhood demand to maximize revenue while remaining competitive.

**Host Engagement and Support:**

Provide support and resources to hosts, especially those with lower review counts, to help improve their performance and enhance guest satisfaction.
Offer training sessions or workshops on effective hosting practices, customer service, and property management to empower hosts to succeed on the platform.

**Enhance Customer Experience:**

Focus on enhancing the overall guest experience by encouraging hosts to maintain high standards of cleanliness, communication, and hospitality.
Implement guest feedback mechanisms to identify areas for improvement and address any issues promptly to ensure positive reviews and repeat bookings.

**Neighborhood Development and Expansion:**

Explore opportunities for expansion into neighborhoods with lower listing counts, potentially tapping into underserved markets and niche segments.
Collaborate with local authorities and community stakeholders to promote responsible tourism and sustainable growth in emerging neighborhoods.

**Data-Driven Decision Making:**

Continue conducting comprehensive analyses of the Airbnb dataset to identify trends, patterns, and opportunities for innovation.
Invest in data analytics tools and capabilities to enable real-time monitoring of market dynamics, customer preferences, and competitor strategies.

**Marketing and Branding:**

Develop targeted marketing campaigns tailored to specific neighborhoods and customer segments to increase brand awareness and attract new users.
Leverage success stories and testimonials from top hosts to build trust and credibility among potential guests and hosts.

**Regulatory Compliance and Community Engagement:**

Stay abreast of regulatory developments and compliance requirements in each market to ensure adherence to local laws and regulations.
Foster positive relationships with local communities through transparent communication, responsible hosting practices, and community engagement initiatives.

By implementing these strategies, the client can achieve their business objectives of enhancing the overall guest experience, optimizing host performance, and driving sustainable growth and profitability in the Airbnb ecosystem. These efforts will contribute to the long-term success and resilience of the platform while fostering positive relationships with hosts, guests, and local communities.

# **Conclusion**

In this EDA project, we delved into a comprehensive analysis of the Airbnb dataset. Our primary objective was to gain valuable insights into the distribution of listings, room types, host behavior, and the role of different neighborhoods on the Airbnb platform. Here's a summary of the key findings and conclusions:

**Distribution of Listings Across Neighborhoods:**

We observed that the distribution of listings across neighborhoods is not uniform.
Certain neighborhoods, such as Manhattan and Brooklyn, have a significantly higher number of listings, suggesting their popularity and high demand.
Other neighborhoods, like the Bronx and Staten Island, have a relatively lower number of listings, indicating they may be less popular or serve niche markets.

**Distribution of Room Types:**

The distribution of room types revealed a diverse range of accommodations on the Airbnb platform.
Entire homes or apartments were the most common room type, followed by private rooms, and shared rooms.
This diversity reflects the preferences and requirements of Airbnb guests, as well as host offerings.

**Insights into the Busiest Hosts:**

We analyzed the busiest hosts based on the number of reviews and found that some hosts have a substantial presence on the platform.
These busy hosts likely possess extensive hosting experience, strong reputations, and efficient management skills.
They have developed effective strategies for maintaining high occupancy rates and delivering excellent customer service.

**Pricing and Preferences:**

The dataset allowed us to explore the relationship between room types and pricing, with entire homes often commanding higher prices.
Traveler preferences and market trends play a crucial role in shaping the distribution of room types and pricing dynamics.
In conclusion, this EDA project provided valuable insights into the Airbnb dataset, offering a glimpse into the behavior of hosts, the preferences of travelers, and the role of different neighborhoods in the Airbnb ecosystem.

These insights can inform strategic decisions for both hosts and Airbnb itself, enhancing the overall experience for users of the platform. Additionally, the findings can serve as a foundation for more in-depth analyses and data-driven decision-making in the context of the short-term rental market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***