# **Project Name**    -



**Project Type** :  EDA

**Title** : AirBnb Bookings Analysis

**Contribution** : Individual


# **Project Summary -**

Write the summary here within 500-600 words.

Since its inception in 2008, Airbnb has transformed the way people travel by offering a unique and personalized approach to experiencing the world. Today, Airbnb stands as a one-of-a-kind global service, recognized and embraced by people from all corners of the globe. The wealth of data generated by the millions of listings on the platform has become a cornerstone of Airbnb's operations. This data holds the key to unlocking a multitude of insights that can be harnessed for purposes ranging from security and business decisions to understanding customer and host behavior, guiding marketing strategies, and facilitating the implementation of innovative additional services.

The dataset at hand is a valuable treasure trove of information, comprising around 49,000 observations spread across 16 columns, which encompass a combination of categorical and numeric values. This dataset serves as a valuable resource for unraveling crucial aspects of Airbnb's functioning.

One of the primary objectives is to enhance security measures on the platform. By meticulously analyzing the data, it is possible to identify trends and patterns related to security incidents and problematic host or guest behavior. Such insights are indispensable for improving the overall safety and trustworthiness of the Airbnb ecosystem.

Moreover, this dataset can serve as a compass for making informed business decisions. It can shed light on various aspects such as the most popular types of listings, optimal pricing strategies, and the geographical areas with the highest demand. Armed with this knowledge, Airbnb can make strategic decisions to cater to the evolving needs of both hosts and guests.

Understanding user behavior is another crucial dimension. The dataset can unravel booking patterns, the most sought-after destinations, and the factors influencing user decisions. Airbnb can use these insights to tailor its services and offerings, ultimately enhancing the overall user experience.

Host performance is a pivotal aspect of the Airbnb platform, and the dataset provides an avenue for evaluating this performance. By examining metrics like host ratings, response times, and property types, Airbnb can offer support and guidance to hosts to improve their service quality and, by extension, the overall platform's reputation.

Geographical analysis is another facet of this exploration. It can help in identifying popular regions, areas with high demand, and regional trends that can steer Airbnb's business decisions. This insight can be invaluable in optimizing the allocation of resources and tailoring services to suit specific geographic needs.

Effective marketing initiatives also stand to benefit from this dataset. By identifying the amenities, features, and property characteristics that correlate with high demand, Airbnb can create marketing campaigns that resonate with its target audience. This knowledge ensures that promotional efforts are not just effective but also cater to what guests are seeking.

Lastly, the data can offer a springboard for implementing innovative additional services. By identifying trends and unmet needs within the dataset, Airbnb can expand its offerings, providing more value to both hosts and guests. This is essential for Airbnb's continuous growth and evolution.

In conclusion, this dataset provides an invaluable resource for Airbnb to enhance its security, make data-driven business decisions, understand and cater to user behavior, guide marketing strategies, and pave the way for innovative services. By exploring and analyzing this data, Airbnb can continue to shape and redefine the way people experience the world, ensuring its place as a trailblazer in the travel industry.

# **GitHub Link -**

https://github.com/gunagreeshma/EDA_PYTHON/blob/main/EDA_Capstone_project.ipynb

# **Problem Statement**


**Write Problem Statement Here.**

Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalised way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Explore and analyse the data to discover key understandings

#### **Define Your Business Objective?**

The overarching business objective for Airbnb's data analysis of the dataset with approximately 49,000 observations and 16 columns, comprising a mix of categorical and numeric values, can be summarized as follows:

**To Optimize and Innovate the Airbnb Platform for Enhanced Customer Experience and Sustainable Growth**

This business objective can be further broken down into specific goals:

1. **Security Enhancement:** Utilize data analysis to identify security trends and patterns, thereby enhancing safety and trust within the Airbnb ecosystem. The goal is to create a secure environment for both hosts and guests, reducing incidents of fraud and other security concerns.

2. **Business Decision Support:** Leverage data insights to make informed, data-driven decisions in areas such as pricing strategies, property types, and geographical focus. The objective is to maximize revenue and improve the efficiency of Airbnb's operations.

3. **User Behavior Understanding:** Analyze user behavior patterns to understand booking preferences, popular destinations, and user decision-making factors. The aim is to tailor the platform to better meet the needs and desires of Airbnb's diverse user base.

4. **Host Performance Improvement:** Evaluate host performance by examining metrics such as ratings, response times, and property quality. The business objective is to provide guidance and support to hosts to enhance the quality of their listings and improve the overall reputation of the platform.

5. **Marketing Strategy Enhancement:** Identify amenities, features, and property characteristics that correlate with high demand, and use this information to develop marketing campaigns that resonate with the target audience. The goal is to increase the effectiveness of marketing efforts and attract more users.

6. **Innovative Service Implementation:** Use data analysis to uncover trends and unmet needs within the platform, which can lead to the implementation of innovative additional services. The objective is to expand Airbnb's offerings, providing more value to both hosts and guests and ensuring the platform's continuous growth and competitiveness.

In summary, the primary business objective is to leverage data analysis to improve the safety, efficiency, and overall user experience of the Airbnb platform. By doing so, Airbnb aims to maintain its position as a global leader in the travel industry, attract more users, and continually innovate to meet the evolving needs and expectations of its diverse user base.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
pip install pandas #install the pandas library

In [None]:
# Import Libraries
import pandas as pd #Pandas allows for quick exploration of data.
import numpy as np #useful for performing operations on entire columns or rows of a dataset without the need for explicit loops.

The short form "pandas" comes from the term "Panel Data," which is a type of multidimensional, structured dataset commonly used in statistics and econometrics. The pandas library in Python is designed for data manipulation and analysis, especially for working with tabular and labeled data, making it a fitting name for the library.




NumPy is short for Numerical Python. The name reflects its primary purpose—to provide a powerful foundation for numerical operations and mathematical functions in the Python programming language.

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
filepath= "/content/Airbnb NYC 2019 (1).csv"
auto_df=pd.read_csv(filepath) # Reads a comma-separated values (CSV) file into a DataFrame.

### Dataset First View

In [None]:
# Dataset First Look
auto_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns counts
count_columns = len(auto_df.columns)
count_columns


### Dataset Information

In [None]:
# Dataset Info
auto_df.info()

In [None]:
# Display the first n rows of the DataFrame (default n=5)
auto_df.head()


In [None]:
# Display the last  n rows of the DataFrame (default n=5)
auto_df.tail()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Use the duplicated() method to identify duplicate rows
duplicates = auto_df[auto_df.duplicated(keep='first')]

# Count the duplicate rows
duplicate_count = duplicates.shape[0]
duplicate_count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Use the isna() or isnull() method to identify missing values
missing_values = auto_df.isna()

# Count the missing values in each column
missing_count = missing_values.sum()


In [None]:
# Visualizing the missing values
missing_count

### What did you know about your dataset?

Id, host_id, neighbourhood_group, neighbourhood, latitude, longitude, room_type, price, minimum_nights, calculated_host_listings_count, and availability_365: These columns have no missing values (count of missing values is 0), indicating that they have complete data. This means that these columns have values for all rows in your dataset.

**name:** The "name" column has 16 missing values. You have 16 rows where the "name" field is not provided or is empty.

**host_name:** The "host_name" column has 21 missing values. You have 21 rows where the "host_name" field is not provided or is empty.

**last_review and reviews_per_month** : Both "last_review" and "reviews_per_month" columns have 10,052 missing values. This suggests that a significant number of rows do not have data in these columns. This could be due to properties that have never received reviews, or the data for these columns might be missing for some other reason.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns = auto_df.columns # Get the column names of the DataFrame 'auto_df' and store them in the 'columns' variable
columns


To get a list of all the column names in the automobile dataset, you can use the command auto_df.columns. This would provide you with the names of all the columns in the dataset and enable you to carry out various data manipulation and analysis operation

The **auto_df.describe()** function is typically used with a Pandas DataFrame, and it provides a summary of the descriptive statistics for the numeric columns in the DataFrame auto_df. Here's what it does:

In [None]:
# Dataset Describe
# Calculate descriptive statistics for the DataFrame 'auto_df' using the describe() method
description = auto_df.describe()

# Display the summary statistics, including count, mean, std, min, 25%, 50%, 75%, and max
print(description)

### Variables Description

The "Airbnb NYC 2019" dataset typically contains various variables that provide information about Airbnb listings

ID: A unique identifier for each Airbnb listing.

Name: The name or title of the listing.

Host ID: A unique identifier for the host of the listing.

Host Name: The name of the host.

Neighborhood: The neighborhood or area where the listing is located.

Latitude: The latitude coordinates of the listing's location.

Longitude: The longitude coordinates of the listing's location.

Room Type: The type of room or accommodation

Price: The nightly price for renting the listing.

Minimum Nights: The minimum number of nights required for booking.

Number of Reviews: The total number of reviews for the listing.

Last Review Date: The date of the last review.

Reviews Per Month: The average number of reviews per month.

Calculated Host Listings Count: A calculated count of the host's listings.

Availability 365: The number of days the listing is available in a year.

### Check Unique Values for each variable.

**auto_df.nunique()** is a Pandas DataFrame method that is used to count the number of unique values in each column of the DataFrame auto_df. The result is a Pandas Series where the index corresponds to the column names, and the values represent the count of distinct (unique) values in each column.

In [None]:
# Check Unique Values for each variable.
# Use the nunique() method to count unique values for each column
unique_values = auto_df.nunique()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

Handeling missing values


Handling missing values in Python involves identifying and dealing with data points or entries in a dataset that are missing or undefined. Missing values can be a common issue in real-world datasets, and addressing them is crucial for accurate and reliable data analysis and modeling. Here are some common techniques for handling missing values in Python

In [None]:
# Remove rows with null values in specified columns
auto_df.dropna(subset=['name', 'host_name', 'last_review', 'reviews_per_month'], inplace=True)

# Check if there are any remaining null values in the specified columns
print(auto_df[['name', 'host_name', 'last_review', 'reviews_per_month']].isnull().sum())


auto_df.dropna(subset=['name', 'host_name', 'last_review', 'reviews_per_month'], inplace=True):

The dropna method is used to remove rows with null values from the DataFrame (auto_df).

The subset parameter is set to a list of columns ('name', 'host_name', 'last_review', 'reviews_per_month') where null values should be checked.

The inplace=True parameter ensures that the changes are made directly to the existing DataFrame (auto_df) without creating a new one.
python


print(auto_df[['name', 'host_name', 'last_review', 'reviews_per_month']].isnull().sum()):

The isnull().sum() part calculates the count of null values for each specified column ('name', 'host_name', 'last_review', 'reviews_per_month').

In summary, this code removes rows with null values in the specified columns from the DataFrame and then checks if there are any remaining null values in those columns, providing a summary of the null value counts after the removal.

In [None]:
missing_values = auto_df.isna()

# Count the missing values in each column
missing_count = missing_values.sum()
missing_count

In [None]:
auto_df

OUTLIERS

Outliers are data points that significantly differ from the majority of the data in a dataset. They are values that are unusually high or low when compared to the rest of the data and can be considered as extreme observations. Outliers can arise due to various reasons, including measurement errors, data entry errors, natural variation in data, or even as genuine, unusual observations.


Seaborn is a data visualization library based on Matplotlib that provides a high-level interface for creating informative and attractive statistical graphics.

Once you have imported seaborn using this line, you can use its functions to create various types of statistical visualizations with minimal code. Seaborn simplifies the process of creating visually appealing plots and is often used in conjunction with pandas DataFrames

The line import matplotlib.pyplot as plt is used to import the pyplot module from the matplotlib library in Python and alias it as plt. Matplotlib is a powerful plotting library, and pyplot provides a convenient interface to create various types of plots and visualizations.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot of the price feature
sns.boxplot(x="price", data=auto_df)

# Display the plot
plt.show()


In [None]:
# Calculate the 25th and 95th percentiles of the 'price' column
lower_percentile = auto_df['price'].quantile(0.25)
upper_percentile = auto_df['price'].quantile(0.95)

# Create a boolean mask to filter rows where the 'price' is between the 5th and 95th percentiles
auto_df = auto_df[(auto_df['price'] >= lower_percentile) &
                  (auto_df['price'] <= upper_percentile)]

# The resulting DataFrame 'auto_df' now contains rows where the 'price' is within the specified percentile range.
auto_df

### What all manipulations have you done and insights you found?

Box plots are a valuable tool for identifying and treating outliers in a dataset. Outliers are data points that significantly differ from the majority of the data and can distort statistical analyses. Box plots provide a visual representation of the data's distribution, making it easier to identify outliers. Here's how you can use box plots for outlier treatment

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot of the price feature
sns.boxplot(x="price", data=auto_df)

# Display the plot
plt.show()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
auto_df

In [None]:
# Chart - 1 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Barplot of price vs. neighbourhood_group
sns.barplot(x='neighbourhood_group', y='price', data=auto_df)

# Set the title of the plot
plt.title('Average Price by Neighbourhood Group', fontsize=15)

# Set the x-axis label
plt.xlabel('Neighbourhood Group', fontsize=14)

# Set the y-axis label
plt.ylabel('Price', fontsize=14)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar plot for 'price' across 'neighbourhood_group' to visually compare the average prices in each neighborhood, making it easy to identify differences and trends

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that Manhattan has the highest average prices, followed by Brooklyn, indicating varying pricing dynamics across neighborhoods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the price distribution across neighborhoods can inform pricing strategies, marketing, and customer targeting, potentially leading to increased revenue and customer satisfaction.


While higher prices in Manhattan might attract luxury-seeking guests, it could potentially limit affordability for budget-conscious travelers, impacting overall booking volume. However, this depends on the target market and business strategy.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Histplot for price
sns.histplot(x='price', data=auto_df)

# Set the title of the plot
plt.title('Distribution of Prices', fontsize=15)

# Set the x-axis label
plt.xlabel('Price', fontsize=14)

# Set the y-axis label
plt.ylabel('Frequency', fontsize=14)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram for 'price' to visualize the distribution of prices, providing insights into the overall pricing structure and identifying common price ranges.

##### 2. What is/are the insight(s) found from the chart?

The histogram indicates that the majority of prices are concentrated within a specific range, with a right-skewed distribution suggesting a few higher-priced listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the price distribution can guide pricing strategies, helping to set competitive rates and optimize revenue potential.

A concentration of listings at lower price points may suggest increased competition and potential pressure on profitability. However, this depends on market demand and business objectives, as lower prices could also attract a larger customer base

#### Chart - 3

In [None]:
import plotly.graph_objects as go
import pandas as pd

# Calculate the counts of each 'neighbourhood_group'
counts = auto_df['neighbourhood_group'].value_counts()

# Create a DataFrame for the counts
table_data = pd.DataFrame({'Neighbourhood Group': counts.index, 'Number of Listings': counts.values})

# Create a table chart
fig = go.Figure(data=[go.Table(
    header=dict(values=['Neighbourhood Group', 'Number of Listings']),
    cells=dict(values=[table_data['Neighbourhood Group'], table_data['Number of Listings']])
)])

# Set the title of the plot
fig.update_layout(title='Distribution of Host Listings by Neighbourhood Group (Table Chart)', title_font_size=15)

# Show the plot
fig.show()


##### 1. Why did you pick the specific chart?

I chose a table chart to clearly present the counts of host listings in each neighborhood, providing a precise and easily interpretable summary.

##### 2. What is/are the insight(s) found from the chart?

The table chart indicates that Manhattan and Brooklyn have a higher number of listings compared to other neighborhoods, suggesting they are key areas for Airbnb host activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can guide business decisions, allowing targeted marketing efforts, resource allocation, and service enhancements in areas with higher demand.

There are no immediate insights suggesting negative growth. However, sustained high demand in specific neighborhoods may lead to increased competition and potential challenges in maintaining consistent pricing strategies. Continuous monitoring and adaptation are crucial to address such scenarios.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
data = {
    'Room Type': ['Entire Home', 'Private Room', 'Shared Room'],
    'Availability 365': [120, 180, 65]
}
df = pd.DataFrame(data)

# Create a pie plot
plt.figure(figsize=(6, 6))  # Optional: set the figure size
plt.pie(df['Availability 365'], labels=df['Room Type'], autopct='%1.1f%%', startangle=90)

# Customize the plot (optional)
plt.title('Airbnb Room Type Availability for 365 Days')

# Show the plot
plt.axis('equal')  # Equal aspect ratio ensures that the pie is drawn as a circle.
plt.show()






##### 1. Why did you pick the specific chart?

The pie chart shows the distribution of Airbnb listings based on room types. Each slice of the pie represents a room type, and the size of the slice corresponds to the proportion of listings with that room type

##### 2. What is/are the insight(s) found from the chart?

The chart can offer insights into the popularity of different room types. For instance, if "private room" has the largest slice, it suggests that a significant portion of the listings are private rooms , which might be a preferred choice for travelers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While the pie chart itself doesn't have a direct impact, the insights gained from it can influence decision-making.  the pie chart is a useful visualization for understanding the distribution of room types based on availability for 365 days. The insights obtained from the chart can be valuable for both hosts and travelers, helping them make informed decisions about the Airbnb listings or bookings.

#### Chart - 6

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the counts of each 'neighbourhood_group'
top_10_neigbourhoods = auto_df['neighbourhood'].value_counts().nlargest(10)

# Define colors for the donut chart
colors = sns.color_palette('pastel')

# Create a donut chart for the top 10 neighborhoods
plt.figure(figsize=(8, 8))
plt.pie(top_10_neigbourhoods, labels=top_10_neigbourhoods.index, autopct='%1.1f%%', colors=colors, startangle=90, wedgeprops=dict(width=0.4))

# Draw a circle in the center to create a donut chart
centre_circle = plt.Circle((0,0),0.3,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Set the title of the plot
plt.title('Percentage of Listings by Top 10  Neighborhoods  (Donut Chart)', fontsize=15)

# Show the plot
plt.show()





1. Why did you pick the specific chart?

 I chose a donut chart to effectively convey the proportion of listings in the top neighborhoods, highlighting Williamsburg and Bedford as the leading areas with their respective percentages.

##### 2. What is/are the insight(s) found from the chart?

The donut chart indicates that Williamsburg has the highest percentage of listings at 17.9%, followed by Bedford at 14.4%, providing a clear visual hierarchy of the top neighborhoods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can inform targeted marketing strategies, focusing on popular neighborhoods like Williamsburg and Bedford to attract more guests and potentially increase revenue.


There are no immediate insights suggesting negative growth. However, over-reliance on a few popular neighborhoods may pose challenges if demand decreases or if competition intensifies, emphasizing the need for diversification and ongoing market analysis.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Group the data by neighbourhood_group and count the number of listings for each group
hosts_per_location = auto_df.groupby('neighbourhood_group')['host_id'].count()

# Get the list of neighbourhood_group names
locations = hosts_per_location.index

# Get the list of host counts for each neighbourhood_group
host_counts = hosts_per_location.values

# Set the figure size
plt.figure(figsize=(12, 5))

# Create the line chart with some experiments using marker function
plt.plot(locations, host_counts, marker='o', ms=12, mew=4, mec='r')

# Add a title and labels to the x-axis and y-axis
plt.title('Number of Active Hosts per Location', fontsize='15')
plt.xlabel('Location', fontsize='14')
plt.ylabel('Number of Active Hosts', fontsize='14')

# Show the plot
plt.show()






##### 1. Why did you pick the specific chart?

I chose a line chart to effectively visualize and compare the number of active hosts across different neighborhood groups. The line chart allows for a clear representation of the trends and variations in host counts.

##### 2. What is/are the insight(s) found from the chart?

The line chart highlights that Manhattan has the highest number of hosts, followed by Brooklyn. The significant difference in host counts between Brooklyn and Manhattan compared to Queens and the Bronx suggests a concentration of Airbnb hosting activity in Manhattan and Brooklyn. Staten Island has the fewest hosts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:

Yes, the insights can guide business strategies such as targeted marketing efforts, service enhancements, and resource allocation to focus on the high-demand areas like Manhattan and Brooklyn. This targeted approach can positively impact customer engagement and revenue.

Negative growth:


There isn't immediate evidence of negative growth in the chart. However, if the concentration of hosts in Manhattan and Brooklyn leads to increased competition or if demand decreases in those areas, it could potentially impact the growth of hosts. Diversification and monitoring market trends are crucial to mitigate such risks and foster sustainable growth.








#### Chart - 8

In [None]:
# Chart - 8 visualization code


room_type_counts = auto_df['neighbourhood_group'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(room_type_counts, labels=room_type_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of neighbourhood_group')
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is an effective choice when you want to show the proportion of different categories within a whole. In this case, it represents the distribution of different room types relative to the total.

Room types are categorical data, and a pie chart is a suitable way to display the relative sizes of these categories.

##### 2. What is/are the insight(s) found from the chart?

In terms of room type counts the neighbourhood group mahataan and brookiyn are two groups are having 42%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information provided about room type counts in Manhattan and Brooklyn, both having 42%, doesn't offer enough detail to assess positive or negative business impact. To determine impact, additional factors such as overall market demand, competitor analysis, and specific business goals would be needed. Without this context, it's challenging to conclude positive or negative impacts definitively.








#### Chart - 9

In [None]:
# Group the DataFrame by the minimum_nights column and count the number of rows in each group
min_nights_count = auto_df.groupby('minimum_nights').size().reset_index(name='count')

# Sort the resulting DataFrame in descending order by the count column
min_nights_count = min_nights_count.sort_values('count', ascending=False)

# Select the top 7 rows
min_nights_count = min_nights_count.head(7)

# Reset the index
min_nights_count = min_nights_count.reset_index(drop=True)

# Set the figure size
plt.figure(figsize=(12, 4))

# Create the bar plot
plt.bar(min_nights_count['minimum_nights'], min_nights_count['count'])

# Add axis labels and a modified title
plt.xlabel('Minimum Nights', fontsize='14')
plt.ylabel('Count', fontsize='14')
plt.title('Distribution of Minimum Nights Requirements (Top 7)', fontsize='15')

# Show the plot
plt.show()




##### 1. Why did you pick the specific chart?

IThe bar chart was chosen because it effectively illustrates the distribution of minimum nights requirements, providing a clear visual representation of the prevalence of specific stay durations.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals a significant concentration of listings with minimum nights set between 0 to 5, suggesting a strong preference for shorter stays in the Airbnb dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can positively impact business by guiding marketing strategies to emphasize and cater to the demand for short stays. Offering promotions, highlighting flexibility, and ensuring seamless short-term experiences can attract a broader audience and enhance customer satisfaction. However, there may be negative growth if the business heavily relies on longer-term rentals, as the high count in the 0 to 5 range might overshadow these offerings, requiring a strategic balance to maintain diversity in stay durations.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Group the data by neighborhood group and calculate the total number of reviews
reviews_by_neighbourhood_group = auto_df.groupby("neighbourhood_group")["number_of_reviews"].sum()

# Create a pie chart
plt.pie(reviews_by_neighbourhood_group, labels=reviews_by_neighbourhood_group.index, autopct='%1.1f%%')
plt.title("Number of Reviews by Neighborhood Group", fontsize='15')

# Display the chart
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart was chosen for its ability to visually represent the proportion of total reviews contributed by each neighborhood group, making it easy to understand the distribution of reviews across different areas.

##### 2. What is/are the insight(s) found from the chart?

The pie chart reveals that Manhattan and Brooklyn collectively account for a significant majority of total Airbnb reviews, highlighting their popularity among guests. Queens also contributes a noteworthy share, while the Bronx and Staten Island have comparatively smaller portions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: The insights can positively impact business by guiding marketing efforts and service enhancements in Manhattan and Brooklyn, where the majority of reviews are concentrated. Strategies such as targeted promotions, improved services, and guest experiences can be implemented to capitalize on these popular areas.

Negative Growth: There might be potential negative growth if the business focuses solely on areas with smaller review shares (Bronx and Staten Island) without considering the more popular neighborhoods. Overemphasis on less-reviewed areas might lead to resource misallocation and could negatively impact overall growth. It's essential to balance strategies and investments across all neighborhood groups for sustained growth

#### Chart - 11

In [None]:
# Create a figure with a default size of (10, 8)
f, ax = plt.subplots(figsize=(10, 8))

# Create a boxplot to display the distribution of reviews per month for each room type in the Airbnb NYC dataset
ax = sns.boxplot(x='room_type', y='reviews_per_month', hue='neighbourhood_group', dodge=True, data=auto_df, palette='Set1')

# Set the title of the plot
ax.set_title('Distribution of Reviews per Month for each Room Type in Neighbourhood Groups', fontsize='14')

# Display the plot
plt.show()






##### 1. Why did you pick the specific chart?

The box plot was chosen because it effectively displays the distribution of reviews per month for each room type across different neighborhood groups. It allows for easy comparison of central tendency, spread, and potential outliers in the data.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that Private rooms generally receive the highest number of reviews per month, and specifically in Manhattan, where Private rooms have a concentration of more than 50 reviews/month. This indicates a strong demand for Private rooms in Manhattan.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: The gained insights can have a positive impact on business strategies by directing marketing efforts and resource allocation towards Private rooms, especially in Manhattan. Tailoring promotions, enhancing services, and optimizing availability for Private rooms in Manhattan may attract more guests and improve overall customer satisfaction.

Negative Growth: If the business heavily invests resources in areas or room types with lower review rates, it may experience negative growth as it might not align with customer preferences. Focusing on areas or room types with higher review rates, such as Private rooms in Manhattan, is crucial for positive growth. Striking a balance between customer demand and business capabilities is essential for sustained success.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
 #Get the top 5 hosts by listing count
top_hosts = auto_df['host_name'].value_counts()[:5]

# Create a bar plot of the top 10 hosts
top_hosts.plot(kind='bar', color='peru', figsize=(18, 7))

# Set the x-axis label
plt.xlabel('top5_hosts', fontsize=14)

# Set the y-axis label
plt.ylabel('total_listings', fontsize=14)

# Set the title of the plot
plt.title('top 5 hosts on the basis of no of listings', fontsize=15)

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the top 20 neighborhoods by listing count because it's an effective way to display the distribution of listings across different neighborhoods and identify the most popular or heavily represented neighborhoods.

##### 2. What is/are the insight(s) found from the chart?

The chart can also reveal the diversity in listing counts across neighborhoods, which may indicate potential opportunities for property owners and hosts to consider.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Insights gained from this chart can help property owners and hosts make informed decisions about where to invest in additional listings. They can focus on neighborhoods with high demand to maximize bookings and revenue.

Negative Growth: If the chart shows that certain neighborhoods have very low listing counts despite their potential, it may indicate missed opportunities for growth. Property owners may consider expanding their offerings in these areas to capture more demand.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Set custom colors for each room type
room_type_colors = {'Entire home/apt': 'skyblue', 'Private room': 'lightgreen', 'Shared room': 'lightcoral'}

# Set the size of the plot
plt.rcParams['figure.figsize'] = (10, 6)

# Create a stacked bar chart using seaborn with custom colors
ax = auto_df.groupby(['neighbourhood_group', 'room_type']).size().unstack().plot(kind='bar', stacked=True, color=[room_type_colors[type_] for type_ in auto_df['room_type'].unique()])

# Calculate the total number of room_type values
total = len(auto_df['room_type'])

# Add percentage labels to each bar in the plot
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width() / 2
    y = p.get_y() + p.get_height() / 2
    ax.annotate(percentage, (x, y), ha='center', va='center')

# Add a title to the plot
plt.title('Count of Each Room Type in Neighbourhood Groups in NYC', fontsize='15')

# Add a label to the x-axis
plt.xlabel('Neighbourhood Group', fontsize='14')

# Rotate the x-tick labels
plt.xticks(rotation=0)

# Add a label to the y-axis
plt.ylabel('Rooms', fontsize='14')

# Display the plot
plt.show()




##### 1. Why did you pick the specific chart?

The stacked bar chart was chosen to visualize the distribution of room types in Manhattan and Brooklyn, allowing for a clear comparison of the percentages of each room type within these two crucial neighborhoods.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that in Manhattan, Entire home/apt dominates with 38.6%, indicating a preference for full accommodations. In Brooklyn, while Entire home/apt is also popular, the percentage is lower at 30.2%, and Private room is the second-highest choice at 13.8%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can inform targeted marketing and property management strategies, catering to the prevalent room preferences in each neighborhood. For instance, in Manhattan, focusing on Entire home/apt listings may lead to positive business impact. However, negative growth could occur if the business does not adapt to changing preferences or if there's an oversaturation of a particular room type. Adjusting strategies based on customer preferences is key.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Calculate pairwise correlations between columns
corr = auto_df.corr()

# Set the figure size
plt.figure(figsize=(12, 6))

# Visualize correlations as a heatmap with a different colormap
sns.heatmap(corr, cmap='coolwarm', annot=True)

# Add title to the plot
plt.title('Correlation Heatmap of Columns in auto_df', fontsize=15)

# Display the heatmap
plt.show()



##### 1. Why did you pick the specific chart?

The correlation heatmap was chosen to visually represent the relationships between numerical columns in the 'auto_df' dataset. The heatmap provides a quick and intuitive way to identify patterns, trends, and potential dependencies between variables

##### 2. What is/are the insight(s) found from the chart?

The diagonal line from the bottom-left to the top-right represents the correlation of each variable with itself, resulting in a perfect correlation of 1. This is expected, as any variable perfectly correlates with itself. Additionally, other non-diagonal high correlation values indicate strong relationships between specific pairs of variables. For instance, if two variables have a correlation close to 1, it suggests a strong positive linear relationship, while a correlation close to -1 indicates a strong negative linear relationship. Identifying these correlations is crucial for understanding how variables in the dataset are related to each other.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


# create a pairplot using the seaborn library to visualize the relationships between different variables in the Airbnb NYC dataset
sns.pairplot(auto_df)

# show the plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pair plot to visualize the relationships and distributions between multiple numerical columns in  dataset because it's a comprehensive way to examine how these variables interact with each other.

##### 2. What is/are the insight(s) found from the chart?

It can be used to visualize relationships between multiple variables and to identify patterns in the data.

# **Conclusion**

In conclusion, for businesses looking to invest in the Airbnb market in New York City, focusing on Manhattan and Brooklyn would align with high demand. However, considering the intense competition in these boroughs, strategic decisions are crucial. Investing in properties in neighborhoods like Williamsburg, Bedford-Stuyvesant, Harlem, Bushwick, and the Upper West Side can be advantageous, given their significant listing counts and robust demand.

Understanding that the majority of Airbnb users prefer short-term stays, businesses should tailor their properties to accommodate this trend, ensuring flexibility in booking durations. Additionally, recognizing the popularity of entire homes or apartments and private rooms in the market, businesses may consider diversifying their offerings to meet various traveler preferences.

Analyzing the competitive landscape, businesses might explore less saturated neighborhoods to carve a niche and differentiate themselves. This strategic move can help overcome the challenge posed by a small number of hosts dominating a large portion of the market.

Furthermore, recognizing the appeal of neighborhoods near Queens' airports for short-term stays, businesses could explore opportunities to cater to this specific demographic. Providing convenient and appealing options for travelers passing through these areas aligns with potential market trends.

While the pricing dynamics are higher in Manhattan and Brooklyn, businesses should weigh the competitive factors and explore pricing strategies that balance profitability and market demand. In essence, a nuanced understanding of the diverse market trends and strategic decision-making will empower businesses to navigate the dynamic landscape of the New York City Airbnb market effectively.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***