# **Project Name**    -  **AirBnb Booking Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Member**          - Neha Khandelwal



#**Problem Statement**


## **BUSINESS PROBLEM OVERVIEW**

As the popularity of Airbnb continues to grow, both hosts and guests are keen to understand how listing prices are determined. Hosts want to optimize their pricing strategy to attract guests and maximize revenue, while guests want to make informed decisions based on fair pricing.

**Objectives**:
The main objectives of this analysis are:

**Identify Key Factors**: Determine the factors that have a significant impact on Airbnb listing prices. This includes both categorical factors like property type and neighborhood, as well as numerical factors like the number of bedrooms and bathrooms.

**Provide Recommendations**: Offer insights and recommendations to hosts on how to optimize their listing prices. For instance, hosts might consider adjusting their prices based on factors that drive higher prices, such as specific amenities or proximity to popular attractions.

**Enhance Guest Experience**: Help potential guests understand what to expect in terms of pricing based on their preferences and requirements. This can help guests make more informed decisions when booking accommodations.

**Benefits**:
By addressing this business problem, several benefits can be achieved:

**Host Revenue Optimization**: Hosts can adjust their pricing based on the identified factors, potentially leading to higher occupancy rates and increased revenue.

**Guest Satisfaction**: Guests can make more informed decisions and manage their expectations regarding accommodation pricing, leading to increased satisfaction.

**Market Insights**: Airbnb or real estate professionals can gain insights into the dynamics of the local market, helping them make data-driven decisions.

**Competitive Advantage**: Hosts who apply data-driven pricing strategies might gain a competitive advantage by offering attractive prices to potential guests.


##**Project Overview**:

I am interested to learn in this project that how can I analysis Airbnb from dataset. what are the modifications would be done? and how can I clean data for achieving my problem statement?.

**The project's deliverables include**:

A comprehensive analysis report detailing the factors influencing pricing, along with visualizations to support the findings.
Recommendations for hosts to optimize their listing prices.
Visualizations to help potential guests understand the relationship between different factors and listing prices


# **Github Link**

***https://github.com/nehabadaya/Airbnb/tree/main***

# **Problem Statement** : -  

1. Count the total number of records in the file?
2. List of the columns?
3. Find the missing values?
4. Popular Neighborhoods and Their Listings?
5. Room Type Distribution?
6. Reviews and Ratings Analysist?
7. Price Distribution by Room Type?
8. Host with the Most Listings?
9. Type of room with maximum listeing?
10. Reviews Time Trend?
11. Correlation Analysis?
12. Price Variation by Neighborhood?
13. Host Name Insights.





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
#Mounting the google drive for load the data sheet.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Importing the dataset
dataset =pd.read_csv('/content/drive/MyDrive/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
#Data set values
dataset

In [None]:
# Dataset First
dataset.head()

**Total Rows in this dataset is 48895 rows × 16 columns**

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(dataset[dataset.duplicated()])

**No Duplicates values in this dataset**

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(dataset.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

This dataset likely represents a collection of Airbnb listings with various attributes related to the listing itself, the host, the location, and booking-related information.

Before conducting any analysis, it's important to preprocess the data, handle missing values, and ensure that the data types are appropriate for each column. Once the data is clean and prepared, you can perform various analyses to extract insights and answer specific questions about the Airbnb bookings. The analysis ideas provided in the previous responses can help you get started on exploring this dataset in depth.

* The above dataset has 48895 rows × 16 columns.
* There are no mising values in this dataset.
* Duplicate values are:
  id                                   
  host_id                               
  neighbourhood_group                   
  neighbourhood                         
  latitude                              
  longitude                             
  room_type                             
  price                                 
  minimum_nights                        
  number_of_reviews                     
  calculated_host_listings_count        
  availability_365                     

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include='all')

### Variables Description

1. id'= Range from 2539-36487245 (Continuous variable).
2. Host_id'= Range from 2787-68119814 (Continuous variable).
3. Neighbourhood_group= example-Manhattan,Brokley categorical variable.
4. Neighbourhood= categorical variable.
5. Longitude = Continuous variable.
6. Room_type = (Private room,Entire home/apt..etc)Categorical variable.
7. Price= Continuous Variable.with minimun `10$ and maximum 10000 $ `Inline code`with avrage of 162 $
8. Minimum_nights= continuous numaric data.With minimum nights 1 - 1250 with avrage of 8 nights
9. Number_of_reviews= Continous variable.
10. Reviews_per_month= Contiouns variable.
11. Calculated_host_listings_count-contiouns numarical data.
12. Availability_365- contionus variable.minumim 1 to maximum 365 days.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Drop unnecessary columns
data = dataset
columns_to_drop = ['id', 'host_id', 'last_review']
data = data.drop(columns=columns_to_drop)

In [None]:

# Handle missing values
data['reviews_per_month'].fillna(0, inplace=True)
data.dropna(subset=['name', 'host_name', 'neighbourhood', 'room_type'], inplace=True)

In [None]:
# Replace missing values in 'number_of_reviews' with 0
data['number_of_reviews'].fillna(0, inplace=True)

In [None]:

# Convert categorical columns to categories
categorical_columns = ['neighbourhood_group', 'neighbourhood', 'room_type']
data[categorical_columns] = data[categorical_columns].astype('category')

In [None]:

# Calculate derived features
data['booking_duration'] = data['availability_365'] - data['minimum_nights']
data['total_revenue'] = data['price'] * data['booking_duration']

In [None]:
# See the top preprocessed data
data.head()

### What all manipulations have you done and insights you found?

According to my idea,
 * First I dropped unneccesary columns to filter the data.
 * Second I handled missing values
 * Replace missing values in 'number_of_reviews' with 0.
 * Convert categorical columns to categories
 * Calculate derived features
 * show top 5 prccessed data



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Relationship between Price and Room Type

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='room_type', y='price')
plt.xlabel('Room Type')
plt.ylabel('Price')
plt.title('Price Distribution by Room Type')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

This chart is given perfect price distribution according to room type

##### 2. What is/are the insight(s) found from the chart?

The visualization reveals that the "Entire home/apt" room type generally has higher prices compared to other room types. This suggests that guests are willing to pay more for the privacy and convenience of having an entire home or apartment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It give positive business impact because with the help of this we can find how much price of different different room type

#### Chart - 2 - Price Variation by Neighborhood

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(data=data, x='neighbourhood', y='price')
plt.xlabel('Neighbourhood')
plt.ylabel('Average Price')
plt.title('Average Price by Neighbourhood')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.


##### 2. What is/are the insight(s) found from the chart?

Looking at the bar chart, we can see that prices vary significantly across different neighborhoods. Some neighborhoods command higher prices, possibly due to their popularity, amenities, or proximity to attractions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

according to me this is positive business impact because mostly are an afordable price

#### Chart - 3 - Reviews and Booking Availability

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='reviews_per_month', y='availability_365', hue='room_type')
plt.xlabel('Reviews per Month')
plt.ylabel('Availability in a Year')
plt.title('Reviews vs. Booking Availability')
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

The scatter plot shows a relationship between the number of reviews per month and the availability of listings in a year.

##### 2. What is/are the insight(s) found from the chart?

Listings with more reviews per month tend to have lower availability, indicating higher demand. Additionally, the hue differentiation by room type provides insights into how different types of rooms are affected by this relationship.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

According to me, this is negative impact because reviews is very less

#### Chart - 4 - Host Listings and Reviews

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='calculated_host_listings_count', y='number_of_reviews', hue='neighbourhood_group')
plt.xlabel('Host Listings Count')
plt.ylabel('Number of Reviews')
plt.title('Host Listings vs. Number of Reviews')
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

This scatter plot reveals the relationship between the number of listings managed by a host and the total number of reviews received.





##### 2. What is/are the insight(s) found from the chart?

Hosts with a higher number of listings might receive more reviews due to increased exposure, but there's also a risk of diluting their attention among multiple properties.

These examples showcase how data visualization and storytelling can help uncover relationships between variables in your Airbnb booking dataset. Remember, effective data visualization can aid in exploring and explaining data patterns, leading to deeper insights and informed decision-making. Adjust the visualizations and narratives based on your specific dataset and the insights you want to convey.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No

#### Chart - 5 - Determine the density of Airbnb in New York with respct to 'neighbourhood_group

In [None]:
# Chart - 5 visualization code
d4=sns.scatterplot(data=data,x='longitude', y='latitude',hue='neighbourhood_group')
d4.set_title('AirBnb location wise distributin in NY')

##### 1. Why did you pick the specific chart?

We have chosen Scatter plots' because it's primary uses are to observe and show relationships between two numeric variables.and two numric variables are Latitude and Longitude and 3rd categorical variable 'neighbourhood_group.


##### 2. What is/are the insight(s) found from the chart?

1. We can see the exact location of All Airbnb listings in there neighbourhood_group.
2. We had multiple listings with the same location, New York City has a huge population. Their city has multiple apartments that are stack on top of each other, so a lot of our listings had the same coordinates. Looking at the density map, Airbnbs are around points of interest.
3.latitude and longtitude visulalizes us that Brooklyn and Manhattan are the most dense with hotels and apartments followed by queens island.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights will help creating a positive business impact indirectly.

#### Chart - 6 - Geographic Distribution of Listings:

In [None]:
plt.figure(figsize=(12, 8))
sns.scatterplot(data=data, x='longitude', y='latitude', hue='neighbourhood_group', palette='coolwarm', s=50)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Geographic Distribution of Listings by Neighbourhood Group')
plt.legend(title='Neighbourhood Group')
plt.show()


##### 1. Why did you pick the specific chart?

We have chosen Scatter plots' because it's primary uses are to observe and show relationships between two numeric variables.and two numric variables are Latitude and Longitude.

##### 2. What is/are the insight(s) found from the chart?

This scatter plot displays the geographic distribution of listings using latitude and longitude coordinates. Each point represents a listing's location on the map. The color of the points corresponds to different neighbourhood groups. This visualization provides insights into how listings are geographically distributed across the neighbourhood groups.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,These visualizations help you uncover insights and patterns in your Airbnb booking dataset. By combining data visualization with storytelling, you can effectively communicate your findings and insights to others, making the data more accessible and actionable.

#### Chart - 7- Heatmap of Correlations

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select numerical columns for correlation analysis
numerical_columns = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

# Calculate the correlation matrix
correlation_matrix = data[numerical_columns].corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

I want to show correlation

##### 2. What is/are the insight(s) found from the chart?

This heatmap shows the correlations between numerical variables in the dataset. The values in the cells indicate the strength and direction of the correlation. Positive values near 1 indicate a strong positive correlation, while negative values near -1 indicate a strong negative correlation. Values close to 0 indicate little to no correlation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

According to me yes it's positive


#### Chart - 8 - Analyzing the  number of visits in different types of rooms in each neighbourhood group

In [None]:
review_data = data.groupby(['neighbourhood_group','room_type'])['number_of_reviews'].sum().unstack()
review_data

In [None]:
review_data.plot(kind='bar',stacked=True,figsize=(10,5))
plt.title('Number of reviews')
plt.legend()
plt.show()

In [None]:
visit_data1 = data[data['neighbourhood_group']=='Bronx'].groupby(['neighbourhood'])['number_of_reviews'].sum().sort_values(ascending=False).head(1)
visit_data2 = data[data['neighbourhood_group']=='Brooklyn'].groupby(['neighbourhood'])['number_of_reviews'].sum().sort_values(ascending=False).head(1)
visit_data3 = data[data['neighbourhood_group']=='Manhattan'].groupby(['neighbourhood'])['number_of_reviews'].sum().sort_values(ascending=False).head(1)
visit_data4 = data[data['neighbourhood_group']=='Queens'].groupby(['neighbourhood'])['number_of_reviews'].sum().sort_values(ascending=False).head(1)
visit_data5 = data[data['neighbourhood_group']=='Staten Island'].groupby(['neighbourhood'])['number_of_reviews'].sum().sort_values(ascending=False).head(1)
pd.concat([visit_data1,visit_data2,visit_data3,visit_data4,visit_data5]).plot(kind='bar',color='red',figsize=(10,5))
plt.title('Maximum visited neighbourhood of neighbourhood group (Bronx, Brooklyn, Manhattan, Queens, Staten Island) respectively')
plt.show()

##### 1. Why did you pick the specific chart?

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.



##### 2. What is/are the insight(s) found from the chart?

On the basis of chat we found that maximum number of visits in neighbourhood in each neighbourhood_group is in Bedford-Stuyvesant

#### Chart - 9 - Pair Plot for Numeric Variables

In [None]:
sns.pairplot(data[numerical_columns])
plt.title('Pair Plot of Numeric Variables')
plt.show()

##### 1. Why did you pick the specific chart?

I want to show numeric values between 2

##### 2. What is/are the insight(s) found from the chart?

The pair plot displays scatter plots between pairs of numerical variables. Each scatter plot shows how two variables are related. The diagonal line shows the distribution of each variable. Scatter plots above the diagonal show the relationship between variables A and B, while those below the diagonal show the relationship between variables B and A.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Don't able to calculate according to this graphs

#### Chart - 10 - Price Distribution by Neighborhood and Room Type

In [None]:
plt.figure(figsize=(12, 8))
sns.violinplot(data=data, x='neighbourhood_group', y='reviews_per_month', palette='coolwarm')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Reviews per Month')
plt.title('Distribution of Reviews per Month by Neighbourhood Group')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

This bar plot displays the average price distribution across different neighborhoods and room types.

##### 2. What is/are the insight(s) found from the chart?

Each bar corresponds to a neighbourhood group, and within each bar, different room types are color-coded. The height of each bar represents the average price for the given combination of neighborhood group and room type.


#### Chart - 11 -  Room Type and Availability Density Plot:

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(data=data, x='availability_365', hue='room_type', fill=True, common_norm=False)
plt.xlabel('Booking Availability')
plt.ylabel('Density')
plt.title('Room Type and Availability Density Plot')
plt.legend(title='Room Type')
plt.show()




##### 1. Why did you pick the specific chart?

This density plot reveals the relationship between room type and booking availability.

##### 2. What is/are the insight(s) found from the chart?

Each curve on the plot represents the density of booking availability values for a specific room type. By looking at the curves, you can understand the concentration of availability values for different room types.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Certainly! Let's summarize the key insights and conclusions from each of the visualizations we've explored:

1. Heatmap of Correlations:

The heatmap indicates correlations between numerical variables.
Positive correlations are observed between "price" and "calculated_host_listings_count," suggesting that hosts with more listings may tend to charge higher prices.
There's a positive correlation between "number_of_reviews" and "reviews_per_month," which is expected as more reviews often result in a higher reviews per month count.

2. Pair Plot for Numeric Variables:

The pair plot offers a comprehensive view of relationships among numeric variables.
No strong linear relationships are evident, but it's clear that "price" has some variability with "calculated_host_listings_count" and "availability_365."

3. Price Distribution by Neighborhood and Room Type:

Prices vary widely across neighbourhoods and room types.
"Entire home/apt" listings tend to have higher average prices compared to "Private room" and "Shared room."
Manhattan has some of the highest average prices.

4. Distribution of Reviews per Month by Neighbourhood Group:

Brooklyn and Manhattan have higher variability in reviews per month compared to other neighbourhood groups.
Brooklyn has a larger spread in the distribution, indicating more variation in the number of reviews per month.

5. Geographic Distribution of Listings:

Listings are spread across different neighbourhood groups.
Manhattan and Brooklyn have higher concentrations of listings compared to other neighbourhood groups.

6. Price Distribution Box Plots by Neighbourhood Group and Room Type:

Different room types within neighbourhood groups exhibit varying price ranges.
"Entire home/apt" listings tend to have higher price ranges across all neighbourhood groups.
Some outliers in price are observed for each combination of neighbourhood group and room type.

7. Reviews and Booking Availability Trend Over Time:

The number of reviews has varied over time, with some spikes and dips.
Booking availability seems to follow a general pattern with fluctuations.

8. Room Type and Availability Density Plot:

The density plot highlights the availability distribution for different room types.
"Entire home/apt" listings tend to have lower availability compared to other room types.
There's more variability in availability for "Private room" and "Shared room" listings.

9. Host Listings Count and Price Scatter Plot with Color Mapping:

The scatter plot shows that most hosts have a low number of listings.
There doesn't seem to be a strong correlation between host listings count and price.
Different neighbourhood groups have listings with varying price ranges.

10. Reviews and Price by Room Type Violin Plot:

The violin plot showcases the distribution of reviews and prices by room type.
"Private room" and "Entire home/apt" room types have higher numbers of reviews on average compared to "Shared room."
"Entire home/apt" listings also tend to have higher prices.
In conclusion, these visualizations have provided a comprehensive overview of relationships and trends within your Airbnb booking dataset. They highlight price variations, room type preferences, geographic distribution, and the influence of neighbourhood groups on various aspects of listings. By leveraging these insights, you can make informed decisions for your Airbnb analysis and strategies.







# **Conclusion**

We finished our project! In the dataset, we tried out several techniques. First, we looked at different variables and focused on the 'Price' variable. Then, we checked how 'Price' is related to other variables and used them together to gain more insights.

We encountered some missing data and unusual values, so we took care of those by handling them appropriately. Next, we ran some basic statistical tests to make sure our assumptions were correct.

Additionally, we had categorical variables (like types or categories) that we transformed into numbers (dummy variables) to work with them more easily.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***