# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Team
##### **Team Member 1 -**Manasvi Save
##### **Team Member 2 -**Sameer Rudani


# **Project Summary -**

Airbnb, which stands for "Air Bed and Breakfast," is a platform that lets owners of real estate lease out their apartments to vacationers in need of somewhere to stay.

In 2008, Brian Chesky and Joe Gebbia, who were based in San Francisco, California, founded Airbnb. Both a website and a mobile app are available for the platform.

Since its launch in 2008, Airbnb has allowed both hosts and visitors to extend their travel options and offer a more distinctive, customised method of seeing the world.

Millions of listings create a vast amount of data that can be analysed and utilised for a variety of purposes, including security, business decision-making, understanding customer and provider behaviour and performance on the platform, directing marketing campaigns, and putting innovative new services into place.

EDA's primary goal is to assist in examining data before drawing any conclusions. It can assist in locating glaring errors, better understanding data patterns, spotting outliers or unusual occurrences, and discovering intriguing correlations between the variables.It can assist in identifying glaring mistakes, improving understanding of data patterns, spotting anomalies or unexpected occurrences, and finding intriguing correlations between variables.


# **GitHub Link -**

https://github.com/msave121/Airbnb-Booking-Analysis-EDA

# **Problem Statement**


**Airbnb was founded in 2008 with the goal of providing travellers and hosts with additional options for travel and a more distinctive and intimate travel experience. These days, Airbnb is a distinct service that is utilised and acknowledged globally. An essential component of Airbnb's business model is the analysis of data from millions of postings. An enormous amount of data is produced by these millions of entries.
We will be analysing data from Airbnb NYC in 2019.
Our primary study objectives span four propositions, which may be summed up as Host Learnings, Areas, Prices, Ratings, Locations, etc., but we won't stop there; we'll also aim to investigate some other concepts.**

#### **Define Your Business Objective?**

The business objective for Airbnb revolves around providing a platform that connects hosts with travelers, offering unique and authentic travel experiences worldwide.
Facilitate Accommodation

1.	Enhance User Experience
2.	Global Expansion
3.	Quality Assurance
4.	Partnerships and Collaborations
5.	Customer Loyalty and Retention


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Airbnb_NYC_2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

Through its online marketplace, Airbnb links visitors with nearby hosts.
People can post their open space on the platform and get more revenue in the form of rent, on the one hand.
Conversely, travellers can save money and get to know locals by using Airbnb to book distinctive homestays from local hosts.
With operations in more than 190 nations worldwide, Airbnb serves the on-demand tourism sector.

This dataset consists of 16 columns, around 48,895 observations, and a combination of numeric and category values.There are no mising values and duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

1. id: Unique identifier for each listing.

2. name: The title or name of the listing.

3. host_id: Unique identifier for the host of the property.

4. host_name: Name of the host.

5. neighbourhood_group: The broader area or group that the neighbourhood belongs to.

6. neighbourhood: Specific neighbourhood where the property is located.

7. latitude: Latitude coordinate of the property.

8. longitude: Longitude coordinate of the property.

9. room_type: Type of room (e.g., Private room, Entire home/apt, Shared room).

10. price: Price of the listing per night.

11. minimum_nights: Minimum number of nights required for booking.

12. number_of_reviews: Total number of reviews received for the listing.

13. last_review: Date of the last review.

14. reviews_per_month: Average number of reviews per month.

15. calculated_host_listings_count - Number of properties available on Airbnb owned by the host

16. availability_365: Number of days the listing is available for booking in a year.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Let's check the column 'last review'
df['last_review']

In [None]:
df['last_review'] = pd.to_datetime(df['last_review'])

In [None]:
# Let's check the min and max timestamps

df['last_review'].min(), df['last_review'].max()

In [None]:
# Now let's impute the null values to the minimum date in the dataset

df.loc[df['last_review'].isnull(), 'last_review'] = df['last_review'].median()

In [None]:
# Let's find the duplicate listings if any (like the one we deleted earlier)

df[df.duplicated()]

In [None]:
# There could be some more duplicates like the first case we saw. Let's use the latitude, longitude,host name, and price combination to find such cases

df.duplicated(subset=['host_name', 'latitude', 'longitude', 'price']).sum()

In [None]:
temp = df.loc[df.duplicated(subset=['host_name', 'latitude', 'longitude', 'price'], keep=False)].copy()
temp = temp.groupby(['host_name', 'latitude', 'longitude', 'price'])
for key, subdf in temp:
    print(key)
    print(pd.DataFrame(subdf), '\n')
    break

In [None]:
# Drop duplicate data
df.drop_duplicates(subset=['host_name', 'latitude', 'longitude', 'price'], inplace=True)
df.info()

In [None]:
# Let's check the null counts once again

df.isnull().sum().sort_values(ascending=False)

In [None]:
# Now let's impute the null values to the minimum date in the dataset

df.loc[df['reviews_per_month'].isnull(), 'reviews_per_month'] = df['reviews_per_month'].median()

In [None]:
# Let's check the null counts once again

df.isnull().sum().sort_values(ascending=False)

In [None]:
#There is duplicate data in name
df.duplicated(subset=['name']).sum()

In [None]:
temp = df.loc[df.duplicated(subset=['name'], keep=False)].copy()
temp = temp.groupby(['name'])
for key, subdf in temp:
    print(key)
    print(pd.DataFrame(subdf), '\n')
    break

In [None]:
#Drop duplicate data
del temp, subdf
df.drop_duplicates(subset=['name'], inplace=True)
df.info()

In [None]:
# Let's check the null counts once again

df.isnull().sum().sort_values(ascending=False)

In [None]:
# Impute name column with 'blank'
df.loc[df['name'].isnull(), 'name'] = 'blank'    # or use df['NAME'] =  df['NAME'].fillna('blank')

# Impute host_name with 'blank'
df.loc[df['host_name'].isnull(), 'host_name'] = 'blank'

In [None]:
# Let's check the null counts once again

df.isnull().sum().sort_values(ascending=False)

In [None]:
df.to_csv('Airbnb.csv', encoding='utf-8',index=False)

In [None]:
df.shape


### What all manipulations have you done and insights you found?

***Manipulation :***

According to our idea, hosts can use the host dashboard on the Airbnb platform to manage their properties and add unique elements to their listings.

**Details of the listing:**Property details that hosts can change include the kind of property, number of bedrooms, number of bathrooms, amenities, and house rules.

**Establish house rules:**
In addition to general instructions, hosts might provide specifics like whether smoking is permitted and when visitors should arrive and go.

**Modify pricing:**In addition to charging for additional services, such as cleaning or additional guests, hosts can determine the nightly rate for their listing.

**Reviews:**To foster trust in the Airbnb community, hosts are able to publish reviews about their hosts, and guests are able to post feedback about their stays.

***Insights found:***

Due to its size and variety as a platform for experiences and short-term rentals, Airbnb produces a huge amount of data. Data insights from Airbnb can offer both hosts and the firm itself useful information.

**Trends in Supply and Demand:**
Recognise the need for lodging at various times and in various places.
Examine the availability of listings in different areas to determine if ones are saturated or not.

**Strategies for Pricing:**
Examine market patterns in order to assist hosts in establishing fair fees.
Determine when demand is highest and modify prices accordingly.
Examine how amenities, location, and the kind of property affect costs.

**Analysis of User Behaviour:**
Examine user reviews to find recurring themes and potential areas for development.
Recognise the user



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

df.head()
plt.figure(figsize=(12,8))
df = df[df['minimum_nights']==1]
df1 = df.groupby(['room_type','neighbourhood_group'])['price'].mean().sort_values(ascending=True)
df1.plot(kind='bar')
plt.title('Average Price for rooms in neighbourhood group')
plt.ylabel('Average Daily Price')
plt.xlabel('Neighbourhood Group')
plt.show()
print('List of Average Price per night based on the neighbourhood group')
pd.DataFrame(df1).sort_values(by='room_type')

##### 1. Why did you pick the specific chart?

However, if you are referring to a specific bar chart related to Airbnb, it's possible that it was chosen by someone creating a visualization to effectively convey information about Airbnb-related data. Bar charts are commonly used in data visualization because they are straightforward and easy to interpret, making them a popular choice for displaying categorical data and comparisons between different groups or categories. They provide a clear representation of quantities and can be easily understood by a wide audience.

. In the context of air booking analysis, a bar chart might be chosen for various reasons
1.	Comparison of Quantities
2.	Categorical Data
3.	Simple Representation
4.	Time Series Analysis

##### 2. What is/are the insight(s) found from the chart?

Average Price for rooms in neighbourhood group are highest for Entire home/apt, Manhattan which is **293.580439**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indeed, using Airbnb data for exploratory data analysis (EDA) may yield insightful findings that have a favourable effect on the company. There exist multiple avenues in which EDA might enhance Airbnb's business operations:

Comprehending Customer Behaviour: EDA facilitates the examination of user conduct, inclinations, and reservation trends. Business strategies can be informed by insights into the features, locations, or amenities that guests find most appealing. This enables Airbnb to optimise its platform for maximum user happiness.

Airbnb can create efficient pricing plans by examining demand changes, pricing patterns, and the effects of various factors on rental prices. By doing this, hosts may optimise their profits and maintain their competitiveness in the market.

Analysis of Supply and Demand: EDA can shed light on the dynamics of Supply and Demand for various

#### Chart - 2

In [None]:
# Chart - 2 visualization code

neighbourhood_group_counts=df.neighbourhood_group.value_counts()
group_names = neighbourhood_group_counts.index
colors = ['#008fd5','#fc4f30','#e5ae38','#6d904f','#8b8b8b']
explode = (0.05,0,0,0,0)
group_counts = neighbourhood_group_counts.values

plt.figure(figsize=(8,8))
plt.pie(group_counts, explode = explode, labels=group_counts, colors= colors, autopct = '%1.1f%%', startangle=0,)
plt.legend(group_names)
plt.title('Neighbourhood Group wise listing counts')
plt.show()
pd.DataFrame(neighbourhood_group_counts)

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

 It is very well evident that Manhattan(42.7%) has the most number of listings followed by Brooklyn(36.5%).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 It is seen that Manhattan(42.7%) has the most number of listings followed by Brooklyn(36.5%). We can check the distribution of listings across different neighbourhoods. We can see a nice pie-chart depicting the counts as well as the percentages of listings.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

viz_Airbnb_map = px.scatter_mapbox(df, lat="latitude", lon="longitude",
                                   color="neighbourhood_group",
                                   color_discrete_map={
                                                      'Bronx': '#222A2A',
                                                      'Brooklyn': '#2E91E5',
                                                      'Manhattan': '#FC0080',
                                                      'Queens': '#750D86',
                                                      'Staten Island': '#0000EE'
                                                      }, zoom=10, height=780,
                                                      width =1000)
viz_Airbnb_map.update_layout(mapbox_style="open-street-map")
viz_Airbnb_map.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
viz_Airbnb_map.show()

##### 1. Why did you pick the specific chart?

scatter_mapbox is a feature within the Plotly library, which is known for creating interactive and dynamic visualizations. This integration allows you to leverage the power of Plotly for creating sophisticated and responsive map visualizations.

##### 2. What is/are the insight(s) found from the chart?

Airbnbs are considerably more in number in 'neighbourhood_group's 'Manhattan' and 'Brooklyn'

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If we observe that the Airbnbs are considerably more in number in 'neighbourhood_group's 'Manhattan' and 'Brooklyn' which defines the higher demand of Airbnbs in those 'neighbourhood_group', from this we can assess that they attract maximum tourists. Moreover,we can even assess that the average price of Airbnb in those 'neighbourhood_group's higher in comparision to that of in other 'neighbourhood_group's due to its high demand.


#### Chart - 4

In [None]:
# Chart - 4 visualization code

neighbourhood_group_counts=df.neighbourhood_group.value_counts()
x= 'neighbourhood_group'
y= 'price'
title = 'Price per Neighbourhood Group'
f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x=x, y=y, data=df)
plt.title(title)
plt.show()
pd.DataFrame(neighbourhood_group_counts)

##### 1. Why did you pick the specific chart?

Box plots are widely used in EDA for several reasons:

1.	Summary of Distribution
2.	Outlier Detection
3.	Space Efficiency

##### 2. What is/are the insight(s) found from the chart?

The larger the box the more spread of data is.Manhattan has larger spread of data i.e **5295**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

sns.set(rc={'figure.figsize':(15,8)})
viz_neigh_grp_price = sns.violinplot(data=df[df.price < 630],
                                     x='neighbourhood_group', y='price')
viz_neigh_grp_price.set_title('Density and distribution of price for each neighbourhood_group');
plt.show()


##### 1. Why did you pick the specific chart?

Violin plots are used in data visualization to combine aspects of box plots and kernel density plots. They are particularly useful when you want to display the distribution of a dataset and compare multiple distributions
some reasons why someone might choose a violin plot:

1. Distribution Comparison
2. Data Symmetry
3. Handling Unequal Sample Sizes

##### 2. What is/are the insight(s) found from the chart?

As seen in the Violin plot, from the price distribution across diffrent neighbourhoods median price would be a better estimate for comparisons. The mean value would be influenced by the outliers.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from violin plots can have a positive business impact in various ways:

1.	Data Distribution Understanding
2.	Comparison Between Groups
3.	Trend Analysis
4.	Decision Support


#### Chart - 6

In [None]:
# Chart - 6 visualization code

sns.set(rc={'figure.figsize':(16,8)})
sns.histplot(df[df['minimum_nights']<=30].minimum_nights);
plt.show()

##### 1. Why did you pick the specific chart?

Histograms are a type of data visualization that is commonly used to represent the distribution of a dataset. In the context of air booking analysis, histograms can be useful for understanding patterns and trends in booking data.

##### 2. What is/are the insight(s) found from the chart?

The Histogram plot gives us a good understanding that mostly rooms are allow bookings for less than 15 nights.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For all listings with minimum nights less than 15, we divide into two groups 1 to 7 and 8 to 15, we analyze the median price distribution in diffrent neighborhoods.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

sns.set_palette("muted")
x = 'minimum_nights'
y = 'price'

title = 'Price relation to minimum_nights for Properties under $550'
data_filtered =df[(df['minimum_nights']<=30) & (df['price'] <= 550)]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=data_filtered)
plt.title(title)
plt.ioff()
plt.show()

##### 1. Why did you pick the specific chart?

One kind of visualisation that shows data as a collection of points is a scatter plot. Each point's location on the plot indicates the values of two different variables. In order to see how two variables relate to one another and to spot any patterns or trends in the data, scatter plots are frequently utilised.

##### 2. What is/are the insight(s) found from the chart?

A scatter plot of the listings against price shows us that the first half the plot is more dense giving us an hint the wide range of price options available for different number of nights a customer books. But around 25 we can see a short range signifying that rooms that require 25 days to be booked have cheaper prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Scatter diagrams are an effective way to demonstrate non-linear patterns.
By seeing the range,it makes scatter diagrams possible to determine data flow range, such as the maximum and minimum values.
Plotting scatter diagrams helps with better project decisions.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

For Airbnb, achieving business goals requires a blend of technological implementation, ongoing improvement, and strategic planning.

**Clear Business Goals:** Clearly state the goals the client hopes to accomplish with Airbnb. These objectives can be raising occupancy rates, breaking into untapped markets, raising client satisfaction levels, or refining pricing policies.

**Optimized Listings:** Make sure Airbnb listings are optimised. This consists of excellent images, thorough and precise descriptions, and reasonable prices. Invest in high-quality photographs to properly present homes.

**Competitive Pricing:** Examine local market price patterns and modify pricing tactics as necessary. Offering lower prices can draw in more visitors, particularly at the busiest times of year.

**Enhanced Guest Experience:** Pay close attention to delivering a first-rate visitor experience. This includes thoughtful facilities, tidy and well-maintained buildings, and timely communication. Bookings in the future might be greatly impacted by positive visitor reviews.

**Partnerships and Collaborations:** To improve the entire visitor experience, look into forming alliances with nearby companies or tourism associations. Increased visibility and cross-promotion are possible outcomes of collaborations.

**Customer relationship management (CRM):** To handle interactions, preferences, and comments from visitors, put in place a CRM system. Enhancing client pleasure and personalising future interactions can be achieved through this.

**Investing in Technology:** Take advantage of solutions like automated messaging platforms, channel managers, and property management systems (PMS) that can help to optimise operations. These innovations can improve productivity and visitor happiness.

**Continuous Improvement:** Look for areas for improvement and assess performance indicators on a regular basis. Remain flexible and modify your tactics in response to changing consumer demands and market trends.


# **Conclusion**

To put it briefly, we handled missing values, dealt with outliers, and performed a transformation to transfer a category into some ordinal values in this notebook. We also conducted some exploratory data analysis.

Our data clearly shows that Brooklyn and Mnahattan are the neighbourhood groups with the greatest number of Airbnb accommodations. Additionally, it is evident that the most common room kinds are either private rooms or the entire house. In addition, the greatest median price is found in Manhattan, followed by Brooklyn, when it comes to costs. Also, we observed that the quantity of evaluations in various areas supports our conclusions that Manhattan and Brooklyn have the highest popularity. We also looked at the price ranges that were available for reservations depending on a specific number of nights.

A new host who want to use Airbnb for business purposes may find this analysis to be beneficial. He can rapidly determine which locations are in higher demand and which kinds of rooms are most popular with clients, enabling him to focus his business setup in these specific places only. It might also be advantageous for current hosts to comprehend lucrative neighbourhoods. A visitor to New York would also benefit from this research, which would provide him with an understanding of the pricing range for various hotel kinds in various neighbourhoods. He has the freedom to select locations that best suit his needs, and the demographic research can assist him in making these decisions. In subsequent research, this data can be utilised to train a regression model for machine learning and forecast Airbnb prices.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***