<a href="https://colab.research.google.com/github/nikita13-hub/-EDA-AirBnb-Bookings-Analysis-/blob/main/Copy_of_Sample_EDA_Submission_Template__Nikita.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Bookings Analysis




##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual


# **Project Summary -**

This project is about analyzing a dataset with around 49,000 entries to find useful insights that can help management and stakeholders make better decisions. The goal is to discover ways to grow and improve the business. By looking for patterns in the data, we aim to provide suggestions that will benefit both guests and hosts.


The dataset includes information such as:

* Number of listings in different neighborhood groups.
* Prices for the listings.
* Reviews that show guest satisfaction.
* Room types that guests prefer.

Using this data, we will answer important questions like:

* What types of rooms are most popular with guests?
* What price ranges are preferred?
* Which neighborhoods are the most in demand?
* How can we match guests with listings based on their budget and preferences?
We will also create a system to rank hosts based on how well they meet guest needs. This will help identify top-performing hosts and guide others on how to improve their services. By focusing on what customers want and offering personalized options, this project will improve guest satisfaction and help hosts attract more bookings.

For the analysis, we will use tools like:

* Pandas to clean and organize the data so it’s easier to understand.
* NumPy to do calculations, especially for ranking hosts and analyzing numbers.
* Matplotlib and Seaborn to create graphs and charts that show the results clearly.

This project is not just about analyzing data; it’s also a great learning opportunity. It will help me understand how businesses in this field work and how to solve real-world problems. I’ll improve my ability to identify issues, think critically, and create effective solutions.


From a data science perspective, I’ll get better at using Python libraries and applying advanced techniques to make sense of data. I’ll also learn how to turn data into a story that makes it easy for stakeholders to see what actions they should take.


By working on this project, I’ll gain valuable skills and experience in handling real-life challenges. It will help me improve my technical knowledge, problem-solving abilities, and understanding of business needs.


In short, this project will provide useful insights to grow the business, improve guest experiences, and help hosts perform better. At the same time, it will help me develop important skills and knowledge that I can use in future projects.





# **GitHub Link -**

# **Problem Statement**


The task of this project is to derive insights from the given dataset so that it can be used by the stake holders for business improvements

#### **Define Your Business Objective?**


The objective of this project is to identify areas for improvement and uncover patterns and insights into customer preferences. These insights will guide strategies aimed at enhancing customer satisfaction and fostering long-term loyalty.

This version is concise and communicates the purpose effectively while maintaining a professional tone.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
from matplotlib import pyplot as plt  # For data visualization
import math  # For mathematical operations
import seaborn as sns  # For statistical data visualization




### Dataset Loading

In [None]:
# Load Dataset
'''Creating a dataframe of the given dataset'''
airbnb_df = pd.DataFrame(pd.read_csv('/content/Airbnb NYC 2019 (1).csv'))

### Dataset First View

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Dataset First Look
airbnb_df.head(5)

In [None]:
# Let's check for all the columns we have in our dataset.
airbnb_df.columns

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb_df.shape


### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info()

###### Here we can see the division of Categorical and Numerical values in our dataset, We can see:-

* 3 columns have float64 data values (Numerical)
* 7 columns have int64 data type values (Numerical)
* 6 columns have object data type values (Categorical)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value count
duplicated_values = airbnb_df[airbnb_df.duplicated()]
duplicated_values


In [None]:
# Let us first check for 'id' column.
duplicated_id = airbnb_df[airbnb_df.duplicated(subset=['id'])]
duplicated_id

In [None]:
# Now let us check for 'name' column.
duplicated_name = airbnb_df[airbnb_df.duplicated(subset=['name'])]
duplicated_name

In [None]:
# We can check for any specific row having the name value included in the duplicated data, using the query command.
# Let's check for where name is 'Superior @ Box House', which is in our duplicated data.
airbnb_df.query('name == "Superior @ Box House"')

In [None]:
# To tackle this issue we will create a new data frame in which we will include
# only those columns which are really important for us.
airbnb_df = airbnb_df[['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365']].copy()

In [None]:
# Let's check for the duplicated values of any other row.
airbnb_df.query('name == "Loft w/ Terrace @ Box House Hotel"')

In [None]:
# In this case we will seek out the duplicated values for those columns which
# are concerning and may effect the data if duplicated.

# Checking for the values where - 'name', 'host_name','neighbourhood_group', 'neighbourhood', 'room_type' are duplicated.
airbnb_df.loc[airbnb_df.duplicated(subset= ['name', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type'])]

In [None]:
# Let's check for any specific value.
airbnb_df.query('name == "✿✿✿ COUNTRY COTTAGE IN THE CITY✿✿✿"')

In [None]:
# Converting the 'last_review' column in a datetime format.
airbnb_df['last_review'] = pd.to_datetime(airbnb_df['last_review'])

In [None]:
# Now let us sort the data by the last_review column
airbnb_df = airbnb_df.sort_values(by='last_review', ascending=False).reset_index(drop=True)

In [None]:
# Replacing NA Values
airbnb_df['last_review'].replace(np.nan,airbnb_df['last_review'].max(), inplace=True)

# Let's sort the values again
airbnb_df = airbnb_df.sort_values(by='last_review', ascending=False).reset_index(drop=True)

In [None]:
# Dropping duplicated values
airbnb_df = airbnb_df.drop_duplicates(subset=['name', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type'], keep='first').reset_index(drop=True)

# Now the duplicated columns have been dropped, let's check the current shape of our data frame.
airbnb_df.shape # (48655, 14)

# Let's check if there are any duplicated values now
airbnb_df[airbnb_df.duplicated(subset=['name', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type'])] # No values

airbnb_df.query('name == "✿✿✿ COUNTRY COTTAGE IN THE CITY✿✿✿"') # The latest value shows up

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values count
null_values = airbnb_df.isnull()
null_value_count = null_values.sum()


In [None]:
# Visualizing the missing values
null_value_count

In [None]:
# To check the outliers we need to check the columns which are having numerical data.
airbnb_df.describe()

In [None]:
# As we see above the column_name 'calculated_host_listings_count' is quiet long
# Let's change it to 'listings'
airbnb_df.rename(columns={
    "calculated_host_listings_count": 'listings'
}, inplace=True)

# Lets check the price columns once
airbnb_df['price'].describe()

# As we can see the min price is 0 which is not likely to happen.
# According to the current website the price range starts from $25.
# In this case we will check for the prices which are below $25 and we will
# replace their values with 25 so that we can handle the outliers in price column.

airbnb_df['price'].replace(range(0, 25), 25, inplace=True) # replacing values
airbnb_df['price'].describe()

### What did you know about your dataset?

The Airbnb NYC 2019 Dataset has:- Rows = 48895 Columns = 16

We can see the division of Categorical and Numerical values in our dataset, We can see:-


3 columns have float64 data values (Numerical) 7 columns have int64 data type values (Numerical) 6 columns have object data type values (Categorical)


The dataset contains both numerical and categorical data.


The primary key of our dataset is the "id" column, having a unique IDs for the hotel names.


Here is the list of columns:- ['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']


This dataset had no exact duplicated values, however in the name column there were 998 rows that were duplicates. Where all the values were almost the same, except for the prices, there might be a chances that the prices were altered with time as per the market condition however the hotel was the same.


To handle these values we sorted our dataframe on the basis of latest entries, for which we needed a timestamp in our dataset. In our dataset we were able to use the 'last_review' column as the timestamp for our dataset.


After doing few operation we were successfully able to handle our duplicated values.

These are columns that majorly has the null values:-


* last_review = 10052
* reviews_per_month = 10052

As the date column was missing values we replaced the NA values with the latest date present in our dataset, so that the time stamp could be efficient.


We are also missing few "Names" as well as "Host Names":-

* name = 16
* host_name = 21

We have now successfully formatted our dataframe and it is now ready for data wrangling.

## ***2. Understanding Your Variables***

In [None]:
# Looking on our data once
airbnb_df.head()

In [None]:
# Dataset Columns
df_columns = airbnb_df.columns
df_columns # All the columns of our cleaned data.

There are basically 3 types of variables according to their roles:-


Numerical Variables: These variables represent quantitative data and can be further categorized into:-


* Continuous Variables: These variables can take any value within the given number of range.

* Discrete Variables: These variables are having a specific value and are related to a specific identity.

Categorical Variables: These variables represent qualitative data and can be further categorized into:-


* Nominal Variables: These variables are random and they do not follow any order or ranking.

* Ordinal Variables: These variables are according to an order they can be ranked as well.

Time Variables: These variables are basically date and time variables having a timestamp.

In [None]:
# Dataset Describe
df_describe = airbnb_df.describe()
df_describe # These are all the numerical variables in our dataset.

### Variables Description

As we can see that we are having various outputs that can be studied so as to attain certain measures and draw conclusions on the basis of the data. For example: we have a wide range of prices available, however the average price range prefered by the customers are around 150, which gives us an idea on the budget of the guests and their preference.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
def unique_value(df):
  for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for '{column}': {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code
As our Data cleaning, Data transformation, and Handling outliers has been completed, now we will be working on "Feature Engineering".



In [None]:
# Let's create a new column with the price range distribution.
# We will use it while working with the price column.
start = 0
end = 10000
breakpoints = np.linspace(start, end, num=101)
breakpoints = breakpoints.astype(int)
def price_range(amt):
    bp = breakpoints
    for i in range(len(breakpoints)-1):
        if bp[i] <= amt <= bp[i+1]:
            return f"{bp[i]} - {bp[i+1]}"


airbnb_df['price_range'] = airbnb_df['price'].apply(lambda amt: price_range(amt))
airbnb_df['price_range']

Creating this price range column can provide us these benefits:-

Data Summarization: As this will provide an effective summary of the price distribution in our dataset.

Visualization: on a price range instead of individual price column is much more effective.

Segmentation and Analysis: As it will divide the prices into different segments it would be really easy to compare the prices into different segments or ranges.

Decision Making: It can help us in various decisions making, into different scenarios, like we can use this info to provide recommendations to the customers according to their budget and rerquirements.

Communicating Insights: Price range can easily communicate the budget and preferences of our customers and it can also tell us where does the majority of our cutomers lies, according to their purchasing power.

In [None]:
# Let's take a look at our dataset once.
airbnb_df.head()

# As we can see our dataset is sorted according to the last review, let us sort our dataset according to the price.
airbnb_df.sort_values(by='price', ascending=False, inplace=True)

# Let's take a look at our price sorted dataset once.
airbnb_df.head()

In [None]:
# Let us check our columns and filter those columns that will work as our features.
airbnb_df.columns

We can consider these columns to be our features so that we can work with these:-

* id
* host_id
* neighbourhood_group
* neighbourhood
* room_type
* price
* price_range
* minimum_nights
* number_of_reviews
* reviews_per_month
* listings
* availability_365

In [None]:
# Let filter the dataframe with only the required columns.
feature_df = airbnb_df[['id', 'name', 'host_id', 'neighbourhood_group',
       'neighbourhood', 'room_type', 'price', 'minimum_nights',
       'number_of_reviews', 'reviews_per_month', 'listings',
       'availability_365', 'price_range']]

feature_df = feature_df.reset_index(drop=True)
feature_df.head()

In [None]:
feature_df.head()

As we can see we are having few groups in our datasets, let us check what all insights we can derive from them, by seeking information according to the groups.

In [None]:
# Let's check the top performing Host as per the total listing count
host_groups = feature_df.groupby('name')
hosts = []
no_of_listings = []
host_prices = []
for host, data in host_groups:
  hosts.append(host)
  no_of_listings.append(data['listings'].sum())
  host_prices.append(data['price'].mean())

host_df = pd.DataFrame({
    'Host Name': hosts,
    'Total Listings': no_of_listings,
    'Price': host_prices
})
host_df = host_df.sort_values(by='Total Listings', ascending=False).reset_index(drop=True)
host_df = host_df.drop_duplicates(subset='Total Listings').reset_index(drop=True)
top_10_hosts = host_df.head(10)

host_df['Revenue'] = (host_df['Total Listings'])*(host_df['Price'])
top_host_revenue = host_df.sort_values(by='Revenue', ascending=False).reset_index(drop=True)
top_host_revenue.head(10)

In [None]:
# First let's check how many neighbourhood_groups are there.
feature_df['neighbourhood_group'].unique() # There are 5 different groups.

# ['Brooklyn', 'Queens', 'Manhattan', 'Staten Island', 'Bronx']
#  let's group our dataset according to the neighbourhood groups.

n_groups = feature_df.groupby('neighbourhood_group')

# Let's check which group is most prefered as per the listings, and no. of reviews.
groups = [] # To save the groups
listings = []
reviews = []
max_price = []
min_price = []
for group, data in n_groups:
  groups.append(group)
  listings.append(data['listings'].sum())
  reviews.append(data['number_of_reviews'].sum())
  max_price.append(data['price'].max())
  min_price.append(data['price'].min())

group_feat_df = pd.DataFrame({
    'Group': groups,
    'Listing Count': listings,
    'No._of_reviews': reviews,
    'Min Price': min_price,
    'Max Price': max_price
})

group_feat_df
# Here we can see that Manhattan group is the most prefered group as per the listings.
# However the most reviews are given to the Brooklyn group.

# Now let's divide our dataset as per the room types, and create their groups.
# First let's check how room_types are there.

feature_df['room_type'].unique() # There are 3 room types.

# ['Entire home/apt', 'Private room', 'Shared room']
#  let's group our dataset according to the room types.

room_groups = feature_df.groupby('room_type')
rooms = []
room_listings = []
room_reviews = []
max_room_price = []
min_room_price = []
for room_type, room_data in room_groups:
    rooms.append(room_type)
    room_listings.append(room_data['listings'].sum())
    max_room_price.append(room_data['price'].max())
    min_room_price.append(room_data['price'].min())

room_feat_df = pd.DataFrame({
    'Group': rooms,
    'Listing Count': room_listings,
    'Min Price': min_room_price,
    'Max Price': max_room_price
})

room_feat_df
# Here we can see the most prefered room type is the Entire home/apt.
# The most reviews are also given to the Entire home/apt.
# Shared rooms are the least prefered.

In [None]:
# Now let us check how many different neighbourhoods are there in total.
feature_df['neighbourhood'].count() # There are 48655 neighbourhoods

# Now let us check what are the top 10 most prefered neighbourhoods.
# Also, let's check their average pricing and average price range.
area_groups = feature_df.groupby('neighbourhood')
areas = []
listing_count = []
avg_price = []
for area, n_data in area_groups:
  areas.append(area)
  listing_count.append(n_data['listings'].sum())
  avg_price.append(round(n_data['price'].mean(), 2))

area_feat_df = pd.DataFrame({
    'Area': areas,
    'Listing Count': listing_count,
    'Average Price': avg_price,
})

area_feat_df = area_feat_df.sort_values(by='Listing Count', ascending=False).reset_index(drop=True)
area_feat_df.head(10)

In the above manipulations we have created and worked with few groups that were there in our dataset, based upon which we were able to derive few insights.

In [None]:
# Let us now work with relationships.
# Relationship: Price vs. Room Type
# Checking the distribution of prices for different room types.
# "room_groups" is our grouped dataframe we will be using this.

avg_room_price = []
for room, data in room_groups:
  avg_room_price.append(data['price'].mean())

room_vs_price = pd.DataFrame({
    'Room Type': rooms,
    'Avg_Price': avg_room_price
})

room_vs_price

# As we can see here the Entire home/apt is having the highest pricing.

In [None]:
# Relationship: Reviews per Month vs. Room Type
# Comparing the distribution of reviews per month for different room types.

total_reviews = []
for room, data in room_groups:
  total_reviews.append(data['reviews_per_month'].sum())

room_vs_reviews = pd.DataFrame({
    'Room Type': rooms,
    'Reviews': total_reviews
})

room_vs_reviews
# As we can see here the Entire home/apt is having the highest no. of reviews per month.

### What all manipulations have you done and insights you found?
This dataset was having good amount of well distributed information, which was a plus point for our study. As this dataset if of Airbnb, the locations play a very crucial role in these values and the inforamtion was distributed well as per the locations and their respective areas.

We have done number of manipulations to seek information which can be beneficial for the stake holders to make decisions for the business, not only that our hosts will also get good idea about the preferences of their customers, as we promised in our agenda.

We divided the complete manipulation into 2 major parts, as per the features and the relationships, so that we can draw insights accordingly.

These are the manipulations we followed:-

* Created a new price range column: As this columns will help us to categorize the price distribution and to simplify the judgement for us to get an idea of the budget of our customer, we have tried to keep it as precise as possible.

* Sorted the dataset as per price: As the data wasn't sorted and was not having any significance regaring any value, we sorted it as per the pricing, keep the most expensive ones on the top.


* We filtered the data so that it becomes simple and easy to understand(less chaotic) for us to work only with those columns that are required for the data analysis and feature engineering.


 * Divided the data accoring to the categorical groups to derive insights accordingly, it helped us to identify the outcomes in a structed manner with respect to the considered groups, it provided us with more specific information about the different groups we created so that we can identify insights for each group individually.

  These are the diferent groups we created and worked with:-


*   Created Nighbourhood Group: This group helped us to get an idea of which neighbourhood is the most prefered one in all of the neighbourhoods.


*  Created room type groups: This group helped us to identify which room type is the most prefered by our customers.


*  Created Neighbourhood area groups: This group helped us to identify which is the most prefered area within the neighbourhoods.


We also worked on few of the relationships that helped us in few comparisons that will help our stake holders to make decisions accordingly. These are the relationships we worked with:-


Relationship: Price vs. Room Type


* Checked the distribution of prices for different room types.

* Determined which room type is having the most expensive price distribution.

Relationship: Reviews per Month vs. Room Type

* Compared the distribution of reviews per month for different room types.

* Explored the level of engagement and satisfaction of our customers as per different room types.

Relationship: Price vs. Neighbourhood

* Compared the distribution of prices across different neighbourhoods.

* Identified neighbourhoods with higher or lower average prices and explored price variations as per the areas.

 These are the manipulations that we did in order to derive useful information so that it can contribute into the growth of our business and also the busniness of our hosts, and ultimately grow and improve tavelling experience for our customers.

The specifilly derived insights will be mentioned with the visualisations so that it would be better to understand and visualize at the same time.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Let's visualize the most prefered neighbourhood group with a bar chart

group_feat_df # As we have already created a dataframe for the groups

plt.figure(figsize=(10,5))
# plt.plot(groups,listings, color='black', marker = "o", markerfacecolor = 'blue', markeredgecolor='blue',linestyle='-')
colors = ['red', 'blue', 'green', 'orange', 'purple'] # To make each bar with a different color
sns.barplot(x="Group", y='Listing Count', data=group_feat_df, hue='Group', palette=colors, legend=False)
plt.xlabel('Groups')
plt.ylabel('Listing Counts')
plt.title('Most prefered neighbourhood')
for x, y in zip(range(len(groups)), listings):
    plt.text(x, y, f'{y}', ha='center', va='bottom') # To annotate each bar with the exact value

##### 1. Why did you pick the specific chart?

As we need to check which of the groups is having the highest preference, it was quite certain that we need to check the value attained by each group and we also wanted to see the comparison between all the groups, for doing this, bar chart is the best option.

##### 2. What is/are the insight(s) found from the chart?

As we can clearly see that Manhattan is the most prefered group among all and it is outperforming all the other groups with a huge difference, Brooklyn is the 2nd prefernce of our customers after Manhattan.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely, as we know what are the preferences of our customers we will be able to work on those things that are mostly in demand and we will be able to meet their requirements for their satisfaction, as here in this case we know that the most prefered group is Manhattan, we can target our customers with the availabilites in that area.

If we talk about the negative growth, well it is quite there for our other neighbourhood groups, as there is low demand in the other neighbourhoods, however we will be able to come up with a solution if we try to find out why is that, the most prefered group is Manhattan, if we do that we will be able to identify the cause for low demad in those areas, we need to focus on the reasons and the difference so that we can get to the roots of this.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Let's visualize the most prefered room type using a pie chart

room_feat_df # Our grouped df created earlier

# Plotting the data into a pie chart
plt.pie(room_feat_df['Listing Count'], labels=room_feat_df['Group'], autopct="%1i%%", explode=(0.1, 0, 0.1), shadow=True)
plt.show()

##### 1. Why did you pick the specific chart?

As we were having only 3 types of rooms it was better to use a pie chart as it would be really easy to see the distribution of the listings on a pie chart and also to see the difference between the room types. In this we are also able to see the difference in percentage wich gave us a wider view on the data outcome.

##### 2. What is/are the insight(s) found from the chart?

As we can clearly see that the "Entire Home/Apt" room type is the most prefered one, this clearly indicates that the people are prioritising their privacy and they would wanna stay all by themselves without having any type interference, as we see that the shared rooms are the least prefered, they are mostly prefered by the students that come from outside so that their accommodation can be affordable. This information can be easily used for more specific target approaches.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As we are able to see what is prefered by the majority of our customer we will be able to spend money on advertisements more efficiently which will help us to minimize our cost wastage, not only for the majority as what sort of customers prefer the other room types, we will be able to target them with their needs which will increase our consumer market even in the low demand sectors, which can work as a decoy for us.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Now let's check which price range has the most number of listings, it would give us an idea for the most prefered price range of our customers.
# Let's plot the top 10 most prefered price range on chart.
prefered_range = airbnb_df['price_range'].value_counts().head(10)
plt.figure(figsize=(10,5))

prefered_range_sorted = prefered_range.sort_index()
plt.scatter(prefered_range.index, prefered_range, color='blue', marker='o')
plt.plot(prefered_range_sorted.index, prefered_range_sorted, color='black', linestyle='-', linewidth=2, label='Line Connecting Dots')

for x, y in zip(prefered_range_sorted.index, prefered_range_sorted): # To annotate each plotted value
    plt.text(x, y, f'{y}', ha='center', va='bottom')
plt.title("Top 10 Prefered Price Range")
plt.xlabel("Price Range")
plt.ylabel("Listings")
plt.grid(color='black')

##### 1. Why did you pick the specific chart?

The plot chart can give us the graphical representation of the preferences of our customers regarding prices and how do they go through each of the price ranges. As we can see the chart is giving us a fall in the preferences as the price range increases.



##### 2. What is/are the insight(s) found from the chart?

Looking at this chart we can cleary see the budget preferences of our customers and we can see that the most prefered price range is the 0-100, however there is no 0 values as we have already dropped them. This can give us an idea on the spending power of our customers which will help us in setting better and more specific prices, also at the time of listings we can even make recommendations to our customers with their prefered price ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes definitely, as customers like personalized interfaces and options, if we will focus on providing them the options that are there for them only, it would be really appreciated by them as they will not have to go through a lot while searching for things they are looking for specifically.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Let us now visualize the average price division with room types.

data = {
    'Room Type': room_vs_price['Room Type'],
    'Avg_Price': room_vs_price['Avg_Price'],
    'Listings': ((room_feat_df['Listing Count']/room_feat_df['Listing Count'].sum())*100)
}

# Creating a DataFrame
df = pd.DataFrame(data)

df_melted = pd.melt(df, id_vars='Room Type', var_name='Attribute', value_name='Value')
# Creating a strip plot
sns.barplot(x='Room Type', y='Value', data=df_melted, hue="Attribute", palette=['blue', 'grey'], legend=True)

# Applying annotations on the values
for p in plt.gca().patches:
    plt.gca().annotate('{:.1f}'.format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
                       ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                       textcoords='offset points')

plt.show()


##### 1. Why did you pick the specific chart?

As we are only having 3 types it was better to use the bar chart for better visualizing the divisions of the average price and the percentage of the listings divided among these values altogether.




##### 2. What is/are the insight(s) found from the chart?

As we can clearly see that the relationship between the average pricing the listing counts is direct, the highest pricing is in the Entire Home/Apt room type, and slo the listings division, which can be quite using in making decisions like pricing.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, as we can see what percentage is ready to pay which amount for their preferences we will be able to make better pricing strategies which will definitely assist us in the growth.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Let's visualize the relationship between listing price and the number of reviews received.
# As we are having huge number of rows for this we will be taking only top 10 price ranges.
# For the listing prices we will take price_range

sorted_range = feature_df.sort_values(by='price_range', ascending=False)
sorted_range = sorted_range.groupby('price_range')
range_group = []
review_count = []
for range, data in sorted_range:
  range_group.append(range)
  review_count.append(data['number_of_reviews'].sum())

range_df = pd.DataFrame({
    'Range': range_group,
    'Review Count': review_count
})
plt.figure(figsize=(15, 5))
sns.scatterplot(x='Range', y='Review Count', data=range_df.head(10), color='blue', legend=False, marker='o')
plt.plot(range_df['Range'].head(10), range_df['Review Count'].head(10), color='grey', linestyle='--', linewidth=2, label='Line Connecting Dots')
plt.title("Price vs Reviews of top 10 most prefered price range")
for x, y in zip(range_df['Range'].head(10), range_df['Review Count'].head(10)):
  plt.text(x, y, f'{y}', ha='center', va='bottom')

In [None]:
# Let us also check for the least 10 prefered price ranges
plt.figure(figsize=(15, 5))
sns.scatterplot(x='Range', y='Review Count', data=range_df.tail(10), color='blue', legend=False, marker='o')
plt.plot(range_df['Range'].tail(10), range_df['Review Count'].tail(10), color='grey', linestyle='--', linewidth=2, label='Line Connecting Dots')
plt.title("Price vs Reviews of bottom least 10 prefered price range")
for x, y in zip(range_df['Range'].tail(10), range_df['Review Count'].tail(10)):
  plt.text(x, y, f'{y}', ha='center', va='bottom')

##### 1. Why did you pick the specific chart?

As the difference between the no. of reviews in the price range is very huge, using a plot chart is quiet handy so that the pointers can be seen clearly with respect to their values and also their differences.

##### 2. What is/are the insight(s) found from the chart?

As we can see the price range and the no. of reviews are having inverse relationship, it cleary states that the budget of our majortiy customer base lies between the range 0 - 200.

We can also see that there are few preferences in the higher budget section as well where there is a competetion in the prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This suggests that we will be able to make our pricing policies more targeted and more specific resulting in increasing customer base by providing them with the prices that are under their budget.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Let us know the price range having the highest listings so as to get more specific idea on price preference.


sorted_range # Sorted Price Range grouped
range_group # List of all the price range groups
range_listing_count = []

for range, data in sorted_range:
  range_listing_count.append(data['listings'].sum())


range_list_df = pd.DataFrame({
    'Range': range_group,
    'Listing Count': range_listing_count
})

range_list_df = range_list_df.sort_values(by='Listing Count', ascending=False).reset_index(drop=True)

# We will be taking on the top 5 prefered ranges.
plt.pie(range_list_df['Listing Count'].head(), labels=range_list_df['Range'].head(), autopct="%1i%%", explode=(0.1, 0.1, 0.1, 0, 0), shadow=True)
plt.show()


##### 1. Why did you pick the specific chart?

Pie chart is an effective chat to clearly see the distrubutions and preferences, it becomes easier to notice the division of listngs by the price range.


##### 2. What is/are the insight(s) found from the chart?

As we can see form the chart:- The top 3 price ranges that are having the most listings are:-

* 200-300: 33%
* 100-200: 28%
* 0-100: 22%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see the number of listings are closely divided among these sectors and this can be a great was to keep a track of all the hostels that are within this price range and create recommendations according to their prefered neighbourhoods.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# As now we are having the price ranges with the most listings let us check the neighbourhoods with the most listings.

areas = []
area_listing = []
area_group = feature_df.groupby('neighbourhood')
for area, data in area_group:
    areas.append(area)
    area_listing.append(data['listings'].sum())

area_data_df = pd.DataFrame({
    "Area": areas,
    "Values": area_listing
})

area_data_df = area_data_df.sort_values(by='Values', ascending=False).reset_index(drop=True)
area_data_df
plt.figure(figsize=(17,5))
# plt.plot(groups,listings, color='black', marker = "o", markerfacecolor = 'blue', markeredgecolor='blue',linestyle='-')
colors = ['red', 'blue', 'green', 'orange', 'purple']
sns.barplot(x="Area", y='Values', data=area_data_df.head(10), hue="Area", palette=colors, legend=False)
plt.title('Top 10 prefered neighbourhood areas')
plt.xlabel('Area')
plt.ylabel('Listing Counts')
for x, y in zip(area_data_df['Area'].head(10), area_data_df['Values'].head(10)):
    plt.text(x, y, f'{y}', ha='center', va='bottom')

##### 1. Why did you pick the specific chart?

The bar chart clearly reflects the differece between the areas and the gap between them as per the listing counts, we can see the top ten most listed neighbourhoods.

##### 2. What is/are the insight(s) found from the chart?

We can see that the most listed neighbourhood area is the 'Financial District' and it is also having a major gap between the other ones in the list, we can clearly see that the Financial District is the most prefered neighbourhood of our customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, as this insight suggests the areas where there is a great scope for business and we can create more opportunities and also increase our Advertisements in a more targetted manner.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Reviews vs Review per month
sns.scatterplot(x='number_of_reviews', y='reviews_per_month', data=feature_df)

##### 1. Why did you pick the specific chart?

In this case using a scatter plot can easily tell us the frequency, as we can see the review traffic is not that scattere and is rather collective.

##### 2. What is/are the insight(s) found from the chart?

As we can clearly see that there is an outlier as well, however the relationship between the reviews and reviews per month is quiet direct which means that the customer are getting satisfied with the outcomes, where the number of review is higher and the host is also getting good reach towards consumer market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes as we can seek out the difference from the given reviews and make specific recommendations to our hosts about what changes can be done by them in order to increase their reach to the customers. This can be one of our premium services that can widely help our hosts.

#### Chart - 9

In [None]:
# Chart - 9 visualization code


##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
feature_df.describe()

# using corr() function to check the correlation between numeric columns.
corr_df = feature_df[['price', 'minimum_nights', 'number_of_reviews',
             'reviews_per_month', 'listings', 'availability_365']].corr()

# Visualizing correlation using seaborn heatmap, using annotations.
sns.heatmap(corr_df, annot=True)

##### 1. Why did you pick the specific chart?

In order to check the correlation between numeric columns the heatmap is a really great option as it indicates the depth of each correlation with a different color, lighter to darker depending on the depth, not just that, it gives a proper one look visual about the data we want to check.

##### 2. What is/are the insight(s) found from the chart?

This chart helped us to identify which pairs of variables have strong positive or negative correlations.

It also helped us understand which variables are most closely related to each other.

It even highlighted potential multicollinearity issues (high correlations between independent variables) if we plan to use regression analysis.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Let's create a pairplot to know the relationship between our few variables.

pair_df = feature_df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'listings', 'availability_365', 'price_range']]
sns.pairplot(pair_df)

##### 1. Why did you pick the specific chart?

Here we can easily see the distribution of our numerical variables, it will also give us an overview of our dataset and the relationships between our variables.

##### 2. What is/are the insight(s) found from the chart?

Yes, we can see that the utmost majority of our complete data set is within the price range of 0-2500.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



Coming up to the solution of our business objective, we have throughly studied the whole data set and drawn some conclusions based on that, here are all the pointers we can take into consideration.


* 0-300 is the most prefered price range, which gives us a clear idea on what basis we will have to create our pricing.

* Entire home/apt being the most prefered room type makes it obvious that customers are very much concerned about their privacy and safety, we can create more relevant policies regarding their safety and privacy, ulimately resulting in gaining their trust and creating goodwill.

* We know that the most prefered area is the Financial District which is in the Manhattan Neighbourhood Group, hence we can create recommendations as per the pricing and area preferrences.

* The most reviews are also given to the same area itself we can take into consideration those points which proved to be the most satisfying for our customers so that we can make such changes in the other neighbourhoods, if possible, whcih will result in getting us more business in the other areas as well.

* I personally feel that this dataset should have one more column as per the occupation of the customer, it will create different groups of our customers based upon their occupation, which will help us to determine the preference of people coming from different sectors and we can identify which group of customers prefers which kind of accomodations.

* This dataset also gave us a good idea on the budget of the majority of our consumer market, and now just that we were able to see the ratio of prefered pricing.

* We were clearly able to see where most of our business lies and which consumer group creates the most of the business for us, we can make improvements in those and not just that we can also focus on the other sectors as well so that we can expand our business even more.

# **Conclusion**

The conclusion of this complete EDA project is simple and crisp, we have enough data to see what is liked by our customers and what is it that they are demanding as is is clear that around 90% of our business is saturated into one side whether it is the area preference or pricing, we can use that information to check what works for our customers and make those changes in other areas as well.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***