# **Project Name**    - Zomato Project



##### **Project Type**    - Usupervised / Clustering
##### **Contribution**    - Individual
##### **Member       - Puneeth Sai Satvik**

# **Project Summary -**

This project focuses on analyzing two core datasets obtained from Zomato. The first dataset, Restaurant Names, includes structured information about 105 restaurants such as the cost per person, types of cuisines offered, operating timings, and more. The second dataset, Restaurant Reviews, contains over 10,000 user-generated reviews that provide qualitative insights into customer satisfaction and restaurant performance.

The analysis begins with data importing and initial exploration. Important characteristics such as dataset dimensions, column names, data types, and summary statistics are reviewed. Null values are detected and visualized using a heatmap for the restaurant metadata and a bar chart for the review data. Duplicate rows are also checked to ensure data integrity. Following this, data wrangling is performed where missing values are either removed or replaced with appropriate substitutes, and duplicate entries are dropped. Columns such as cost and rating, originally in string format, are converted to numeric types to facilitate further analysis and visualization.

Once the data is cleaned and preprocessed, exploratory data analysis is conducted through the generation of 15 visual charts. These visualizations compare various attributes and uncover key patterns and trends. The insights gathered from this phase form the basis for formulating several business-relevant hypotheses, which are then tested using statistical hypothesis testing techniques to validate assumptions based on the data.

In preparation for clustering, additional feature engineering steps are carried out. Since clustering algorithms require numerical input, categorical columns like cuisines and timings are transformed using one-hot encoding. The review text data is processed using sentiment analysis to classify each review as positive, negative, or neutral. Each sentiment is then assigned a numerical score. The two datasets are subsequently merged, and for each restaurant, the average rating, number of pictures, and sentiment score are computed. This merged dataset is then scaled to ensure that all features contribute equally during clustering.

Clustering is performed using three unsupervised machine learning algorithms: KMeans Clustering, Agglomerative (Hierarchical) Clustering, and DBSCAN. To determine the most appropriate number of clusters, silhouette scores are calculated for various values of K, with K = 2 emerging as optimal. Each algorithm is then applied using this parameter, and the clustering outputs are visualized using PCA projections for interpretability. The clustering results reveal two major groups of restaurants that differ based on their cost, average rating, number of pictures, and cuisine offerings.

Among the three models, KMeans Clustering provided the best performance. It produced the highest silhouette score and showed the clearest visual separation between clusters. The clusters generated by KMeans were also the most interpretable and aligned well with practical business segmentation. Therefore, KMeans was selected as the final model for this project due to its effectiveness, efficiency, and relevance to actionable business insights.

# **GitHub Link -**

https://github.com/puneethsai001/Zomato-Clustering.git

# **Problem Statement**


The Project focuses on Customers and Company, you have to analyze the sentiments of the reviews given by the customer in the data and make some useful conclusions in the form of Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized as it becomes easy to analyse data at instant. The Analysis also solves some of the business cases that can directly help the customers finding the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in. This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be used in cost vs. benefit analysis Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
restNames = pd.read_csv('/content/drive/MyDrive/Datasets/Zomato Restaurant names and Metadata.csv')
restReviews = pd.read_csv('/content/drive/MyDrive/Datasets/Zomato Restaurant reviews.csv')

### Dataset First View

In [None]:
# Dataset 1 First Look
restNames.head()

In [None]:
# Dataset 2 First Look
restReviews.head()

### Dataset Rows & Columns count

In [None]:
# Dataset 1 Rows & Columns count
restNames.shape

In [None]:
# Dataset 2 Rows & Columns count
restReviews.shape

### Dataset Information

In [None]:
# Dataset 1 Info
restNames.info()

In [None]:
# Dataset 2 Info
restReviews.info()

#### Duplicate Values

In [None]:
# Dataset 1 Duplicate Value Count
restNames.duplicated().sum()

In [None]:
# Dataset 2 Duplicate Value Count
restReviews.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count for dataset 1
restNames.isnull().sum()

In [None]:
# Visualizing the missing values for dataset 1
plt.figure(figsize=(8, 6))
sns.heatmap(restNames.isnull(), cbar=False, cmap='Paired')
plt.title('Restaurant Names Missing Values')
plt.show()

In [None]:
# Missing Values/Null Values Count for dataset 2
restReviews.isnull().sum()

In [None]:
# Visualizing the missing values for dataset 2
plt.figure(figsize=(8, 6))
sns.barplot(x=restReviews.isnull().sum().index, y=restReviews.isnull().sum().values, color='skyblue')
plt.xticks(rotation=90)
plt.title("Restaurant Reviews Missing Values")
plt.show()

### What did you know about your dataset?

Answer: There are two datasets, one is restaurant dataset which has restaurant names, cost, cuisines and timings with 105 records and other is reviews dataset which has around 1000 records that consists of reviewer, review, restaurant etc. There are null values in both datasets which must be handled but there are duplicates only in the reviews dataset that must be discarded.

## ***2. Understanding Your Variables***

In [None]:
# Dataset 1 Columns
restNames.columns

In [None]:
# Dataset 1 Describe
restNames.describe(include='all')

In [None]:
# Dataset 2 Columns
restReviews.columns

In [None]:
# Dataset 2 Describe
restReviews.describe(include='all')

### Variables Description

Answer: Describing the dataset gave me an insight about the variables. It can observed that all the records in the restaurant dataset are unique as there are 105 different names and links. Rating and cost are identified as non numeric datatypes due to the inability to calculate the mean, min and max values.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
restNames.nunique()

In [None]:
restReviews.nunique()

## 3. ***Data Wrangling***



```
# This is formatted as code
```

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Dataset 1

# Handling the Null Values

# Filling missing collection as uncategorized if Null
restNames['Collections'] = restNames['Collections'].fillna('Uncategorized')

# Filling missing timings as Not Avaliable if Null
restNames['Timings'] = restNames['Timings'].fillna('Not Available')

In [None]:
# Convert the cost to numeric
restNames['Cost'] = restNames['Cost'].astype(str)
restNames['Cost'] = restNames['Cost'].str.replace(r'[^\d.]', '', regex=True)
restNames['Cost'] = pd.to_numeric(restNames['Cost'], errors='coerce')

In [None]:
# Dataset 2

# Handling the Null Values

# Drop the row if reviewer or rating is Null
restReviews = restReviews.dropna(subset=['Review', 'Rating'])

# Replace reviewer with anonymous if Null
restReviews['Reviewer'] = restReviews['Reviewer'].fillna('Anonymous')

# Replace time with Unknown if Null
restReviews['Time'] = restReviews['Time'].fillna('Unknown')

# Replace Metadata with 0 Reviews 0 Followers if Null
restReviews['Metadata'] = restReviews['Metadata'].fillna('0 Reviews, 0 Followers')

In [None]:
# Drop duplicate rows
restReviews = restReviews.drop_duplicates()

In [None]:
# Convert rating to numeric
restReviews['Rating'] = pd.to_numeric(restReviews['Rating'], errors='coerce')
restReviews.dropna(subset=['Rating'], inplace=True)

In [None]:
# Convert rating to numeric
restReviews['Pictures'] = pd.to_numeric(restReviews['Pictures'], errors='coerce')

In [None]:
# Reset the index
restNames.reset_index(drop=True, inplace=True)
restReviews.reset_index(drop=True, inplace=True)

### What all manipulations have you done and insights you found?

In [None]:
# Shape after data wrangling
print(restNames.shape)
print(restReviews.shape)

In [None]:
# Merge the datasets
merged = pd.merge(restReviews, restNames, left_on='Restaurant', right_on='Name', how='inner')
merged.drop(columns=['Name'])

Answer: To begin with, there were null values in both the datasets, so firstly the replacable ones like the reviewer, time, metadata, timing were replaced accordingly and the crucial values like review and rating which are irreplacable were deleted. No rows were deleted from the restaurant names dataset as every row has a unique restaurant name that is too important to discard.

Next, the duplicates from reviews dataset (around 36 of them) were deleted, and variables rating and cost were converted to numeric type for easy operation. For sentiment analysis, the review must be cleaned so gotten rid of all the special characters and converted it to lower case. Finally, after deletion of all these rows, the index is resetted. The final shape of each dataset is displayed below.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1: Scatter Plot of Cost vs Average Rating

# Group by restaurant to get average rating and cost
cost_vs_rating = merged.groupby('Restaurant', as_index=False)[['Rating', 'Cost']].mean()

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(cost_vs_rating['Cost'], cost_vs_rating['Rating'], alpha=0.6)
plt.title('Cost vs Rating')
plt.xlabel('Cost')
plt.ylabel('Average Rating')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer: The scatter plot visualizes the relationship between the cost in a resturant and the average rating of a restaurant. It helps to identify whether higher cost results in better customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Answer: There is no strong correlation between cost and rating.
Some mid-cost restaurants have very high ratings.
A few high-cost restaurants have moderate to poor ratings, which is unexpected.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Zomato can highlight budget-friendly high-rated restaurants as great value picks for customers. Expensive restaurants with low ratings are hurting customer trust and may need quality or service improvements.

#### Chart - 2

In [None]:
# Chart - 2: Bar Chart of Average Rating per Cuisine

cuisine_data = merged[['Cuisines', 'Rating']].copy()

# Split and explode cuisines
cuisine_data['Cuisines'] = cuisine_data['Cuisines'].str.split(', ')
cuisine_data = cuisine_data.explode('Cuisines')

# Group by cuisine and calculate average rating
cuisine_rating = cuisine_data.groupby('Cuisines')['Rating'].mean().sort_values(ascending=False)

# Plot
plt.figure(figsize=(12, 12))
sns.barplot(x=cuisine_rating.values, y=cuisine_rating.index, palette='Spectral')
plt.title('Cuisines by Average Rating')
plt.xlabel('Average Rating')
plt.ylabel('Cuisine')
plt.xlim(0, 5)
plt.grid(axis='x')
plt.show()

##### 1. Why did you pick the specific chart?

Answer: To understand which cuisines consistently perform well in terms of customer satisfaction. For this, bar chart is the best option

##### 2. What is/are the insight(s) found from the chart?

Answer: Mediterranean, Modern Indian, and European cuisines have the highest ratings. Fast food, street food, pizza, and healthy food are rated the lowest on average.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Zomato can promote high-rated cuisines on the platform. Low-rated cuisines may signal poor quality or customer mismatch—improvement is needed or listings should be re-evaluated.

#### Chart - 3

In [None]:
# Chart - 3: Bar Chart of Number of Restaurants per Cuisine

cuisine_data = restNames[['Name', 'Cuisines']].rename(columns={'Name': 'Restaurant'})

# Split and explode cuisines
cuisine_data['Cuisines'] = cuisine_data['Cuisines'].str.split(', ')
cuisine_data = cuisine_data.explode('Cuisines')

# Count number of restaurants per cuisine
cuisine_count = cuisine_data['Cuisines'].value_counts().reset_index()
cuisine_count.columns = ['Cuisine', 'Count']

# Plot
plt.figure(figsize=(12, 12))
sns.barplot(x='Count', y='Cuisine', data=cuisine_count, palette='Spectral')
plt.title('Number of Restaurants per Cuisine')
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine')
plt.grid(axis='x')
plt.show()

##### 1. Why did you pick the specific chart?

Answer: To show the popularity of each cuisine type based on the restaurants serving it.

##### 2. What is/are the insight(s) found from the chart?

Answer: North Indian, Chinese and Continental dominate the restaurant scene with a significant market share.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Opportunity to promote underrepresented cuisines to diversify the market and oversaturation in North Indian/Chinese may lead to stiff competition and diluted customer attention.

#### Chart - 4

In [None]:
# Chart - 4: Pie Chart of Sentiment
from textblob import TextBlob

# Define sentiment function
def get_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0.1:
        return 'Positive'
    elif polarity < -0.1:
        return 'Negative'
    else:
        return 'Neutral'

# Apply sentiment analysis
reviews_sentiment = restReviews[['Review']].copy()
reviews_sentiment['Sentiment'] = reviews_sentiment['Review'].apply(get_sentiment)

# Count sentiments
sentiment_counts = reviews_sentiment['Sentiment'].value_counts()

# Plot
plt.figure(figsize=(6, 6))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', colors=['orange', 'lime', 'red'])
plt.title('Review Sentiment Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

Answer: To break down customer review text into positive, neutral, and negative sentiment, providing a quick view of overall satisfaction..

##### 2. What is/are the insight(s) found from the chart?

Answer: 68.3% of reviews are positive. Around 14.7% are negative and neutral feedback sits around 17%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Reinforces high customer satisfaction, which is a good sign for brand trust.

#### Chart - 5

In [None]:
# Chart - 5: Bar Chart of Restaurant Reviews

# Count reviews per restaurant
review_counts = restReviews.groupby('Restaurant')['Review'].count().sort_values(ascending=False)

# Plot
plt.figure(figsize=(12, 20))
sns.barplot(x=review_counts.values, y=review_counts.index, palette='flare')
plt.title('Restaurant Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Restaurant Name')
plt.grid(axis='x')
plt.show()

##### 1. Why did you pick the specific chart?

Answer: Identifies restaurants that are most engaged with by users, showing influence and visibility.

##### 2. What is/are the insight(s) found from the chart?

Answer: Almost all restaurants except American Wild Wings and Arena Eleven have the same no of reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The restaurants with low reviews must promote customer engagement to get a deeper insights about their business based on the review.

#### Chart - 6

In [None]:
# Chart - 6: Histogram of Cost Distribution

# Plot
plt.figure(figsize=(10, 6))
sns.histplot(restNames['Cost'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Cost per Person')
plt.xlabel('Cost (INR)')
plt.ylabel('Number of Restaurants')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer: To analyze the distribution of cost per person across restaurants and see common cost brackets.

##### 2. What is/are the insight(s) found from the chart?

Answer: Most restaurants fall under 300 to 800 Rs, showing the platform is budget centric.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: For restaurants in high-cost ranges, if not paired with high ratings, this could create pricing dissatisfaction.

#### Chart - 7

In [None]:
# Chart - 7: Bar Chart of Cuisines by Average Cost

# Create cuisine data
cuisine_data = merged[['Cuisines', 'Cost']].copy()

# Split and explode cuisines
cuisine_data['Cuisines'] = cuisine_data['Cuisines'].str.split(', ')
cuisine_data = cuisine_data.explode('Cuisines')

# Group by cuisine and calculate average cost
cuisine_cost = cuisine_data.groupby('Cuisines')['Cost'].mean().sort_values(ascending=False)

# Plot
plt.figure(figsize=(12, 20))
sns.barplot(x=cuisine_cost.values, y=cuisine_cost.index, palette='crest')
plt.title('Cuisines by Average Cost')
plt.xlabel('Average Cost per Person (INR)')
plt.ylabel('Cuisine')
plt.grid(axis='x')
plt.show()

##### 1. Why did you pick the specific chart?

Answer: Identify which cuisines are high-end or luxury vs budget-friendly. Help customers set price expectations..

##### 2. What is/are the insight(s) found from the chart?

Answer: Modern Indian, Japanese, and Sushi cuisines are among the most expensive (1500 - 2000 Rs range).These are likely premium dining experiences, often found in upscale locations or niche restaurants.

Mithai, Ice Cream, Street Food, Lebanese, and Pizza are at the bottom of the cost spectrum. These are typically snacks, quick bites, or lower overhead food types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Zomato can offer premium filters for high-cost cuisines like Japanese or Modern Indian. Run budget campaigns for Street Food or Ice Cream, especially in student-heavy areas. Owners can compare their pricing with the cuisine average and adjust to stay competitive.

#### Chart - 8

In [None]:
# Chart - 8: Box Plot by Pictures Taken

# Create pic data
pic_data = merged[['Rating', 'Pictures']].copy()

# Group pictures into categories
pic_data['Picture Group'] = pd.cut(
    pic_data['Pictures'],
    bins=[-1, 0, 2, 5, float('inf')],
    labels=['0', '1–2', '3–5', '6+']
)

# Plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Picture Group', y='Rating', data=pic_data, palette='pastel')
plt.title('Rating Distribution by Picture Group')
plt.xlabel('Number of Pictures in Review')
plt.ylabel('Rating')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer: This box plot visualizes whether users who upload different numbers of pictures with their reviews tend to give higher or lower ratings.

##### 2. What is/are the insight(s) found from the chart?

Answer: Group 0 (No Pictures):
Has the widest spread — ratings range from 1 to 5.

Includes the lowest outliers — possibly negative feedback or spam.

Median rating is lower than all other groups.

Groups 1-2, 3-5, 6+ (With Pictures):
Consistently higher and tighter rating distribution.

Most reviews fall in the 4-5 range, suggesting satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Reviews with pictures are more trustworthy and often come with higher ratings. Zomato can encourage users to upload pictures during review — better sentiment and user engagement.

#### Chart - 9

In [None]:
# Chart - 9: Review Volume Over Time (Line Plot)

# Create a copy of review timestamps
time_data = merged[['Time']].copy()

# Convert Time to datetime
time_data['Time'] = pd.to_datetime(time_data['Time'], errors='coerce')

# Extract just the date (not time)
time_data['Date'] = time_data['Time'].dt.date

# Count reviews per date
review_trend = time_data['Date'].value_counts().sort_index()

# Plot
plt.figure(figsize=(16, 6))
sns.lineplot(x=review_trend.index, y=review_trend.values, marker='o')
plt.title('Review Activity Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer: The number of reviews posted per day over the span of 3 years that shows user activity and internaction. This chart helps track temporal patterns in customer engagement.

##### 2. What is/are the insight(s) found from the chart?

Answer: There's a sharp increase in review activity starting mid-2018. Possibly due to a marketing push, new feature rollout, or influencer campaigns.
Increased app usage or awareness in a specific region. Peaks around festive periods (Diwali, New Year, etc.) suggest customers dine out more and leave reviews. A few high spikes may correspond to major promotions or trending restaurants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Zomato can time campaigns or feature launches during natural review spikes.Monitor service performance during high-traffic periods (ex: servers, delivery, restaurant prep).

#### Chart - 10

In [None]:
# Chart - 10 Strip Plot of Cost vs Rating per Cuisine

# Prepare cuisine vs cost-rating data
cuisine_data = merged[['Cuisines', 'Cost', 'Rating']].copy()
cuisine_data['Cuisines'] = cuisine_data['Cuisines'].str.split(', ')
cuisine_data = cuisine_data.explode('Cuisines')

# Plot
plt.figure(figsize=(14, 8))
sns.stripplot(data=cuisine_data, x='Rating', y='Cuisines', hue='Cost', palette='viridis', size=5, jitter=True)
plt.title('Cost vs Rating per Cuisine')
plt.xlabel('Rating')
plt.ylabel('Cuisine')
plt.legend(title='Cost', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer: This chart helps visualize:

The distribution of ratings for each cuisine.

How cost levels (as color/hue) align with customer satisfaction.

Whether expensive cuisines are rated better, or if cheap eats outperform premium options.

It's perfect for identifying pricing mismatches and opportunities for value positioning.

##### 2. What is/are the insight(s) found from the chart?

Answer: Some cuisines like BBQ, Sushi, and Modern Indian have high-cost dots (lighter green) even around ratings 2-3, which is a red flag. Cuisines like North Indian, Biryani, Chinese, and Continental have low-cost (dark purple) options rated highly (4-5).These are value-for-money champions that can be promoted to price-sensitive users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Promote low-cost, high-rated cuisines on banners, filters, and discovery sections. Use insights to recommend affordable restaurants with high ratings. High-cost, low-rated cuisines (e.g., Sushi, BBQ) may indicate poor service, poor food quality, or mismatched expectations.

#### Chart - 11

In [None]:
# Chart - 11: Line Chart of Average Rating by Hour of the Day

# Convert 'Time' column to datetime format
restReviews['Time'] = pd.to_datetime(restReviews['Time'], errors='coerce')

# Extract hour from time
restReviews['Hour'] = restReviews['Time'].dt.hour

# Group by hour and calculate average rating
hourly_rating = restReviews.groupby('Hour')['Rating'].mean().reset_index()

# Plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_rating, x='Hour', y='Rating', marker='o', color='teal')
plt.xticks(range(0, 24))
plt.title('Average Rating by Hour of the Day')
plt.xlabel('Hour of Day (24-Hour Format)')
plt.ylabel('Average Rating')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer This chart displays the average user rating of restaurants based on the hour of the day when the review was posted. The x-axis represents the hour (0-23) in 24-hour format, while the y-axis shows the average rating given during that hour..

##### 2. What is/are the insight(s) found from the chart?

Answer: Peak Rating Hours (Early Morning) is between 4 AM and 6 AM, ratings are the highest — peaking around 3.85+. Low Rating Periods (Morning Rush), ratings drop between 7 AM - 9 AM, hitting the lowest around 8 AM (~3.43).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Maintain high service consistency during peak dining hours

#### Chart - 12

In [None]:
# Chart - 12: Bar Chart of top reviewers by no of Reviews

# Count reviews per reviewer
top_reviewers = restReviews['Reviewer'].value_counts().head(20)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_reviewers.values, y=top_reviewers.index, palette='viridis')
plt.title('Top 20 Reviewers by Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Reviewer')
plt.grid(axis='x')
plt.show()

##### 1. Why did you pick the specific chart?

Answer: The chart displays the top 20 reviewers on Zomato based on the number of reviews they have written. Each bar represents a reviewer, and the bar length shows how many reviews they've submitted..

##### 2. What is/are the insight(s) found from the chart?

Answer: Parijat Ray is the most active reviewer with 13 reviews, closely followed by Ankita and Kiran. A few users have contributed significantly more reviews than the rest, indicating power users or frequent diners.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Identifying top reviewers helps in recognizing influential customers whose opinions may shape public perception. These reviewers can also be key stakeholders for early feedback on new dishes, services, or locations.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Extract needed columns from merged table
collection_data = merged[['Collections', 'Restaurant']].copy()

# Explode multiple collections into separate rows
collection_data['Collections'] = collection_data['Collections'].str.split(', ')
collection_data = collection_data.explode('Collections')

# Remove 'Uncategorized'
collection_data = collection_data[collection_data['Collections'] != 'Uncategorized']

# Count reviews per collection
collection_review_volume = collection_data['Collections'].value_counts().head(15)

# Plot
plt.figure(figsize=(12, 8))
sns.barplot(x=collection_review_volume.values, y=collection_review_volume.index, palette='plasma')
plt.title('Top 15 Collections by Review Volume (Excluding Uncategorized)')
plt.xlabel('Total Number of Reviews')
plt.ylabel('Collection')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer: This bar chart was chosen to identify which Zomato collections (like Great Buffets or Live Sports Screenings) receive the highest number of reviews, giving a direct indication of user interest and engagement with thematic restaurant groupings.

##### 2. What is/are the insight(s) found from the chart?

Answer: Great Buffets dominates all other categories, indicating strong customer interest in buffet-style dining experiences. More niche or time-sensitive collections like Sunday Brunches, Happy Hours, and Rooftops are comparatively lower, possibly reflecting limited demand or time-based appeal.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: Restaurants can align themselves with top-performing collections (e.g., start offering buffet options or participate in “Food Hygiene Rated” programs) to attract more visibility and footfall.

Over-reliance on a few collections (ex: buffets, hygiene-rated lists) may cause market saturation or less innovation in underperforming categories.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Select only numeric columns for correlation
numeric_cols = merged[['Cost', 'Rating', 'Pictures']].copy()


# Compute correlation matrix
correlation_matrix = numeric_cols.corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Heatmap of Key Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

Answer: The correlation heatmap provides a concise visual summary of the relationships between key numerical features (Cost, Rating, Pictures). It helps identify whether there are strong linear associations that could influence customer behavior, satisfaction, or business strategies.

##### 2. What is/are the insight(s) found from the chart?

Answer: Cost vs Rating - 0.14: Very weak positive correlation — expensive restaurants may have slightly better ratings, but not significantly.

Pictures vs Rating - 0.08: Almost no correlation — the number of pictures in a review does not strongly impact the rating.

Pictures vs Cost - 0.11: Slight positive correlation — higher-cost restaurants may encourage reviewers to post more pictures.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

plot_data = merged[['Rating', 'Cost']].copy()

# Create the pair plot
sns.pairplot(plot_data, diag_kind='hist', corner=True)
plt.suptitle("Pair Plot of Cost and Rating", y=1.02)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer: This pair plot visually explores the relationship between Cost per person and Rating, two key numerical features that directly influence customer satisfaction and pricing strategy.

##### 2. What is/are the insight(s) found from the chart?

Answer: No Strong Linear Correlation - The scatter distribution does not show a strong trend, confirming the low correlation coefficient of 0.14 seen earlier.

High ratings (4.0 - 5.0) are observed even at both low and high cost levels, suggesting price is not the only determinant of customer satisfaction.

Very few entries exist in the low rating (1-2) zone, indicating that most restaurants maintain decent quality, regardless of cost.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer: The three hypothesis statements based on analyzing the charts are: \n

* Higher-cost restaurants receive better average ratings than lower-cost restaurants.
* Restaurants with reviews containing more pictures receive higher ratings on average.
* Cuisines with fewer restaurants (less popular) tend to have higher average ratings than common cuisines.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer: Null Hypothesis: There is no correlation between the cost per person and average rating of a restaurant.

Alternate Hypothesis: There is a significant positive correlation between the cost per person and average rating of a restaurant..

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Perform Pearson Correlation Test
correlation_coef, p_value = pearsonr(merged['Cost'], merged['Rating'])

print(f"Pearson Correlation Coefficient: {correlation_coef:.3f}")
print(f"P-Value: {p_value:.4f}")

##### Which statistical test have you done to obtain P-Value?

Answer: Pearson Correlation Test.

##### Why did you choose the specific statistical test?

Answer: Since both variables are numerical and we testing for linear relationship between them.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer: Null Hypothesis: There is no significant difference in ratings and  picture-count in reviews.

Alternate Hypothesis: There is a significant difference in ratings depending on how many pictures are included in the review.

#### 2. Perform an appropriate statistical test.

In [None]:
# Categorize picture count

def categorize(val):
    if val == 0:
        return '0'
    elif 1 <= val <= 2:
        return '1-2'
    elif 3 <= val <= 5:
        return '3-5'
    else:
        return '6+'

merged['Picture_Group'] = merged['Pictures'].apply(categorize)

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import f_oneway

group_0 = merged[merged['Picture_Group'] == '0']['Rating']
group_1_2 = merged[merged['Picture_Group'] == '1-2']['Rating']
group_3_5 = merged[merged['Picture_Group'] == '3-5']['Rating']
group_6_plus = merged[merged['Picture_Group'] == '6+']['Rating']

# Run One-Way ANOVA
f_stat, p_value = f_oneway(group_0, group_1_2, group_3_5, group_6_plus)

print(f"F-Statistic: {f_stat:.3f}")
print(f"P-Value: {p_value:.4f}")

##### Which statistical test have you done to obtain P-Value?

Answer: One-Way ANOVA (Analysis of Variance) Test.

##### Why did you choose the specific statistical test?

Answer:We are comparing average ratings across multiple groups based on picture count. The dependent variable (Rating) is continuous, and the independent variable (Picture Group) is categorical with more than 2 levels.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer: Null Hypothesis: There is no significant difference in average ratings between popular cuisines and niche (less popular) cuisines.

Alternate Hypothesis: Niche cuisines (with fewer restaurants) have higher average ratings than popular cuisines.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import mannwhitneyu

# Get cuisine counts
cuisine_counts = merged.groupby('Cuisines')['Name'].count()

# Finding thresholds
popular_threshold = cuisine_counts.quantile(0.75)
niche_threshold = cuisine_counts.quantile(0.25)

# Tag cuisines
popular_cuisines = cuisine_counts[cuisine_counts >= popular_threshold].index
niche_cuisines = cuisine_counts[cuisine_counts <= niche_threshold].index

# Filter data for the two groups
popular_ratings = merged[merged['Cuisines'].isin(popular_cuisines)]['Rating']
niche_ratings = merged[merged['Cuisines'].isin(niche_cuisines)]['Rating']

# Perform Mann–Whitney U test
u_stat, p_value = mannwhitneyu(niche_ratings, popular_ratings, alternative='greater')

print(f"U-Statistic: {u_stat:.3f}")
print(f"P-Value: {p_value:.4f}")

##### Which statistical test have you done to obtain P-Value?

Answer: Mann–Whitney U Test.

##### Why did you choose the specific statistical test?

Answer: It compares two independent groups and it doesnt assume normal distribution. We are testing if one group's median rating is significantly higher than the other's.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Missing values or null Values are already handled previously

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer: In dataset 1 (Restaurant Names Dataset) null values were only present in collections column and one null value in timings column. The values of collection columns are replaced with Uncategorized and timing is replaced with Not Available

In dataset the rows were deleted where the review or the rating is missing. If the reviewer is missing, it is been replaced with anonymous. Missing Time is replaced with Unknown and Metadata is given as 0 reviews and 0 followers.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

def outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] < lower_bound) | (df[column] > upper_bound)]

# Outliers in restaurant data
cost_outliers = outliers_iqr(restNames, 'Cost')
print(f"Cost Outliers: {len(cost_outliers)}")

# Outliers in review data
rating_outliers = outliers_iqr(restReviews, 'Rating')
print(f"Rating Outliers: {len(rating_outliers)}")

picture_outliers = outliers_iqr(restReviews, 'Pictures')
print(f"Picture Outliers: {len(picture_outliers)}")

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer: Interquartile range method is used as it helps quickly identify extreme values without needing assumptions about the underlying distribution.

### 3. Categorical Encoding

In [None]:
# Dataset 1

from sklearn.preprocessing import MultiLabelBinarizer
import re

restNames['Cuisine_List'] = restNames['Cuisines'].apply(lambda x: [i.strip() for i in x.split(',')])

# One-hot encode cuisines
mlb = MultiLabelBinarizer()
cuisine_encoded = pd.DataFrame(mlb.fit_transform(restNames['Cuisine_List']), columns=mlb.classes_)

# Time encoding function
def encode_timing(t):
    if pd.isnull(t):
        return 'Unknown'
    t = t.lower()
    if 'am' in t and 'pm' in t:
        return 'All day'
    elif re.search(r'(\d{1,2})(:?\d{0,2})?\s*am', t):
        return 'Morning'
    elif re.search(r'(\d{1,2})(:?\d{0,2})?\s*pm', t):
        return 'Evening'
    return 'Unknown'

# Apply the time encoding
restNames['Timing_Category'] = restNames['Timings'].apply(encode_timing)

# Appending to new encoded table
restNames_encoded = pd.concat([restNames.drop(columns=['Cuisine_List']), cuisine_encoded], axis=1)
restNames_encoded = pd.concat([restNames_encoded, pd.get_dummies(restNames_encoded['Timing_Category'], prefix='Timing').astype(int)], axis=1)

restNames_encoded.drop(columns=['Timing_Category'], inplace=True)

In [None]:
# Dataset 2

restReviews_encoded = restReviews.copy()

# Extracting followers and reviews
def extract_metadata(meta):
    review_count = 0
    follower_count = 0
    if isinstance(meta, str):
        parts = meta.split(',')
        for p in parts:
            if 'review' in p.lower():
                review_count = int(''.join(filter(str.isdigit, p)))
            elif 'follower' in p.lower():
                follower_count = int(''.join(filter(str.isdigit, p)))
    return pd.Series([review_count, follower_count])

# Appending to new encoded table
restReviews_encoded[['Review_Count', 'Follower_Count']] = restReviews_encoded['Metadata'].apply(extract_metadata)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer: One Hot Encoding is used here as the timing category is a single non-ordinal label (Morning, Evening, etc) and needs to be split into binary indicators. Multi-label One-Hot Encoding is used for cuisines Each unique cuisine becomes a column with values 0 or 1, indicating whether that cuisine applies to the restaurant.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re

restReviews_encoded['processed_review'] = restReviews_encoded['Review']

# Expand contractions
contractions_dict = {
    "can't": "cannot", "won't": "will not", "n't": " not", "'re": " are",
    "'s": " is", "'d": " would", "'ll": " will", "'t": " not",
    "'ve": " have", "'m": " am"
}

def expand_contractions(text):
    text = str(text)
    for contraction, expanded in contractions_dict.items():
        text = re.sub(contraction, expanded, text)
    return text

restReviews_encoded['processed_review'] = restReviews_encoded['processed_review'].apply(expand_contractions)


#### 2. Lower Casing

In [None]:
# Lower Casing
restReviews_encoded['processed_review'] = restReviews_encoded['processed_review'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

# Function to remove all characters except letters and whitespace
def remove_punctuation(text):
    return re.sub(r'[^a-z\s]', '', text)

# Apply to lowercase review
restReviews_encoded['processed_review'] = restReviews_encoded['processed_review'].apply(remove_punctuation)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

def clean_urls_digits(text):
    text = re.sub(r'http\S+|www\S+', '', text)
    text = ' '.join([word for word in text.split() if not any(char.isdigit() for char in word)])
    return text

restReviews_encoded['processed_review'] = restReviews_encoded['processed_review'].apply(clean_urls_digits)

In [None]:
from textblob import TextBlob

# Function to get sentiment label and score
def get_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0:
        return pd.Series(['positive', 1])
    elif polarity < 0:
        return pd.Series(['negative', -1])
    else:
        return pd.Series(['neutral', 0])

# Apply to processed_review column
restReviews_encoded[['Sentiment', 'Sentiment_Score']] = restReviews_encoded['processed_review'].apply(get_sentiment)

### 4. Feature Selection

In [None]:
restNames_encoded.head()

In [None]:
restNames_encoded = restNames_encoded.drop(columns=['Links', 'Collections', 'Cuisines', 'Timings'])

In [None]:
restReviews_encoded.head()

In [None]:
restReviews_encoded = restReviews_encoded.drop(columns=['Metadata', 'Review', 'processed_review', 'Reviewer'])

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes the data is needs to be transformed as it must be merged into one table with all numerica features that are fit for clustering. Since the review table consists of more than one instance of each restaurant, aggregate functions must be considered

In [None]:
# Transform Your data

# Step 1: Aggregate restReviews_encoded
agg_reviews = restReviews_encoded.groupby('Restaurant').agg({
    'Rating': 'mean',
    'Pictures': 'sum',
    'Review_Count': 'mean',
    'Follower_Count': 'mean',
    'Sentiment_Score': 'mean'
}).reset_index()

# Rename Restaurant to Name for merging
agg_reviews.rename(columns={'Restaurant': 'Name'}, inplace=True)

# Merge with restNames_encoded
cluster_ready_df = pd.merge(restNames_encoded, agg_reviews, on='Name', how='inner')


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Drop identifier column
X = cluster_ready_df.drop(columns=['Name'])

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


scores = []
ks = range(2, 11)
for k in ks:
    model = KMeans(n_clusters=k, random_state=42)
    labels = model.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))


# Plot to find optimal K value
plt.plot(ks, scores, marker='o')
plt.title("KMeans - Silhouette Score vs K")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.show()

In [None]:
# ML Model - 1 Implementation

from sklearn.cluster import KMeans

# Apply KMeans clustering

kmeans = KMeans(n_clusters=2, random_state=42)
cluster_ready_df['KMeans_Cluster'] = kmeans.fit_predict(X_scaled)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Answer: KMeans is an unsupervised clustering algorithm that partitions the data into K clusters based on distance to cluster centroids. We used KMeans with K=2, selected using the Elbow and Silhouette method. The model identified distinct groups of restaurants based on cost, cuisines, ratings, timing, etc.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce to 2D using PCA for plotting
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plotting KMeans clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_ready_df['KMeans_Cluster'], cmap='rainbow', s=40, alpha=0.7)
plt.title("KMeans Results (PCA Projection)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross- Validation & Hyperparameter Tuning

for k in range(2, 11):
    model = KMeans(n_clusters=k, random_state=42)
    labels = model.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"K={k}, Silhouette Score={score:.3f}")

##### Which hyperparameter optimization technique have you used and why?

Answer: We used the Silhouette Score to select the optimal number of clusters K. Silhouette measures the compactness and separation of clusters — a good metric for unsupervised model selection.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer: When the silhouette score was plotted against the k values, highest values of it was chosen to cluster the restaurants.

In [None]:
from IPython.display import display, Markdown

# Group and display each cluster as a separate table
for cluster_id in sorted(cluster_ready_df['KMeans_Cluster'].unique()):
    display(Markdown(f"KMeans Cluster {cluster_id}"))
    cluster_df = cluster_ready_df[cluster_ready_df['KMeans_Cluster'] == cluster_id][['Name', 'Cost', 'Rating', 'Review_Count']]
    display(cluster_df.reset_index(drop=True))


Answer

### ML Model - 2

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Apply Agglomerative Clustering
agglo = AgglomerativeClustering(n_clusters=2)
cluster_ready_df['Agglo_Cluster'] = agglo.fit_predict(X_scaled)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Agglomerative Clustering is a bottom-up hierarchical clustering technique that merges the closest clusters iteratively based on linkage criteria (ex: ward, average).

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.decomposition import PCA

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot PCA projection colored by Agglomerative Clustering labels
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_ready_df['Agglo_Cluster'], cmap='rainbow', s=40, alpha=0.7)
plt.title('Agglomerative Clustering Results (PCA Projection)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Try different K values and evaluate silhouette score
for k in range(2, 11):
    model = AgglomerativeClustering(n_clusters=k)
    labels = model.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"K={k}, Silhouette Score={score:.3f}")

##### Which hyperparameter optimization technique have you used and why?

Answer: We used Silhouette Score to determine the optimal number of clusters (n_clusters). Since Agglomerative Clustering doesn’t require a random seed, we only tuned the number of clusters to maximize cohesion and separation.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer: Since clustering is unsupervised, we don't use accuracy or F1-score. The primary evaluation metric is Silhouette Score. Silhouette Score measures how similar a data point is to its own cluster compared to other clusters. A higher score means well-separated, cohesive clusters.

High silhouette score means:

Restaurants are grouped based on meaningful similarities like cost, rating, cuisine, and timing.

This allows for targeted marketing, menu personalization, or collection curation for each cluster.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

from sklearn.cluster import DBSCAN

# Apply DBSCAN
dbscan = DBSCAN(eps=0.09, min_samples=5)
cluster_ready_df['DBSCAN_Cluster'] = dbscan.fit_predict(X_scaled)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

DBSCAN is a density-based clustering algorithm that identifies core samples and expands clusters based on neighborhood density. It is particularly good for identifying outliers and clusters of irregular shape.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot PCA projection colored by DBSCAN Clustering labels
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_ready_df['DBSCAN_Cluster'], cmap='rainbow', s=40, alpha=0.7)
plt.title("DBSCAN Clustering Results (PCA Projection)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.metrics import silhouette_score

# Try different eps values and check silhouette score
for eps in [1.0, 1.5, 2.0, 2.5, 3.0]:
    model = DBSCAN(eps=eps, min_samples=5)
    labels = model.fit_predict(X_scaled)
    if len(set(labels)) > 1 and -1 in labels:
        score = silhouette_score(X_scaled, labels)
        print(f"eps={eps}, Silhouette Score={score:.3f}")

##### Which hyperparameter optimization technique have you used and why?

Answer: Since DBSCAN does not require specifying the number of clusters (K), we tuned the eps value manually based on silhouette score.By adjusting eps, we control the radius within which points are considered part of the same cluster.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer: I would choose KMeans Clustering Algorithm because of the following reasons:

1. Best Evaluation Performance
2. Visually Distinct Clusters
3. Business Interpretability
4. Low Complexity and High Speed.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer: KMeans groups restaurants based on numeric features into K distinct clusters by minimizing the distance between data points and their assigned cluster centroids. Its an unsupervised learning algorithm, so there is no ground truth label, but we can still interpret which features influenced the clusters the most using explainability tools..

In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

To conclude, after analyzing with the 3 clustering algorithms, KMeans Clustering is found to be best to segment the restaurants based on Cuising, Timing, Rating, Review and Cost

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***

In [None]:
from google.colab import drive
drive.mount('/content/drive')