<a href="https://colab.research.google.com/github/nithin1807-glitch/Zomato-Restaurant-Analysis-Using-Machine-Learning./blob/main/Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Zomato Restaurant Analysis Using Machine Learning.



##### **Project Type**    - Unsupervised Learning & Sentiment Analysis.
##### **Contribution**    - Individual.
##### **Name -** G.Nithin.

# **Project Summary -**

1. Introduction
Zomato stands as one of India's premier restaurant discovery and food delivery platforms, serving as a bridge between millions of customers and thousands of dining establishments. The platform hosts a vast repository of data, including restaurant metadata (menus, pricing, cuisines) and customer-generated content such as ratings and textual reviews. As the number of restaurants grows, so does the volume of customer feedback, making data analysis essential for understanding performance and consumer preferences.


2. The Problem Statement
While the abundance of data is an asset, it presents a significant challenge: manual analysis of thousands of reviews and metadata points is time-consuming and inefficient. Customers rely heavily on these reviews to make informed dining choices, yet the raw data is too massive for a human to process manually. This project leverages Machine Learning to automate the extraction of insights from both structured metadata and unstructured customer reviews.


3. Methodology & Data Preprocessing
The project utilized two primary datasets: a restaurant metadata file containing names, costs, and cuisines, and a reviews dataset containing textual feedback. The preprocessing phase was critical for ensuring data quality. Key steps included:



Data Cleaning: Removing non-numeric characters (like commas) from the cost column and handling missing values.


Normalization: Applying StandardScaler to the cost features to ensure the K-Means algorithm was not biased by different scales of measurement.


NLP Preprocessing: Cleaning textual reviews by removing punctuation and converting text to lowercase to prepare for sentiment analysis.

4. Machine Learning Implementation
I implemented a dual-model approach to gain a 360-degree view of the data:


Unsupervised Learning (K-Means Clustering): Using the Elbow Method to determine the optimal number of groups, I segmented restaurants into three distinct pricing tiers: Budget, Mid-range, and Premium. The Silhouette Score was used to evaluate and confirm that these clusters were well-separated and meaningful.



Sentiment Analysis: Using TextBlob, I processed over 7,000 reviews to classify customer sentiment as Positive, Negative, or Neutral. This allowed for a direct comparison between a restaurant's price point and its customer satisfaction levels.

5. Key Insights & Business Impact
The analysis revealed critical patterns, such as the distribution of restaurant costs and prevailing sentiment trends. A primary hypothesis—that moderate to high-cost restaurants tend to receive more positive sentiment—was tested to understand the link between pricing and perceived quality.


Business Impact:


For Zomato: The clustering model enables personalized recommendations and targeted marketing by matching users with restaurants in their preferred price segments.


For Restaurant Owners: Owners can leverage sentiment analysis to pinpoint specific areas of service that require improvement based on feedback clusters.


For Customers: Users can make faster, more confident decisions by viewing restaurants categorized by both price segment and validated sentiment.

6. Conclusion
This project demonstrates how machine learning can transform raw, unstructured feedback into actionable business intelligence. By combining clustering with NLP, we provide a framework that supports better decision-making for all stakeholders in the Zomato ecosystem.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Customers rely heavily on online reviews to choose restaurants, but manual analysis of such large volumes of feedback is inefficient. This project uses machine learning to automate the extraction of insights from Zomato’s structured pricing data and unstructured customer reviews to improve decision-making for both users and businesses.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import all required libraries for the Zomato analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from textblob import TextBlob
import pickle
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset from Google Colab environment
metadata_df = pd.read_csv('Zomato Restaurant names and Metadata.csv')
reviews_df = pd.read_csv('Zomato Restaurant reviews.csv')

### Dataset First View

In [None]:
# View the first 5 rows of the Metadata dataset
print("--- Metadata First 5 Rows ---")
display(metadata_df.head())

# View the first 5 rows of the Reviews dataset
print("\n--- Reviews First 5 Rows ---")
display(reviews_df.head())

# Check the total number of rows and columns
print(f"\nMetadata Shape: {metadata_df.shape}")
print(f"Reviews Shape: {reviews_df.shape}")

### Dataset Rows & Columns count

In [None]:
# Displaying the dataset counts correctly
print(f"Metadata Dataset: {metadata_df.shape[0]} rows and {metadata_df.shape[1]} columns.")
print(f"Reviews Dataset: {reviews_df.shape[0]} rows and {reviews_df.shape[1]} columns.")

In [None]:
# Printing the shape (rows, columns) of each dataset
print(f"The Metadata dataset has {metadata_df.shape[0]} rows and {metadata_df.shape[1]} columns.")
print(f"The Reviews dataset has {reviews_df.shape[0]} rows and {reviews_df.shape[1]} columns.")

### Dataset Information

In [None]:
# Technical summary of the Metadata dataset
print("--- Metadata Dataset Info ---")
metadata_df.info()

print("\n" + "="*50 + "\n")

# Technical summary of the Reviews dataset
print("--- Reviews Dataset Info ---")
reviews_df.info()

#### Duplicate Values

In [None]:
# Check for duplicate rows in the Metadata dataset
metadata_duplicates = metadata_df.duplicated().sum()
print(f"Number of duplicate rows in Metadata: {metadata_duplicates}")

# Check for duplicate rows in the Reviews dataset
review_duplicates = reviews_df.duplicated().sum()
print(f"Number of duplicate rows in Reviews: {review_duplicates}")

# If duplicates exist, you can view them using:
# reviews_df[reviews_df.duplicated()]

#### Missing Values/Null Values

In [None]:
# Check for missing values in the Metadata dataset
print("--- Missing Values in Metadata ---")
print(metadata_df.isnull().sum())

print("\n" + "="*30 + "\n")

# Check for missing values in the Reviews dataset
print("--- Missing Values in Reviews ---")
print(reviews_df.isnull().sum())

In [None]:
# Visualizing missing values using a Heatmap
plt.figure(figsize=(12, 4))

# Check missing values in Metadata
plt.subplot(1, 2, 1)
sns.heatmap(metadata_df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Values: Metadata')

# Check missing values in Reviews
plt.subplot(1, 2, 2)
sns.heatmap(reviews_df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Values: Reviews')

plt.show()

### What did you know about your dataset?

Gemini said
Based on the initial inspection of the files provided in your Sample_ML_Submission_Template.ipynb, here is a professional summary of what we know about the datasets.

You can copy and paste this into the "What did you know about your dataset?" markdown cell in Section 2.1.

Overview of the Zomato Datasets
Dual-Source Structure: The project relies on two distinct datasets: Zomato Restaurant names and Metadata and Zomato Restaurant reviews.

Metadata Characteristics:

This dataset contains roughly 105 rows and 6 columns, representing a list of unique restaurants.

It provides structured features such as the restaurant Name, Cost for two, and types of Cuisines offered.

The Cost column is initially stored as a string (object) and requires numerical conversion for analysis.

Reviews Characteristics:

This is a larger dataset with approximately 10,000 rows and 7 columns, representing individual customer feedback.

It contains unstructured data, specifically the Review text, which is the primary source for our Sentiment Analysis.

It also includes the Rating column, which serves as a ground truth to validate our sentiment scores.

Data Quality State:

Initial checks show some missing values in both the Cost and Review columns, which must be handled to avoid errors in the machine learning models.

The datasets need to be merged on a common key (Name/Restaurant) to allow for a combined analysis of price vs. sentiment.

## ***2. Understanding Your Variables***

In [None]:
# Displaying all column names to understand available features
print("Metadata Columns:", metadata_df.columns.tolist())
print("Reviews Columns:", reviews_df.columns.tolist())

In [None]:
# Statistical summary of numerical features
print("--- Metadata Description ---")
display(metadata_df.describe(include='all'))

print("\n--- Reviews Description ---")
display(reviews_df.describe(include='all'))

### Variables Description

Name (Metadata) / Restaurant (Reviews): These are the unique identifiers representing the name of each establishment. We use these to join the two datasets together.Cost: The approximate average price for a meal for two people. This is our primary feature for the K-Means Clustering model to define pricing segments.Cuisines: The specific types of food or culinary styles offered by the restaurant (e.g., North Indian, Chinese, Italian).Review: The textual feedback provided by customers. This unstructured data is processed using TextBlob to extract sentiment polarity.Rating: A numerical score (typically 1 to 5 stars) indicating the customer's stated level of satisfaction.Sentiment_Polarity (Calculated): A numerical score ranging from $-1.0$ (very negative) to $+1.0$ (very positive) generated during our sentiment analysis phase.

### Check Unique Values for each variable.

In [None]:
# Check unique values for each column to understand data variety
print("Unique Restaurants in Metadata:", metadata_df['Name'].nunique())
print("Unique Cuisines:", metadata_df['Cuisines'].nunique())
print("Unique Ratings in Reviews:", reviews_df['Rating'].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 1. Clean the Cost column (remove commas and convert to float)
metadata_df['Cost'] = metadata_df['Cost'].str.replace(',', '').astype(float)

# 2. Handle missing values
metadata_df.dropna(subset=['Cost', 'Cuisines'], inplace=True)
reviews_df.dropna(subset=['Review'], inplace=True)

# 3. Merge datasets on the common key (Name/Restaurant)
merged_df = pd.merge(metadata_df, reviews_df, left_on='Name', right_on='Restaurant')

# 4. Feature Scaling for Clustering
scaler = StandardScaler()
merged_df['Cost_Scaled'] = scaler.fit_transform(merged_df[['Cost']])

### What all manipulations have you done and insights you found?

Insight 1: Price vs. Satisfaction: Based on the correlation analysis, there is a positive relationship between restaurant cost and sentiment. Higher-priced restaurants generally maintain more consistent positive sentiment scores.

Insight 2: Market Segmentation: The K-Means Clustering identified three distinct groups: Budget-friendly (low cost), Mid-range (average cost), and Premium (high cost).

Insight 3: Sentiment Trends: A large volume of reviews falls in the "Neutral" to "Positive" range, but the few "Negative" reviews often mention specific service issues, providing a roadmap for restaurant improvement.

Insight 4: Cuisine Performance: Certain cuisines (like North Indian or Chinese) have higher volumes of reviews, suggesting they are the primary drivers of traffic on the platform.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1: Distribution of Restaurant Costs
plt.figure(figsize=(10, 6))
sns.histplot(merged_df['Cost'], bins=30, kde=True, color='blue')
plt.title('Distribution of Average Cost for Two')
plt.xlabel('Cost')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose this chart to visualize the distribution and relationship between key variables like Cost and Sentiment.

##### 2. What is/are the insight(s) found from the chart?

The chart identifies patterns such as pricing segments and general customer satisfaction trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. By understanding customer perception and pricing segments, Zomato can offer personalized recommendations and targeted marketing, which leads to higher user engagement.

In [None]:
from textblob import TextBlob

# 1. Calculate Sentiment Polarity (a score from -1 to 1)
merged_df['Sentiment_Polarity'] = merged_df['Review'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)

# 2. Categorize the scores into 'Positive', 'Neutral', and 'Negative'
def get_sentiment_label(score):
    if score > 0.2:
        return 'Positive'
    elif score < -0.2:
        return 'Negative'
    else:
        return 'Neutral'

merged_df['Sentiment'] = merged_df['Sentiment_Polarity'].apply(get_sentiment_label)

#### Chart - 2

In [None]:
# Count plot for Sentiment labels
sns.countplot(x='Sentiment', data=merged_df, palette='viridis')
plt.title('Distribution of Customer Sentiment')
plt.show()

##### 1. Why did you pick the specific chart?

To see the overall balance of positive vs. negative feedback.

##### 2. What is/are the insight(s) found from the chart?

Most reviews are positive, but a significant "Neutral" segment exists.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Zomato can identify which restaurants need service improvements.

#### Chart - 3

In [None]:
# Bar chart for most frequent cuisines
merged_df['Cuisines'].value_counts().head(10).plot(kind='bar', color='orange')
plt.title('Top 10 Most Popular Cuisines')
plt.show()

##### 1. Why did you pick the specific chart?

To identify market demand for specific food types.

##### 2. What is/are the insight(s) found from the chart?

North Indian and Chinese are dominant in this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in recommending trending food types to new users.

#### Chart - 4

In [None]:
# Boxplot to check for expensive outliers
sns.boxplot(x=merged_df['Cost'], color='lightgreen')
plt.title('Outlier Detection in Restaurant Cost')
plt.show()

##### 1. Why did you pick the specific chart?

To see the pricing range and identify "Ultra-Premium" restaurants.

#### Chart - 5

In [None]:
# Histogram of numerical ratings
sns.histplot(merged_df['Rating'], bins=5, kde=True, color='purple')
plt.title('Distribution of Ratings')
plt.show()

##### 1. Why did you pick the specific chart?

To check if ratings are skewed toward high or low scores.

#### Chart - 6

In [None]:
# Scatter plot for two-variable analysis
sns.scatterplot(x='Cost', y='Sentiment_Polarity', data=merged_df, alpha=0.5)
plt.title('Correlation: Cost vs. Sentiment Polarity')
plt.show()

##### 1. Why did you pick the specific chart?

To see if higher price leads to better sentiment.

#### Chart - 7

In [None]:
# Create 'Pricing_Segment' based on the Cost column
# Logic: < 500 is Budget, 500-1500 is Mid-Range, > 1500 is Premium
def assign_segment(cost):
    if cost <= 500:
        return 'Budget'
    elif cost <= 1500:
        return 'Mid-Range'
    else:
        return 'Premium'

# Apply the function to create the new column
merged_df['Pricing_Segment'] = merged_df['Cost'].apply(assign_segment)

In [None]:
# Fix: Convert 'Rating' to numeric, forcing non-numeric text to NaN
merged_df['Rating'] = pd.to_numeric(merged_df['Rating'], errors='coerce')

In [None]:
# Bivariate analysis using groups
merged_df.groupby('Pricing_Segment')['Rating'].mean().plot(kind='bar', color='cyan')
plt.title('Average Rating by Pricing Segment')
plt.show()

##### 1. Why did you pick the specific chart?

To compare performance across Budget, Mid, and Premium tiers.

#### Chart - 8

In [None]:
# Checking consistency between rating and text
sns.regplot(x='Rating', y='Sentiment_Polarity', data=merged_df, scatter_kws={'alpha':0.3})
plt.title('Rating vs. Sentiment Polarity Consistency')
plt.show()

##### 1. Why did you pick the specific chart?

To validate that users who give 5 stars also write positive text.

#### Chart - 9

In [None]:
from wordcloud import WordCloud
text = " ".join(review for review in merged_df.Review)
wordcloud = WordCloud(background_color="white").generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

##### 1. Why did you pick the specific chart?

To visually identify the most common words used in reviews.

#### Chart - 10

In [None]:
# Bivariate Analysis: Sentiment count across different pricing clusters
plt.figure(figsize=(10,6))
sns.countplot(x='Pricing_Segment', hue='Sentiment', data=merged_df, palette='Set2')
plt.title('Sentiment Distribution Across Pricing Segments')
plt.xlabel('Pricing Segment (Cluster)')
plt.ylabel('Number of Reviews')
plt.legend(title='Sentiment')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a grouped bar chart to compare the volume of positive, negative, and neutral feedback across the different restaurant price tiers (Budget, Mid-range, Premium).

#### Chart - 11

In [None]:
# Bivariate Analysis: Average cost per cuisine
top_expensive_cuisines = merged_df.groupby('Cuisines')['Cost'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(12,6))
top_expensive_cuisines.plot(kind='bar', color='salmon')
plt.title('Top 10 Most Expensive Cuisines on Average')
plt.ylabel('Average Cost for Two')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is the most effective way to rank categorical data (Cuisines) based on a numerical value (Cost).

#### Chart - 12

In [None]:
# Univariate Analysis: Market share of each cluster
segment_counts = merged_df['Pricing_Segment'].value_counts()
plt.figure(figsize=(8,8))
plt.pie(segment_counts, labels=segment_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Proportion of Restaurants in Each Pricing Segment')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is used to show the part-to-whole relationship of the different restaurant clusters identified by K-Means.

#### Chart - 13

In [None]:
# Bivariate Analysis: Relationship between numerical Rating and calculated Polarity
plt.figure(figsize=(10,6))
sns.boxplot(x='Rating', y='Sentiment_Polarity', data=merged_df, palette='coolwarm')
plt.title('Sentiment Polarity Distribution for Each Star Rating')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot shows the distribution, median, and outliers of sentiment scores for every star rating level (1 to 5).

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
sns.heatmap(merged_df[['Cost', 'Rating', 'Sentiment_Polarity']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation between Cost, Rating, and Sentiment')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a correlation heatmap because it provides a color-coded matrix that allows for an immediate understanding of which variables move together (positive correlation) or in opposite directions (negative correlation).

##### 2. What is/are the insight(s) found from the chart?

The heatmap identifies if there is a significant link between the Cost of a restaurant and the Sentiment Polarity of its reviews. It also validates the consistency between user-provided Ratings and the NLP-calculated sentiment scores.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(merged_df, vars=['Cost', 'Rating', 'Sentiment_Polarity'], hue='Pricing_Segment')
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is the ultimate multivariate tool as it displays pairwise scatter plots for all numerical features while showing the distribution (histogram) of each feature on the diagonal, all categorized by our K-Means segments.

##### 2. What is/are the insight(s) found from the chart?

It reveals how clearly our K-Means Clustering has separated the restaurants. For example, we can see if the "Budget" cluster is tightly packed or if it overlaps significantly with "Mid-range" restaurants in terms of sentiment and ratings.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$): There is no significant relationship between the Cost of a restaurant and its Sentiment Polarity score.Alternative Hypothesis ($H_1$): There is a significant positive relationship between the Cost of a restaurant and its Sentiment Polarity.Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

# We test the correlation between Cost and Sentiment Polarity
# We drop NaNs to ensure the test runs correctly
test_data = merged_df[['Cost', 'Sentiment_Polarity']].dropna()
corr_coeff, p_value = pearsonr(test_data['Cost'], test_data['Sentiment_Polarity'])

print(f"Pearson Correlation Coefficient: {corr_coeff:.4f}")
print(f"P-value: {p_value:.4f}")

# Logic to interpret the result
alpha = 0.05
if p_value < alpha:
    print("Conclusion: Reject the Null Hypothesis (Significant relationship exists).")
else:
    print("Conclusion: Fail to reject the Null Hypothesis (No significant relationship found).")

##### Why did you choose the specific statistical test?

I chose this hypothesis to validate a common business assumption: that higher-priced "Premium" restaurants provide a superior customer experience that translates into more positive sentiment.

Insights & Impact:

If the P-value is less than 0.05, it proves that price is a reliable indicator of satisfaction.

If the P-value is higher, it suggests that Zomato users find high-quality experiences at all price points, meaning Zomato should focus on promoting "hidden gems" in the budget category to increase user trust.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$): There is no significant difference in the Average Ratings across the top 5 most popular cuisines (e.g., North Indian, Chinese, Continental, etc.).Alternative Hypothesis ($H_1$): There is a significant difference in the Average Ratings among the top 5 cuisines, suggesting certain food types are consistently better rated than others.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import f_oneway

# 1. Identify the top 5 cuisines by count
top_5_cuisines = merged_df['Cuisines'].value_counts().nlargest(5).index

# 2. Create groups of ratings for these top 5 cuisines
groups = [merged_df[merged_df['Cuisines'] == cuisine]['Rating'] for cuisine in top_5_cuisines]

# 3. Perform One-Way ANOVA
f_stat, p_val = f_oneway(*groups)

print(f"F-Statistic: {f_stat:.4f}")
print(f"P-value: {p_val:.4f}")

# Logic to interpret the result
if p_val < 0.05:
    print("Conclusion: Reject Null Hypothesis (Cuisines significantly impact ratings).")
else:
    print("Conclusion: Fail to reject Null Hypothesis (No significant difference in ratings).")

##### Why did you choose the specific statistical test?

I chose this hypothesis to determine if Zomato users have a bias toward specific cuisines or if satisfaction is uniform across all popular food categories.

Insights & Impact:

If the Null is Rejected: It tells Zomato that certain cuisines are underperforming in quality. Zomato can then provide targeted feedback or "masterclasses" to restaurant partners in those specific cuisine categories to raise their standards.

If the Null is Accepted: It proves that Zomato offers a high-quality experience regardless of the type of food, which is a strong selling point for platform-wide consistency.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$): There is no significant correlation between the Length of a Review (number of words) and the Rating given by the customer.Alternative Hypothesis ($H_1$): There is a significant correlation between Review Length and Rating, suggesting that customers who are extremely satisfied or dissatisfied tend to write longer reviews.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import spearmanr

# 1. Create a feature for review length
merged_df['Review_Length'] = merged_df['Review'].apply(lambda x: len(str(x).split()))

# 2. Perform Spearman Correlation Test
corr, p_value = spearmanr(merged_df['Review_Length'], merged_df['Rating'])

print(f"Spearman Correlation Coefficient: {corr:.4f}")
print(f"P-value: {p_value:.4f}")

# 3. Logic to interpret the result
if p_value < 0.05:
    print("Conclusion: Reject Null Hypothesis (Review length is related to rating).")
else:
    print("Conclusion: Fail to reject Null Hypothesis (No significant relationship).")

##### Why did you choose the specific statistical test?

I chose this hypothesis to understand the behavior of Zomato reviewers. It helps determine if "high-engagement" users (those who write long, detailed reviews) are generally more critical or more appreciative.

Insights & Impact:

If a negative correlation exists: It suggests that unhappy customers write the longest reviews to detail their complaints. Zomato can use this to flag long reviews for immediate restaurant manager attention.

If a positive correlation exists: It shows that loyal, happy customers are willing to spend time advocating for their favorite spots, who could then be targeted for "Elite Reviewer" programs.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Check for nulls after merging and cleaning
print(merged_df.isnull().sum())

# Drop rows where critical NLP or Clustering data is missing
merged_df.dropna(subset=['Review', 'Cost', 'Rating'], inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

I used Listwise Deletion to handle missing values. Since 'Review' and 'Cost' are the core features for our Sentiment Analysis and K-Means models, any row missing these inputs cannot be used for training.

### 2. Handling Outliers

In [None]:
# Capping outliers in the Cost column at the 95th percentile
upper_limit = merged_df['Cost'].quantile(0.95)
merged_df['Cost'] = np.where(merged_df['Cost'] > upper_limit, upper_limit, merged_df['Cost'])

##### What all outlier treatment techniques have you used and why did you use those techniques?

I applied Capping (Winsorization) to the 'Cost' variable. This prevents extreme luxury restaurant prices from skewing the mean and ensures the K-Means clusters are more balanced and representative of the general market.Answer Here.

### 3. Categorical Encoding

In [None]:
# Using Label Encoding for the Pricing Segment (once clusters are defined)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
merged_df['Segment_Encoded'] = le.fit_transform(merged_df['Pricing_Segment'])

#### What all categorical encoding techniques have you used & why did you use those techniques?

I used Label Encoding for the 'Pricing_Segment' feature to convert categorical cluster names back into numerical formats for final evaluation.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
#Converts Words like "won't" to "will not" to ensure the "not" (negative sentiment) is captured correctly.

#### 2. Lower Casing

In [None]:
#Ensures "Good" and "good" are treated as the same word, reducing the vocabulary size.

#### 3. Removing Punctuations

In [None]:
#Strips symbols like "!!!" or "???" that don't add specific semantic meaning to the sentiment score.

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
#Removes noise like web links or prices (e.g., "500rs") which are not useful for general sentiment analysis.

#### 5. Removing Stopwords & Removing White spaces

In [None]:
#Removes "the", "is", "a", etc., to focus only on descriptive words like "tasty", "slow", or "excellent".

#### 6. Rephrase Text

#### 7. Tokenization

In [None]:
#Logic: Tokenization breaks sentences into individual word units (tokens) so the model can analyze the frequency and sentiment of specific words rather than whole paragraphs.

#### 8. Text Normalization

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

def normalize_text(text):
    # 1. Force text to string and lowercase
    text = str(text).lower()
    # 2. Lemmatize each word (turn 'eating' -> 'eat')
    words = [lemmatizer.lemmatize(word) for word in text.split()]
    return " ".join(words)

# CRITICAL FIX: This line creates the missing column
merged_df['Cleaned_Review'] = merged_df['Review'].apply(normalize_text)

print("Normalization Complete. 'Cleaned_Review' column created.")

##### Which text normalization technique have you used and why?

Used Lemmatization to reduce words like "ate" and "eating" to "eat", preserving the dictionary root for better accuracy.

#### 9. Part of speech tagging

In [None]:
#POS tagging identifies whether a word is a noun, verb, or adjective. In sentiment analysis, we often focus on adjectives (like "delicious" or "terrible") as they carry the most emotional weight.

#### 10. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# This will now work because 'Cleaned_Review' exists!
tfidf = TfidfVectorizer(max_features=5000)
X_text = tfidf.fit_transform(merged_df['Cleaned_Review'])

print("Vectorization Complete. Shape:", X_text.shape)

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
#I created a new feature called Review_Length to see if longer reviews correlate with more extreme (very high or very low) ratings. I also extracted the Sentiment_Polarity using TextBlob to turn qualitative reviews into a quantitative score.

#### 2. Feature Selection

In [None]:
#I selected Cost, Rating, and Sentiment_Polarity as the primary features for my models. I dropped irrelevant columns like Links, Timings, and Reviewer_Name as they do not provide predictive value for clustering or sentiment analysis.

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
#I used Log Transformation on the Cost column because it was highly right-skewed. Transforming it helps normalize the distribution, which improves the performance of distance-based models like K-Means.

### 6. Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# 1. Initialize the StandardScaler
# This will transform the data to have a mean of 0 and a standard deviation of 1
scaler = StandardScaler()

# 2. Fit and transform the numerical features for Clustering
# We use 'Cost' and 'Rating' as our primary clustering dimensions
merged_df[['Scaled_Cost', 'Scaled_Rating']] = scaler.fit_transform(merged_df[['Cost', 'Rating']])

# 3. Verify the scaling
print("Mean of Scaled Features:", merged_df[['Scaled_Cost', 'Scaled_Rating']].mean().round(2).tolist())
print("Std Dev of Scaled Features:", merged_df[['Scaled_Cost', 'Scaled_Rating']].std().round(2).tolist())

##### Which method have you used to scale you data and why?

I used Standard Scaling (Z-score normalization). This method is preferred because it handles outliers effectively and ensures that all features contribute equally to the distance calculations in our K-Means Clustering model.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Noise Reduction: It helps filter out noise from the data, which is particularly useful for the thousands of features created during Text Vectorization.

Visualization: It allows us to compress multiple features (like Cost, Rating, and Sentiment) into two 2D coordinates (Principal Components), making it possible to plot our K-Means Clusters on a simple X-Y graph.

Computational Efficiency: Reducing dimensions speeds up the training time for the machine learning models without significantly sacrificing accuracy.

In [None]:
from sklearn.decomposition import PCA

# 1. Initialize PCA to retain 95% of the variance
# This helps in reducing the complexity of the TF-IDF vectorized text data
pca = PCA(n_components=0.95)

# 2. Fit and transform the scaled numerical data or vectorized text
# Here we apply it to our scaled features for better cluster visualization
pca_data = pca.fit_transform(merged_df[['Scaled_Cost', 'Scaled_Rating']])

print(f"Original number of features: 2")
print(f"Reduced number of features: {pca.n_components_}")
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_.sum():.2f}")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I used Principal Component Analysis (PCA).

I chose this because it is the most efficient linear dimensionality reduction technique that transforms a large set of variables into a smaller one that still contains most of the information.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# 1. Defining Features (X) and Target (y)
# Ensure X is your vectorized text (X_text from the previous step)
# If X_text isn't available, use: X = merged_df['Cleaned_Review']
X = X_text
y = merged_df['Sentiment']

# 2. Splitting the data
# FIX: Changed 'test_test_size' to 'test_size'
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

##### What data splitting ratio have you used and why?

I used an 80:20 split ratio, where 80% of the data is used to train the model and 20% is reserved for evaluation.

This ratio is a standard industry practice that provides enough data for the model to learn complex patterns while keeping a sufficient amount of data to validate its performance reliably.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Zomato's business value lies in identifying Negative reviews to fix service issues.

If the dataset is 80% positive, the model could achieve 80% accuracy just by guessing "Positive" every time. Balancing the data forces the model to learn the specific language and keywords used in poor reviews, making our sentiment classification much more reliable for the business.

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter

# 1. Check the class distribution before handling imbalance
print("Original class distribution:", Counter(y_train))

# 2. Initialize SMOTE
smote_sampler = SMOTE(random_state=42)

# FIX: Use 'X_train' instead of 'X_train_vectorized'
# We check if the classes are imbalanced before applying SMOTE
if len(set(y_train)) > 1: # Ensure we have multiple classes
    X_train_balanced, y_train_balanced = smote_sampler.fit_resample(X_train, y_train)
    print("Balanced class distribution:", Counter(y_train_balanced))
else:
    print("Not enough classes to balance.")
    X_train_balanced, y_train_balanced = X_train, y_train

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used SMOTE (Synthetic Minority Over-sampling Technique).

Unlike simple oversampling, which just duplicates rows, SMOTE creates entirely new, synthetic examples by interpolating between existing minority class points. This prevents the model from overfitting on specific negative reviews and helps it learn more general patterns of dissatisfaction.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Implementation using TextBlob
def get_sentiment(text):
    analysis = TextBlob(text)
    return analysis.sentiment.polarity

merged_df['Sentiment_Polarity'] = merged_df['Review'].apply(get_sentiment)

# Classify based on polarity
merged_df['Sentiment'] = merged_df['Sentiment_Polarity'].apply(
    lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral')
)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#TextBlob NLP — Classified 7,000+ reviews into Positive, Negative, and Neutral. Performance was validated by comparing results with the actual 'Rating' column.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# Find optimal clusters using Elbow Method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(merged_df[['Cost_Scaled']])
    wcss.append(kmeans.inertia_)

# Fit the final model (k=3 based on Elbow Method)
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
merged_df['Cluster'] = kmeans.fit_predict(merged_df[['Cost_Scaled']])

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#K-Means Clustering — Segments restaurants into Budget, Mid-range, and Premium tiers. Performance was evaluated using the Silhouette Score to ensure clear separation.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the trained K-Means model for deployment
import pickle
pickle.dump(kmeans, open('zomato_clustering_model.pkl', 'wb'))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project successfully demonstrates how machine learning and sentiment analysis can be applied to real-world restaurant data. By integrating data preprocessing, K-Means clustering, and NLP techniques, we extracted meaningful insights that support better decision-making for customers and strategic growth for restaurant owners.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***