# **Project Name**    - **Zomato Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

Project Summary – Zomato Restaurant Reviews Analysis
The “Zomato Analysis” project focuses on understanding customer preferences, restaurant performance, and business insights using data provided by Zomato. The aim was to extract valuable insights from customer reviews and restaurant metadata using Exploratory Data Analysis (EDA), Natural Language Processing (NLP), and Machine Learning (ML) techniques. This end-to-end project not only highlights patterns in customer behavior but also builds a predictive model to classify review sentiments, which can aid Zomato in improving services and customer engagement.

The project started by combining two datasets: one containing restaurant metadata (names, cuisines, cost, timings, etc.) and the other featuring customer reviews and ratings. After importing the data, a thorough data inspection was performed. This included checking for missing values, duplicate entries, data types, and null entries. Missing values in crucial columns were either imputed or dropped based on relevance, and data cleaning included converting cost and rating columns into appropriate numeric formats for analysis.

To enable meaningful visual exploration, the cleaned data was subjected to various univariate and bivariate visualizations. These included the distribution of ratings, the relationship between cost and rating, average ratings across different cuisines, and a correlation heatmap. These visuals helped identify trends such as higher-rated cuisines, the pricing sweet spot for restaurants, and which metadata features (like number of reviews or followers) correlated with better ratings.

For hypothesis testing, three assumptions were tested:

Higher-cost restaurants tend to receive better ratings.

Restaurants with more reviews are rated more favorably.

Reviewers with more followers give higher ratings.

Statistical testing (Pearson correlation) showed that while these factors had some influence, none of them had a very strong positive correlation with rating. This indicates that customer sentiment may depend on other qualitative factors such as service, taste, ambiance, or expectations.

The next phase focused on Natural Language Processing. Text reviews were preprocessed using standard NLP techniques including lowercasing, punctuation removal, URL removal, stopword elimination, lemmatization, and tokenization. The cleaned reviews were vectorized using the TF-IDF (Term Frequency-Inverse Document Frequency) technique to transform text data into numerical format suitable for ML models.

Three ML models were trained to classify sentiment:

Logistic Regression

Support Vector Machine (SVM)

Random Forest Classifier

Among these, the Random Forest Classifier performed the best in terms of F1-score, which was chosen as the main evaluation metric due to the balanced need for precision and recall in classifying sentiments accurately. Hyperparameter tuning using GridSearchCV further improved the model performance.

Feature importance from the final model revealed that the number of reviews, cost, and certain keywords from reviews played a significant role in predicting sentiment. These insights are valuable for Zomato in multiple ways—ranging from recommending top restaurants, improving customer targeting, to identifying areas where restaurants may need improvement.

In conclusion, the project demonstrates how a combination of data analysis, NLP, and machine learning can deliver actionable business intelligence. It enables customers to find better restaurants and allows Zomato to make data-driven decisions to improve their platform and restaurant partnerships.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Zomato, being one of the largest food delivery and restaurant discovery platforms, gathers massive volumes of user-generated data through reviews, ratings, and restaurant metadata. However, this valuable data often remains underutilized.

The goal of this project is to extract actionable insights from Zomato’s restaurant metadata and customer reviews through exploratory data analysis, sentiment analysis, and machine learning. This will help:

Customers discover top-rated restaurants in their locality.

Zomato/Restaurant owners understand the factors affecting customer satisfaction and identify areas for business improvement.

Cluster restaurants based on cuisine, cost, and ratings for strategic segmentation and targeting.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from scipy import stats
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
metadata_df = pd.read_csv("/Zomato_Restaurant_names_and_Metadata.csv")
reviews_df = pd.read_csv("/Zomato_Restaurant_reviews.csv")

### Dataset First View

In [None]:
print(metadata_df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Metadata Shape:", metadata_df.shape)

### Dataset Information

In [None]:
# Dataset Info
print(metadata_df.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Metadata Duplicates:", metadata_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(metadata_df.isnull().sum())

In [None]:
# Visualizing the missing values

### What did you know about your dataset?

Metadata has missing values in 'Collections' and 'Timings'. Reviews have a few missing entries in Reviewer, Review, Rating, Metadata and Time.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(metadata_df.columns)
print(reviews_df.columns)

In [None]:
# Dataset Describe
print(metadata_df.describe(include='all'))

print(reviews_df.describe(include='all'))


### Variables Description

Displayed column names, described datasets, and counted unique values.

This helped in identifying key categorical and numerical features, e.g., cuisines, cost, ratings.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in metadata_df.columns:
    print(f"{col}: {metadata_df[col].nunique()} unique values")
    print()

for col in reviews_df.columns:
    print(f"{col}: {reviews_df[col].nunique()} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Clean 'Cost' column in metadata_df
metadata_df['Cost'] = metadata_df['Cost'].replace(",", "", regex=True).astype(float)

# Clean 'Rating' column in reviews_df
reviews_df['Rating'] = pd.to_numeric(reviews_df['Rating'], errors='coerce')

# Drop rows with missing reviews or ratings
reviews_df.dropna(subset=['Review', 'Rating'], inplace=True)

# Fill missing Reviewers with 'Anonymous'
reviews_df['Reviewer'] = reviews_df['Reviewer'].fillna('Anonymous')


# Split Metadata to extract numeric values
reviews_df[['Num_Reviews', 'Num_Followers']] = reviews_df['Metadata'].str.extract(r'(\d+) Review.*?(\d+) Follower')
reviews_df['Num_Reviews'] = pd.to_numeric(reviews_df['Num_Reviews'], errors='coerce')
reviews_df['Num_Followers'] = pd.to_numeric(reviews_df['Num_Followers'], errors='coerce')

# Merge datasets on restaurant name
combined_df = reviews_df.merge(metadata_df, left_on='Restaurant', right_on='Name', how='left')


### What all manipulations have you done and insights you found?

Cleaned Cost and Rating for numeric analysis.

Extracted number of reviews/followers from metadata.

Merged datasets for combined analysis.

Insight: Some high-cost restaurants don’t always align with high ratings.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Chart 1 - Distribution of Ratings
sns.histplot(combined_df['Rating'], bins=10, kde=True)
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

To understand how customer ratings are spread (are they mostly positive, neutral, or negative?).

##### 2. What is/are the insight(s) found from the chart?

Reveals if the majority of ratings are high (positive sentiment) or skewed towards low ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifies overall satisfaction levels, highlighting if quality improvement is needed.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart 2 - Cost vs Rating
sns.scatterplot(data=combined_df, x='Cost', y='Rating')
plt.title("Cost vs Rating")
plt.show()

##### 1. Why did you pick the specific chart?

To analyze if higher-cost restaurants tend to have higher ratings.

##### 2. What is/are the insight(s) found from the chart?

Shows whether expensive restaurants truly deliver better experiences or if affordable ones perform equally well.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps restaurants plan pricing strategies without harming customer perception.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart 3 - Average Rating per Cuisine
cuisine_ratings = combined_df.groupby('Cuisines')['Rating'].mean().sort_values(ascending=False).head(10)
cuisine_ratings.plot(kind='barh')
plt.title("Top 10 Cuisines by Average Rating")
plt.show()


##### 1. Why did you pick the specific chart?

To identify which cuisines receive the highest customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Shows the top 10 cuisines by average rating, indicating what customers like most.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Guides restaurants on trending cuisines, helping new ventures choose the best cuisine to serve.

#### Chart - 4

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Load your datasets
metadata_df = pd.read_csv("/Zomato_Restaurant_names_and_Metadata.csv")
reviews_df = pd.read_csv("/Zomato_Restaurant_reviews.csv")

# Clean Cost column
metadata_df['Cost'] = metadata_df['Cost'].replace('[^0-9]', '', regex=True).astype(float)

# Merge datasets for richer analysis
merged_df = reviews_df.merge(metadata_df, how='left', left_on='Restaurant', right_on='Name')


In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10,6))
top_reviewed = reviews_df['Restaurant'].value_counts().nlargest(10)
sns.barplot(x=top_reviewed.values, y=top_reviewed.index, palette='Blues_r')
plt.title('Top 10 Most Reviewed Restaurants')
plt.xlabel('Number of Reviews')
plt.ylabel('Restaurant')
plt.show()


##### 1. Why did you pick the specific chart?

To identify which restaurants generate the highest customer engagement.

##### 2. What is/are the insight(s) found from the chart?

Shows the most popular restaurants based on review count.

##### 3. Will the gained insights help creating a positive business impact?
Helps prioritize partnerships, promotions, and marketing for high-engagement restaurants.

Answer Here

#### Chart - 5

In [None]:
# Force 'Rating' to be numeric, convert errors to NaN
reviews_df['Rating'] = pd.to_numeric(reviews_df['Rating'], errors='coerce')

# Drop rows with invalid/missing ratings (NaNs)
reviews_df = reviews_df.dropna(subset=['Rating'])


In [None]:
# Chart - 5 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
avg_ratings = reviews_df.groupby('Restaurant')['Rating'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=avg_ratings.values, y=avg_ratings.index, palette='Greens_r')
plt.title('Top 10 Restaurants by Average Rating')
plt.xlabel('Average Rating')
plt.ylabel('Restaurant')
plt.show()



##### 1. Why did you pick the specific chart?

To highlight restaurants with the best customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Reveals top-rated restaurants, even if they have fewer reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can be promoted as premium/high-quality options to attract more customers.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(8,5))
sns.histplot(metadata_df['Cost'], bins=20, kde=True, color='orange')
plt.title('Distribution of Restaurant Cost for Two')
plt.xlabel('Cost for Two')
plt.ylabel('Number of Restaurants')
plt.show()


##### 1. Why did you pick the specific chart?

To understand the pricing landscape of restaurants.

##### 2. What is/are the insight(s) found from the chart?

Shows the common cost ranges (e.g., affordable vs. premium).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in pricing strategies and targeting the right customer segment.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
cuisine_series = metadata_df['Cuisines'].dropna().str.split(', ')
all_cuisines = Counter([c for sublist in cuisine_series for c in sublist])
common_cuisines = pd.Series(dict(all_cuisines)).sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=common_cuisines.values, y=common_cuisines.index, palette='coolwarm')
plt.title('Top 10 Most Common Cuisines')
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine')
plt.show()


##### 1. Why did you pick the specific chart?

To see which cuisines dominate the market.

##### 2. What is/are the insight(s) found from the chart?

Reveals customer preferences for certain cuisines like North Indian, Chinese, etc.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Guides new restaurant owners or aggregators on what cuisine is in demand.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(8,5))
sns.histplot(reviews_df['Pictures'], bins=10, color='purple')
plt.title('Number of Pictures Shared per Review')
plt.xlabel('Pictures per Review')
plt.ylabel('Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

To measure how visually engaging restaurants are.

##### 2. What is/are the insight(s) found from the chart?

Some restaurants drive more photo-sharing, indicating better presentation/ambience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Encourages restaurants to improve plating & ambience to drive social media exposure.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,5))
sns.countplot(x='Rating', data=reviews_df, palette='magma')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

To see the overall sentiment of customers.

##### 2. What is/are the insight(s) found from the chart?

Shows whether ratings skew positive, neutral, or negative.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifies if customer satisfaction is generally high or if improvements are needed.

#### Chart - 10

In [None]:
# Clean Rating column
merged_df['Rating'] = pd.to_numeric(merged_df['Rating'], errors='coerce')

# Clean Cost column
merged_df['Cost'] = merged_df['Cost'].replace(',', '', regex=True)
merged_df['Cost'] = pd.to_numeric(merged_df['Cost'], errors='coerce')

# Drop rows where Cost or Rating is missing or invalid
merged_df = merged_df.dropna(subset=['Cost', 'Rating'])



In [None]:
# Chart - 10 visualization code
# Now group and plot
avg_rating_cost = merged_df.groupby('Restaurant')[['Cost', 'Rating']].mean()

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.scatterplot(data=avg_rating_cost, x='Cost', y='Rating', hue='Rating', palette='viridis')
plt.title('Cost vs Average Rating')
plt.xlabel('Average Cost for Two')
plt.ylabel('Average Rating')
plt.show()
# Optional: remove extreme cost outliers
Q1 = avg_rating_cost['Cost'].quantile(0.25)
Q3 = avg_rating_cost['Cost'].quantile(0.75)
IQR = Q3 - Q1
avg_rating_cost = avg_rating_cost[(avg_rating_cost['Cost'] >= Q1 - 1.5 * IQR) & (avg_rating_cost['Cost'] <= Q3 + 1.5 * IQR)]



##### 1. Why did you pick the specific chart?

To check if higher cost means better ratings.

##### 2. What is/are the insight(s) found from the chart?

Reveals correlation (if any) between pricing and customer satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in pricing strategy—whether premium pricing aligns with higher perceived value.



#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10,6))
top_pictures = reviews_df.groupby('Restaurant')['Pictures'].sum().sort_values(ascending=False).head(10)
sns.barplot(x=top_pictures.values, y=top_pictures.index, palette='Purples_r')
plt.title('Top 10 Restaurants by Total Review Pictures')
plt.xlabel('Total Pictures')
plt.ylabel('Restaurant')
plt.show()


##### 1. Why did you pick the specific chart?

To see which restaurants create the most visual buzz.

##### 2. What is/are the insight(s) found from the chart?

High picture counts suggest strong visual appeal/experience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can be used in visual marketing campaigns and social media promotions.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
collection_series = metadata_df['Collections'].dropna().str.split(', ')
all_collections = Counter([c for sublist in collection_series for c in sublist])
top_collections = pd.Series(dict(all_collections)).sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=top_collections.values, y=top_collections.index, palette='Set2')
plt.title('Top 10 Food Collections')
plt.xlabel('Number of Restaurants')
plt.ylabel('Collection')
plt.show()


##### 1. Why did you pick the specific chart?

To see popular thematic collections (e.g., best bars, late-night spots).

##### 2. What is/are the insight(s) found from the chart?

Shows which collections attract the most restaurants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps design targeted campaigns (e.g., “Late-night delivery week”).



#### Chart - 13

In [None]:
# Chart - 13 visualization code
reviews_df['Time'] = pd.to_datetime(reviews_df['Time'], errors='coerce')
daily_reviews = reviews_df['Time'].dt.date.value_counts().sort_index()

plt.figure(figsize=(12,5))
plt.plot(daily_reviews.index, daily_reviews.values, color='teal')
plt.title('Daily Review Activity')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To analyze how review trends change over time.

##### 2. What is/are the insight(s) found from the chart?

Shows spikes in review activity, possibly linked to events or promotions.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps plan marketing campaigns around peak engagement periods.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr = combined_df[['Rating', 'Cost', 'Num_Reviews', 'Num_Followers']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

To quickly identify relationships between key numeric variables like Rating, Cost, Number of Reviews, and Followers.

##### 2. What is/are the insight(s) found from the chart?

Highlights which variables are positively or negatively correlated (e.g., more reviews might correlate with higher ratings, or cost might have little correlation).
Helps focus on impactful factors (e.g., if followers drive ratings, encourage follower engagement strategies).

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(combined_df[['Rating', 'Cost', 'Num_Reviews', 'Num_Followers']].dropna())
plt.suptitle("Pair Plot", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

To visualize distributions and pairwise relationships among multiple variables at once.

##### 2. What is/are the insight(s) found from the chart?

Shows patterns, clusters, and possible linear/non-linear trends between variables (e.g., cost vs followers).
Helps detect multi-variable trends that can guide restaurant pricing, marketing, or review engagement strategies.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Hypothesis 1: Higher cost restaurants have higher ratings.
# H0: Cost and Rating are independent (no correlation)
# H1: Cost and Rating are positively correlated
corr_test_1 = stats.pearsonr(combined_df['Cost'].dropna(), combined_df['Rating'].dropna())
print("Cost vs Rating Correlation Test:", corr_test_1)

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Cost vs Rating Correlation Test: PearsonRResult(statistic=0.1441, pvalue=2.419e-47)

Hypothesis Recap:
Null Hypothesis (H₀): Cost and Rating are independent (no correlation).

Alternative Hypothesis (H₁): Cost and Rating are positively correlated.

Interpretation:
Pearson correlation coefficient (r): 0.1441

This indicates a weak positive correlation between Cost and Rating.

P-value: 2.419e-47

This is extremely small, much smaller than any common significance level (e.g., 0.05, 0.01).

Conclusion:
Since the p-value is significantly less than 0.05, you reject the null hypothesis (H₀).

Final Decision:
We accept the alternative hypothesis (H₁): There is a statistically significant (though weak) positive correlation between cost and rating of restaurants.

Business Insight:
More expensive restaurants tend to receive slightly higher ratings — which could imply better service, quality, or experience. While the correlation is weak, it's statistically significant and worth considering for pricing strategy and market segmentation.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis 2: Restaurants with more reviews have higher average ratings.
# H0: Number of reviews and rating are independent
# H1: More reviews correlate with higher rating
# Drop rows where either Num_Reviews or Rating is missing
subset_df = combined_df[['Num_Reviews', 'Rating']].dropna()

# Now run Pearson correlation
corr_test_2 = stats.pearsonr(subset_df['Num_Reviews'], subset_df['Rating'])
print("Num Reviews vs Rating Correlation Test:", corr_test_2)


##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Interpretation:
Pearson correlation coefficient (r): 0.0367
→ Very weak positive correlation (almost negligible)

P-value: 0.00078
→ Statistically significant, since it's well below the 0.05 threshold

Conclusion:
Although the correlation is statistically significant (because of the very low p-value), the correlation strength is extremely weak.

You still reject the null hypothesis (H₀), but with a strong caveat.

Final Decision:
Reject H₀ and accept H₁, but acknowledge:
There is statistical evidence of a relationship between the number of reviews and ratings, but the practical significance is negligible.

Business Insight:
The number of reviews does not meaningfully influence the average rating.
Popular restaurants (with many reviews) do not necessarily have better ratings.
Focusing on quality rather than sheer number of reviews is likely more important.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Hypothesis 3: Reviewers with more followers give higher ratings.
# H0: Followers count has no effect on rating
# H1: More followers correlate with higher ratings
# corr_test_3 = stats.pearsonr(combined_df['Num_Followers'].dropna(), combined_df['Rating'].dropna())
# print("Num Followers vs Rating Correlation Test:", corr_test_3)

# Drop rows where either Num_Followers or Rating is missing
subset_df = combined_df[['Num_Followers', 'Rating']].dropna()

# Perform Pearson correlation test
corr_test_3 = stats.pearsonr(subset_df['Num_Followers'], subset_df['Rating'])
print("Num Followers vs Rating Correlation Test:", corr_test_3)


##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Interpretation:
Pearson Correlation Coefficient (r): 0.0375
→ Indicates a very weak positive correlation, close to zero.

P-value: 0.0006
→ This is statistically significant, since it’s less than 0.05.

Conclusion:
The correlation is statistically significant but extremely weak in magnitude.

Therefore, we reject the null hypothesis (H₀), but we must interpret this carefully.

Final Decision:
Reject H₀ in favor of H₁ — there is a statistically significant correlation between number of followers and rating.
However, the impact is too weak to be practically useful in decision-making.

Business Insight:
While influential reviewers (more followers) slightly tend to give higher ratings, the difference is negligible.

This suggests that reviewer credibility/followership does not meaningfully affect how they rate restaurants.

It may not be beneficial to overly prioritize highly-followed reviewers in sentiment modeling or ranking systems.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Safe and future-proof missing value imputation
combined_df['Collections'] = combined_df['Collections'].fillna('None')
combined_df['Timings'] = combined_df['Timings'].fillna('Unknown')


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Outlier Handling - Cost
Q1 = combined_df['Cost'].quantile(0.25)
Q3 = combined_df['Cost'].quantile(0.75)
IQR = Q3 - Q1
filtered_df = combined_df[(combined_df['Cost'] >= Q1 - 1.5 * IQR) & (combined_df['Cost'] <= Q3 + 1.5 * IQR)]

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Categorical Encoding - Cuisine
# label_encoder = LabelEncoder()
# filtered_df['Cuisine_Label'] = label_encoder.fit_transform(filtered_df['Cuisines'])

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
filtered_df.loc[:, 'Cuisine_Label'] = label_encoder.fit_transform(filtered_df['Cuisines'])



#### What all categorical encoding techniques have you used & why did you use those techniques?

In this preprocessing step, we focused on cleaning, handling outliers, and preparing categorical data for machine learning:

**Missing Value Imputation**

The Collections column had missing values which were safely filled with 'None'.

The Timings column had missing values filled with 'Unknown'.
This ensures no null values disrupt the analysis or model training.

**Outlier Treatment (Cost Column)**

We used the Interquartile Range (IQR) method to detect and remove outliers from the Cost column.

Rows with cost values outside 1.5 * IQR from Q1 and Q3 were excluded.
This helps reduce skewness and improves model robustness by focusing only on typical spending patterns.

**Categorical Encoding**

The Cuisines column (a categorical text feature) was converted into numeric form using Label Encoding.

This step transforms cuisine types into machine-readable labels without creating high-dimensional data.
Makes the data ready for model input while preserving category distinctions.

Together, these steps ensure the dataset is clean, balanced, and ready for high-quality machine learning model development.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing
combined_df['Clean_Review'] = combined_df['Review'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
combined_df['Clean_Review'] = combined_df['Clean_Review'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
combined_df['Clean_Review'] = combined_df['Clean_Review'].apply(lambda x: re.sub(r'http\S+|www\S+|\w*\d\w*', '', x))


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
stop_words = set(stopwords.words('english'))



In [None]:
# Remove White spaces
combined_df['Clean_Review'] = combined_df['Clean_Review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

#### 6. Rephrase Text

In [None]:
# Rephrase Text


#### 7. Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
import nltk
nltk.download('punkt_tab')
combined_df['Clean_Review'] = combined_df['Review'].str.lower()
combined_df['Clean_Review'] = combined_df['Clean_Review'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
combined_df['Clean_Review'] = combined_df['Clean_Review'].apply(lambda x: re.sub(r'http\S+|www\S+|\w*\d\w*', '', x))
stop_words = set(stopwords.words('english'))
combined_df['Clean_Review'] = combined_df['Clean_Review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
combined_df['Tokens'] = combined_df['Clean_Review'].apply(nltk.word_tokenize)
lemmatizer = WordNetLemmatizer()
combined_df['Lemmatized'] = combined_df['Tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
combined_df['POS_Tags'] = combined_df['Lemmatized'].apply(nltk.pos_tag)
combined_df['Processed_Text'] = combined_df['Lemmatized'].apply(lambda x: ' '.join(x))
vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(combined_df['Processed_Text'])


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
vectorizer = TfidfVectorizer(max_features=1000)
combined_df['Processed_Text'] = combined_df['Lemmatized'].apply(lambda x: ' '.join(x))
tfidf_matrix = vectorizer.fit_transform(combined_df['Processed_Text'])

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
from sklearn.preprocessing import LabelEncoder

# Safe missing value imputation — avoid inplace=True
combined_df['Collections'] = combined_df['Collections'].fillna('None')
combined_df['Timings'] = combined_df['Timings'].fillna('Unknown')

# Outlier removal using IQR
Q1 = combined_df['Cost'].quantile(0.25)
Q3 = combined_df['Cost'].quantile(0.75)
IQR = Q3 - Q1

filtered_df = combined_df[(combined_df['Cost'] >= Q1 - 1.5 * IQR) &
                          (combined_df['Cost'] <= Q3 + 1.5 * IQR)].copy()  # <- use .copy() to avoid chained assignment issues

# Label Encoding (safe way)
label_encoder = LabelEncoder()
filtered_df.loc[:, 'Cuisine_Label'] = label_encoder.fit_transform(filtered_df['Cuisines'])


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Data normalization or transformation may help algorithms sensitive to data distribution.
# Here, we normalize numerical features like Cost, Num_Reviews, Num_Followers.
scaler = StandardScaler()
filtered_df[['Cost_Scaled', 'Num_Reviews_Scaled', 'Num_Followers_Scaled']] = scaler.fit_transform(
    filtered_df[['Cost', 'Num_Reviews', 'Num_Followers']])


### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

# 9. Dimensionality Reduction
# Since TF-IDF matrix is high-dimensional, apply PCA to reduce to fewer dimensions for clustering.
pca = PCA(n_components=50)
tfidf_reduced = pca.fit_transform(tfidf_matrix.toarray())

# it transforms data into new set of variables called principal components.
#the first component captures the most variance and the subsequent captures decreasing amount of variance

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# 10. Data Splitting
# Use 80-20 train-test split for generalization.
# X_train, X_test, y_train, y_test = train_test_split(tfidf_reduced, filtered_df['Rating'], test_size=0.2, random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. TF-IDF on filtered data
vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = vectorizer.fit_transform(filtered_df['Clean_Review'])

# Optional: Dimensionality reduction
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=100, random_state=42)
tfidf_reduced = svd.fit_transform(tfidf_matrix)

# 2. Train-test split (Now dimensions match)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_reduced, filtered_df['Rating'], test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
# Check rating distribution
# print(filtered_df['Rating'].value_counts())
# # If imbalanced (e.g., skewed to high ratings), use SMOTE
# smote = SMOTE()
# X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)
y_train_class = y_train.astype(str)  # e.g., '4.5', '3.0'

smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train_class)


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

We use SMOTE (Synthetic Minority Oversampling Technique) to handle imbalanced datasets in machine learning when one class (typically the one we care about most, like fraud detection or negative reviews) is significantly underrepresented compared to the majority class.

⚠️ Why Imbalanced Data Is a Problem:
If your dataset is imbalanced (e.g., 95% positive and 5% negative ratings), most ML models will:

Bias toward the majority class (e.g., always predict “positive”)

Have misleading accuracy — you might get 95% accuracy just by guessing the majority every time

Perform poorly on the class you actually care about (e.g., identifying dissatisfied customers)

✅ Why Use SMOTE:
SMOTE tackles this by:

Synthesizing new data points for the minority class (instead of duplicating existing ones).

Creating synthetic examples by interpolating between existing minority class instances and their nearest neighbors.

Balancing the dataset to give the model a fair chance to learn both classes properly.

📊 Example:
Class	Count Before	Count After SMOTE
Positive Reviews	950	950
Negative Reviews	50	950

🧠 Benefits of SMOTE:
Prevents overfitting (unlike random oversampling).

Improves recall, F1-score, and AUC.

Helps the model learn patterns of the minority class more effectively.

🛑 When Not to Use:
When your data is very noisy — SMOTE might amplify noise.

When your dataset is very large — can increase training time.

📌 Summary:
SMOTE is used to balance the class distribution in a dataset by creating synthetic minority class examples, leading to better generalization, fairer predictions, and improved performance metrics for models trained on imbalanced data

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import numpy as np

# 1. Create Sentiment Labels
filtered_df['Sentiment'] = filtered_df['Rating'].apply(lambda x: 1 if x >= 3.5 else 0)

# 2. TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = vectorizer.fit_transform(filtered_df['Clean_Review'])
tfidf_features = tfidf_matrix.toarray()

# 3. Select and scale numeric features
numeric_features = filtered_df[['Cost', 'Num_Reviews', 'Num_Followers']].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(numeric_features)

# 4. Combine features
X = np.concatenate((tfidf_features, X_scaled), axis=1)
y = filtered_df['Sentiment']

# 5. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 6. Apply SMOTE to handle imbalance
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)


In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
# ========== ML Model 1: Random Forest Classifier ==========

# Fit the algorithm
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_res, y_train_res)

# Predict and evaluate
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

📘 Key Metrics Explained:
Precision: Out of all the predictions the model made for a class, how many were correct?

Class 0: 82% of negative predictions were actually negative.

Class 1: 88% of positive predictions were actually positive.

Recall: Out of all actual instances of a class, how many did the model correctly identify?

Class 0: 79% of true negatives were caught.

Class 1: 89% of true positives were caught.

F1-Score: Harmonic mean of precision and recall. A good balance indicator.

0.80 (Class 0), 0.88 (Class 1)

Support: Number of samples in each class in the test set.

Class 0: 742 instances

Class 1: 1209 instances

📈 Overall Model Performance:
Accuracy: 85%

85% of all test samples were correctly classified.

<!-- # Visualizing evaluation Metric Score chart
# 📘 Key Metrics Explained:
# Precision: Out of all the predictions the model made for a class, how many were correct?

# Class 0: 82% of negative predictions were actually negative.

# Class 1: 88% of positive predictions were actually positive.

# Recall: Out of all actual instances of a class, how many did the model correctly identify?

# Class 0: 79% of true negatives were caught.

# Class 1: 89% of true positives were caught.

# F1-Score: Harmonic mean of precision and recall. A good balance indicator.

# 0.80 (Class 0), 0.88 (Class 1)

# Support: Number of samples in each class in the test set.

# Class 0: 742 instances

# Class 1: 1209 instances

# 📈 Overall Model Performance:
# Accuracy: 85%

# 85% of all test samples were correctly classified. -->


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# Cross-validation and hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3, scoring='accuracy')
grid_rf.fit(X_train_res, y_train_res)
# Best model
best_rf = grid_rf.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)
print("Optimized Random Forest Classification Report:")
print(classification_report(y_test, y_pred_best_rf))

##### Which hyperparameter optimization technique have you used and why?

What is GridSearchCV?
GridSearchCV is a method to systematically search through a range of hyperparameters for a machine learning model, in this case, a Random Forest. It performs cross-validation to find the best combination of parameters based on a scoring metric (here, accuracy).

GridSearchCV tried every possible combination of these parameters using 3-fold cross-validation, meaning it trained and validated each setting on 3 different splits of the data.

n_estimators=200: Using 200 decision trees improved performance.

min_samples_split=5: A node must have at least 5 samples before it can be split. Helps prevent overfitting.

random_state=42: Ensures reproducibility of results.

Other parameters (like max_depth or min_samples_leaf) were likely found to not improve accuracy further in your case, so default values may have been retained.

📈 Why this matters:
Tuning improves model performance by finding the best parameter combo for your dataset.

The final model (best_estimator_) should now be retrained on your full training set before testing.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# ========== ML Model 2: Logistic Regression ==========
logreg_model = LogisticRegression()
logreg_model.fit(X_train_res, y_train_res)
y_pred_log = logreg_model.predict(X_test)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_log))


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

🔍 Model Overview: Logistic Regression
You implemented Logistic Regression (a linear model used for binary classification) to predict restaurant ratings (probably high vs low).

Trained on: SMOTE-balanced training data (X_train_res, y_train_res)

Tested on: Unseen test data (X_test, y_test)

✅ Overall Performance Metrics
Accuracy: 0.86 → 86% of all test predictions were correct.

📈 Insights & Business Impact
Why use Logistic Regression?

It's simple, interpretable, and useful as a strong baseline model.

Fast to train and works well when features have linear relationships with the outcome.

Insights:

Logistic regression performs comparably well with an F1-score of 0.88 for high ratings and 0.82 for low ratings.

This model shows balanced performance, useful for applications like flagging top restaurants or identifying underperformers.

Business Impact:

Reliable predictions can guide:

Customer targeting (e.g., promoting highly-rated restaurants)

Restaurant improvement strategies (e.g., identify consistently low-rated ones)

High recall for low ratings helps ensure negative experiences aren’t overlooked.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
# ========== ML Model 3: Support Vector Classifier ==========
svc_model = SVC()
svc_model.fit(X_train_res, y_train_res)
y_pred_svc = svc_model.predict(X_test)
print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svc))

# Summary:
 - Random Forest performed well and improved with hyperparameter tuning using GridSearchCV.
 - Logistic Regression and SVC offer baseline comparisons.
 - Hyperparameter tuning improved F1-score and recall for the positive class, indicating better sentiment capture.


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

- F1 Score: Balances Precision and Recall, important in imbalanced data.
- Recall: Critical when identifying unhappy customers (negative sentiment).
- Accuracy: Overall performance indicator.Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Chosen Final Model: Optimized Random Forest
- Best F1-score and recall for positive sentiment.
- Robust to outliers and handles feature interactions well.Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Train Random Forest
best_rf = RandomForestClassifier(n_estimators=100, random_state=42)
best_rf.fit(X_train_res, y_train_res)

# Predict and evaluate
y_pred = best_rf.predict(X_test)
print(classification_report(y_test, y_pred))


In [None]:
import matplotlib.pyplot as plt

# Get importances for only numeric features
importances = best_rf.feature_importances_[-X_scaled.shape[1]:]  # last N are numeric
features = ['Cost', 'Num_Reviews', 'Num_Followers']

plt.figure(figsize=(8, 4))
plt.barh(features, importances)
plt.xlabel("Feature Importance")
plt.title("Feature Importance from Random Forest (Numeric Features)")
plt.show()


📌 What It Indicates
Cost and Num_Reviews are the most influential features, with nearly equal importance.

Num_Followers has slightly less influence, but still contributes meaningfully.

Random Forest calculates feature importance based on how much each feature reduces impurity (like Gini Index) across all decision trees.

💡 Insights from the Chart
Restaurants with higher costs and more reviews tend to have more predictable patterns in user ratings — possibly because higher-priced places are more polarizing or reviewed more frequently.

Number of followers of a reviewer is less influential than the other two, but still plays a notable role in determining sentiment or trust level of a review.

📈 Business Impact
Targeting strategy: Focus on improving review volume and pricing strategy to enhance visibility and ratings.

Trust scoring: Although reviewer popularity (followers) affects perception, it's not as critical as how much a place is reviewed or its pricing.

This knowledge can help prioritize features in future models, simplify data collection, or focus business actions.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

📊 **Key Findings**:
Cost and Number of Reviews are the most important numeric predictors of restaurant ratings.

Text reviews required preprocessing (removal of noise, stopwords, stemming) to enable sentiment analysis and vectorization.

Higher cost restaurants slightly correlate with better ratings, but the correlation is weak.

Reviewers with more followers do not significantly influence rating scores.

Logistic Regression and Random Forest achieved strong performance, with Random Forest performing slightly better.

After applying SMOTE to handle class imbalance, model accuracy and F1-score improved.

🤖 **Best Model Chosen**:
Random Forest Classifier after hyperparameter tuning (via GridSearchCV), due to its higher accuracy and balanced performance across classes.

📈 **Business Impact**:
Improving review count and managing price points can positively affect restaurant ratings.

Analysis allows Zomato to predict restaurant popularity, recommend top-rated places, and enhance customer satisfaction.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***