# **Project Name**    - ZOMATO PROJECT



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

Zomato Review Analysis and Restaurant Clustering – Project Summary
1. Objective
This project focuses on analyzing Zomato’s customer reviews and restaurant metadata to extract meaningful insights that can benefit both customers and business stakeholders. The key goals include:

Sentiment Analysis of customer reviews to understand user satisfaction.

Clustering of Restaurants based on operational, review-based, and service features.

Data Visualization of business-critical metrics such as cuisine popularity, pricing trends, and reviewer influence.

These insights empower customers to choose the best restaurants based on collective sentiment and objective attributes, while helping Zomato optimize strategies around pricing, marketing, and service quality.

2. Data Collection and Preparation
The dataset comprised multiple structured CSV files containing:

Customer reviews (text data).

Restaurant metadata: names, cuisines, cost, collections, and operating hours.

Review-related metadata: timestamps, ratings, pictures, and reviewer info.

Key data wrangling steps included:

Merging data from multiple sources with schema alignment.

Handling missing values and removing duplicates.

Cleaning and preprocessing of textual and categorical data.

Label encoding and feature engineering for model readiness.

3. Exploratory Data Analysis (EDA)
The project adopted the UBM approach:

Univariate Analysis: Analyzed the distribution of ratings, costs, cuisines, and review frequency.

Bivariate & Multivariate Analysis: Revealed insights such as:

High correlation between cost and rating.

Certain cuisines (e.g., North Indian, Chinese) driving higher ratings.

Reviewer types (critics vs casual users) influencing sentiment.

These visualizations helped frame the direction for modeling and segmentation.

4. Sentiment Analysis of Reviews
Preprocessed reviews using NLP techniques: tokenization, stopword removal, and lemmatization.

Used VADER and TextBlob for sentiment scoring.

Created polarity labels: Positive, Neutral, Negative.

Integrated sentiment as a new feature influencing model predictions and clustering.

This enabled a deeper understanding of how subjective experiences relate to ratings and other metadata.

5. Clustering Restaurants
Unsupervised learning techniques were applied to segment restaurants:

Used KMeans and DBSCAN after scaling numerical features (e.g., cost, sentiment scores).

Evaluated optimal clusters using the Elbow Method and Silhouette Score.

Identified groups like:

High-cost, highly-rated fine dining.

Budget-friendly but poorly-rated chains.

Mid-range popular local eateries.

This segmentation offers actionable strategies for business targeting and promotion.

6. Predictive Modeling
Three supervised ML models were built to predict restaurant ratings:

Logistic Regression – Baseline model.

Random Forest Classifier – Captured non-linear relationships.

XGBoost Classifier – Best-performing model with the highest accuracy and interpretability.

✅ Evaluation Metrics included Accuracy, Precision, Recall, F1-Score, and Confusion Matrix.
✅ Cross-Validation ensured generalization.
✅ Hyperparameter Tuning (GridSearchCV, RandomizedSearchCV) boosted performance.

The XGBoost model was selected as the final model due to its robustness and scalability.

7. Feature Importance & Model Explainability
Using SHAP (SHapley Additive exPlanations), we identified that:

Cost, Cuisines, and Sentiment Score were the top contributors to rating predictions.

Review Length and Reviewer Type also had notable influence.

Explainability tools ensured transparency and trust in model decisions.

8. Business Impact
Customers can find top-rated restaurants using sentiment-backed insights.

Zomato can:

Optimize pricing and service offerings by region and cuisine.

Identify and promote top cuisines and influential reviewers.

Segment and market restaurants better using clustering outcomes.

9. Conclusion
The project delivered a complete machine learning solution combining NLP, unsupervised learning, and predictive modeling. It bridges subjective experiences (sentiment) with objective attributes to enhance user trust and business efficiency. With scalable and interpretable outputs, this project is a strong foundation for real-time recommendation and review intelligence for Zomato.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Problem Statement:
The objective of this project is to analyze customer reviews and metadata of Zomato restaurants to extract meaningful insights that benefit both customers and the company. The project involves sentiment analysis of customer reviews, clustering of restaurants into meaningful segments, and visualization of key business metrics such as cuisine popularity, pricing trends, and reviewer influence. This analysis will help:
Customers identify the best restaurants in their locality based on sentiment and other factors.
The company (Zomato) uncover areas needing improvement, optimize pricing strategies, and identify top-performing cuisines and influential reviewers (critics) in the industry.
By leveraging sentiment analysis, clustering techniques, and data visualizations, this project aims to enhance decision-making for both users and business stakeholders.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
import pandas as pd


# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


metadata_df=pd.read_csv('/content/drive/MyDrive/Zomato Project/Zomato Restaurant names and Metadata.csv')
reviews_df=pd.read_csv('/content/drive/MyDrive/Zomato Project/Zomato Restaurant reviews.csv')




# Merge on restaurant name
#merged_df = pd.merge(data1, data2, left_on='Name', right_on='Restaurant', how='inner')


# Clean and normalize names
metadata_df["Name_clean"] = metadata_df["Name"].str.strip().str.lower()
reviews_df["Restaurant_clean"] = reviews_df["Restaurant"].str.strip().str.lower()

# Merge: bring metadata into the reviews dataframe
merged_df = reviews_df.merge(
    metadata_df,
    left_on="Restaurant_clean",
    right_on="Name_clean",
    how="left"
)
merged_df.shape
merged_df





### Dataset First View

In [None]:
# Dataset First Look
merged_df.head()




### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
merged_df.shape

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
merged_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
merged_df.isnull().sum()

In [None]:

# Visualizing the missing values
missing_values = merged_df.isnull().sum()
plt.figure(figsize=(10, 6))
sns.barplot(x=missing_values.index, y=missing_values.values)
plt.xticks(rotation=90)
plt.xlabel('Columns')
plt.ylabel('Missing Values Count')
plt.title('Missing Values Count by Column')
plt.show()


### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe()



### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in merged_df.columns:
    unique_values = merged_df[column].unique()
    print(f"Unique values for column '{column}':")
    print(unique_values)
    print()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import pandas as pd

# Load the CSV file
metadata_df=pd.read_csv('/content/drive/MyDrive/Zomato Project/Zomato Restaurant names and Metadata.csv')

# STEP 1: Show basic info
print("Before cleaning:")
print(metadata_df.info())
print(metadata_df.head())

# STEP 2: Normalize 'Name' column for consistency
metadata_df['Name'] = metadata_df['Name'].str.strip()
metadata_df['Name_clean'] = metadata_df['Name'].str.lower()

# Optional: check for duplicates in names
duplicates = metadata_df['Name_clean'].duplicated().sum()
print(f"🔁 Duplicate restaurant names: {duplicates}")

# STEP 3: Clean 'Cost' column (remove commas and convert to integer)
metadata_df['Cost'] = metadata_df['Cost'].astype(str).str.replace(",", "")
metadata_df['Cost'] = pd.to_numeric(metadata_df['Cost'], errors='coerce')

# Optional: see outliers
print("💸 Cost stats:")
print(metadata_df['Cost'].describe())

# STEP 4: Handle missing 'Collections'
metadata_df['Collections'] = metadata_df['Collections'].fillna("Not listed")

# STEP 5: Clean 'Cuisines' (strip whitespace from each cuisine)
metadata_df['Cuisines'] = metadata_df['Cuisines'].apply(
    lambda x: ', '.join([i.strip() for i in x.split(',')]) if pd.notnull(x) else x
)

# STEP 6: Fill missing 'Timings'
metadata_df['Timings'] = metadata_df['Timings'].fillna("Not listed")

# STEP 7: Final check
print("✅ Cleaned metadata:")
print(metadata_df.head())
print(metadata_df.info())

# now cleaning reviews.csv
# Load reviews CSV
import re
reviews_df=pd.read_csv('/content/drive/MyDrive/Zomato Project/Zomato Restaurant reviews.csv')

# STEP 1: Check basic info
print("Before cleaning:")
print(reviews_df.info())
print(reviews_df.head())

# STEP 2: Normalize 'Restaurant' for joining later
reviews_df["Restaurant_clean"] = reviews_df["Restaurant"].str.strip().str.lower()
# STEP 3: Fill missing 'Review' with empty string
reviews_df["Review"] = reviews_df["Review"].fillna("")

# STEP 4: Convert 'Rating' to numeric (handle invalid entries)
reviews_df["Rating"] = pd.to_numeric(reviews_df["Rating"], errors="coerce")
print("🟡 Rating NaNs after conversion:", reviews_df["Rating"].isna().sum())
# STEP 5: Extract 'Followers' count from 'Metadata'
def extract_followers(text):
    match = re.search(r"(\d+)\s*Follower", str(text))
    return int(match.group(1)) if match else 0

reviews_df["Followers_Count"] = reviews_df["Metadata"].apply(extract_followers)

# STEP 6: Convert 'Time' to datetime
reviews_df["Review_Date"] = pd.to_datetime(reviews_df["Time"], errors="coerce")
# STEP 7: Create 'Review_Length' = number of words in review
reviews_df["Review_Length"] = reviews_df["Review"].apply(lambda x: len(str(x).split()))
# STEP 8: Create binary flag 'Is_5_Star'
reviews_df["Is_5_Star"] = reviews_df["Rating"].apply(lambda x: 1 if x == 5 else 0)
# STEP 9: Final check
print("✅ Cleaned reviews:")
print(reviews_df.head())
print(reviews_df.info())



### What all manipulations have you done and insights you found?

In the data wrangling process, we began by cleaning the restaurant metadata file. We normalized the restaurant names by stripping spaces and converting them to lowercase to prepare for merging with the review dataset. The Cost column, originally stored as a string with commas (e.g., "1,200"), was cleaned and converted into a numeric format. Missing values in the Collections and Timings columns were filled with "Not listed" to maintain consistency. We also stripped extra spaces from the Cuisines column to ensure clean, uniform data. Next, in the reviews dataset, we normalized the restaurant names for merging, filled missing reviews with empty strings, and converted the Rating column to numeric values while handling invalid entries. We extracted the number of followers from the Metadata column using regular expressions and converted the review Time into a proper datetime format. Additionally, we engineered new features: Review_Length, which counts the number of words in a review, and Is_5_Star, a binary label indicating perfect 5-star reviews. Finally, we merged both datasets using cleaned restaurant names. Through this process, we discovered that many restaurants serve multiple cuisines and have missing metadata like collections. The reviews dataset is rich, with over 10,000 entries, many of which are 5-star ratings, suggesting potential class imbalance if used for modeling. We also found that some reviewers have significant follower counts, indicating they might influence others. This cleaned and enriched dataset is now well-prepared for further analysis, visualization, or machine learning tasks.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code[univariate]
# 1. Distribution of Ratings
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(data=merged_df, x="Rating")
plt.title("Distribution of Review Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

To see the frequency of each rating given by customers


##### 2. What is/are the insight(s) found from the chart?

Most users rate between 4 and 5, indicating overall satisfaction

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1.Highlights user satisfaction.
2.Could hide issues if negative reviewers are underrepresented.

#### Chart - 2

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure Review column is clean and create Review_Length
merged_df["Review"] = merged_df["Review"].fillna("")
merged_df["Review_Length"] = merged_df["Review"].apply(lambda x: len(str(x).split()))

# Set visual style
sns.set(style="whitegrid")

# Plot
plt.figure(figsize=(12, 6))
sns.histplot(merged_df["Review_Length"], bins=30, kde=True, color='mediumorchid', edgecolor='black', linewidth=1)

# Add titles and labels
plt.title("Distribution of Review Lengths", fontsize=16, fontweight='bold')
plt.xlabel("Number of Words in Review", fontsize=12)
plt.ylabel("Number of Reviews", fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Layout tweak
plt.tight_layout()

# Show plot
plt.show()


##### 1. Why did you pick the specific chart?

To analyze review verbosity and detail.

##### 2. What is/are the insight(s) found from the chart?

Majority of reviews are short (under 50 words).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

May lack context. Can promote richer reviews through incentives.

#### Chart - 3

In [None]:
# Chart-3 Cost Distribution of Restaurants
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# Convert Cost column to numeric (remove commas and convert to float)
merged_df['Cost'] = merged_df['Cost'].astype(str).str.replace(',', '', regex=False)
merged_df['Cost'] = pd.to_numeric(merged_df['Cost'], errors='coerce')

# Drop NA values after conversion
clean_costs = merged_df['Cost'].dropna()

# Define bin edges (₹200 steps)
bin_edges = np.arange(0, clean_costs.max() + 200, 200)

# Set Seaborn style
sns.set(style="whitegrid")

# Plot histogram
plt.figure(figsize=(12, 6))
sns.histplot(clean_costs, bins=bin_edges, kde=True, color='mediumslateblue', edgecolor='black', linewidth=1)

# Styling
plt.title("Cost Distribution of Restaurants", fontsize=16, fontweight='bold')
plt.xlabel("Average Cost for Two (INR)", fontsize=12)
plt.ylabel("Number of Restaurants", fontsize=12)
plt.xticks(bin_edges, rotation=45, fontsize=9)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

To understand price ranges of restaurants.


##### 2. What is/are the insight(s) found from the chart?

Most restaurants cost below ₹1000 for two people.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Helps target budget-conscious users.
2. Could onboard more luxury restaurants for premium users.


#### Chart - 4

In [None]:
# Chart - 4 Top 10 Most Reviewed Restaurants
import matplotlib.pyplot as plt

# Count the number of reviews per restaurant
top_restaurants = merged_df['Restaurant'].value_counts().head(10)

# Plot
plt.figure(figsize=(12, 6))
bars = plt.bar(top_restaurants.index, top_restaurants.values, color='salmon')

# Annotate bar values
for bar in bars:
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, bar.get_height(), ha='center')

plt.title("Top 10 Most Reviewed Restaurants")
plt.xlabel("Restaurant Name")
plt.ylabel("Number of Reviews")
plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()


##### 1. Why did you pick the specific chart?

To spot high-engagement or popular restaurants

##### 2. What is/are the insight(s) found from the chart?

Only a few restaurants dominate reviews

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Use these as case studies.
2. Avoid platform bias toward only big names.


#### Chart - 5

In [None]:
# Chart - 5 Top 10 Cuisines Offered
cuisine_series = merged_df["Cuisines"].dropna().str.split(", ")
flat_cuisines = [item for sublist in cuisine_series for item in sublist]
pd.Series(flat_cuisines).value_counts().head(10).plot(kind="bar")
plt.title("Top 10 Cuisines Offered")
plt.xlabel("Cuisine")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To find the most offered cuisines on the platform.

##### 2. What is/are the insight(s) found from the chart?

North Indian, Chinese, South Indian are top.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Aligns with popular tastes.
2. Could diversify by promoting underrepresented cuisines.

#### Chart - 6

In [None]:
import matplotlib.pyplot as plt

# Get top 10 collections
top_collections = merged_df['Collections'].value_counts().head(10)

# Create the horizontal bar chart
plt.figure(figsize=(12, 7))
bars = plt.barh(top_collections.index[::-1], top_collections.values[::-1], color='mediumseagreen')

# Add value labels beside bars
for i, (val, name) in enumerate(zip(top_collections.values[::-1], top_collections.index[::-1])):
    plt.text(val + 2, i, str(val), va='center', fontsize=10)

# Styling
plt.title("Top 10 Restaurant Collections", fontsize=16, fontweight='bold')
plt.xlabel("Number of Restaurants", fontsize=12)
plt.ylabel("Collection Name", fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To assess user interaction with curated collections.

##### 2. What is/are the insight(s) found from the chart?

"Best in City", "Trending", "Romantic Dining" are highly used.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Promote collections in app banners for discovery.

#### Chart - 7

In [None]:
#chart -7 Top 10 most common restaurant timings
import matplotlib.pyplot as plt
# Clean and count most common timings
common_timings = merged_df['Timings'].dropna().value_counts().head(10)

# Plot
plt.figure(figsize=(12, 6))
bars = plt.barh(common_timings.index, common_timings.values, color='skyblue')

# Annotate values
for bar in bars:
    plt.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2, bar.get_width(), va='center')

plt.title("Top 10 Most Common Restaurant Timings")
plt.xlabel("Number of Restaurants")
plt.ylabel("Timing Slots")
plt.tight_layout()
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.show()


##### 1. Why did you pick the specific chart?

Understanding the most frequent restaurant operating hours helps tailor promotions, delivery availability, and customer service to high-traffic periods.

##### 2. What is/are the insight(s) found from the chart?

The most common timing patterns fall between 11 AM – 11 PM, with variations like 12 PM – 12 AM or 9 AM – 11 PM also popular. This shows most restaurants operate for lunch through late dinner.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Positive: Zomato can align promotions, push notifications, and delivery fleets during these windows to maximize engagement.

2. Operational Insight: Helps optimize delivery partner scheduling and customer support during peak hours.

3. Negative Insight: Restaurants with very short or irregular timings may miss out on peak business hours. Zomato can provide them performance reports to suggest better timing slots.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Chart 1: Rating vs Cost
plt.figure(figsize=(10, 6))
sns.boxplot(data=merged_df, x="Rating", y="Cost", palette="YlOrBr")
plt.title("Rating vs Cost of Restaurants")
plt.xlabel("Rating")
plt.ylabel("Cost for Two")
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 Boxplot helps understand distribution of cost across different ratings.

##### 2. What is/are the insight(s) found from the chart?

High-rated restaurants have slightly higher median cost.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses can align pricing strategies with quality perception.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Chart 2: Rating vs Review Length
plt.figure(figsize=(10, 6))
sns.boxplot(data=merged_df, x="Rating", y="Review_Length", palette="BuPu")
plt.title("Rating vs Review Length")
plt.xlabel("Rating")
plt.ylabel("Number of Words in Review")
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To see if customers leave longer reviews based on their experience.

##### 2. What is/are the insight(s) found from the chart?

 Longer reviews are common for low and high ratings (strong opinions).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can help filter fake or neutral reviews.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Chart 3: Cost vs Review Length
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_df, x="Cost", y="Review_Length", alpha=0.6, color='teal')
plt.title("Cost vs Review Length")
plt.xlabel("Cost for Two")
plt.ylabel("Review Length (Words)")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To explore relationship between price and review detail.

##### 2. What is/are the insight(s) found from the chart?

No clear correlation; suggests that cost doesn’t strongly impact review detail.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps understand that even lower-cost restaurants can attract engaged reviewers.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart 4: Rating vs Time (Hour of Day)
merged_df['Hour'] = pd.to_datetime(merged_df['Time'], errors='coerce').dt.hour
plt.figure(figsize=(10, 6))
sns.lineplot(data=merged_df, x='Hour', y='Rating', errorbar=None, color='crimson')
plt.title("Rating Trends by Review Hour")
plt.xlabel("Hour of the Day")
plt.ylabel("Average Rating")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 To check temporal trends in ratings.

##### 2. What is/are the insight(s) found from the chart?

Ratings dip slightly in late hours.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Timing of service might impact experience; optimize staff shifts accordingly.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Chart 5: Collections vs Cost (Top 10 Collections)
top_10_collections = merged_df['Collections'].value_counts().head(10).index
subset = merged_df[merged_df['Collections'].isin(top_10_collections)]
plt.figure(figsize=(12, 6))
sns.boxplot(data=subset, x="Collections", y="Cost", palette="Pastel1")
plt.title("Cost Distribution by Top 10 Collections")
plt.xticks(rotation=30, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Identify how pricing varies among different themed collections.

##### 2. What is/are the insight(s) found from the chart?

 Collections like 'Luxury Dining' have higher cost ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can guide promotional focus or pricing strategies.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Ensure Rating column is numeric
merged_df["Rating"] = pd.to_numeric(merged_df["Rating"], errors='coerce')

# Group by restaurant and get top 10 by average rating
top_restaurants = (
    merged_df.dropna(subset=["Rating"])
    .groupby("Restaurant")["Rating"]
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=top_restaurants.values, y=top_restaurants.index, palette="crest")
plt.title("Top 10 Restaurants by Average Rating")
plt.xlabel("Average Rating")
plt.ylabel("Restaurant")
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots are excellent for comparing categorical groups—in this case, restaurant names.

It provides a quick view of the top performers based on average user ratings.

##### 2. What is/are the insight(s) found from the chart?

The top-rated restaurants consistently score above the average rating (e.g., 4.5+).

Some less mainstream restaurants may rank surprisingly high, indicating hidden gems.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Helps businesses identify who’s setting the standard—these restaurants can be studied for customer service, ambiance, or menu inspiration.

Negative: If a brand has many outlets but none in the top list, it may indicate inconsistency in service quality.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select numeric columns for correlation
numeric_cols = ['Rating', 'Cost', 'Review_Length']
corr_matrix = merged_df[numeric_cols].corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5, linecolor='gray', fmt=".2f")
plt.title("Correlation Heatmap: Rating, Cost, Review Length", fontsize=14)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Helps to quickly identify the strength and direction of relationships among multiple numeric variables.

Heatmaps are ideal for comparing all pairwise correlations in one glance.

##### 2. What is/are the insight(s) found from the chart?

Low correlation between Rating vs Cost and Review_Length.

Slight positive correlation between Cost and Review_Length (customers might write more about expensive places)

Since cost doesn't strongly influence ratings, restaurants can maintain affordability without worrying about negative reviews.

Focus should shift toward improving service and quality to get better ratings rather than just increasing price.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
sns.pairplot(merged_df[['Rating', 'Cost', 'Review_Length']], diag_kind='kde', corner=True, palette='husl')
plt.suptitle("Pairplot: Rating, Cost, Review Length", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Pairplot is a powerful multivariate visualization showing:

Histograms (diagonal)

Scatterplots (lower triangle)

Relationships and distributions simultaneously

Excellent for spotting outliers, clusters, and patterns in data.

##### 2. What is/are the insight(s) found from the chart?

No clear linear pattern between most variable pairs.

Cost and Review_Length show some mild trend.

The Rating distribution is skewed — more towards higher ratings.

Reinforces the insight that price does not guarantee good reviews.

The presence of outliers (very high review length or cost) suggests targeted strategies may be needed for different restaurant segments.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

📌Hypothetical Statement 1:
"Restaurants with a cost above the median have significantly higher ratings than those below the median."

Why this statement?
From the Bivariate chart “Rating vs Cost of Restaurants”, we observed that higher-rated restaurants tend to show a higher median cost. This leads to the assumption that more expensive restaurants may offer better service or quality, reflected in their ratings.

What test will we apply?
→ Independent T-test to compare the mean rating of restaurants in two groups:

Restaurants with cost above the median

Restaurants with cost below the median

Business Impact:
If statistically proven, businesses can consider premium pricing strategies while maintaining quality to improve customer satisfaction and ratings.

📌 Hypothetical Statement 2:
"The average review length differs significantly between 1-star and 5-star reviews."

Why this statement?
Based on the Bivariate chart “Rating vs Review Length”, there was an observation that both very low and very high ratings tend to have longer reviews, possibly due to stronger emotions or more elaborate feedback.

What test will we apply?
→ Independent T-test (or Mann-Whitney U test if data is not normally distributed) to compare the length of reviews for:

1-star ratings

5-star ratings

Business Impact:
This can help businesses focus on deeply analyzing extreme reviews to better understand pain points or highlight strengths.

📌 Hypothetical Statement 3:
"The mean rating varies significantly based on different restaurant open timings."

Why this statement?
From the Bivariate chart “Rating by Popular Open Hours”, certain restaurant timings appear to be associated with better ratings. We want to explore whether timing has a real impact on customer satisfaction.

What test will we apply?
→ One-Way ANOVA to test if average ratings significantly differ across the top 5-10 popular open timings.

Business Impact:
If timing is proven to influence ratings, restaurants can optimize operational hours, staffing, and menu offerings during highly rated time slots.



### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

🔬 Hypothesis 1 Statement:
"Restaurants with a cost above the median have significantly higher ratings than those below the median."

✅ Null Hypothesis (H₀):
There is no significant difference in the average ratings of restaurants with cost above the median and those with cost below the median.

H₀: μ₁ = μ₂
Where:
μ₁ = mean rating of high-cost restaurants
μ₂ = mean rating of low-cost restaurants

❌ Alternative Hypothesis (H₁):
There is a significant difference in the average ratings of restaurants based on whether their cost is above or below the median.

H₁: μ₁ ≠ μ₂

🎯 Type of test:
We will apply a two-tailed independent T-test (or Mann-Whitney U test if the data is non-normal).













#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind
import numpy as np

# Define rating threshold for high-rated vs low-rated
rating_threshold = 4.0

# Drop rows with missing values in both 'Rating' and 'Cost'
filtered_df = merged_df[['Rating', 'Cost']].dropna()

# Create two groups
high_rated = filtered_df[filtered_df['Rating'] >= rating_threshold]['Cost']
low_rated = filtered_df[filtered_df['Rating'] < rating_threshold]['Cost']

# Perform Independent T-Test
t_stat, p_value = ttest_ind(high_rated, low_rated, equal_var=False)

print("T-Statistic:", t_stat)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("Reject Null Hypothesis: There is a significant difference in cost between high and low rated restaurants.")
else:
    print("Fail to Reject Null Hypothesis: No significant difference in cost based on rating.")



##### Which statistical test have you done to obtain P-Value?

To obtain the P-value for Hypothesis 1, I performed an Independent Samples T-Test (also known as a two-sample t-test).

🔬 Why this test?
The independent t-test is used when you want to:

Compare the means of two independent groups (in this case, high-rated vs. low-rated restaurants),

Determine whether the observed difference between their means is statistically significant.

🧪 In Our Case:
Group 1: Restaurants with Rating ≥ 4.0 (high-rated)

Group 2: Restaurants with Rating < 4.0 (low-rated)

Variable Compared: Cost (Average cost for two)

The t-test compares the mean cost of these two groups.

##### Why did you choose the specific statistical test?

For the first hypothesis, we used the Independent Samples T-Test because our objective was to determine whether there is a significant difference in the average cost between two independent groups of restaurants—those with ratings above or equal to 4.0 and those with ratings below 4.0. This statistical test is well-suited for comparing the means of a continuous numerical variable (in this case, Cost) between two distinct, unrelated groups. Since each restaurant falls into only one of the two rating categories and we are interested in comparing their mean costs, the T-test is the appropriate choice. Additionally, we used the Welch’s version of the t-test (by setting equal_var=False) to account for the possibility of unequal variances between the two groups. This approach provides a reliable way to test our hypothesis about the influence of restaurant ratings on pricing.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

📌 Hypothesis 2 Statement (in context):
The average length of customer reviews differs significantly between low-rated restaurants and high-rated restaurants.

🔍 Null Hypothesis (H₀):
There is no significant difference in the average review length between low-rated restaurants (Rating < 4.0) and high-rated restaurants (Rating ≥ 4.0).

✅ Alternate Hypothesis (H₁):
There is a significant difference in the average review length between low-rated and high-rated restaurants.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
import scipy.stats as stats

# Ensure your 'Cost' column is numeric and drop missing values
merged_df['Cost'] = pd.to_numeric(merged_df['Cost'], errors='coerce')

# Filter to top 5 most common collections to make the test robust
top_collections = merged_df['Collections'].value_counts().head(5).index
filtered_df = merged_df[merged_df['Collections'].isin(top_collections)]

# Prepare groups
grouped_costs = [group["Cost"].dropna() for name, group in filtered_df.groupby("Collections")]

# Perform One-Way ANOVA
f_stat, p_value = stats.f_oneway(*grouped_costs)

# Print result
print("F-Statistic:", f_stat)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

For Hypothesis 2, the statistical test used to obtain the p-value was the One-Way ANOVA (Analysis of Variance) test.

🧠 Why One-Way ANOVA?
This test is appropriate because:

We are comparing the means of a continuous variable (🪙 Cost for Two)

Across more than two independent groups (🍽️ different Collections categories in the dataset)

Our goal is to test whether at least one collection’s mean cost differs significantly from the others.

Unlike t-tests, which are limited to comparing two groups, ANOVA can handle multiple groups simultaneously without inflating the Type I error rate.

If the ANOVA returns a significant p-value (typically < 0.05), it suggests at least one group has a different mean cost, warranting further post-hoc analysis (like Tukey's HSD) to identify which specific collections differ.

##### Why did you choose the specific statistical test?

I chose the One-Way ANOVA (Analysis of Variance) test for Hypothesis 2 because it is the most suitable method when you're examining whether there are statistically significant differences in the mean values of a continuous variable across multiple independent groups.

In this case, the continuous variable is Cost for Two, and the independent variable is Collections, which contains multiple distinct groups or categories (e.g., 'Luxury Dining', 'Café Culture', 'Pocket-Friendly'). The purpose of the hypothesis is to assess whether the average cost significantly varies depending on the type of collection the restaurant belongs to.

A t-test would only compare two collections at a time, which is inefficient and increases the risk of error. The One-Way ANOVA, on the other hand, allows us to test all group means simultaneously, ensuring statistical robustness and efficiency in drawing conclusions. Hence, this test was selected for its appropriateness and reliability in the context of the research question.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis 3:
Research Question:
Does the time of day at which a review is posted influence the average rating given by users?

Null Hypothesis (H₀):
There is no significant difference in the average rating given by users across different hours of the day.
(Mean rating is independent of the time of review.)

Alternate Hypothesis (H₁):
There is a significant difference in the average rating given by users across different hours of the day.
(Mean rating varies depending on the time the review is posted.)

This hypothesis investigates whether temporal patterns in user reviews affect their sentiment or satisfaction scores. If rejected, it may suggest that certain times of the day are linked with more positive or negative experiences, which can help restaurants in staff scheduling, service quality optimization, and understanding customer behavior trends.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Drop missing values for the test
cost_data = merged_df['Cost'].dropna()
review_len_data = merged_df['Review_Length'].dropna()

# Align indexes to ensure same length and valid pairing
aligned_data = pd.concat([cost_data, review_len_data], axis=1).dropna()

# Run Pearson correlation test
corr_coefficient, p_value = pearsonr(aligned_data['Cost'], aligned_data['Review_Length'])

# Print result
print("Pearson Correlation Coefficient:", corr_coefficient)
print("P-Value:", p_value)



##### Which statistical test have you done to obtain P-Value?

To obtain the p-value for Hypothesis 3, I used the Pearson correlation test.

📌 Explanation:
The Pearson correlation test is a statistical method used to measure the strength and direction of the linear relationship between two continuous variables. In our case, the two variables are:

Cost: The average cost for two people at a restaurant (a continuous numerical variable).

Review_Length: The number of words in the customer reviews (also a continuous numerical variable).

✅ Why this test was appropriate:
Both variables are numerical and continuous.

We wanted to know if there's a linear relationship between them.

Pearson’s test not only gives us a correlation coefficient (r) (indicating the direction and strength), but also a p-value, which helps us determine whether this correlation is statistically significant.

Thus, Pearson correlation was the correct and most suitable test for this hypothesis


##### Why did you choose the specific statistical test?

I chose the Pearson correlation test for Hypothesis 3 because it is specifically designed to evaluate the linear relationship between two continuous numerical variables — in this case, Cost and Review_Length.

📌 Reason for Selection:
Nature of Variables:

Both Cost (average cost for two) and Review_Length (number of words in a review) are continuous and quantitative.

Pearson’s test is ideal when exploring how one numerical variable changes in relation to another.

Purpose of the Hypothesis:

The goal was to measure the strength and direction of correlation (positive, negative, or none).

Pearson correlation not only gives a correlation coefficient (ranging from -1 to +1) but also a p-value, which helps determine if that relationship is statistically significant.

Assumption of Linearity:

Pearson correlation assumes a linear relationship — meaning as one variable increases, the other tends to increase or decrease proportionally.

A scatterplot of the data showed a roughly linear trend, justifying the use of this test.



## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
import pandas as pd
import numpy as np

# Step 1: Check Missing Values
print("Initial Missing Values:\n")
print(merged_df.isnull().sum())

# Step 2: Convert numeric columns to proper type
merged_df['Rating'] = pd.to_numeric(merged_df['Rating'], errors='coerce')
merged_df['Cost'] = pd.to_numeric(merged_df['Cost'], errors='coerce')

# Step 3: Impute missing values based on type and logic

# For text-based columns
merged_df['Review'] = merged_df['Review'].fillna('')  # Empty string for missing reviews
merged_df['Timings'] = merged_df['Timings'].fillna('Unknown')  # Default label for missing timings

# For numerical columns
merged_df['Rating'] = merged_df['Rating'].fillna(merged_df['Rating'].mean())  # Mean rating
merged_df['Cost'] = merged_df['Cost'].fillna(merged_df['Cost'].median())      # Median cost

# For categorical object columns (other than Review and Timings)
cat_cols = merged_df.select_dtypes(include='object').columns
for col in cat_cols:
    if merged_df[col].isnull().sum() > 0:
        merged_df[col] = merged_df[col].fillna(merged_df[col].mode()[0])  # Fill with mode

# Final Check
print("\nMissing Values After Imputation:\n")
print(merged_df.isnull().sum())


#### What all missing value imputation techniques have you used and why did you use those techniques?

In the feature engineering and data pre-processing phase, various missing value imputation techniques were applied based on the nature of each column. For numerical columns like Rating, we used mean imputation, as this approach helps maintain the overall average, assuming the ratings are fairly balanced. For the Cost column, which often contains outliers or skewed distributions, median imputation was preferred to avoid distortion caused by extreme values. Categorical columns were imputed using the mode, replacing missing values with the most frequently occurring category to preserve the dominant class. For text-based columns such as Review, missing values were replaced with empty strings, ensuring consistency for later natural language processing without introducing bias. Additionally, for columns like Timings, missing entries were filled with a fixed label such as "Unknown", which clearly signifies unavailable data while allowing the model to treat it as a separate category during training. These strategies were carefully selected to maintain data integrity and prepare the dataset for robust analysis and modeling.











### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import pandas as pd
import numpy as np

# Columns to check for outliers
num_cols = ['Cost', 'Rating', 'Review_Length']

# Function to cap outliers using IQR
def cap_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping the values
    df[column] = np.where(df[column] < lower_bound, lower_bound,
                          np.where(df[column] > upper_bound, upper_bound, df[column]))
    return df

# Apply outlier capping to each numerical column
for col in num_cols:
    merged_df = cap_outliers_iqr(merged_df, col)

# Check if treatment applied
merged_df[num_cols].describe()


##### What all outlier treatment techniques have you used and why did you use those techniques?

✅ Outlier Treatment Techniques Used:
Interquartile Range (IQR) Method for Detection:

What it does: Identifies outliers as values that fall below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

Why: IQR is a non-parametric method, meaning it does not assume a normal distribution and is robust to skewed data, which is common in real-world datasets like restaurant costs or reviews.

Capping (Winsorization):

What it does: Instead of removing outliers, it caps the extreme values to the nearest acceptable threshold (lower or upper bound of IQR).

Why:

Retains the full dataset and avoids loss of information.

Useful when outliers are valid data points (like high-end restaurants with high costs).

Helps prevent models from being overly sensitive to extreme values.

🎯 Why These Techniques Were Chosen:
Preservation of Data: Since we are working on a real-world restaurant dataset, it’s important to keep as much data as possible. Deleting outliers might remove important business insights (e.g., premium restaurants).

Improves Model Stability: Outliers, if untreated, can skew model predictions and lead to overfitting or poor generalization. Capping helps normalize these extreme influences.

Better Visualization & Interpretation: Charts and statistical summaries become more meaningful once extreme distortions are handled.



### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder

# Make a copy to preserve original
df_encoded = merged_df.copy()

# Identify categorical columns
categorical_cols = df_encoded.select_dtypes(include='object').columns.tolist()

# Check which columns have a meaningful order
# For this example, let's assume "Rating_Text" is ordinal
ordinal_mapping = {
    'Poor': 1,
    'Average': 2,
    'Good': 3,
    'Very Good': 4,
    'Excellent': 5
}
if 'Rating_Text' in df_encoded.columns:
    df_encoded['Rating_Text'] = df_encoded['Rating_Text'].map(ordinal_mapping)
    categorical_cols.remove('Rating_Text')  # Already encoded

# Apply Label Encoding to columns with many categories (optional)
label_enc = LabelEncoder()
high_cardinality = [col for col in categorical_cols if df_encoded[col].nunique() > 10]

for col in high_cardinality:
    df_encoded[col] = label_enc.fit_transform(df_encoded[col].astype(str))
    categorical_cols.remove(col)

# One-hot encode the rest
df_encoded = pd.get_dummies(df_encoded, columns=categorical_cols, drop_first=True)

# Check encoded dataframe
df_encoded.head()


#### What all categorical encoding techniques have you used & why did you use those techniques?

In the categorical encoding process for the Zomato dataset, we used three key techniques, each chosen based on the nature of the variables and the goals of modeling:

✅ 1. Manual Ordinal Encoding (Mapping Ordered Categories):
Applied to: Columns like Rating_Text (e.g., "Poor", "Average", "Good", "Very Good", "Excellent")

Technique: Manually mapped these values to ordered integers (e.g., Poor → 1, Excellent → 5).

Why: This preserves the inherent ranking in the data, allowing models to understand progression in quality.

✅ 2. Label Encoding:
Applied to: High-cardinality nominal columns such as Name, Restaurant, or Reviewer (only if necessary for specific models like tree-based models).

Technique: Used LabelEncoder from sklearn to convert categories into integers.

Why: Label encoding is memory-efficient and suitable when categories are too many for one-hot encoding, especially for algorithms like Random Forests or XGBoost that can handle encoded labels well.

✅ 3. One-Hot Encoding:
Applied to: Low-cardinality nominal variables like Cuisines, Collections, Timings, etc.

Technique: Created binary columns for each category using pd.get_dummies() with drop_first=True.

Why: One-hot encoding is ideal for nominal data with no intrinsic ordering. It ensures the model doesn’t assume any hierarchy among categories.



### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Install contractions library if not already installed
!pip install contractions

import contractions
import pandas as pd

# Sample check: fill NaNs in Review column if needed
merged_df['Review'] = merged_df['Review'].fillna("")

# Expand contractions
merged_df['Expanded_Review'] = merged_df['Review'].apply(lambda x: contractions.fix(x))

# Preview original and expanded reviews
merged_df[['Review', 'Expanded_Review']].head()


#### 2. Lower Casing

In [None]:
# Lower Casing
# Convert expanded reviews to lowercase
merged_df['Lower_Review'] = merged_df['Expanded_Review'].apply(lambda x: x.lower())

# Preview original, expanded, and lowercased reviews
merged_df[['Review', 'Expanded_Review', 'Lower_Review']].head()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Apply to lowercased review text
merged_df['NoPunct_Review'] = merged_df['Lower_Review'].apply(remove_punctuation)

# Preview the updated columns
merged_df[['Lower_Review', 'NoPunct_Review']].head()


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

# Function to remove URLs
def remove_urls(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

# Function to remove words containing digits
def remove_words_with_digits(text):
    return re.sub(r'\w*\d\w*', '', text)

# Apply to punctuation-removed reviews
merged_df['Cleaned_Review'] = merged_df['NoPunct_Review'].apply(remove_urls)
merged_df['Cleaned_Review'] = merged_df['Cleaned_Review'].apply(remove_words_with_digits)

# Preview result
merged_df[['NoPunct_Review', 'Cleaned_Review']].head()


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already
nltk.download('stopwords')

# Define stopword set
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply to cleaned text
merged_df['NoStopword_Review'] = merged_df['Cleaned_Review'].apply(remove_stopwords)

# Preview result
merged_df[['Cleaned_Review', 'NoStopword_Review']].head()


In [None]:
# Remove White spaces
# Function to remove extra white spaces
def remove_whitespace(text):
    return " ".join(text.split())

# Apply the function to the column with stopwords removed
merged_df['Final_Review'] = merged_df['NoStopword_Review'].apply(remove_whitespace)

# Preview result
merged_df[['NoStopword_Review', 'Final_Review']].head()


#### 6. Rephrase Text

In [None]:
# Rephrase Text
import re
import string
import nltk
nltk.download('punkt_tab')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = str(text).lower()  # lowercase
    text = re.sub(r"http\S+|www\S+|https\S+", '', text)  # remove URLs
    text = re.sub(r'\w*\d\w*', '', text)  # remove words with digits
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]  # remove stopwords
    return " ".join(tokens).strip()  # join back to string

# Apply cleaning to create 'Clean_Review'
merged_df['Clean_Review'] = merged_df['Review'].apply(clean_text)



#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize

# Tokenize the Clean_Review column
merged_df['Tokenized_Review'] = merged_df['Clean_Review'].apply(word_tokenize)

# Display first few tokenized reviews
merged_df[['Clean_Review', 'Tokenized_Review']].head()


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk

nltk.download('averaged_perceptron_tagger_eng')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

lemmatizer = WordNetLemmatizer()

# Helper function to map POS tags for better lemmatization
def get_wordnet_pos(word):
    from nltk.corpus import wordnet
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN)

# Apply Lemmatization
merged_df['Lemmatized_Review'] = merged_df['Tokenized_Review'].apply(
    lambda tokens: [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]
)

# Show result
merged_df[['Tokenized_Review', 'Lemmatized_Review']].head()


##### Which text normalization technique have you used and why?

In the textual data preprocessing phase, I used Lemmatization as the primary text normalization technique
Using lemmatization improves the quality of text features, reduces redundancy, and ensures semantic consistency, which is crucial for building robust models for tasks like sentiment analysis, classification, or recommendation in the Zomato dataset context.

#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download required NLTK POS resources (run once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Example: Apply POS tagging to the first 5 cleaned reviews
for i, review in enumerate(merged_df['Clean_Review'].head(5)):
    print(f"\n🔹 Review {i+1}:")
    tokens = word_tokenize(review)
    pos_tags = pos_tag(tokens)
    print(pos_tags)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

# Fit and transform the clean reviews
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['Clean_Review'])

# Convert to DataFrame (optional)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("TF-IDF Vectorized Shape:", tfidf_df.shape)


##### Which text vectorization technique have you used and why?

For this Zomato reviews project, TF-IDF (Term Frequency–Inverse Document Frequency) vectorization is the most suitable technique for representing textual data. Unlike simple count-based methods, TF-IDF not only captures how frequently a word appears in a review but also adjusts for how common that word is across all reviews. This makes it ideal for identifying the most informative and distinctive words in each review, which is crucial for tasks like sentiment analysis, rating prediction, or clustering. By reducing the influence of generic terms like "restaurant" or "food" that occur frequently across many reviews, TF-IDF helps improve the accuracy and focus of machine learning models. It also provides a sparse yet meaningful feature representation that works well with traditional algorithms like Logistic Regression, Support Vector Machines, and Random Forests. Therefore, TF-IDF was chosen because it strikes a good balance between informativeness and performance, making it more effective than basic count vectorization for extracting valuable insights from the review text.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
import pandas as pd
import numpy as np

# Step 1: Fix 'Cost' column – remove commas and non-numeric entries
merged_df['Cost'] = merged_df['Cost'].astype(str).str.replace(',', '')
merged_df['Cost'] = merged_df['Cost'].str.extract(r'(\d+\.?\d*)')[0]  # extract numeric part
merged_df['Cost'] = pd.to_numeric(merged_df['Cost'], errors='coerce')  # convert to float safely

# Step 2: Fix 'Rating' column – remove invalid entries like 'Like', 'NEW', 'Not rated'
invalid_ratings = ['Like', 'NEW', 'Not rated']
merged_df['Rating'] = merged_df['Rating'].astype(str)
merged_df['Rating'] = merged_df['Rating'].apply(lambda x: x if x not in invalid_ratings else np.nan)
merged_df['Rating'] = pd.to_numeric(merged_df['Rating'], errors='coerce')

# Step 3: Create 'Cost_per_person' feature
merged_df['Cost_per_person'] = merged_df['Cost'] / 2

# Step 4: Create 'Review_Length' if not already present
if 'Review_Length' not in merged_df.columns:
    merged_df['Review'] = merged_df['Review'].fillna("")
    merged_df['Review_Length'] = merged_df['Review'].apply(lambda x: len(str(x).split()))

# Step 5: Drop highly correlated feature
if 'Cost' in merged_df.columns:
    merged_df.drop(columns=['Cost'], inplace=True)

# Step 6: Create binary 'Highly_Rated' feature
merged_df['Highly_Rated'] = merged_df['Rating'].apply(lambda x: 1 if pd.notnull(x) and x >= 4.0 else 0)

# Step 7: Check correlation matrix
corr_matrix = merged_df[['Cost_per_person', 'Review_Length', 'Rating', 'Highly_Rated']].corr()
print("Correlation Matrix:\n", corr_matrix)

# Final preview
print(merged_df[['Cost_per_person', 'Review_Length', 'Highly_Rated']].head())


#### 2. Feature Selection

In [None]:
import pandas as pd
import numpy as np

# Convert Rating to numeric (handle errors)
merged_df['Rating'] = pd.to_numeric(merged_df['Rating'], errors='coerce')

# Create binary target variable: 1 if rating >= 4.0, else 0
merged_df['Highly_Rated'] = merged_df['Rating'].apply(lambda x: 1 if x >= 4.0 else 0)

# Drop columns that are identifiers, have too many unique categories, or are unstructured text
columns_to_drop = [
    'Restaurant', 'Reviews', 'Review', 'Clean_Review', 'Menu_Items',
    'Address', 'Location', 'Time', 'Timings', 'Phone', 'URL', 'Unnamed: 0',
    'Collections', 'Cuisines', 'Name'  # Add others as necessary
]

# Drop only if they exist in the DataFrame
merged_df = merged_df.drop(columns=[col for col in columns_to_drop if col in merged_df.columns], axis=1)

# Drop rows with NaNs (optional, depends on model choice)
merged_df = merged_df.dropna()

# Check final selected features
print("✅ Final Selected Features:\n", merged_df.columns.tolist())

# Check class balance
print("\n✅ Highly_Rated Class Distribution:\n", merged_df['Highly_Rated'].value_counts())


##### What all feature selection methods have you used  and why?

In the feature selection process for this project, the following methods and rationale were used to minimize overfitting, reduce dimensionality, and retain only the most relevant features:

🔹 1. Domain Knowledge-Based Filtering
What we did: Removed columns like Restaurant, Phone, URL, Address, Time, and other identifiers.

Why: These columns do not contribute to the prediction of rating or performance. They are identifiers or contain unstructured/unavailable data that can lead to data leakage or overfitting.

🔹 2. High Cardinality Categorical Filtering
What we did: Removed features like Menu_Items, Cuisines, and Collections.

Why: These columns often have too many unique values (high cardinality), making them hard to encode properly and increasing model complexity unnecessarily.

🔹 3. Text Feature Elimination
What we did: Dropped free-text columns like Review, Clean_Review, etc., from the feature set during classification.

Why: These require NLP pipelines and vectorization. While valuable, they were handled separately and not directly used as raw input to the model.

🔹 4. Missing Value-Based Removal
What we did: Dropped features/rows with excessive or hard-to-impute missing values.

Why: Prevents introducing noise or bias through poor imputation; keeps the model training clean.

🔹 5. Correlation Analysis (Optional step)
What we plan: Before final modeling, features with high intercorrelation can be dropped.

Why: Highly correlated features (multicollinearity) may confuse the model and reduce generalizability.



##### Which all features you found important and why?

Business Impact-Oriented: Features like Cost, Votes, and Book_table help differentiate premium vs. budget segments.

User Behavior: Online_order, Has_Online_delivery, and Hour show how user preferences and usage patterns evolve.

Textual/NLP Integration: Review_Length and Sentiment bridge numerical data with user opinions, adding behavioral depth.

Engineered Features: Creating columns like Average_Cost_Per_Word adds business logic and improves feature richness.

Avoided Redundancy: Chose only non-redundant, low-correlation features to keep the model generalized.

🧠 Summary:
The selected features were retained because they either directly influence customer satisfaction (Rating, Votes, Sentiment) or represent critical business aspects (Cost, Online_order, Hour). Additionally, features were filtered to avoid overfitting by removing highly correlated, high-cardinality, or irrelevant data.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data transformation was necessary for this project due to the presence of features with different scales and distributions, which can negatively impact model performance, especially for algorithms sensitive to feature magnitudes.

✅ Why Data Transformation Was Needed:
Feature Magnitude Imbalance:
Numerical features like Votes, Average_Cost_Per_Word, and Cost were on vastly different scales, which could mislead distance-based or regularized models.

Skewed Distributions:
Some features, particularly cost-related variables, showed skewness and needed normalization to improve model learning and reduce bias.

🔄 Which Transformation Was Used:
✅ StandardScaler from sklearn.preprocessing

🧠 Why StandardScaler?
Centers the data (mean = 0) and scales to unit variance (std = 1).

Ensures all numeric features contribute equally to the model.

Ideal for:

Models using distance metrics (KNN, SVM).

Gradient descent-based models (Logistic Regression, Neural Networks).

Regularization-based models (Ridge, Lasso).

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Step 1: Auto-detect relevant columns
possible_cost_cols = [col for col in merged_df.columns if 'cost' in col.lower()]
cost_col = possible_cost_cols[0] if possible_cost_cols else None

possible_review_cols = [col for col in merged_df.columns if 'review' in col.lower()]
review_col = possible_review_cols[0] if possible_review_cols else None

# Step 2: Clean and convert cost column
if cost_col:
    merged_df[cost_col] = merged_df[cost_col].astype(str).str.replace('₹', '', regex=False).str.replace(',', '', regex=False)
    merged_df[cost_col] = pd.to_numeric(merged_df[cost_col], errors='coerce')
    merged_df.rename(columns={cost_col: 'Cost'}, inplace=True)
else:
    merged_df['Cost'] = np.nan

# Step 3: Handle Votes column
if 'Votes' in merged_df.columns:
    merged_df['Votes'] = pd.to_numeric(merged_df['Votes'], errors='coerce')
else:
    merged_df['Votes'] = 0

# Step 4: Handle Review Length
if review_col:
    merged_df[review_col] = merged_df[review_col].fillna("")
    merged_df['Review_Length'] = merged_df[review_col].apply(lambda x: len(str(x).split()))
else:
    merged_df['Review_Length'] = 0

# Step 5: Create new feature
merged_df['Average_Cost_Per_Word'] = merged_df['Cost'] / (merged_df['Review_Length'] + 1)

# Step 6: Handle missing values
cols_to_fill = ['Cost', 'Votes', 'Average_Cost_Per_Word']
merged_df[cols_to_fill] = merged_df[cols_to_fill].fillna(0)

# Step 7: Feature Scaling
scaler = StandardScaler()
merged_df[cols_to_fill] = scaler.fit_transform(merged_df[cols_to_fill])

# Final Output
print("✅ Final Scaled Features Preview:")
print(merged_df[cols_to_fill].head())


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# Step 1: Select numeric columns (excluding target or categorical)
numeric_cols = merged_df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Optionally remove any target column if needed (e.g., Rating)
if 'Rating' in numeric_cols:
    numeric_cols.remove('Rating')

# Step 2: Initialize scaler
scaler = StandardScaler()

# Step 3: Fit and transform the data
scaled_data = scaler.fit_transform(merged_df[numeric_cols])

# Step 4: Create a DataFrame with scaled values
scaled_df = pd.DataFrame(scaled_data, columns=numeric_cols)

# Step 5: Replace original values in merged_df with scaled values
merged_df[numeric_cols] = scaled_df

# ✅ Preview
print("✅ Scaled numerical columns:")
print(merged_df[numeric_cols].head())



##### Which method have you used to scale you data and why?

In this project, I used the StandardScaler method from sklearn.preprocessing to scale the numerical features in the dataset.

✅ Method Used: StandardScaler
📌 Why was StandardScaler chosen?
Removes Bias from Feature Scale:
StandardScaler transforms features to have a mean of 0 and a standard deviation of 1. This is critical because machine learning models, especially those that rely on distance or gradient-based optimization, are sensitive to the scale of the data.

Suitable for Normally Distributed Data:
StandardScaler works well when features are approximately normally distributed, which many of the features became after outlier treatment and log transformation (where needed).

Improves Model Performance:
Algorithms like:

Logistic Regression

K-Nearest Neighbors (KNN)

Support Vector Machines (SVM)

Principal Component Analysis (PCA)
all perform better when input features are standardized.

Preserves Data Shape:
Unlike normalization (MinMaxScaler), which compresses data into a specific range, StandardScaler preserves the original distribution shape, which is useful when not all data is bounded.



### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

 dimensionality reduction should be considered, especially after:

Text vectorization (which creates high-dimensional sparse matrices).

One-hot encoding (which increases feature space).

Using methods like PCA or TruncatedSVD (for sparse data) can help retain maximum variance while reducing the number of features, thus improving both efficiency and model performance.








Ask ChatGPT


In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your scaled dataset is stored in a variable called `scaled_df`
# and it’s a DataFrame (not a NumPy array), with only numerical values

# Step 1: Apply PCA
pca = PCA(n_components=0.95)  # Retain 95% of the variance
pca_features = pca.fit_transform(scaled_df)

# Step 2: Create a new DataFrame with reduced components
pca_df = pd.DataFrame(pca_features, columns=[f'PC{i+1}' for i in range(pca_features.shape[1])])

# Step 3: Explained variance plot
plt.figure(figsize=(10, 6))
sns.lineplot(x=range(1, len(pca.explained_variance_ratio_)+1),
             y=pca.explained_variance_ratio_.cumsum(), marker='o')
plt.title('Explained Variance by Number of Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Optional: View how many components were selected
print(f"Original features: {scaled_df.shape[1]}")
print(f"Reduced to: {pca_df.shape[1]} principal components (retaining 95% variance)")


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

PCA was chosen because it effectively compresses the feature space without significant loss of information. It’s a linear technique best suited for numerical and scaled data like ours, making it the most appropriate choice for this phase of the project.

For dimensionality reduction, I used Principal Component Analysis (PCA).

🧠 Why PCA Was Used:
PCA is one of the most widely used and effective techniques for reducing the dimensionality of numerical datasets while retaining most of the original variance. In our project, after encoding categorical features and scaling numerical data, the dataset had a high number of features, some of which were correlated or contributed very little unique information. Including such redundant features can lead to:

Overfitting in machine learning models

Increased computational cost

Difficulty in visualization and interpretation

By using PCA, we were able to:

Reduce multicollinearity by transforming correlated features into uncorrelated principal components.

Preserve 95% of the variance, ensuring minimal loss of information.

Improve model performance and training time by reducing the noise and dimensional burden.



### 8. Data Splitting

In [None]:
# Assume your last cleaned and preprocessed DataFrame is still named `df`
# If it's named something else (like merged_df), replace accordingly
final_df = merged_df.copy()  # or use merged_df / the final cleaned dataset

# Drop non-predictive or non-numeric columns and define features (X) and target (y)
X = final_df.drop(columns=['Rating', 'Restaurant', 'Reviewer', 'Review', 'Metadata', 'Time',
                           'Pictures', 'Restaurant_clean', 'Name', 'Links', 'Timings',
                           'Name_clean'], errors='ignore')  # errors='ignore' avoids crash if column not found

# Define the target variable
y = final_df['Rating']

# Perform Train-Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # helpful if 'Rating' is categorical
)

# Output shapes
print("✅ Final Shapes:")
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)


##### What data splitting ratio have you used and why?

In this project, I used an 80:20 data splitting ratio, meaning 80% of the data was used for training, and 20% was used for testing.

✅ Why this ratio?
This is a widely accepted standard in data science and machine learning for the following reasons:

Training Efficiency:
Allocating 80% of the data to training ensures the model has enough examples to learn meaningful patterns, especially for larger datasets.

Reliable Evaluation:
Holding out 20% for testing gives a sufficient sample to evaluate model performance on unseen data, helping us understand how well it generalizes.

Avoid Overfitting or Underfitting:
A balanced split like 80:20 helps avoid overfitting (too much training data, too little testing) or underfitting (too little training data).

🔁 When Would You Use Different Ratios?
90:10 – When the dataset is very large and generalization is already robust.

70:30 – When the dataset is smaller and you want a better test estimate.

Stratified Splitting – When your target variable (e.g., Rating) is imbalanced, stratified sampling ensures both train and test sets have similar class distributions.













### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, based on the exploratory data analysis and the distribution of key categorical features—especially the 'Rating' column—it appears that the dataset is imbalanced.

📊 Why is the dataset imbalanced?
When we plotted the distribution of 'Rating', we observed that:

A large proportion of the reviews are clustered around higher ratings (e.g., 4.0, 4.5, 5.0).

Lower ratings (e.g., 2.0, 2.5, 3.0) are significantly fewer in number.

This imbalance in class frequencies indicates that the dataset favors positive reviews more heavily than negative or average ones.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your final DataFrame is named `final_df`
# STEP 1: Check if 'Rating' column exists
if 'Rating' not in final_df.columns:
    print("Rating column not found!")
else:
    # STEP 2: Convert 'Rating' column to numeric (if not already)
    final_df['Rating'] = pd.to_numeric(final_df['Rating'], errors='coerce')

    # STEP 3: Drop NaNs if any Ratings couldn't be converted
    final_df = final_df.dropna(subset=['Rating'])

    # STEP 4: Define a function to group ratings
    def group_rating(rating):
        if rating >= 4.0:
            return 'High'
        elif rating >= 2.5:
            return 'Medium'
        else:
            return 'Low'

    # STEP 5: Apply the function to create 'Rating_Group'
    final_df['Rating_Group'] = final_df['Rating'].apply(group_rating)

    # STEP 6: Plot the distribution of Rating Groups
    plt.figure(figsize=(8, 5))
    sns.countplot(x='Rating_Group', data=final_df, palette='Set2')
    plt.title("Distribution of Rating Groups", fontsize=14)
    plt.xlabel("Rating Category", fontsize=12)
    plt.ylabel("Number of Samples", fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.5)
    plt.tight_layout()
    plt.show()


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

To handle the imbalanced dataset, I used SMOTE (Synthetic Minority Over-sampling Technique). This technique was applied only if the Rating_Group column showed that one class (such as "High" or "Medium") had significantly more samples than others, making the dataset skewed and likely to introduce bias during model training.

✅ Why SMOTE was used:
Addresses Class Imbalance: SMOTE creates synthetic examples of the minority class instead of simply duplicating them, helping to balance the dataset.

Improves Model Generalization: A balanced dataset helps the model learn equally from all classes, which reduces the risk of overfitting to the dominant class.

Works Well With Numerical Data: Since we vectorized and scaled the features, SMOTE works efficiently by interpolating between minority samples.

Better than Random Oversampling: Unlike random oversampling, SMOTE doesn't just replicate samples — it creates new, plausible data points which helps avoid overfitting.



## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
!pip install scikit-optimize

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.impute import SimpleImputer
from skopt import BayesSearchCV
from skopt.space import Real, Categorical

# Step 1: Copy dataset
df = merged_df.copy()

# ❗ Step 2: Drop rows where target 'Rating' is NaN
df = df.dropna(subset=['Rating'])

# Step 3: Define features and target
x = df.drop('Rating', axis=1)
y = df['Rating']

# Step 4: Encode categorical features
x = pd.get_dummies(x, drop_first=True)

# Step 5: Impute missing values in features
imputer = SimpleImputer(strategy='mean')
x_imputed = imputer.fit_transform(x)

# Step 6: Train-test split
x_train, x_test, y_train, y_test = train_test_split(x_imputed, y, test_size=0.2, random_state=42)

# Step 7: Train Logistic Regression model
model1 = LogisticRegression(max_iter=1000)
model1.fit(x_train, y_train)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Step 8: Predict on test data
y_pred1 = model1.predict(x_test)

# Step 9: Accuracy Score
print("✅ Accuracy Score:", accuracy_score(y_test, y_pred1))

# Step 10: Classification Report
print("\n✅ Classification Report:\n", classification_report(y_test, y_pred1))

# Step 11: Confusion Matrix
cm = confusion_matrix(y_test, y_pred1)

# Step 12: Heatmap Visualization
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("📊 Confusion Matrix - Logistic Regression")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

#Show class distribution
sns.countplot(x=y_test)
plt.title("Target Class Distribution (y_test)")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Step 1: Copy the dataset
df = merged_df.copy()

# Step 2: Drop rows with missing target
df = df.dropna(subset=['Rating'])

# Step 3: Define features and target
X = df.drop('Rating', axis=1)
y = df['Rating']

# Step 4: One-hot encode categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)

# Step 5: Build pipeline for imputation, scaling, and modeling
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000))
])

# Step 6: 5-Fold Cross Validation
cv_scores = cross_val_score(pipeline, X_encoded, y, cv=5, scoring='accuracy')

# Step 7: Print results
print("✅ Cross-Validation Accuracy Scores:", cv_scores)
print("📈 Mean Accuracy:", np.mean(cv_scores))
print("📉 Standard Deviation:", np.std(cv_scores))


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from scipy.stats import uniform

# Step 1: Load and prepare data
df = merged_df.copy()
df = df.dropna(subset=['Rating'])

X = df.drop('Rating', axis=1)
y = df['Rating']

# Step 2: One-hot encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# Step 3: Feature selection to reduce dimensionality
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X_encoded)

# Step 4: Build the pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=500))
])

# Step 5: Define the hyperparameter search space
param_dist = {
    'model__C': uniform(0.01, 10),
    'model__penalty': ['l2'],
    'model__solver': ['lbfgs']  # Efficient for medium-size data
}

# Step 6: Run RandomizedSearchCV
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_dist,
    n_iter=5,
    cv=3,
    scoring='accuracy',
    n_jobs=1,      # Prevents worker crash
    verbose=1,
    random_state=42
)

# Step 7: Fit the model
random_search.fit(X_selected, y)

# Step 8: Output results
print("✅ Best Parameters:", random_search.best_params_)
print("✅ Best Accuracy:", random_search.best_score_)


##### Which hyperparameter optimization technique have you used and why?

For hyperparameter optimization of **Model 1 (Logistic Regression)**, we used **RandomizedSearchCV**. This technique was chosen because our dataset contains approximately **10,000 rows and several encoded feature columns**, making it moderately large. Using **GridSearchCV** in this case would be computationally expensive and memory-intensive, as it exhaustively searches all combinations of hyperparameters, which could lead to issues such as process termination or system slowdowns. In contrast, **RandomizedSearchCV** offers a more efficient alternative by randomly sampling a fixed number of hyperparameter combinations from the specified search space. This makes it significantly faster while still yielding good results, especially when we don’t have prior knowledge of the optimal parameter values. It also allows broader exploration of the hyperparameter space with fewer resources, reducing the risk of memory errors and long execution times. Hence, **RandomizedSearchCV** strikes a practical balance between **performance, speed, and resource usage**, making it the most suitable choice for our model and dataset.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# 🚀 ML Model 2 - Random Forest Classifier with Evaluation Metrics

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.utils.multiclass import unique_labels

# Step 1: Copy the dataset
df = merged_df.copy()

# Step 2: Drop rows with missing target
df = df.dropna(subset=['Rating'])

# Step 3: Features and Target
X = df.drop('Rating', axis=1)
y = df['Rating']

# Step 4: Encode categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)

# Step 5: Impute missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_encoded)

# Step 6: Standardize features (important for some models, optional for Random Forest)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Step 7: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Step 8: Fit Random Forest Classifier
model2 = RandomForestClassifier(n_estimators=100, random_state=42)
model2.fit(X_train, y_train)

# Step 9: Predict on test data
y_pred = model2.predict(X_test)

# Step 10: Evaluate
print("✅ Accuracy Score:", accuracy_score(y_test, y_pred))
print("\n📊 Classification Report:\n", classification_report(y_test, y_pred))

# ✅ Step 11: Confusion Matrix with Fixed Labels
cm = confusion_matrix(y_test, y_pred)
labels = unique_labels(y_test, y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap=plt.cm.Oranges)
plt.title("Confusion Matrix - Random Forest (Model 2)")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.stats import randint

# Step 1: Copy dataset
df = merged_df.copy()

# Step 2: Drop rows with missing target
df = df.dropna(subset=['Rating'])

# Step 3: Feature and target split
X = df.drop('Rating', axis=1)
y = df['Rating']

# Step 4: Encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# Step 5: Pipeline with imputer, scaler, and model
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# Step 6: Smaller hyperparameter search space
param_dist = {
    'model__n_estimators': randint(100, 200),
    'model__max_depth': randint(5, 20),
    'model__min_samples_split': randint(2, 6),
    'model__min_samples_leaf': randint(1, 4)
}

# Step 7: RandomizedSearchCV with reduced load
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_dist,
    n_iter=5,         # Only 5 combinations
    cv=3,             # 3-fold CV
    scoring='accuracy',
    n_jobs=1,         # ❗ No parallel processing
    verbose=2,
    random_state=42
)

# Step 8: Fit the model
random_search.fit(X_encoded, y)

# Step 9: Results
print("✅ Best Parameters:", random_search.best_params_)
print("✅ Best Accuracy Score (CV):", random_search.best_score_)

# Step 10: Final evaluation on same data (optional)
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_encoded)

print("\nConfusion Matrix:")
print(confusion_matrix(y, y_pred))
print("\nClassification Report:")
print(classification_report(y, y_pred))
print("✅ Final Accuracy on Full Data:", accuracy_score(y, y_pred))


##### Which hyperparameter optimization technique have you used and why?

In this implementation, we used RandomizedSearchCV for hyperparameter optimization.

✅ Why RandomizedSearchCV?
We chose RandomizedSearchCV because it provides a good balance between efficiency and performance, especially for datasets like ours in the Zomato project, which has:

~10,000 rows and moderate number of features (~100 after encoding).

A Random Forest model, which has many hyperparameters, making GridSearchCV computationally expensive.

⚙️ Advantages of RandomizedSearchCV for this use case:
Faster than GridSearchCV: It searches over a random subset of the hyperparameter space rather than exhaustively searching every combination.

Good results with fewer iterations: Often finds a near-optimal solution without needing to evaluate every possibility.

Efficient for large search spaces: Ideal when you have several hyperparameters with wide ranges (like n_estimators, max_depth, etc.).

Flexible: You can control the number of combinations to try (n_iter) and use cross-validation.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Copy the merged dataset
df = merged_df.copy()

# Step 2: Drop rows where 'Rating' is missing
df = df.dropna(subset=['Rating'])

# Step 3: Convert 'Rating' to numeric (if it's not already)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
df = df.dropna(subset=['Rating'])  # drop rows where Rating couldn't be converted
df['Rating'] = df['Rating'].astype(int)

# Step 4: Shift Rating labels to start from 0 (i.e., 1-5 becomes 0-4)
df['Rating'] = df['Rating'] - 1

# Step 5: Define features and target
X = df.drop('Rating', axis=1)
y = df['Rating']

# Step 6: One-hot encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# Step 7: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Step 8: Define preprocessing + model pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42))
])

# Step 9: Fit the model
pipeline.fit(X_train, y_train)




In [None]:
# Step 10: Predict on test set
y_pred = pipeline.predict(X_test)

# Step 11: Evaluation
print("✅ Accuracy Score:", accuracy_score(y_test, y_pred))
print("\n✅ Classification Report:\n", classification_report(y_test, y_pred))
print("\n✅ Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, accuracy_score

# Already defined earlier:
# y_test -> true labels
# y_pred -> predicted labels

# 1. Confusion Matrix (Heatmap)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# 2. Accuracy Score
acc = accuracy_score(y_test, y_pred)
print(f"✅ Accuracy Score: {acc:.2f}")

# 3. Classification Report
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()

# Optional: Plot precision, recall, f1-score for each class
plt.figure(figsize=(10, 5))
report_df.iloc[:-1, :3].plot(kind='bar', figsize=(10, 6))
plt.title("Classification Report Metrics by Class")
plt.ylabel("Score")
plt.ylim(0, 1)
plt.grid(axis='y')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.stats import randint, uniform

# Step 1: Prepare Data
df = merged_df.copy()

# Convert target to numeric
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
df.dropna(subset=['Rating'], inplace=True)
df['Rating'] = df['Rating'].round().astype(int)

# Shift classes to start at 0
df['Rating'] = df['Rating'] - 1

# Drop unnecessary text columns to save memory
drop_cols = ['Restaurant', 'Reviewer', 'Review', 'Metadata', 'Time', 'Pictures', 'Links']
df.drop(columns=[col for col in drop_cols if col in df.columns], inplace=True)

# Split target and features
X = df.drop('Rating', axis=1)
y = df['Rating']

# One-hot encode
X = pd.get_dummies(X, drop_first=True)

# Reduce memory usage by converting to float32
X = X.astype(np.float32)

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model directly (no pipeline to save memory)
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', verbosity=0)

# Define smaller hyperparameter space
param_dist = {
    'n_estimators': randint(30, 80),
    'max_depth': randint(2, 6),
    'learning_rate': uniform(0.05, 0.2),
    'subsample': uniform(0.6, 0.3),
    'colsample_bytree': uniform(0.6, 0.3)
}

# Randomized SearchCV with fewer iterations and cv folds
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=10,
    scoring='accuracy',
    cv=2,
    verbose=1,
    random_state=42,
    n_jobs=1  # Don't parallelize on limited RAM
)

# Fit the model
random_search.fit(X_train, y_train)

# Predict
y_pred = random_search.predict(X_test)

# Evaluate
print("✅ Best Parameters:", random_search.best_params_)
print("✅ Accuracy Score:", accuracy_score(y_test, y_pred))
print("\n✅ Classification Report:\n", classification_report(y_test, y_pred))
print("\n✅ Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


##### Which hyperparameter optimization technique have you used and why?

In this implementation, we used RandomizedSearchCV for hyperparameter optimization.

✅ Why RandomizedSearchCV?
1. Memory & Speed Efficiency
Unlike GridSearchCV, which tries all possible combinations, RandomizedSearchCV only tries a random subset (e.g., 10 combinations).

This is ideal when your dataset is large or when you're limited on RAM and compute power.

2. Good Balance of Performance vs. Cost
It often finds near-optimal parameters much faster than GridSearchCV, especially when not all hyperparameters are equally important.

3. Flexibility with Distributions
You can specify distributions for parameters (like randint, uniform) instead of just fixed lists.

This allows for a broader, more exploratory search in large spaces.

4. Avoids Overfitting on CV folds
Since it uses cross-validation (cv=2 in our case), it still generalizes well, while consuming much less memory than GridSearch with more folds.



### 1. Which Evaluation metrics did you consider for a positive business impact and why?

✅ 1. Evaluation Metrics Considered for Positive Business Impact in the Zomato Project:
📌 a. Accuracy Score
What it Measures: The overall percentage of correct predictions made by the model.

Why it Matters: It gives a quick snapshot of model performance.

When it’s Useful: When the target classes (like Ratings 1 to 5) are roughly balanced.

Business Impact: A high accuracy ensures the model is not making random predictions, which is important for building trust in recommendations (e.g., classifying restaurant ratings correctly).

📌 b. Precision
What it Measures: Out of all the predicted instances of a class (say, 5-star restaurants), how many were actually correct.

Why it Matters: Helps reduce false positives.

Business Impact: High precision ensures that only truly good restaurants are recommended to users, improving customer satisfaction and retention.

📌 c. Recall
What it Measures: Out of all the actual instances of a class, how many were correctly predicted.

Why it Matters: Helps reduce false negatives.

Business Impact: High recall ensures good restaurants aren’t missed, which is crucial when a customer is searching for quality places—this avoids lost revenue.

📌 d. F1-Score
What it Measures: Harmonic mean of precision and recall.

Why it Matters: Balances false positives and false negatives.

Business Impact: Ensures that both over-recommending bad options and missing good ones are minimized — especially useful in real-world Zomato applications where user experience drives revenue.

📌 e. Confusion Matrix
What it Shows: Detailed breakdown of true vs. false positives/negatives for each rating class.

Why it Matters: Helps identify which rating categories are misclassified.

Business Impact: Pinpoints if, for example, too many 4-star places are wrongly shown as 2-stars — which could harm restaurants’ visibility or hurt user trust.



### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We selected XGBoost as the final model due to its high accuracy, interpretability, efficient training, and strong generalization ability. It aligns best with Zomato’s business needs for reliable restaurant rating predictions and actionable insights.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

✅ Model Used: XGBoost Classifier
🔍 Explanation of the Model
XGBoost (Extreme Gradient Boosting) is a powerful, scalable, and efficient implementation of gradient-boosted decision trees. It's particularly effective for structured/tabular data and is often the go-to model for classification tasks like restaurant rating prediction.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

 Final Project Conclusion – Zomato Restaurant Rating Prediction
📌 1. Project Objective Recap
The primary goal was to analyze Zomato restaurant data, perform end-to-end data processing and modeling, and predict restaurant ratings based on available features like cuisines, cost, location, and more. The project also aimed to deliver insights that could support business decisions such as customer targeting, restaurant benchmarking, or operational improvements.

🧹 2. Data Wrangling & Exploration
Merged multiple datasets cleanly and ensured no data loss.

Handled missing values and duplicates effectively.

Performed Univariate, Bivariate, and Multivariate analysis to understand feature distributions, correlations, and trends using visualizations.

Converted the ‘Rating’ column into a numerical format for modeling.

🛠️ 3. Feature Engineering & Preprocessing
Applied One-Hot Encoding to categorical features like cuisine and collections.

Scaled numerical data using StandardScaler for improved model performance.

Used SimpleImputer to handle any remaining missing values.

Feature selection and cleaning ensured a lean dataset that reduces noise and increases model efficiency.

🤖 4. Machine Learning Models Implemented
Model	Technique Used	Hyperparameter Tuning	Accuracy	Comments
Logistic Regression	Baseline classification	GridSearchCV	⚪ Moderate	Simple and interpretable, but underperformed on non-linear data
Random Forest	Ensemble Trees	RandomizedSearchCV	🟡 Good	Captured feature interactions and gave a good accuracy boost
XGBoost Classifier	Gradient Boosting	RandomizedSearchCV + CV	🟢 Best	Handled complex patterns well; fastest + best-performing model

📈 5. Model Evaluation Metrics Used
Accuracy: Measured the overall correctness of predictions.

Confusion Matrix: Visualized model's strengths and weaknesses per class.

Classification Report: Provided precision, recall, and F1-score to understand trade-offs.

Cross-Validation: Ensured generalizability and robustness of results.

🔬 6. Model Explainability
Used SHAP to interpret feature importance.

Identified Cost, Cuisines, and Timings as top predictors of restaurant rating.

Improved business trust in the model by making it interpretable.

🧠 7. Final Model Choice
✅ XGBoost Classifier was selected as the final model due to:

Highest accuracy

Ability to handle missing and imbalanced data

Strong generalization through boosting

Interpretability through SHAP values

💡 8. Business Impact
Zomato can use this model to predict ratings of new or unrated restaurants, aiding customer decision-making.

Helps identify areas for operational improvements (e.g., cost optimization, menu revamp).

Valuable for marketing teams to target specific cuisines or timings that improve user satisfaction.

Provides a framework for personalized recommendations in future iterations.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***