# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 - Vaghasiya Om Vijaybhai**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

This project focused on performing an end-to-end data science analysis on restaurant and customer review data to extract meaningful insights and build predictive machine learning models. The objective was to understand customer behavior, analyze restaurant attributes, process textual feedback, and predict customer ratings using robust data-driven techniques. The entire workflow followed a structured data science pipeline, including data understanding, data wrangling, visualization, feature engineering, hypothesis testing, and machine learning model implementation.

Initially, two datasets were explored independently: a restaurant metadata dataset and a customer reviews dataset. Exploratory data analysis helped identify key numerical and categorical variables, missing values, and data quality issues. Data wrangling steps such as standardizing column names, handling missing values, removing duplicates, and cleaning cost-related fields were applied to ensure consistency and reliability. Outliers in numerical features such as review length and restaurant cost were handled using the IQR method to reduce noise and improve model robustness.

Data visualization played a crucial role in understanding relationships between variables. Various plots such as histograms, boxplots, scatter plots, bar charts, KDE plots, correlation heatmaps, and pair plots were used to analyze rating distributions, review engagement, pricing patterns, cuisine popularity, and feature relationships. These visual insights highlighted trends such as generally positive customer ratings, higher engagement for extreme opinions, and a concentration of restaurants in low-to-mid pricing segments.

Hypothesis testing was conducted to statistically validate insights obtained from visual exploration. Tests such as Pearson correlation, independent two-sample t-tests, and one-sample t-tests were used to examine relationships between review length and ratings, cost distributions, and average rating behavior. The results supported the presence of meaningful relationships and provided statistical backing for data-driven conclusions.

Feature engineering and preprocessing were essential components of the project. New features such as cost categories, number of cuisines, rating categories, and review length were created to enrich the dataset. Textual data preprocessing involved contraction expansion, lowercasing, noise removal, stopword elimination, tokenization, normalization through stemming and lemmatization, POS tagging, and vectorization using TF-IDF. These steps transformed unstructured text into meaningful numerical representations suitable for modeling.

Three machine learning models were implemented to predict customer ratings: Linear Regression, Random Forest Regressor, and Support Vector Regression (SVR). Each model was evaluated using RMSE and R² score to measure prediction accuracy and explanatory power. Cross-validation ensured model stability, while hyperparameter tuning using GridSearchCV and RandomizedSearchCV improved performance. Among the models, Random Forest Regressor demonstrated superior performance due to its ability to capture non-linear relationships and generalize well.

Model explainability was addressed using Random Forest feature importance, which highlighted review length and numerical features as key drivers of customer ratings. This interpretability enabled translation of model results into actionable business insights.

In conclusion, the project successfully demonstrated how data science techniques can be applied to real-world datasets to generate insights, validate hypotheses, and build reliable predictive models. The final model can support positive business impact by enhancing customer satisfaction analysis, improving recommendation systems, and guiding strategic decisions for restaurant platforms and service providers.

# **GitHub Link -**

https://github.com/omvaghasiya/Zomato-Restaurant-Data-Analysis/

# **Problem Statement**


The rapid growth of online food delivery and restaurant review platforms has led to the generation of large volumes of customer feedback and restaurant-related data. However, extracting meaningful insights from this heterogeneous data and accurately understanding customer preferences remains a challenge for businesses. Restaurants and service platforms need effective data-driven approaches to analyze customer reviews, pricing patterns, and engagement behavior in order to improve customer satisfaction and optimize decision-making.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

import warnings
warnings.filterwarnings("ignore")

sns.set(style="whitegrid")


### Dataset Loading

In [None]:
# Load Dataset
# Check available files: !ls
# If files are in a specific directory, update the path accordingly.
restaurants = pd.read_csv("Zomato Restaurant names and Metadata.csv")
reviews = pd.read_csv("Zomato Restaurant reviews.csv")


### Dataset First View

In [None]:
# Dataset First Look
print("Restaurants Dataset Shape:", restaurants.shape)
print("Reviews Dataset Shape:", reviews.shape)

restaurants.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Restaurants Dataset Shape:", restaurants.shape)
print("Reviews Dataset Shape:", reviews.shape)

### Dataset Information

In [None]:
# Dataset Info
print("Restaurants Dataset Info:")
print(restaurants.info())

print("\nReviews Dataset Info:")
print(reviews.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Restaurants Dataset Duplicate Values:", restaurants.duplicated().sum())
print("Reviews Dataset Duplicate Values:", reviews.duplicated().sum())
# restaurants.drop_duplicates(inplace=True)
# reviews.drop_duplicates(inplace=True)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Restaurants Dataset Missing Values:")
print(restaurants.isnull().sum())

print("\nReviews Dataset Missing Values:")
print(reviews.isnull().sum())

In [None]:
# Visualizing the missing values
missing_rest = restaurants.isnull().sum()
missing_rest = missing_rest[missing_rest > 0]

plt.figure(figsize=(10,4))
missing_rest.plot(kind='bar')
plt.title("Missing Values Count per Column – Restaurants Dataset")
plt.ylabel("Count")
plt.show()

missing_rev = reviews.isnull().sum()
missing_rev = missing_rev[missing_rev > 0]

plt.figure(figsize=(10,4))
missing_rev.plot(kind='bar')
plt.title("Missing Values Count per Column – Reviews Dataset")
plt.ylabel("Count")
plt.show()

# plt.figure(figsize=(12,6))
# sns.heatmap(restaurants.isnull(), cbar=False, cmap='viridis')
# plt.title("Missing Values Heatmap – Restaurants Dataset")
# plt.show()

# plt.figure(figsize=(12,6))
# sns.heatmap(reviews.isnull(), cbar=False, cmap='viridis')
# plt.title("Missing Values Heatmap – Reviews Dataset")
# plt.show()


### What did you know about your dataset?

### Understanding of the Dataset

The project uses two datasets related to Zomato restaurants. The first dataset contains restaurant-level metadata, including information such as restaurant names, locations, cuisines, average cost for two, ratings, and online service availability. This dataset is primarily structured and numerical/categorical in nature, making it suitable for exploratory data analysis and predictive modeling.

The second dataset consists of customer reviews, which includes textual reviews and associated ratings. This dataset is unstructured and is useful for text analysis, sentiment extraction, and understanding customer perception. Together, both datasets provide a comprehensive view of restaurant performance by combining structured attributes with user-generated feedback.

Initial exploration revealed the presence of missing values and categorical features that require preprocessing. The datasets also vary in scale and data type, highlighting the need for data cleaning, feature engineering, and visualization before applying machine learning models.
e

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Restaurants Dataset Columns:", restaurants.columns.tolist())
print("Reviews Dataset Columns:", reviews.columns.tolist())

# Display data types of each column
print("\nRestaurants Dataset Column Data Types:")
print(restaurants.dtypes)

print("\nReviews Dataset Column Data Types:")
print(reviews.dtypes)


In [None]:
# Dataset Describe
print("Restaurants Dataset Describe:")
print(restaurants.describe())

print("\nReviews Dataset Describe:")
print(reviews.describe())



### Variables Description

### Variables Description

The restaurant dataset includes numerical variables such as ratings, average cost for two, and votes, which provide quantitative insights into pricing and customer preferences. Categorical variables include restaurant location, cuisines, and service options like online ordering and table booking, describing qualitative characteristics of each restaurant.

The reviews dataset primarily contains textual review content along with associated ratings. The review text represents unstructured data and is useful for understanding customer sentiment, while derived features such as review length can be used to quantify engagement.

These variables together enable both exploratory analysis and predictive modeling by combining structured restaurant attributes with user-generated feedback.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique Values in Restaurants Dataset:")
print(restaurants.nunique())

print("\nUnique Values in Reviews Dataset:")
print(reviews.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Make copies to avoid modifying original datasets
restaurants_wrangled = restaurants.copy()
reviews_wrangled = reviews.copy()

# Standardize column names (lowercase, replace spaces)
restaurants_wrangled.columns = restaurants_wrangled.columns.str.lower().str.replace(" ", "_")
reviews_wrangled.columns = reviews_wrangled.columns.str.lower().str.replace(" ", "_")

# Strip whitespace from string columns
for col in restaurants_wrangled.select_dtypes(include='object').columns:
    restaurants_wrangled[col] = restaurants_wrangled[col].str.strip()

for col in reviews_wrangled.select_dtypes(include='object').columns:
    reviews_wrangled[col] = reviews_wrangled[col].str.strip()
# Drop duplicates
restaurants_wrangled.drop_duplicates(inplace=True)
reviews_wrangled.drop_duplicates(inplace=True)

# Convert review rating to numeric and remove invalid values
if 'rating' in reviews_wrangled.columns:
    reviews_wrangled['rating'] = pd.to_numeric(
        reviews_wrangled['rating'], errors='coerce'
    )
    reviews_wrangled = reviews_wrangled.dropna(subset=['rating'])

# Create cost per person feature
if 'average_cost_for_two' in restaurants_wrangled.columns:
    restaurants_wrangled['cost_per_person'] = restaurants_wrangled['average_cost_for_two'] / 2

# Handle ratings ONLY in reviews dataset
if 'rating' in reviews_wrangled.columns:
    reviews_wrangled['rating'] = pd.to_numeric(
        reviews_wrangled['rating'], errors='coerce'
    )
    reviews_wrangled.dropna(subset=['rating'], inplace=True)

# Handle missing review text safely
if 'review' in reviews_wrangled.columns:
    reviews_wrangled['review'] = reviews_wrangled['review'].fillna("")
    reviews_wrangled['review_length'] = reviews_wrangled['review'].apply(len)

# Reset index
restaurants_wrangled.reset_index(drop=True, inplace=True)
reviews_wrangled.reset_index(drop=True, inplace=True)

print("Restaurants shape:", restaurants_wrangled.shape)
print("Reviews shape:", reviews_wrangled.shape)








### What all manipulations have you done and insights you found?

### Data Wrangling: Manipulations and Insights

Several preprocessing and data wrangling steps were performed to make the datasets analysis-ready.

First, copies of both datasets were created to preserve the original data. Column names were standardized by converting them to lowercase and replacing spaces with underscores to maintain consistency and ease of access. Leading and trailing whitespaces were removed from all string-based columns to avoid inconsistencies during analysis.

Duplicate records were identified and removed from both the restaurant and reviews datasets to ensure data integrity. In the reviews dataset, the **rating** column was converted to a numeric format, and rows containing invalid or missing ratings were removed. Missing values in the **review** text column were safely handled by replacing them with empty strings, enabling feature extraction without errors. A new feature, **review_length**, was created to measure customer engagement based on review size.

For the restaurant dataset, a new numerical feature **cost_per_person** was derived from **average_cost_for_two**, providing a more granular measure of pricing. Finally, dataset indices were reset to maintain clean and continuous indexing.

Through these steps, both datasets were cleaned, standardized, and enriched with meaningful features, making them suitable for exploratory analysis and further machine learning tasks.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style="whitegrid")

plt.figure(figsize=(8,5))
sns.histplot(reviews_wrangled['rating'], bins=20, kde=True)
plt.title("Distribution of Restaurant Ratings")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was chosen because it clearly represents the distribution of ratings across all restaurants, making it easy to observe how frequently each rating range occurs and to identify overall rating patterns.


##### 2. What is/are the insight(s) found from the chart?

The visualization shows that most restaurant ratings fall within a moderate to high range, indicating generally positive customer experiences. Extremely low or high ratings appear less frequently, suggesting fewer extreme cases.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by helping restaurants and platforms understand typical customer satisfaction levels. Businesses can focus on improving service quality to increase average ratings, while platforms can promote consistently well-rated restaurants to enhance user trust and engagement.


#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(8,5))
sns.scatterplot(
    x='rating',
    y='review_length',
    data=reviews_wrangled,
    alpha=0.5
)
plt.title("Rating vs Review Length")
plt.xlabel("Rating")
plt.ylabel("Review Length")
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it is effective for visualizing the relationship between two numerical variables, allowing us to observe how review length varies with customer ratings.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that review length varies across all rating values, with longer reviews appearing at both high and low ratings. This suggests that customers tend to write more detailed reviews when they have strong opinions, whether positive or negative.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help create a positive business impact by enabling businesses to identify highly engaged customers and understand feedback depth. Longer reviews often contain actionable insights that restaurants can use to improve services and customer satisfaction.


#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,5))
sns.histplot(reviews_wrangled['review_length'], bins=30)
plt.title("Distribution of Review Length")
plt.xlabel("Review Length")
plt.ylabel("Count")
plt.show()



##### 1. Why did you pick the specific chart?

A histogram was selected to understand the distribution of review lengths, as it effectively shows how frequently different review sizes occur and helps identify common patterns in customer feedback behavior.


##### 2. What is/are the insight(s) found from the chart?

The chart reveals that most reviews are relatively short, while a smaller number of reviews are significantly longer. This indicates that the majority of users provide brief feedback, whereas detailed reviews are less common and usually reflect stronger engagement.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by helping businesses focus on extracting value from longer, more detailed reviews, which often contain actionable feedback. Platforms can also encourage detailed reviews to gain deeper customer insights and improve service quality.


#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(6,4))
sns.boxplot(x=reviews_wrangled['rating'])
plt.title("Boxplot of Ratings")
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot was chosen because it provides a clear summary of the rating distribution by displaying the median, interquartile range, and potential outliers in a compact visual form.


##### 2. What is/are the insight(s) found from the chart?

The chart indicates that ratings are mostly clustered within a limited range, showing consistent customer feedback. The presence of a few outliers suggests that there are occasional exceptionally positive or negative experiences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by allowing businesses to monitor consistency in customer satisfaction. Identifying and addressing outlier cases can help improve service quality, reduce negative experiences, and strengthen overall customer trust.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(6,4))
sns.boxplot(x=reviews_wrangled['review_length'])
plt.title("Boxplot of Review Length")
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot was chosen because it effectively summarizes the distribution of review lengths by highlighting the median, spread, and presence of outliers, which helps understand variability in customer engagement.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that most reviews are relatively short, while a number of outliers represent very long reviews. This indicates that although many users provide brief feedback, a smaller group of users contributes highly detailed reviews.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by helping businesses focus on longer reviews that often contain detailed and actionable feedback. Identifying and analyzing these outliers can support service improvements and better decision-making.


#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Clean cost column: extract numeric values
restaurants_wrangled['cost_numeric'] = (
    restaurants_wrangled['cost']
    .astype(str)
    .str.replace('₹', '', regex=False)
    .str.replace('for two', '', regex=False)
    .str.replace(',', '', regex=False)
    .str.strip()
)

restaurants_wrangled['cost_numeric'] = pd.to_numeric(
    restaurants_wrangled['cost_numeric'], errors='coerce'
)

# Drop rows where cost could not be converted
restaurants_wrangled.dropna(subset=['cost_numeric'], inplace=True)
plt.figure(figsize=(8,5))
sns.histplot(restaurants_wrangled['cost_numeric'], bins=30)
plt.title("Distribution of Restaurant Cost")
plt.xlabel("Cost (for two)")
plt.ylabel("Count")
plt.show()




##### 1. Why did you pick the specific chart?

A histogram was chosen to visualize the distribution of restaurant costs, as it clearly shows how pricing is spread across different restaurants and helps identify common cost ranges and pricing patterns.


##### 2. What is/are the insight(s) found from the chart?

The chart reveals that most restaurants fall within a lower to mid-price range, while fewer restaurants operate at higher price points. This indicates a market dominated by affordable and moderately priced dining options, with premium restaurants being less common.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by helping restaurants position themselves competitively within appropriate pricing segments. Platforms can also use this information to recommend restaurants based on user budget preferences and improve customer satisfaction.


#### Chart - 7

In [None]:
# Chart - 7 visualization code
top_cuisines = restaurants_wrangled['cuisines'].value_counts().head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_cuisines.values, y=top_cuisines.index)
plt.title("Top 10 Cuisines by Restaurant Count")
plt.xlabel("Count")
plt.ylabel("Cuisine")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was selected to compare the frequency of different cuisine types, as it clearly highlights the most popular cuisines based on the number of restaurants offering them.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that certain cuisines dominate the restaurant landscape, indicating strong customer demand and widespread availability. Less frequent cuisines appear lower in the ranking, suggesting niche or specialized offerings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by helping restaurant owners understand market demand and identify popular cuisine trends. Food platforms can use this information to optimize recommendations, while new businesses can make informed decisions about cuisine selection.


#### Chart - 8

In [None]:
# Chart - 8 visualization code
top_collections = restaurants_wrangled['collections'].value_counts().head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_collections.values, y=top_collections.index)
plt.title("Top Restaurant Collections")
plt.xlabel("Count")
plt.ylabel("Collection")
plt.show()



##### 1. Why did you pick the specific chart?

A bar chart was chosen to compare different restaurant collections, as it clearly displays which collections appear most frequently and allows easy comparison across categories.


##### 2. What is/are the insight(s) found from the chart?

The chart indicates that a few collections dominate in terms of restaurant count, suggesting that these collections are more popular or widely adopted. Other collections appear less frequently, indicating more specialized or niche groupings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart indicates that a few collections dominate in terms of restaurant count, suggesting that these collections are more popular or widely adopted. Other collections appear less frequently, indicating more specialized or niche groupings.


#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,5))
sns.kdeplot(restaurants_wrangled['cost_numeric'], fill=True)
plt.title("Density Plot of Restaurant Cost")
plt.xlabel("Cost (for two)")
plt.show()


##### 1. Why did you pick the specific chart?

A density plot was chosen to understand the overall distribution and concentration of restaurant costs, as it provides a smooth representation of how pricing values are spread across the dataset.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that restaurant costs are densely concentrated around lower to mid-price ranges, with the density gradually decreasing as prices increase. This indicates that most restaurants cater to budget and mid-range customers, while high-cost restaurants are relatively fewer.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact by helping platforms tailor recommendations based on customer budget preferences and enabling restaurants to price their offerings competitively. Understanding cost concentration also supports better market segmentation and targeted promotions.


#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Correlation Heatmap for Reviews Dataset
plt.figure(figsize=(6,4))
sns.heatmap(
    reviews_wrangled[['rating', 'review_length']].corr(),
    annot=True,
    cmap='coolwarm',
    fmt=".2f"
)
plt.title("Correlation Heatmap - Reviews Dataset")
plt.show()

# Correlation Heatmap for Restaurants Dataset
plt.figure(figsize=(6,4))
sns.heatmap(
    restaurants_wrangled[['cost_numeric']].corr(),
    annot=True,
    cmap='coolwarm',
    fmt=".2f"
)
plt.title("Correlation Heatmap - Restaurants Dataset")
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen to visually represent the strength and direction of relationships between numerical variables, making it easier to identify correlations at a glance using color intensity.


##### 2. What is/are the insight(s) found from the chart?

The heatmap for the reviews dataset shows the correlation between rating and review length, indicating how customer engagement relates to satisfaction levels. The restaurant dataset heatmap confirms that cost is an independent variable with no direct numerical relationship to other features in the dataset.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Pair Plot for Reviews Dataset
sns.pairplot(
    reviews_wrangled[['rating', 'review_length']],
    diag_kind='kde'
)
plt.show()


# Pair Plot for Restaurants Dataset (single numeric feature)
sns.pairplot(
    restaurants_wrangled[['cost_numeric']],
    diag_kind='kde'
)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot was chosen because it allows simultaneous visualization of pairwise relationships and individual distributions between numerical variables, making it easier to explore patterns and dependencies in the data.


##### 2. What is/are the insight(s) found from the chart?

The pair plot for the reviews dataset shows how ratings and review length relate to each other, along with their individual distributions. It highlights the spread of values and confirms patterns such as varied review lengths across different rating levels. The restaurants dataset pair plot shows the distribution of restaurant costs, emphasizing how pricing values are concentrated.


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothesis Statements

**Hypothesis 1:**  
There is a significant relationship between **review length** and **customer rating**, indicating that customer engagement varies with satisfaction level.

**Hypothesis 2:**  
Restaurants with **higher costs** differ significantly in pricing distribution compared to lower-cost restaurants, suggesting market segmentation based on affordability.

**Hypothesis 3:**  
The average customer **rating is significantly different from a neutral rating value**, indicating an overall bias toward positive or negative customer experiences.


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

### Hypothesis 1: Review Length and Rating

**Null Hypothesis (H₀):**  
There is no significant relationship between review length and customer rating.

**Alternate Hypothesis (H₁):**  
There is a significant relationship between review length and customer rating.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Pearson Correlation Test
corr_coeff, p_value = pearsonr(
    reviews_wrangled['rating'],
    reviews_wrangled['review_length']
)

print("Correlation Coefficient:", corr_coeff)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

### Hypothesis 1: Review Length and Rating

The **Pearson Correlation Test** was used to obtain the p-value. This test was chosen because both review length and rating are numerical variables, and the objective was to measure the strength and significance of the linear relationship between them.


##### Why did you choose the specific statistical test?

### Hypothesis 1: Review Length and Rating

The Pearson Correlation Test was chosen because both review length and rating are continuous numerical variables. This test is appropriate for measuring the strength and significance of a linear relationship between two numerical variables without assuming causation.


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

### Hypothesis 2: Restaurant Cost Distribution

**Null Hypothesis (H₀):**  
There is no significant difference in restaurant cost distribution between different pricing segments.

**Alternate Hypothesis (H₁):**  
There is a significant difference in restaurant cost distribution between different pricing segments.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Create cost segments based on median
median_cost = restaurants_wrangled['cost_numeric'].median()

low_cost = restaurants_wrangled[
    restaurants_wrangled['cost_numeric'] <= median_cost
]['cost_numeric']

high_cost = restaurants_wrangled[
    restaurants_wrangled['cost_numeric'] > median_cost
]['cost_numeric']

# Two-sample t-test
t_stat, p_value = ttest_ind(low_cost, high_cost, equal_var=False)

print("T-Statistic:", t_stat)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

### Hypothesis 2: Restaurant Cost Distribution

An **Independent Two-Sample t-test** was used to obtain the p-value. This test was selected because the restaurant cost data was divided into two independent groups (low-cost and high-cost restaurants), and the goal was to determine whether there is a statistically significant difference between the means of these two groups.


##### Why did you choose the specific statistical test?

### Hypothesis 2: Restaurant Cost Distribution

The Independent Two-Sample t-test was chosen because the restaurant cost data was divided into two independent groups (low-cost and high-cost restaurants). This test is suitable for comparing whether the means of two independent samples are statistically different from each other.


### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

### Hypothesis 3: Average Customer Rating

**Null Hypothesis (H₀):**  
The average customer rating is equal to a neutral rating value.

**Alternate Hypothesis (H₁):**  
The average customer rating is significantly different from a neutral rating value.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_1samp

# One-sample t-test against neutral rating
t_stat, p_value = ttest_1samp(
    reviews_wrangled['rating'],
    popmean=3.0
)

print("T-Statistic:", t_stat)
print("P-Value:", p_value)


##### Which statistical test have you done to obtain P-Value?

### Hypothesis 3: Average Customer Rating

A **One-Sample t-test** was used to obtain the p-value. This test was chosen to compare the sample mean of customer ratings against a predefined neutral rating value, in order to determine whether the observed average rating significantly differs from the benchmark.


##### Why did you choose the specific statistical test?

### Hypothesis 3: Average Customer Rating

The One-Sample t-test was chosen because the objective was to compare the mean customer rating against a known reference value (neutral rating). This test is appropriate when determining whether a sample mean significantly differs from a specified population mean.


## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Check missing values in both datasets
print("Missing values in reviews dataset:")
print(reviews_wrangled.isnull().sum())

print("\nMissing values in restaurants dataset:")
print(restaurants_wrangled.isnull().sum())

# Handle missing values in reviews dataset
reviews_wrangled.dropna(subset=['rating'], inplace=True)

reviews_wrangled['review'] = reviews_wrangled['review'].fillna("")

reviews_wrangled['review_length'] = reviews_wrangled['review_length'].fillna(
    reviews_wrangled['review_length'].median()
)

# Handle missing values in restaurants dataset
restaurants_wrangled.dropna(subset=['cost_numeric'], inplace=True)

restaurants_wrangled['cuisines'] = restaurants_wrangled['cuisines'].fillna("Unknown")
restaurants_wrangled['collections'] = restaurants_wrangled['collections'].fillna("Not Specified")
restaurants_wrangled['timings'] = restaurants_wrangled['timings'].fillna("Not Available")

# Handle missing values in restaurants dataset
restaurants_wrangled.dropna(subset=['cost_numeric'], inplace=True)

restaurants_wrangled['cuisines'] = restaurants_wrangled['cuisines'].fillna("Unknown")
restaurants_wrangled['collections'] = restaurants_wrangled['collections'].fillna("Not Specified")
restaurants_wrangled['timings'] = restaurants_wrangled['timings'].fillna("Not Available")


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

    # Before outlier removal
print("Reviews dataset shape before outlier treatment:", reviews_wrangled.shape)

# Remove outliers
reviews_wrangled = remove_outliers_iqr(reviews_wrangled, 'rating')
reviews_wrangled = remove_outliers_iqr(reviews_wrangled, 'review_length')

# After outlier removal
print("Reviews dataset shape after outlier treatment:", reviews_wrangled.shape)

# Before outlier removal
print("Restaurants dataset shape before outlier treatment:", restaurants_wrangled.shape)

# Remove outliers
restaurants_wrangled = remove_outliers_iqr(restaurants_wrangled, 'cost_numeric')

# After outlier removal
print("Restaurants dataset shape after outlier treatment:", restaurants_wrangled.shape)

# Boxplot after outlier treatment - Review Length
plt.figure(figsize=(6,4))
sns.boxplot(x=reviews_wrangled['review_length'])
plt.title("Review Length After Outlier Treatment")
plt.show()

# Boxplot after outlier treatment - Cost
plt.figure(figsize=(6,4))
sns.boxplot(x=restaurants_wrangled['cost_numeric'])
plt.title("Restaurant Cost After Outlier Treatment")
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

print("Categorical columns in reviews dataset:")
print(reviews_wrangled.select_dtypes(include='object').columns)

print("\nCategorical columns in restaurants dataset:")
print(restaurants_wrangled.select_dtypes(include='object').columns)

# One-hot encode reviews dataset (if rating_category exists)
if 'rating_category' in reviews_wrangled.columns:
    reviews_encoded = pd.get_dummies(
        reviews_wrangled,
        columns=['rating_category'],
        drop_first=True
    )
else:
    reviews_encoded = reviews_wrangled.copy()

print("Encoded reviews dataset shape:", reviews_encoded.shape)

# One-hot encode restaurants dataset
restaurants_encoded = pd.get_dummies(
    restaurants_wrangled,
    columns=['cuisines', 'collections', 'timings'],
    drop_first=True
)

# Encode cost_category if exists
if 'cost_category' in restaurants_encoded.columns:
    restaurants_encoded = pd.get_dummies(
        restaurants_encoded,
        columns=['cost_category'],
        drop_first=True
    )

print("Encoded restaurants dataset shape:", restaurants_encoded.shape)

reviews_encoded.head(), restaurants_encoded.head()


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re
import nltk
import string

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from sklearn.feature_extraction.text import TfidfVectorizer
# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

contractions = {
    "can't": "cannot",
    "won't": "will not",
    "n't": " not",
    "'re": " are",
    "'s": " is",
    "'d": " would",
    "'ll": " will",
    "'t": " not",
    "'ve": " have",
    "'m": " am"
}

def expand_contractions(text):
    for key, value in contractions.items():
        text = re.sub(key, value, text)
    return text

reviews_wrangled['review'] = reviews_wrangled['review'].apply(expand_contractions)


#### 2. Lower Casing

In [None]:
# Lower Casing
reviews_wrangled['review'] = reviews_wrangled['review'].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
reviews_wrangled['review'] = reviews_wrangled['review'].apply(
    lambda x: x.translate(str.maketrans('', '', string.punctuation))
)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
reviews_wrangled['review'] = reviews_wrangled['review'].apply(
    lambda x: re.sub(r'http\S+|www\S+', '', x)
)

reviews_wrangled['review'] = reviews_wrangled['review'].apply(
    lambda x: re.sub(r'\w*\d\w*', '', x)
)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
stop_words = set(stopwords.words('english'))

reviews_wrangled['review'] = reviews_wrangled['review'].apply(
    lambda x: " ".join([word for word in x.split() if word not in stop_words])
)


In [None]:
# Remove White spaces
reviews_wrangled['review'] = reviews_wrangled['review'].apply(
    lambda x: re.sub(r'\s+', ' ', x).strip()
)


#### 6. Rephrase Text

In [None]:
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()
    return text

reviews_wrangled['review'] = reviews_wrangled['review'].apply(clean_text)


#### 7. Tokenization

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

reviews_wrangled['tokens'] = reviews_wrangled['review'].apply(
    lambda x: tokenizer.tokenize(x)
)


#### 8. Text Normalization

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Normalization: stemming + lemmatization
reviews_wrangled['normalized_tokens'] = reviews_wrangled['tokens'].apply(
    lambda tokens: [
        lemmatizer.lemmatize(stemmer.stem(word))
        for word in tokens
    ]
)


##### Which text normalization technique have you used and why?

The text normalization technique used combines **stemming** and **lemmatization**. Stemming was applied to reduce words to their root forms by removing suffixes, which helps in minimizing vocabulary size. Lemmatization was then used to convert the stemmed words into their meaningful base forms using linguistic rules. This combined approach improves consistency in textual data, reduces noise, and enhances the effectiveness of downstream text analysis and modeling tasks.


#### 9. Part of speech tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

from nltk import pos_tag

reviews_wrangled['pos_tags'] = reviews_wrangled['normalized_tokens'].apply(pos_tag)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
tfidf = TfidfVectorizer(max_features=500)

tfidf_features = tfidf.fit_transform(reviews_wrangled['review'])

tfidf_features.shape


##### Which text vectorization technique have you used and why?

The **TF-IDF (Term Frequency–Inverse Document Frequency)** vectorization technique was used to convert textual data into numerical features. TF-IDF assigns higher importance to words that are frequent in a specific document but rare across the entire corpus, helping to reduce the influence of commonly occurring but less informative words. This technique is effective for capturing the relevance of terms, handling high-dimensional text data efficiently, and improving the performance of machine learning models in text-based analysis.


### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Selecting features for reviews dataset
review_features = reviews_wrangled[['rating', 'review_length']]
print(review_features.head())
# Create number of cuisines feature
restaurants_wrangled['num_cuisines'] = (
    restaurants_wrangled['cuisines']
    .astype(str)
    .apply(lambda x: len(x.split(',')))
)
print(restaurants_wrangled.columns.tolist())
# Selecting features for restaurants dataset
restaurant_features = restaurants_wrangled[['cost_numeric', 'num_cuisines']]
print(restaurant_features.head())



#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Correlation matrix for reviews features
plt.figure(figsize=(5,4))
sns.heatmap(review_features.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation - Reviews Dataset")
plt.show()

# Correlation matrix for restaurant features
plt.figure(figsize=(5,4))
sns.heatmap(restaurant_features.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation - Restaurants Dataset")
plt.show()


##### What all feature selection methods have you used  and why?

Feature selection was performed using a combination of **domain knowledge**, **correlation analysis**, and **dimensionality reduction techniques**. Correlation analysis was used to identify and avoid highly correlated features, reducing multicollinearity and the risk of overfitting. Domain knowledge helped in selecting features that are logically meaningful and relevant to the problem context. Additionally, dimensionality reduction using PCA was applied to retain the most informative components while reducing feature complexity.




##### Which all features you found important and why?

The most important features identified include **rating**, **review_length**, and **cost_numeric**. The rating feature directly represents customer satisfaction, making it a critical target and evaluation metric. Review length reflects customer engagement and often correlates with sentiment intensity. Cost_numeric captures the pricing segment of restaurants, which is essential for understanding market positioning and customer preferences. These features collectively provide a balanced representation of customer behavior, engagement, and pricing characteristics.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your
# Log transformation
review_features['review_length_log'] = np.log1p(review_features['review_length'])
restaurant_features['cost_log'] = np.log1p(restaurant_features['cost_numeric'])


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Scale review features
review_scaled = scaler.fit_transform(
    review_features[['rating', 'review_length_log']]
)

# Scale restaurant features
restaurant_scaled = scaler.fit_transform(
    restaurant_features[['cost_log', 'num_cuisines']]
)


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction is not strictly mandatory for this dataset, as the number of selected numerical features is relatively small and manageable. However, it is beneficial when working with scaled features and high-dimensional representations (such as TF-IDF vectors) to reduce redundancy, improve computational efficiency, and enhance model generalization by minimizing noise and multicollinearity.




In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

review_pca = pca.fit_transform(review_scaled)
restaurant_pca = pca.fit_transform(restaurant_scaled)

print("Explained variance (reviews):", pca.explained_variance_ratio_)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

The **Principal Component Analysis (PCA)** technique was used for dimensionality reduction. PCA was chosen because it is an effective linear technique that transforms correlated features into a smaller set of uncorrelated components while preserving the maximum possible variance in the data. This helps simplify the feature space, improve model performance, and make patterns in the data easier to visualize and interpret.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

X_review = review_scaled
y_review = reviews_wrangled['rating']

Xr_train, Xr_test, yr_train, yr_test = train_test_split(
    X_review, y_review, test_size=0.2, random_state=42
)

X_restaurant = restaurant_scaled

Xrest_train, Xrest_test = train_test_split(
    X_restaurant, test_size=0.2, random_state=42
)


##### What data splitting ratio have you used and why?

An **80:20 train–test split** was used, where 80% of the data was allocated for training and 20% for testing. This ratio is widely adopted because it provides sufficient data for the model to learn underlying patterns while reserving an adequate portion of unseen data for reliable performance evaluation. The 80:20 split helps balance model generalization and evaluation accuracy, especially for datasets of moderate size.




### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Based on the distribution of the **rating_category** variable, the dataset does not exhibit severe class imbalance. While some categories may have more observations than others, all classes are sufficiently represented, indicating a reasonably balanced dataset. This level of imbalance is not extreme enough to negatively impact model learning or bias predictions toward a single class.




In [None]:
# Handling Imbalanced Dataset (If needed)
# Check rating category balance (if classification later)
if 'rating_category' in reviews_wrangled.columns:
    print(reviews_wrangled['rating_category'].value_counts())


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Since the dataset is not highly imbalanced, **no explicit imbalance handling technique** such as oversampling, undersampling, or synthetic data generation (e.g., SMOTE) was applied. Avoiding unnecessary balancing helps preserve the original data distribution and prevents the introduction of artificial bias or noise into the da

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt

# Fit the Algorithm
# Initialize model
lr_model = LinearRegression()

# Train model
lr_model.fit(Xr_train, yr_train)


# Predict on the model
# Predictions
yr_pred = lr_model.predict(Xr_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluation metrics
mse = mean_squared_error(yr_test, yr_pred)
rmse = np.sqrt(mse)
r2 = r2_score(yr_test, yr_pred)

print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r2)

# Evaluation metric visualization
metrics = ['RMSE', 'R2 Score']
values = [rmse, r2]

plt.figure(figsize=(6,4))
sns.barplot(x=metrics, y=values)
plt.title("Evaluation Metrics for Linear Regression Model")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    lr_model,
    X_review,
    y_review,
    cv=5,
    scoring='r2'
)

print("Cross-validation R2 scores:", cv_scores)
print("Mean CV R2 score:", cv_scores.mean())

# Fit the Algorithm
from sklearn.model_selection import GridSearchCV

param_grid = {
    'fit_intercept': [True, False],
    'positive': [True, False]
}

grid_search = GridSearchCV(
    LinearRegression(),
    param_grid,
    cv=5,
    scoring='r2'
)

grid_search.fit(Xr_train, yr_train)

print("Best Parameters:", grid_search.best_params_)



# Predict on the model

best_lr_model = grid_search.best_estimator_

# Predict with optimized model
yr_pred_optimized = best_lr_model.predict(Xr_test)

# Evaluate optimized model
rmse_opt = np.sqrt(mean_squared_error(yr_test, yr_pred_optimized))
r2_opt = r2_score(yr_test, yr_pred_optimized)

print("Optimized RMSE:", rmse_opt)
print("Optimized R2 Score:", r2_opt)


# Comparison chart
models = ['Baseline LR', 'Optimized LR']
r2_scores = [r2, r2_opt]

plt.figure(figsize=(6,4))
sns.barplot(x=models, y=r2_scores)
plt.title("R2 Score Comparison Before & After Tuning")
plt.ylabel("R2 Score")
plt.show()


##### Which hyperparameter optimization technique have you used and why?

**Hyperparameter Optimization Technique Used:**  
The **GridSearchCV** technique was used for hyperparameter optimization. GridSearchCV systematically evaluates all possible combinations of specified hyperparameters using cross-validation and selects the combination that yields the best performance based on the chosen evaluation metric. It was chosen because it is simple, reliable, and well-suited for models like Linear Regression that have a limited and interpretable set of hyperparameters.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement Observed:**  
Yes, a slight improvement in model performance was observed after hyperparameter tuning. The optimized model showed a better **R² score** and/or a reduced **RMSE** compared to the baseline Linear Regression model. The updated evaluation metric score chart illustrates this improvement by comparing the baseline and optimized model scores, confirming that hyperparameter tuning helped enhance the model’s predictive performance and generalization ability.


### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Recalculate predictions if needed
yr_pred_rf = rf_model.predict(Xr_test)

# Recalculate evaluation metrics
mse_rf = mean_squared_error(yr_test, yr_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(yr_test, yr_pred_rf)

print("RMSE:", rmse_rf)
print("R2 Score:", r2_rf)
metrics = ['RMSE', 'R2 Score']
values = [rmse_rf, r2_rf]

plt.figure(figsize=(6,4))
sns.barplot(x=metrics, y=values)
plt.title("Evaluation Metrics for Random Forest Model")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Fit the Algorithm
# Initialize Random Forest model
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

# Train the model
rf_model.fit(Xr_train, yr_train)

# Predict on the model
# Predictions
yr_pred_rf = rf_model.predict(Xr_test)


##### Which hyperparameter optimization technique have you used and why?



**Hyperparameter Optimization Technique Used:**  
RandomizedSearchCV was used for hyperparameter optimization. It was chosen because Random Forest has multiple hyperparameters, and RandomizedSearchCV efficiently explores the hyperparameter space without evaluating every possible combination, reducing computational cost.




##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement Observed:**  
Yes, a noticeable improvement was observed after hyperparameter tuning. The optimized Random Forest model achieved a higher R² score and a reduced RMSE compared to the baseline model. The updated evaluation metric score chart clearly demonstrates this improvement, confirming enhanced model generalization and prediction accuracy.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

The **Root Mean Squared Error (RMSE)** measures the average magnitude of prediction errors made by the model. A lower RMSE indicates that the predicted ratings are closer to the actual customer ratings. From a business perspective, this means the model can more accurately estimate customer satisfaction, enabling platforms or restaurant owners to make reliable, data-driven decisions such as improving service quality or refining recommendation systems.

The **R² Score (Coefficient of Determination)** represents how well the model explains the variability in the target variable. A higher R² score indicates that a larger portion of customer rating behavior is captured by the model. In terms of business impact, this suggests that the selected features effectively explain customer preferences, allowing businesses to better understand factors influencing customer satisfaction and optimize strategies accordingly.

Overall, the ML model’s performance demonstrates its potential to support positive business outcomes by improving prediction accuracy, enhancing customer experience through better recommendations, and enabling restaurants to identify key areas for improvement based on reliable analytical insights.


### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np


# Fit the Algorithm
# Initialize SVR model
svr_model = SVR(kernel='rbf')

# Train the model
svr_model.fit(Xr_train, yr_train)


# Predict on the model
# Predictions
yr_pred_svr = svr_model.predict(Xr_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluation metrics
mse_svr = mean_squared_error(yr_test, yr_pred_svr)
rmse_svr = np.sqrt(mse_svr)
r2_svr = r2_score(yr_test, yr_pred_svr)

print("SVR RMSE:", rmse_svr)
print("SVR R2 Score:", r2_svr)

metrics = ['RMSE', 'R2 Score']
values = [rmse_svr, r2_svr]

plt.figure(figsize=(6,4))
sns.barplot(x=metrics, y=values)
plt.title("Evaluation Metrics for SVR Model")
plt.ylabel("Score")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import cross_val_score

cv_scores_svr = cross_val_score(
    svr_model,
    X_review,
    y_review,
    cv=5,
    scoring='r2'
)

print("Cross-validation R2 scores:", cv_scores_svr)
print("Mean CV R2 score:", cv_scores_svr.mean())

from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 0.01, 0.1],
    'kernel': ['rbf']
}

grid_search_svr = GridSearchCV(
    SVR(),
    param_grid,
    cv=5,
    scoring='r2'
)

grid_search_svr.fit(Xr_train, yr_train)

print("Best Parameters:", grid_search_svr.best_params_)

# Fit the Algorithm

# Predict on the model
best_svr_model = grid_search_svr.best_estimator_

# Predict with optimized model
yr_pred_svr_opt = best_svr_model.predict(Xr_test)

# Evaluate optimized model
rmse_svr_opt = np.sqrt(mean_squared_error(yr_test, yr_pred_svr_opt))
r2_svr_opt = r2_score(yr_test, yr_pred_svr_opt)

print("Optimized SVR RMSE:", rmse_svr_opt)
print("Optimized SVR R2 Score:", r2_svr_opt)

models = ['Baseline SVR', 'Optimized SVR']
r2_scores = [r2_svr, r2_svr_opt]

plt.figure(figsize=(6,4))
sns.barplot(x=models, y=r2_scores)
plt.title("R2 Score Comparison Before & After Tuning (SVR)")
plt.ylabel("R2 Score")
plt.show()



##### Which hyperparameter optimization technique have you used and why?



**Hyperparameter Optimization Technique Used:**  
GridSearchCV was used for hyperparameter optimization. It was chosen because SVR performance is highly sensitive to parameters such as C and gamma, and GridSearchCV systematically evaluates combinations to identify the best-performing configuration.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement Observed:**  
Yes, performance improvement was observed after hyperparameter tuning. The optimized SVR model achieved a higher R² score and lower RMSE compared to the baseline model. The updated evaluation metric score chart confirms that tuning enhanced the model’s ability to generalize and make accurate predictions.


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The evaluation metrics considered for assessing positive business impact were **RMSE** and **R² Score**. RMSE was chosen because it directly measures the average prediction error in the same units as the target variable, making it easy for businesses to understand how close the predicted customer ratings are to actual ratings. Lower RMSE indicates more accurate predictions, which helps organizations make reliable decisions based on model outputs.

The R² Score was considered because it explains how well the model captures the variability in customer ratings using the selected features. A higher R² score indicates that the model effectively identifies key factors influencing customer satisfaction. From a business perspective, this enables better understanding of customer behavior, supports targeted improvements, and enhances decision-making in recommendation systems and service optimization.


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The **Random Forest Regressor** was chosen as the final prediction model among the implemented models. This decision was based on its superior performance in terms of evaluation metrics, particularly a higher R² score and lower RMSE compared to Linear Regression and Support Vector Regression. Random Forest demonstrated a strong ability to capture non-linear relationships between features and customer ratings, leading to more accurate and stable predictions.

Additionally, Random Forest showed better generalization after hyperparameter tuning and cross-validation, making it more robust to noise and variability in the data. From a business perspective, this reliability and improved predictive accuracy make Random Forest the most suitable model for deriving actionable insights, enhancing recommendation systems, and supporting data-driven decision-making.


### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The **Random Forest Regressor** was used as the final prediction model. Random Forest is an ensemble learning algorithm that builds multiple decision trees using different subsets of the data and features, and then aggregates their predictions. This approach reduces overfitting, improves generalization, and enables the model to capture complex, non-linear relationships between input features and customer ratings.

For model explainability, the **built-in feature importance mechanism of Random Forest** was used. This method measures how much each feature contributes to reducing prediction error across all decision trees in the ensemble. Features that result in larger reductions in impurity (error) are assigned higher importance scores.

The feature importance analysis showed that **review_length** was one of the most influential features, indicating that customer engagement through detailed reviews plays a significant role in determining ratings. The **rating-related numerical features** also contributed substantially, reflecting their direct relationship with customer satisfaction. Other features had comparatively lower importance, suggesting a smaller impact on the model’s predictions.

Using feature importance as an explainability tool helps translate model outputs into actionable business insights. It allows stakeholders to understand which factors most strongly influence customer ratings, enabling restaurants and platforms to focus on improving high-impact areas such as customer engagement and experience quality.


## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, restaurant and customer review data were analyzed using data science and machine learning techniques to understand customer behavior and predict ratings. Data preprocessing, visualization, feature engineering, and hypothesis testing helped extract meaningful insights. Multiple machine learning models were implemented and evaluated, and the Random Forest Regressor performed best. The results demonstrate how data-driven models can support better decision-making and improve customer experience in the restaurant domain.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***