<a href="https://colab.research.google.com/github/nupur-19-hub/nupur_genai_python/blob/main/zomato.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

The Zomato Restaurant Rating Prediction project focused on building a robust machine learning model to predict restaurant ratings using detailed restaurant metadata and customer reviews. The objective was not only to achieve accurate predictions but also to uncover meaningful insights into the factors that influence customer satisfaction in the restaurant industry. The project successfully implemented an end-to-end data science pipeline, from raw data processing to model evaluation, while translating technical results into actionable business insights.

The project began with an in-depth Exploratory Data Analysis (EDA) to understand the structure, quality, and patterns within the dataset. This stage helped identify key variables, outliers, and trends related to pricing, cuisines, customer sentiment, and engagement over time. EDA revealed that customer ratings were generally skewed toward positive values, while restaurant costs followed a right-skewed distribution, with most restaurants falling in the mid-range price bracket of 500–1000 INR. Cuisine analysis showed that North Indian and Chinese cuisines were the most common, whereas Mediterranean, Modern Indian, and European cuisines achieved the highest average ratings.

The data preparation phase was extensive and critical to the model’s success. Duplicate records were removed, incorrect data types were fixed for variables such as cost, rating, and time, and missing values were handled through appropriate imputation or removal strategies. Customer review text underwent rigorous preprocessing, including contraction expansion, lowercasing, removal of punctuation, URLs, and stopwords, followed by lemmatization to standardize text data. The cleaned reviews were transformed into numerical features using TF-IDF vectorization, enabling the model to capture meaningful textual patterns. Additional feature engineering created variables such as Review_Length, further enriching the dataset.To ensure efficiency and prevent overfitting, a multi-stage feature selection process was implemented. This included filter methods, embedded techniques using RandomForestRegressor feature importance, and wrapper-based approaches. Through this structured selection strategy, the feature space was reduced to the most influential 20 features, balancing model complexity with predictive performance.

For model development, the dataset was split into 80% training and 20% testing data. Three regression models—RandomForestRegressor, GradientBoostingRegressor, and Lasso Regression—were trained and optimized using GridSearchCV for hyperparameter tuning. Among these, the tuned RandomForestRegressor demonstrated the strongest performance and was selected as the final model. It achieved an R² score of 0.5116, indicating that over 51% of the variation in restaurant ratings was explained by the selected features. The model’s Mean Absolute Error (MAE) of 0.7842 and Root Mean Squared Error (RMSE) of 1.0349 suggest that predictions typically deviated by about one rating point, reflecting reliable predictive capability for real-world applications.Visualization-driven insights further strengthened the project. Review volume showed rapid growth between late 2017 and early 2019, highlighting increased platform engagement. Most restaurants offered two to four cuisines, suggesting a trend toward menu diversification. Correlation analysis revealed weak direct relationships between ratings and reviewer activity metrics, though reviewer engagement variables were moderately correlated with each other.

From a business perspective, the model enables continuous performance monitoring, data-driven marketing strategies, menu optimization, and informed investment decisions for platforms like Zomato. Future enhancements could include advanced NLP techniques such as deep learning-based sentiment analysis, experimenting with models like XGBoost or LightGBM, addressing rating imbalance through custom loss functions, and incorporating external data sources. Overall, the project establishes a strong foundation for leveraging data science in the restaurant ecosystem.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The project aims to analyze Zomato restaurant data to understand the relationship between pricing, customer sentiment, and engagement. Using exploratory data analysis and clustering techniques, restaurants are segmented into meaningful groups to derive insights that can aid business and recommendation strategies..**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
resto_names= pd.read_csv("/content/Zomato Restaurant names and Metadata.csv")
review=pd.read_csv("/content/Zomato Restaurant reviews.csv")


### Dataset First View

In [None]:
# Dataset First Look
resto_names.head()


In [None]:
review.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = resto_names.shape
print(f"Rows: {rows}")
print(f"Columns: {columns}")

In [None]:
rows, columns =review.shape
print(f"Rows: {rows}")
print(f"Columns: {columns}")

### Dataset Information

In [None]:
# Dataset info
resto_names.info()


In [None]:
review.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
resto_names.duplicated().sum()

In [None]:
review.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
resto_names.isnull().sum()

In [None]:
review.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(resto_names)

In [None]:
import missingno as msno
msno.bar(review)

### What did you know about your dataset?

1. resto_names DataFrame:

It contains 105 rows and 6 columns.
The columns are: 'Name', 'Links', 'Cost', 'Collections', 'Cuisines', and 'Timings'.
There are no duplicate rows.
Missing Values:
Collections: 54 missing values.
Timings: 1 missing value.
All columns are of object data type.
2. review DataFrame:

It contains 10,000 rows and 7 columns.
The columns are: 'Restaurant', 'Reviewer', 'Review', 'Rating', 'Metadata', 'Time', and 'Pictures'.
There are 36 duplicate rows.
Missing Values:
Reviewer: 38 missing values.
Review: 45 missing values.
Rating: 38 missing values.
Metadata: 38 missing values.
Time: 38 missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
resto_names.columns


In [None]:
review.columns

In [None]:
# Dataset Describe
resto_names.describe()

In [None]:
review.describe()

### Variables Description

resto_names DataFrame:
Name: This column contains the names of the restaurants. It has 105 unique entries, indicating that each row represents a distinct restaurant.
Links: This column provides the Zomato URL for each restaurant. Similar to 'Name', it has 105 unique entries, confirming unique links for unique restaurants.
Cost: This column represents the average cost for two people at the restaurant. It's currently stored as an object type, but it should be a numerical value. There are 29 unique cost values, with '500' being the most frequent.
Collections: This column indicates special collections or categories the restaurant belongs to (e.g., 'Food Hygiene Rated Restaurants'). It's a categorical variable with 54 missing values and 42 unique collection types among the non-null entries.
Cuisines: This column lists the types of cuisines offered by each restaurant. It's a categorical variable with 92 unique combinations of cuisines, with 'North Indian, Chinese' being the most frequent.
Timings: This column provides the operating hours of the restaurants. It's a categorical variable with 1 missing value and 77 unique timing patterns.
review DataFrame:
Restaurant: This column contains the name of the restaurant being reviewed. It's an object type and will likely link to the resto_names DataFrame.
Reviewer: This column stores the name of the person who wrote the review. It's an object type with 38 missing values.
Review: This column contains the actual text content of the review. It's textual data (object type) with 45 missing values.
Rating: This column contains the rating given by the reviewer. While it appears to be a rating, it's currently an object type and needs to be converted to a numerical format. It also has 38 missing values.
Metadata: This column contains additional information about the reviewer, such as the number of reviews and followers. It's an object type with 38 missing values and will require parsing to extract meaningful numerical features.
Time: This column indicates when the review was posted. It's currently an object type and needs to be converted to a datetime format. It also has 38 missing values.
Pictures: This column represents the number of pictures uploaded with the review. It's a numerical column (int64) with a mean of 0.75 pictures and a maximum of 64 pictures. The median is 0, suggesting that most reviews do not include pictures.





### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
resto_names.nunique()

In [None]:
review.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Remove duplicate rows from the review DataFrame to ensure data integrity.
print(f"Number of duplicate rows in review DataFrame before removal: {review.duplicated().sum()}")
review.drop_duplicates(inplace=True)
print(f"Number of duplicate rows in review DataFrame after removal: {review.duplicated().sum()}")


In [None]:
#Inspect the unique values in the 'Cost' column to understand its current format.
print("Unique values in 'Cost' column before cleaning:")
print(resto_names['Cost'].unique())

In [None]:
#Remove commas from the 'Cost' column.
#Convert the 'Cost' column to a numeric data type.

resto_names['Cost'] = resto_names['Cost'].str.replace(',', '', regex=False)
resto_names['Cost'] = pd.to_numeric(resto_names['Cost'])
print("Data type of 'Cost' column after cleaning:")
print(resto_names['Cost'].dtype)
print("First 5 rows of 'Cost' column after cleaning:")
print(resto_names['Cost'].head())

In [None]:
#Inspect the unique values in the 'Rating' column to understand its current format and identify non-numeric entries.
#Handle non-numeric entries by replacing them or converting them to NaN.
#Convert the 'Rating' column to a numeric data type.

print("Unique values in 'Rating' column before cleaning:")
print(review['Rating'].unique())

In [None]:
review['Rating'] = review['Rating'].replace('Like', np.nan)
review['Rating'] = pd.to_numeric(review['Rating'])
print("Data type of 'Rating' column after cleaning:")
print(review['Rating'].dtype)
print("First 5 rows of 'Rating' column after cleaning:")
print(review['Rating'].head())

In [None]:
#Convert the 'Time' column in the review DataFrame to datetime objects using pd.to_datetime().
#Verify the conversion by checking the data type of the 'Time' column using .dtype and displaying the first few rows using .head().

review['Time'] = pd.to_datetime(review['Time'])
print("Data type of 'Time' column after conversion:")
print(review['Time'].dtype)
print("First 5 rows of 'Time' column after conversion:")
print(review['Time'].head())

In [None]:
#For the resto_names DataFrame:
#Fill missing values in the 'Collections' column with the string 'Unknown'.
#Fill missing values in the 'Timings' column with its mode.
#For the review DataFrame:
#Drop rows where 'Rating' or 'Review' are missing.
#Fill missing values in 'Reviewer', 'Metadata', and 'Time' columns with the string 'Unknown'.
resto_names['Collections'].fillna('Unknown', inplace=True)
resto_names['Timings'].fillna(resto_names['Timings'].mode()[0], inplace=True)
print("Missing values in resto_names after imputation:")
print(resto_names.isnull().sum())

In [None]:
#The previous code generated a FutureWarning due to the use of inplace=True in fillna. To address this, I will modify the code to directly assign the result of fillna back to the DataFrame columns, as recommended by the warning.
resto_names['Collections'] = resto_names['Collections'].fillna('Unknown')
resto_names['Timings'] = resto_names['Timings'].fillna(resto_names['Timings'].mode()[0])
print("Missing values in resto_names after imputation:")
print(resto_names.isnull().sum())

In [None]:
#Following the instructions for handling missing values in the review DataFrame, I will first drop rows where 'Rating' or 'Review' are missing. This is a critical step as these columns are essential for analysis, and imputing them could introduce bias.
print(f"Number of rows in review DataFrame before dropping missing 'Rating'/'Review': {review.shape[0]}")
review.dropna(subset=['Rating', 'Review'], inplace=True)
print(f"Number of rows in review DataFrame after dropping missing 'Rating'/'Review': {review.shape[0]}")
print("Missing values in review after dropping 'Rating'/'Review':")
print(review.isnull().sum())

In [None]:
#Following the instructions for handling missing values in the review DataFrame, I will now fill the missing values in 'Reviewer', 'Metadata', and 'Time' columns with the string 'Unknown'.
review['Reviewer'].fillna('Unknown', inplace=True)
review['Metadata'].fillna('Unknown', inplace=True)
# The 'Time' column was converted to datetime objects, so filling missing with 'Unknown' as a string is inconsistent.
# However, the task explicitly states to fill with 'Unknown' for missing 'Time' values.
# If the goal is to maintain datetime type for non-missing values, this step needs careful consideration.
# For now, following the instruction to fill with 'Unknown' string for missing values.
# However, it's generally better to either drop rows with missing time or impute with a specific datetime value if the column is to remain datetime type.
# Given the earlier conversion to datetime, if a string 'Unknown' is assigned, the column type will likely change to object.
# Let's check the current dtype before filling.

# As 'Time' was converted to datetime, filling with 'Unknown' will change its dtype to object.
# If the instruction implies that missing 'Time' values should be filled with a string literal, we proceed.
# If datetime functionality is critical, these rows should be dropped or imputed with a valid datetime.

# Let's consider the initial instruction: "Fill missing values in 'Reviewer', 'Metadata', and 'Time' columns with the string 'Unknown'."
# This implies treating 'Time' as an object type after this fill, if any NaT (Not a Time) values exist.

# Let's re-evaluate the Time column. After `dropna(subset=['Rating', 'Review'], inplace=True)`, all 'Time' NaNs might have been removed already.
# We should re-check `review.isnull().sum()` for 'Time' before filling it.

if review['Time'].isnull().sum() > 0:
    review['Time'].fillna('Unknown', inplace=True)

print("Missing values in review after filling 'Reviewer', 'Metadata', and 'Time':")
print(review.isnull().sum())

In [None]:
# display the .info() and .isnull().sum() for the resto_names DataFrame to verify the data types and confirm that all missing values have been handled.

print("resto_names DataFrame Info:")
resto_names.info()
print("\nMissing values in resto_names after all manipulations:")
print(resto_names.isnull().sum())

In [None]:
#display the .info() and .isnull().sum() for the review DataFrame to ensure its data types are correct and all missing values have been handled as per the data wrangling steps.
print("review DataFrame Info:")
review.info()
print("\nMissing values in review after all manipulations:")
print(review.isnull().sum())

### What all manipulations have you done and insights you found?

Duplicate Removal: 36 duplicate rows were identified and successfully removed from the review DataFrame, resulting in a dataset with 0 duplicate entries.
'Cost' Column Cleaning (resto_names): The 'Cost' column, which initially contained string values with commas (e.g., '1,300'), was cleaned by removing commas and converted to a numeric int64 data type.
'Rating' Column Cleaning (review): The 'Rating' column was cleaned by replacing the non-numeric string 'Like' with NaN and then converted to a float64 data type.
'Time' Column Conversion (review): The 'Time' column was successfully converted to datetime64[ns] objects, enabling time-based analysis.
Missing Value Handling (resto_names): Missing values in the 'Collections' column were filled with 'Unknown', and missing values in the 'Timings' column were imputed with the mode of that column. After these operations, the resto_names DataFrame contained no missing values.
Missing Value Handling (review):
Rows with missing 'Rating' or 'Review' values were dropped, leading to the removal of 10 rows and a reduction in the DataFrame size from 9964 to 9954 entries.
Missing values in the 'Reviewer' and 'Metadata' columns were filled with 'Unknown'.
The 'Time' column had no remaining missing values after the previous dropping step.
Post-wrangling, the review DataFrame also contained no missing values across any of its columns.
Final Data State: Both resto_names (105 entries, 6 columns) and review (9954 entries, 7 columns) DataFrames are now free of missing values and have appropriate data types, making them ready for further analysis.
The clean and type-consistent data sets (resto_names and review) are now suitable for various analytical tasks, such as sentiment analysis on reviews, restaurant performance metrics, or exploring relationships between cost, ratings, and collections.
Consider exploring the 'Time' column for potential insights into review frequency patterns or trends over specific periods.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Create a countplot for the 'Rating' column
plt.figure(figsize=(10, 6))
sns.countplot(x='Rating', data=review, hue='Rating', palette='viridis', legend=False)

# Set labels and title
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Distribution of Customer Ratings')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a countplot to visualize the distribution of customer ratings because it is ideal for displaying the frequency of categorical variables. In this case, 'Rating' is a discrete categorical variable, and a countplot clearly shows how many reviews fall into each rating category, making it easy to identify the most common ratings and the overall sentiment trend.

##### 2. What is/are the insight(s) found from the chart?

High Proportion of Positive Ratings: The chart clearly shows a dominant number of high ratings (4.0 and 5.0). This indicates a generally positive customer sentiment towards the restaurants.
Relatively Few Negative Ratings: Ratings of 1.0, 1.5, 2.0, and 2.5 are significantly lower in frequency compared to the higher ratings. This suggests that customers are less likely to leave very negative reviews.
Peak at 4.0 and 5.0: The tallest bars are for ratings 4.0 and 5.0, highlighting that these are the most common ratings given by customers.
Declining Frequency with Lower Ratings: As the rating value decreases, the count of reviews generally decreases, reinforcing the overall positive sentiment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Strong Reputation: The high proportion of 4.0 and 5.0 ratings indicates a generally strong reputation for the restaurants. This can attract new customers, build brand loyalty, and justify premium pricing.
Marketing Opportunities: Restaurants can leverage these high ratings in their marketing campaigns to showcase customer satisfaction and build trust.
Focus on Strengths: Understanding that customers appreciate the current offerings (leading to high ratings) allows businesses to maintain focus on their successful aspects and continue delivering quality experiences.

* Negative Growth:

Ignored Negative Feedback: While negative reviews are fewer, they represent valuable feedback. If businesses ignore the lower ratings (1.0-2.5), they miss opportunities to identify and correct issues, potentially leading to a decline in service quality and customer satisfaction over time.   




#### Chart - 2

In [None]:
# Chart - 2 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram for the 'Cost' column
plt.figure(figsize=(10, 6))
sns.histplot(resto_names['Cost'], bins=20, kde=True, color='skyblue')

# Set labels and title
plt.xlabel('Average Cost for Two (INR)')
plt.ylabel('Number of Restaurants')
plt.title('Distribution of Average Restaurant Costs')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram to visualize the distribution of restaurant costs because 'Cost' is a continuous numerical variable. A histogram is particularly effective for showing the shape, spread, and central tendency of such data, allowing us to easily identify common price ranges, the frequency of restaurants within those ranges, and any outliers. The bins help group the costs into intervals, providing a clear overview of the cost landscape.



##### 2. What is/are the insight(s) found from the chart?

*   **Right-Skewed Distribution:** The histogram shows a right-skewed distribution, indicating that most restaurants have lower average costs, with fewer restaurants at higher price points.
*   **Predominance of Mid-Range Costs:** A significant concentration of restaurants falls within the 500-1000 INR range, suggesting this is the most common price bracket for dining out.
*   **Affordable Options Abound:** There's a noticeable peak around the 500-700 INR mark, implying that many affordable dining options are available.
*   **Fewer High-End Restaurants:** As the cost increases beyond 1500-2000 INR, the number of restaurants rapidly decreases, indicating a smaller segment of high-end or fine-dining establishments.
*   **Outliers at Very High Costs:** There are a few restaurants with very high costs (e.g., above 2500 INR), which could be considered outliers representing luxury dining experiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   **Market Opportunity for Affordable Dining:** The high concentration of restaurants in the 500-1000 INR range, especially around 500-700 INR, indicates a strong market for affordable and mid-range dining. Businesses entering or operating in this segment can attract a large customer base by offering competitive pricing and good value.
*   **Targeted Marketing:** Restaurants can tailor their marketing strategies based on their price point. Those in the dominant mid-range can emphasize value and accessibility, while high-end establishments can focus on exclusivity and premium experience.
*   **Identifying Gaps in the Market:** While the market is saturated with mid-range options, the fewer high-cost restaurants suggest a niche for luxury dining. Entrepreneurs might identify opportunities to open high-end establishments if there's unmet demand.

**Potential for Negative Growth:**
*   **Intense Competition in Mid-Range:** The large number of restaurants in the 500-1000 INR bracket implies fierce competition. New entrants or existing businesses failing to differentiate themselves might struggle to attract and retain customers, leading to reduced profitability or even closure.
*   **Price Sensitivity:** Customers in the affordable/mid-range segment are often price-sensitive. Any significant price increases without a corresponding increase in value or quality could lead to customer churn.
*   **Difficulty for High-End Entry:** The limited number of high-end restaurants, while indicating a niche, also suggests a smaller customer base or higher barriers to entry (e.g., higher operational costs, specialized clientele). Businesses aiming for the luxury market face risks if they misjudge demand or fail to deliver on high expectations.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Extract individual cuisines and count their occurrences
all_cuisines = resto_names['Cuisines'].str.split(', ').explode()
cuisine_counts = all_cuisines.value_counts()

# Get the top 10 most frequent cuisines
top_10_cuisines = cuisine_counts.head(10)

print("Top 10 Cuisines and their counts:")
print(top_10_cuisines)

In [None]:
plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_cuisines.index, y=top_10_cuisines.values, hue=top_10_cuisines.index, palette='viridis', legend=False)

plt.xlabel('Cuisine')
plt.ylabel('Number of Restaurants')
plt.title('Top 10 Cuisines Offered by Restaurants')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 selected a bar chart to visualize the top 10 most frequently offered cuisines because it is highly effective for comparing the frequency or count of distinct categorical variables. Since 'Cuisines' is a categorical variable, and we are interested in seeing which cuisines appear most often, a bar chart clearly displays the relative popularity of each cuisine type. The individual bars make it easy to visually compare the counts and identify the dominant cuisine categories.

##### 2. What is/are the insight(s) found from the chart?

*   **Dominance of North Indian and Chinese Cuisines:** The chart clearly indicates that 'North Indian' and 'Chinese' cuisines are by far the most frequently offered, appearing in significantly more restaurants than other cuisines. This suggests high demand and popularity for these types of food.
*   **Strong Presence of Continental:** 'Continental' cuisine also holds a strong position, being offered in a substantial number of restaurants, indicating its widespread appeal.
*   **Mid-Range Popularity:** Cuisines like 'Biryani', 'Asian', 'Fast Food', and 'Italian' show a moderate level of popularity, with a consistent presence across several restaurants.
*   **Emerging or Niche Cuisines (relative to top):** 'Desserts', 'South Indian', and 'Bakery' are at the lower end of the top 10, suggesting they might be offered by fewer specialized establishments or as complementary options.
*   **Limited Diversity at the Top:** The top few cuisines dominate the landscape, implying that restaurants often stick to these popular choices, potentially to cater to a broader customer base.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   **Strategic Menu Planning:** Restaurants can use the dominance of North Indian and Chinese cuisines to strategize their menu offerings, ensuring they cater to popular demand. New establishments might prioritize these cuisines to attract a wider customer base.
*   **Targeted Marketing:** Knowing the most popular cuisines allows businesses to tailor their marketing campaigns, highlighting their expertise in these high-demand areas to attract specific customer segments.
*   **Franchise and Expansion Opportunities:** The widespread popularity of certain cuisines might indicate viable opportunities for restaurant chains to expand or franchise, particularly in areas where these cuisines are less represented but in high demand.

**Potential for Negative Growth:**
*   **High Competition in Popular Cuisines:** The dominance of North Indian and Chinese cuisines also implies high competition. New restaurants entering these segments must offer unique selling propositions (USPs) or exceptional quality to stand out, otherwise, they risk struggling for market share.
*   **Risk of Homogenization:** A strong focus on only a few popular cuisines might lead to a lack of diversity in the culinary landscape, potentially boring customers who seek new and unique dining experiences. This could lead to a decline in interest in highly saturated cuisine types over time.
*   **Underestimation of Niche Markets:** While some cuisines appear less frequently, they might represent dedicated niche markets with high customer loyalty and willingness to pay premium prices. Ignoring these smaller segments could mean missing out on potentially profitable, less competitive opportunities.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
review_counts = review['Restaurant'].value_counts()
top_10_restaurants = review_counts.head(10)

print("Top 10 Restaurants by Review Count:")
print(top_10_restaurants)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a bar plot for the top 10 restaurants by review count
plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_restaurants.index, y=top_10_restaurants.values, hue=top_10_restaurants.index, palette='plasma', legend=False)

# Set labels and title
plt.xlabel('Restaurant Name')
plt.ylabel('Number of Reviews')
plt.title('Top 10 Restaurants by Number of Reviews')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the top 10 restaurants by the number of reviews because it is highly effective for comparing the frequency or count of distinct categorical variables. In this case, 'Restaurant Name' is a categorical variable, and we are interested in seeing which restaurants have accumulated the most reviews. A bar chart clearly displays the relative popularity of each restaurant based on review volume, making it easy to identify the most engaging or frequently reviewed establishments.

##### 2. What is/are the insight(s) found from the chart?

*  **High Engagement:** All top 10 restaurants have exactly 100 reviews each. This uniformity suggests that these restaurants are highly engaging with customers, prompting them to leave feedback, or perhaps there's a system in place that encourages a high volume of reviews.

*   **Popularity Across Categories:** The list includes a diverse range of restaurant types (e.g., "Beyond Flavours", "Paradise" often for Biryani/North Indian, "Flechazo" for buffets, "Shah Ghouse" for Hyderabadi cuisine, "Over The Moon Brew Company" for brewpubs). This suggests that high engagement isn't limited to a single cuisine or dining experience.
*   **Established Brands:** Many of these names are likely well-known or 100 reviews for all top restaurants might indicate a data collection limit or a sampling method where only the first 100 reviews per restaurant were included in the dataset. This needs to be considered when interpreting "popularity" solely based on review counts
*   **Potential for Data Capping/Truncation:** The consistent count oestablished brands in their respective locations, which naturally attracts more customers and, consequently, more reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Positive Business Impact:**
*   **Validation of Engagement Strategies:** If the consistent 100 reviews per restaurant are due to effective customer engagement (e.g., feedback requests, loyalty programs), this suggests successful strategies that other restaurants could emulate to increase customer interaction and loyalty.
*   **Benchmarking for Popularity:** For new or less reviewed restaurants, the top 10 list provides a benchmark for review volume. Achieving similar review counts could be a target for establishing a strong online presence and perceived popularity.
*   **Market Research on Successful Concepts:** The diversity in cuisine types among the top restaurants indicates that various models (fine dining, casual, specific cuisine focus) can succeed in generating high customer engagement, providing valuable market insights.

**Potential for Negative Growth (if not addressed):**
*   **Misinterpretation of Data:** If the uniform review count of 100 is indeed a data sampling artifact, relying solely on this metric to assess popularity could be misleading. Businesses might falsely believe they are performing as well as others, or that there's a ceiling to review potential, leading to misguided strategies and missed growth opportunities.
*   **Overlooking True Performance:** A cap on reviews could obscure the true difference in popularity between restaurants. A restaurant with only 100 reviews might have actually received 500 in reality, while another genuinely only received 100. This could lead to an inaccurate understanding of market leaders and laggards.
*   **Complacency:** Restaurants consistently appearing at the top of a truncated list might become complacent, assuming their engagement is optimal, and neglect further efforts to solicit genuine, comprehensive customer feedback, which could hinder long-term growth and adaptation.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
merged_df = pd.merge(resto_names, review, left_on='Name', right_on='Restaurant', how='inner')
print("Merged DataFrame head:")
print(merged_df.head())

In [None]:
average_ratings = merged_df.groupby('Restaurant')['Rating'].mean().reset_index()
average_ratings.rename(columns={'Rating': 'Average_Rating'}, inplace=True)

# Merge with resto_names to get Cost
restaurant_cost_rating = pd.merge(resto_names[['Name', 'Cost']], average_ratings, left_on='Name', right_on='Restaurant', how='inner')
print("Restaurant Cost and Average Rating DataFrame head:")
print(restaurant_cost_rating.head())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a scatter plot for Cost vs. Average Rating
plt.figure(figsize=(10, 7))
sns.regplot(x='Cost', y='Average_Rating', data=restaurant_cost_rating, scatter_kws={'alpha':0.6}, line_kws={'color':'red'})

# Set labels and title
plt.xlabel('Average Cost for Two (INR)')
plt.ylabel('Average Rating')
plt.title('Relationship between Average Restaurant Cost and Average Rating')

# Display the plot
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?


I chose a scatter plot with a regression line (`regplot`) because it is ideal for visualizing the relationship between two numerical variables ('Cost' and 'Average_Rating'). The scatter plot shows individual data points, while the regression line helps identify any linear trend or correlation between these variables, which is crucial for understanding how cost might influence rating. This chart effectively illustrates the density of data points at different cost and rating levels and highlights whether there's a general tendency for more expensive restaurants to have higher or lower ratings, or no clear relationship at all.

##### 2. What is/are the insight(s) found from the chart?

*   **Weak Positive Correlation:** The regression line suggests a weak positive correlation between average cost and average rating. While not strong, there's a slight tendency for more expensive restaurants to have slightly higher average ratings.
*   **Concentration at Lower Costs and Higher Ratings:** A significant cluster of restaurants exists at lower cost ranges (e.g., 500-1500 INR) with generally high ratings (around 3.5 to 5.0). This indicates that many well-regarded restaurants are relatively affordable.
*   **High-Cost, High-Rating Potential:** While fewer in number, restaurants with higher costs (e.g., above 2000 INR) generally maintain high average ratings, suggesting that customers expect and receive quality commensurate with the price.
*   **Variability at Mid-Range Costs:** There's more variability in ratings for restaurants in the mid-range cost bracket. Some mid-priced restaurants achieve very high ratings, while others might have lower average ratings, indicating a diverse market where quality can vary regardless of price.
*   **No Guarantee of High Rating with High Cost:** Although there's a slight positive trend, a high cost does not automatically guarantee a high rating, as some expensive restaurants may still receive moderate ratings. Similarly, some lower-cost restaurants achieve excellent ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   **Value Proposition for Affordable Excellence:** The insight that many restaurants in lower to mid-cost ranges (500-1500 INR) achieve high ratings is a significant positive. This indicates a strong market for value-for-money dining experiences. Businesses in this segment can attract a large customer base by emphasizing quality and affordability.
*   **Premium Segment Validation:** For high-end restaurants, maintaining high ratings confirms that they are meeting customer expectations for quality and experience at a higher price point. This validates their pricing strategy and helps in attracting a discerning clientele willing to pay more for exceptional service and food.
*   **Strategic Pricing and Quality Alignment:** New businesses can strategically position themselves. If they aim for a premium market, they must ensure exceptional quality to secure high ratings. If they aim for a broader market, they can focus on delivering high-quality experiences at a more accessible price point, knowing there's a proven demand for this.

**Potential for Negative Growth (if not addressed):**
*   **Price-Quality Disconnect:** If an expensive restaurant fails to deliver an experience commensurate with its high cost, it will likely receive lower ratings, leading to negative reviews, reduced customer trust, and ultimately negative growth. Customers expect value, and a high price amplifies that expectation.
*   **Complacency in Mid-Range:** The variability in ratings for mid-range restaurants means that not all are successful. Businesses in this competitive segment that fail to differentiate or maintain quality standards (even at moderate prices) can quickly lose customers to competitors who offer better value or experience.
*   **Ignoring Value-Driven Customers:** Restaurants focusing solely on the high-end market might overlook the large segment of customers seeking excellent dining experiences at more affordable prices. This can be a missed opportunity for growth if the market is saturated with high-priced options, and there's unmet demand for quality at lower costs.

#### Chart - 6

In [None]:
# Chart - 6 visualization code


In [None]:
# Chart - 6 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Extract individual collections and count their occurrences
all_collections = resto_names['Collections'].str.split(', ').explode()
collection_counts = all_collections.value_counts()

# 2. Get the top 10 most frequent collections
top_10_collections = collection_counts.head(10)

print("Top 10 Collections and their counts:")
print(top_10_collections)

# 3. Create a bar plot for the top 10 collections
plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_collections.index, y=top_10_collections.values, hue=top_10_collections.index, palette='viridis', legend=False)

# 4. Set labels and title
plt.xlabel('Collection')
plt.ylabel('Number of Restaurants')
plt.title('Top 10 Restaurant Collections by Count')

# 5. Rotate x-axis labels for better readability and adjust layout
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# 6. Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I selected a bar chart to visualize the top 10 restaurant collections by count because it is highly effective for comparing the frequency or count of distinct categorical variables. Since 'Collections' is a categorical variable, and we are interested in seeing which collections appear most often, a bar chart clearly displays the relative popularity of each collection type. The individual bars make it easy to visually compare the counts and identify the dominant collection categories.



##### 2. What is/are the insight(s) found from the chart?

*   **Dominance of 'Unknown' Category:** The most striking insight is the prevalence of the 'Unknown' category, which accounts for 54 restaurants. This indicates a significant amount of missing information in the 'Collections' column, which could be valuable for deeper analysis.
*   **Popularity of Specific Collections:** After 'Unknown', 'Great Buffets' (11 restaurants) and 'Food Hygiene Rated Restaurants in Hyderabad' (8 restaurants) are the most common collections. This suggests a demand for buffet dining and a customer preference for hygiene-certified establishments.
*   **Lifestyle-Oriented Collections:** Collections like 'Live Sports Screenings' (7 restaurants), 'Hyderabad's Hottest' (7 restaurants), 'Corporate Favorites' (6 restaurants), and 'Best Bars & Pubs' (4 restaurants) highlight the importance of ambiance, popularity, and specific customer segments (e.g., corporate clients, nightlife enthusiasts).
*   **Quality and Trending Indicators:** 'Gold Curated' (5 restaurants) and 'Top-Rated' (5 restaurants) point to collections that emphasize quality and customer satisfaction, while 'Trending This Week' (5 restaurants) indicates dynamic popularity.
*   **Market Segmentation:** The diverse nature of these collections suggests that restaurants cater to various market segments, from those looking for value (buffets) to those prioritizing hygiene, entertainment, or exclusivity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   **Market Opportunity (Buffets & Hygiene):** The popularity of 'Great Buffets' and 'Food Hygiene Rated Restaurants' signals strong market demand. Businesses can capitalize on this by either specializing in high-quality buffet offerings or by obtaining and promoting hygiene certifications to attract health-conscious customers.
*   **Targeted Marketing & Branding:** Understanding the prevalence of 'Lifestyle-Oriented Collections' allows restaurants to refine their branding and marketing efforts. For instance, promoting 'Live Sports Screenings' can attract sports enthusiasts, while highlighting 'Corporate Favorites' can draw in the business clientele.
*   **Quality and Trend Awareness:** The existence of 'Gold Curated' and 'Top-Rated' collections encourages restaurants to strive for excellence and customer satisfaction, which can lead to higher ratings, positive word-of-mouth, and sustained growth. Being in 'Trending This Week' can provide a short-term boost in visibility and customer traffic.

**Potential for Negative Growth:**
*   **Data Quality Issue ('Unknown' Category):** The large number of restaurants in the 'Unknown' collection category represents a significant data gap. For businesses, this means potentially missing crucial insights into what truly drives success or failure for a large portion of the market. Without this information, strategic decisions might be less informed.
*   **Over-reliance on Trends:** While 'Trending This Week' can be beneficial, an over-reliance on fleeting trends without a strong core offering can lead to inconsistent business performance. Restaurants might chase temporary popularities instead of building a sustainable brand.
*   **Stagnation in Specialized Markets:** If a restaurant belongs to a less popular collection or fails to differentiate within a popular one, it might struggle to attract new customers. For example, simply being a 'buffet' isn't enough; unique selling propositions are needed to stand out in a competitive category.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure merged_df exists or create it if not
# Assuming merged_df was created earlier in Chart-5 or needs to be re-created
if 'merged_df' not in locals() or merged_df.empty:
    merged_df = pd.merge(resto_names, review, left_on='Name', right_on='Restaurant', how='inner')

# 1. Explode the 'Cuisines' column to handle multiple cuisines per restaurant
# First, we need to create a copy to avoid SettingWithCopyWarning if merged_df is a slice
merged_df_copy = merged_df.copy()
merged_df_copy['Cuisine_Single'] = merged_df_copy['Cuisines'].str.split(', ')
exploded_cuisines = merged_df_copy.explode('Cuisine_Single')

# 2. Calculate the average rating for each cuisine
average_rating_per_cuisine = exploded_cuisines.groupby('Cuisine_Single')['Rating'].mean().reset_index()

# 3. Sort by average rating and get the top N cuisines (e.g., top 15 for better visualization)
top_cuisines_by_rating = average_rating_per_cuisine.sort_values(by='Rating', ascending=False).head(15)

print("Top 15 Cuisines by Average Rating:")
print(top_cuisines_by_rating)

# 4. Create a bar plot for the top cuisines by average rating
plt.figure(figsize=(14, 8))
sns.barplot(x='Rating', y='Cuisine_Single', data=top_cuisines_by_rating, hue='Cuisine_Single', palette='magma', legend=False)

# 5. Set labels and title
plt.xlabel('Average Rating')
plt.ylabel('Cuisine Type')
plt.title('Top 15 Cuisines by Average Rating')

# 6. Display the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I selected a horizontal bar chart to visualize the top cuisines by their average rating because it is excellent for comparing the average values of a categorical variable (cuisine type). With multiple cuisine types and their corresponding average ratings, a horizontal bar chart allows for easy readability of cuisine names and clear comparison of their average ratings, especially when ordered from highest to lowest. This arrangement quickly highlights which cuisines are perceived most favorably by customers.

##### 2. What is/are the insight(s) found from the chart?

*   **Highest Rated Cuisines are Diverse**: Mediterranean, Modern Indian, European, BBQ, and Goan cuisines consistently receive the highest average ratings, all above 4.2. This indicates that restaurants specializing in these cuisines are generally highly regarded by customers.
*   **American and Asian Cuisines Perform Well**: American and Asian cuisines also show strong average ratings, hovering around 3.9. These are often broad categories, suggesting a general appreciation for these culinary styles.
*   **Ice Cream and Sushi are Niche Favorites**: Despite being more specific, Ice Cream (3.88) and Sushi (3.83) also demonstrate high average ratings, indicating strong satisfaction among their patrons.
*   **Continental and Italian Maintain Good Ratings**: Continental and Italian cuisines, while not at the very top, still maintain healthy average ratings (around 3.8 and 3.77 respectively), suggesting a stable and positive customer perception.
*   **Traditional Cuisines in Mid-Range**: Desserts, Bakery, and South Indian cuisines are slightly lower in average rating among the top 15 (around 3.6-3.7). This could imply more variability in quality or a different set of customer expectations compared to the higher-rated categories.
*   **Variety of High-Quality Options**: The chart reveals that customers are satisfied with a diverse range of cuisine options, from specialized (Mediterranean, Goan) to more general (European, American, Asian).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


*  **Positive Business Impact:**
*   **Opportunity for Niche Cuisines:** The high average ratings for specialized cuisines like Mediterranean, Modern Indian, European, BBQ, and Goan suggest strong customer satisfaction in these segments. Businesses can leverage this by focusing on authenticity and quality in these areas to attract discerning customers and potentially command premium pricing.
*   **Marketing & Branding for High-Rated Cuisines:** Restaurants offering these top-rated cuisines can highlight their strong customer approval in marketing efforts, using their high average ratings as a key selling point to attract new patrons.
*   **Benchmarking for Improvement:** Cuisines with slightly lower, but still good, average ratings (e.g., Italian, Continental) can use the insights from higher-rated cuisines to identify areas for improvement, potentially enhancing their recipes, service, or ambiance to boost customer satisfaction.

**Potential for Negative Growth:**
*   **Misinterpreting Average Ratings:** A high average rating for a cuisine might be due to a small number of exceptional restaurants rather than consistent quality across all establishments of that cuisine. Businesses entering these 'high-rated' segments without a strong understanding of what drives those specific high ratings risk underperforming.
*   **Complacency in Dominant Cuisines:** If popular cuisines (like North Indian or Chinese, as seen in Chart 3) have lower average ratings compared to some niche ones, restaurants offering these popular options might face negative growth if they become complacent. Customers may seek out higher-rated alternatives, even if they are less common cuisine types.
*   **Overlooking the 'Why':** Simply knowing a cuisine has a high average rating doesn't explain *why*. Without deeper analysis (e.g., sentiment analysis of reviews), businesses might misattribute success to cuisine type alone, rather than factors like service, ambiance, or specific dishes, leading to ineffective strategy implementation.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure the 'Time' column is in datetime format (it was converted in wrangling)
# If not, re-convert: review['Time'] = pd.to_datetime(review['Time'])

# Extract month and year for grouping
review['Review_Month_Year'] = review['Time'].dt.to_period('M')

# Count reviews per month
monthly_review_counts = review['Review_Month_Year'].value_counts().sort_index()

# Convert PeriodIndex to datetime for plotting
monthly_review_counts.index = monthly_review_counts.index.to_timestamp()

print("Monthly Review Counts:")
print(monthly_review_counts.head())
print("...")
print(monthly_review_counts.tail())

# Create a line plot for monthly review volume trend
plt.figure(figsize=(14, 7))
sns.lineplot(x=monthly_review_counts.index, y=monthly_review_counts.values, marker='o', color='purple')

# Set labels and title
plt.xlabel('Month and Year')
plt.ylabel('Number of Reviews')
plt.title('Monthly Review Volume Trend')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I selected a line plot to visualize the monthly review volume trend because 'Time' is a continuous variable and a line plot is the most effective chart type for displaying data points over time. It clearly shows the progression, highlighting any upward or downward trends, seasonality, or sudden changes in the number of reviews submitted each month. This allows for a straightforward interpretation of how review activity has evolved over the recorded period.

##### 2. What is/are the insight(s) found from the chart?

*   **Significant Growth Over Time:** The line plot clearly shows a substantial increase in the number of reviews submitted per month from late 2017 to early 2019. The volume of reviews in early 2019 (e.g., 1346 in May 2019) is dramatically higher than in early 2017 (e.g., 7 in January 2017).
*   **Early Period of Low Activity:** From mid-2016 to mid-2017, the number of reviews was consistently low, often in single or double digits, indicating either low user engagement or perhaps the platform's nascent stage in the region.
*   **Clear Upward Trend from Late 2017:** A noticeable and steady upward trend in review submissions begins around late 2017 (e.g., from 36 in November 2017 to 1013 in March 2019).
*   **Accelerated Growth in 2019:** The growth appears to accelerate further in the first few months of 2019, with review counts consistently above 600 and reaching over 1300 by May 2019.
*   **Potential for Seasonality or Event-Driven Spikes:** While a broad trend is visible, specific spikes or dips could indicate seasonal effects (e.g., holidays) or promotional activities by the platform or restaurants, though this requires further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   **Platform Growth Validation:** The significant and accelerating growth in review volume validates the platform's increasing user base and engagement. This indicates a healthy and expanding market for restaurants listed on the platform, attracting more businesses to join and invest.
*   **Increased Data for Insights:** A higher volume of reviews provides a richer dataset for sentiment analysis, trend identification, and performance benchmarking. This can lead to more accurate insights for restaurants to improve their offerings and marketing strategies.
*   **Enhanced Credibility and Trust:** A vibrant review ecosystem, with many active users, enhances the credibility of the platform and the transparency for consumers, which can further drive user adoption and restaurant bookings.

**Potential for Negative Growth:**
*   **Overwhelming Review Management:** For restaurants, a rapidly increasing volume of reviews can become overwhelming to manage, especially if they are not equipped with resources or tools to respond effectively to feedback. This could lead to missed opportunities for customer recovery or reputation management.
*   **Increased Competition and Scrutiny:** While more reviews mean more data, it also means greater visibility and scrutiny. Restaurants with inconsistent quality or poor customer service might find their negative reviews amplified, potentially leading to a decline in business if issues are not addressed promptly.
*   **Data Quality Deterioration (if unmonitored):** With a massive influx of reviews, there's a risk of lower-quality or spam reviews diluting the overall value of the feedback if the platform's moderation mechanisms are not robust enough. This could erode user trust in the review system.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Calculate the number of cuisines offered by each restaurant
# We will use the resto_names DataFrame as it contains the 'Cuisines' column
resto_names['Cuisine_Count'] = resto_names['Cuisines'].apply(lambda x: len(str(x).split(', ')))

# 2. Get the value counts for each cuisine count
cuisine_count_distribution = resto_names['Cuisine_Count'].value_counts().sort_index()

print("Distribution of Restaurants by Number of Cuisines Offered:")
print(cuisine_count_distribution)

# 3. Create a bar plot for the distribution
plt.figure(figsize=(10, 6))
sns.barplot(x=cuisine_count_distribution.index, y=cuisine_count_distribution.values, hue=cuisine_count_distribution.index, palette='viridis', legend=False)

# 4. Set labels and title
plt.xlabel('Number of Cuisines Offered')
plt.ylabel('Number of Restaurants')
plt.title('Distribution of Restaurants by Number of Cuisines Offered')

# 5. Display the plot
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?





I chose a bar chart to visualize the distribution of restaurants by the number of cuisines offered because it is ideal for displaying the frequency of a categorical or discrete numerical variable (the number of cuisines). This type of chart clearly shows how many restaurants fall into each category of cuisine count, making it easy to identify the most common number of cuisines offered and the overall pattern of specialization versus diversification among restaurants.

##### 2. What is/are the insight(s) found from the chart?



*   **Predominance of Diversified Menus:** The chart reveals that restaurants offering 2, 3, or 4 cuisines are the most common. Specifically, restaurants offering 3 cuisines are the most frequent (33 restaurants), followed by 2 cuisines (26 restaurants) and 4 cuisines (21 restaurants). This indicates a strong trend towards menu diversification rather than extreme specialization.
*   **Limited Extreme Specialization or Broad Offerings:** While there are restaurants offering a single cuisine (12 restaurants), their number is relatively low compared to those offering multiple. Similarly, very few restaurants offer 5 (12 restaurants) or 6 (1 restaurant) cuisines, suggesting that beyond a certain point, expanding cuisine offerings becomes less common.
*   **Optimal Diversification Point:** The peak around 2-4 cuisines suggests that this range is considered an optimal balance by many restaurants, allowing them to cater to a broader customer base without over-stretching their culinary expertise or operational complexity.
*   **Few Highly Specialized Niche Restaurants:** The presence of only 12 restaurants offering a single cuisine type implies that highly specialized establishments are less numerous, possibly due to a smaller target market or higher risk associated with limited offerings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



**Positive Business Impact:**
*   **Optimizing Menu Strategy:** For new restaurants or those looking to rebrand, the insight that 2-4 cuisines are most common suggests an optimal diversification strategy. Offering a balanced number of popular cuisines can attract a wider customer base without diluting quality or increasing operational complexity too much.
*   **Market Positioning:** Restaurants can strategically decide whether to specialize (e.g., offer 1-2 cuisines) and target a niche market, or diversify (e.g., 3-4 cuisines) to appeal to broader tastes. This data provides guidance for such strategic decisions.
*   **Reduced Risk for New Ventures:** Knowing the common successful models (moderate diversification) can help new restaurant owners reduce risk by adopting proven menu structures, rather than venturing into extreme specialization or overly broad offerings that are less common.

**Potential for Negative Growth:**
*   **Intense Competition in Diversified Segments:** The high concentration of restaurants offering 2, 3, or 4 cuisines implies fierce competition within these segments. Restaurants in this bracket must continuously innovate, maintain high quality, and offer exceptional service to stand out, or they risk losing market share due to intense rivalry.
*   **Risk of Losing Niche Appeal:** While diversification can attract more customers, over-diversification (trying to be everything to everyone) can lead to a loss of culinary identity and expertise, potentially alienating customers who seek authentic, specialized dining experiences.
*   **High Operational Costs for Broad Menus:** For the few restaurants offering 5 or 6 cuisines, maintaining quality and consistency across such a wide range can be challenging and costly in terms of ingredients, kitchen equipment, and specialized chefs. Failure to manage these complexities can lead to customer dissatisfaction and negative reviews.

#### Chart - 10

In [None]:
# Chart - 10 visualization code


In [None]:
import re

# Extract 'Number of Reviews'
def get_num_reviews(metadata):
    if pd.isna(metadata) or metadata == 'Unknown':
        return np.nan
    match = re.search(r'(\d+)\sReview', metadata)
    if not match:
        match = re.search(r'(\d+)\sReviews', metadata)
    return int(match.group(1)) if match else np.nan

# Extract 'Number of Followers'
def get_num_followers(metadata):
    if pd.isna(metadata) or metadata == 'Unknown':
        return np.nan
    match = re.search(r'(\d+)\sFollower', metadata)
    if not match:
        match = re.search(r'(\d+)\sFollowers', metadata)
    return int(match.group(1)) if match else np.nan

# Apply the functions to create new columns
review['Num_Reviews'] = review['Metadata'].apply(get_num_reviews)
review['Num_Followers'] = review['Metadata'].apply(get_num_followers)

# Convert to numeric, coercing errors to NaN
review['Num_Reviews'] = pd.to_numeric(review['Num_Reviews'], errors='coerce')
review['Num_Followers'] = pd.to_numeric(review['Num_Followers'], errors='coerce')

print("Extracted 'Num_Reviews' and 'Num_Followers' and converted to numeric:")
print(review[['Metadata', 'Num_Reviews', 'Num_Followers']].head())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Define bins and labels for 'Reviewer_Activity_Level'
review_bins = [-1, 0, 5, 20, np.inf] # -1 to 0 will catch NaNs if they are 0, but np.nan will be handled separately
review_labels = ['No Reviews', '1-5 Reviews', '6-20 Reviews', '21+ Reviews']

# Create 'Reviewer_Activity_Level' column by binning 'Num_Reviews'
# Use pd.cut to categorize, and fillna for actual NaN values
review['Reviewer_Activity_Level'] = pd.cut(
    review['Num_Reviews'],
    bins=review_bins,
    labels=review_labels,
    right=True # bins are (min, max]
)

# Handle actual NaN values separately, if pd.cut doesn't put them in 'No Reviews' if num_reviews was 0
# For this dataset, Num_Reviews is at least 1, so NaN means truly missing. Let's make a separate category.
review['Reviewer_Activity_Level'] = review['Reviewer_Activity_Level'].cat.add_categories('Unknown Activity').fillna('Unknown Activity')

# 2. Calculate the average rating for each 'Reviewer_Activity_Level'
average_rating_by_activity = review.groupby('Reviewer_Activity_Level', observed=False)['Rating'].mean().reset_index()

# Sort for better visualization
average_rating_by_activity = average_rating_by_activity.sort_values(by='Rating', ascending=False)

print("Average Rating by Reviewer Activity Level:")
print(average_rating_by_activity.head())

# 3. Create a bar plot to visualize the average rating for each reviewer activity level
plt.figure(figsize=(12, 7))
sns.barplot(x='Reviewer_Activity_Level', y='Rating', data=average_rating_by_activity, hue='Reviewer_Activity_Level', palette='coolwarm', legend=False)

# 4. Add appropriate labels and title
plt.xlabel('Reviewer Activity Level')
plt.ylabel('Average Rating')
plt.title('Average Rating by Reviewer Activity Level')

# 5. Rotate x-axis labels if necessary for readability and ensure a tight layout
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?



I chose a horizontal bar chart to visualize the top collections by their average rating because it is excellent for comparing the average values of a categorical variable (restaurant collection type). With multiple collection types and their corresponding average ratings, a horizontal bar chart allows for easy readability of collection names and clear comparison of their average ratings, especially when ordered from highest to lowest. This arrangement quickly highlights which restaurant groupings are perceived most favorably by customers.





##### 2. What is/are the insight(s) found from the chart?

*   **Hyderabad's Hottest and Barbecue & Grill Lead Ratings**: Collections like 'Hyderabad's Hottest' (average rating ~4.60) and 'Barbecue & Grill' (~4.59) are at the top, indicating exceptional customer satisfaction for restaurants within these categories. This suggests these collections represent experiences highly valued by customers.
*   **Special Occasions and Premium Experiences Score High**: 'Ramzan Mubarak' (~4.22), 'Top-Rated' (~4.14), and 'Gold Curated' (~4.14) also show very high average ratings. These likely signify collections related to special events, verified quality, or curated premium experiences, where customer expectations are met or exceeded.
*   **Corporate Favorites and Social Hubs are Well-Received**: 'Corporate Favorites' (~4.09) and 'Fancy and Fun' (~4.01) have strong average ratings, suggesting that venues catering to business or social gatherings are generally well-regarded.
*   **Hygiene and Buffets are Important, but Not Always Top-Tier**: 'Food Hygiene Rated Restaurants in Hyderabad' (~4.00) and 'Great Buffets' (~3.96) have good average ratings, highlighting their importance to customers. However, they are not among the absolute highest-rated, which might imply that while hygiene and value are appreciated, they don't necessarily guarantee the highest perceived quality or experience compared to more specialized or premium offerings.
*   **Emerging vs. Established Collections**: 'New on Gold' (~3.95) shows that newly added restaurants to a curated list tend to start with solid ratings, suggesting a quality vetting process. Established social categories like 'Best Bars & Pubs' (~3.88) and 'Top Drinking Destinations' (~3.86) maintain good, consistent ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   **Targeted Investment:** High average ratings in collections like 'Hyderabad's Hottest' and 'Barbecue & Grill' indicate areas of strong customer satisfaction. Businesses can strategically invest in enhancing these offerings or replicating successful models within these categories to maximize returns.
*   **Marketing & Branding:** Restaurants belonging to highly-rated collections can leverage these affiliations in their marketing. Highlighting 'Top-Rated' or 'Gold Curated' status can attract customers seeking quality and reliability.
*   **Identifying Growth Niches:** The good ratings for special occasion collections like 'Ramzan Mubarak' or 'Corporate Favorites' suggest specific customer segments with high satisfaction. Businesses can develop targeted services or promotions to cater to these niches.
*   **Benchmarking Quality:** Even for collections with slightly lower average ratings (e.g., 'Great Buffets'), the data helps in setting quality benchmarks. Restaurants can analyze what makes the top performers in these categories excel to improve their own offerings.

**Potential for Negative Growth:**
*   **False Sense of Security in High-Rated Categories:** While 'Hyderabad's Hottest' and 'Barbecue & Grill' have high average ratings, this can lead to complacency. If individual restaurants within these categories fail to maintain quality, they risk negative reviews and losing market share to equally well-regarded competitors.
*   **Ignoring 'Unknown' Value:** The presence of a large 'Unknown' category in Chart 6 suggests a lack of structured data for many restaurants. If these 'Unknown' restaurants represent a significant portion of the market, overlooking their characteristics and performance could lead to misinformed business strategies or missed opportunities.
*   **Over-reliance on Broad Categories:** While a collection like 'Food Hygiene Rated Restaurants' is important, its average rating is not among the very highest. Over-emphasizing broad, expected qualities without differentiating on taste or experience might not be enough to stand out in a competitive market, potentially leading to stagnation if customers prioritize overall experience over basic assurances.
*   **Failure to Adapt to Evolving Preferences:** Collections like 'Ramzan Mubarak' are seasonal. Over-investing solely in such temporary trends without a sustainable core offering can lead to inconsistent business and potential negative growth outside of peak periods.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns from the review DataFrame
numerical_features = review[['Rating', 'Pictures', 'Num_Reviews', 'Num_Followers']]

# Calculate the correlation matrix
correlation_matrix_full = numerical_features.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_full, annot=True, cmap='coolwarm', fmt=".2f")

# Add title
plt.title('Correlation Matrix of Numerical Features in Review Data')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?



I selected a correlation heatmap because it provides a clear and intuitive visual representation of the correlation matrix between numerical variables. In this case, it helps to quickly identify the strength and direction of relationships between 'Rating', 'Pictures', 'Num_Reviews', and 'Num_Followers' in the review DataFrame. The color-coding and annotation of correlation coefficients directly on the map make it easy to understand potential dependencies or lack thereof at a glance, which is crucial for understanding how these numerical features relate to each other.



##### 2. What is/are the insight(s) found from the chart?

*   **Weak Correlation with Rating**: 'Rating' shows a very weak positive correlation with 'Pictures' (approximately 0.08), 'Num_Reviews' (approximately 0.03), and 'Num_Followers' (approximately 0.04). This indicates that the number of pictures, reviews posted, or followers a reviewer has does not strongly influence the rating they give.
*   **Moderate Correlation Among Reviewer Activity Metrics**: There's a moderate positive correlation between 'Num_Reviews' and 'Num_Followers' (approximately 0.46), suggesting that reviewers who post more reviews also tend to have more followers, which is an expected relationship. Similarly, 'Pictures' has a moderate correlation with 'Num_Reviews' (approximately 0.33) and 'Num_Followers' (approximately 0.28), indicating that more active reviewers and those with more followers are somewhat more likely to include pictures in their reviews.
*   **Limited Direct Influence on Rating**: The heatmap reinforces that the numerical rating itself is largely independent of the reviewer's activity metrics, implying that customers rate based on their experience rather than being influenced by or correlated with their own activity levels.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns from the review DataFrame (already done in previous step)
# numerical_cols = review.select_dtypes(include=['float64', 'int64'])

# Create a pair plot
plt.figure(figsize=(10, 8))
sns.pairplot(numerical_features, diag_kind='kde')

# Add title (optional for pairplot, but good practice)
plt.suptitle('Pairwise Relationships of Numerical Features in Review Data', y=1.02) # y is offset for suptitle

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?



I chose a pair plot to visualize the pairwise relationships and distributions of numerical columns ('Rating', 'Pictures', 'Num_Reviews', and 'Num_Followers') because it provides a comprehensive view in a single grid. For each numerical variable, it displays its distribution (histograms or KDEs) along the diagonal, and scatter plots for all pairwise combinations of variables off the diagonal. This allows for a quick and simultaneous inspection of individual distributions and potential correlations or patterns between variables, making it ideal for understanding the overall structure and interactions within the numerical data.



##### 2. What is/are the insight(s) found from the chart?

*   The 'Rating' distribution confirms a strong peak at higher ratings (4.0 and 5.0), indicating predominantly positive customer sentiment.
*   The 'Pictures' distribution reveals that a significant majority of reviews contain 0 pictures, indicating that uploading pictures is not a common behavior.
*   The 'Num_Reviews' distribution shows that most reviewers have a low number of reviews, with a long tail indicating a few prolific reviewers.
*   The 'Num_Followers' distribution is heavily skewed towards 0, meaning most reviewers have few or no followers.
*   The scatter plot between 'Rating' and 'Pictures' visually reinforces the very weak correlation (approximately 0.08, as seen in the correlation heatmap). Most data points are clustered where 'Pictures = 0', spanning all rating values, and there's no clear trend for reviews with pictures indicating that more pictures lead to consistently higher or lower ratings.
*   The scatter plots between 'Rating' and 'Num_Reviews' and 'Rating' and 'Num_Followers' also show weak correlations, confirming that a reviewer's activity level or influence does not strongly dictate the numerical rating they assign.
*   There is a clear positive relationship between 'Num_Reviews' and 'Num_Followers', indicating that reviewers who contribute more reviews also tend to attract more followers.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the average ratings between restaurants with different cost ranges.

**Alternative Hypothesis (H1):** There is a significant difference in the average ratings between restaurants with different cost ranges.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value



In [None]:
#### 2. Prepare Data for Hypothesis Test

# Define cost range boundaries based on the distribution seen in Chart 2
# For example, using quartiles or domain knowledge
# Let's check the distribution of 'Cost' again to set reasonable bins
# print(merged_df['Cost'].describe())
# 25% is 500, 50% is 800, 75% is 1200. Let's use 600 and 1000 as approximate boundaries for illustration.

# Define bins and labels for cost ranges
cost_bins = [0, 600, 1000, merged_df['Cost'].max() + 1] # Add 1 to max to ensure all values are included
cost_labels = ['Low Cost', 'Medium Cost', 'High Cost']

# Create the 'Cost_Range' column
merged_df['Cost_Range'] = pd.cut(merged_df['Cost'], bins=cost_bins, labels=cost_labels, right=False)

# Display the distribution of restaurants across the new cost ranges
print("Distribution of restaurants by Cost_Range:")
print(merged_df['Cost_Range'].value_counts())

print("\nFirst 5 rows with new 'Cost_Range' column:")
print(merged_df[['Name', 'Cost', 'Cost_Range', 'Rating']].head())

In [None]:
#### 3. Perform Statistical Test (ANOVA)

from scipy.stats import f_oneway

# Extract ratings for each cost range
ratings_low_cost = merged_df[merged_df['Cost_Range'] == 'Low Cost']['Rating'].dropna()
ratings_medium_cost = merged_df[merged_df['Cost_Range'] == 'Medium Cost']['Rating'].dropna()
ratings_high_cost = merged_df[merged_df['Cost_Range'] == 'High Cost']['Rating'].dropna()

# Perform ANOVA test
f_statistic, p_value = f_oneway(ratings_low_cost, ratings_medium_cost, ratings_high_cost)

print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.3f}")

##### Which statistical test have you done to obtain P-Value?

Analysis of variance(ANOVA)

##### Why did you choose the specific statistical test?

 I chose ANOVA because this hypothesis involves comparing the average ratings across three distinct groups of restaurants based on their cost ranges ('Low Cost', 'Medium Cost', and 'High Cost'). ANOVA is the appropriate statistical test for determining if there is a significant difference among the means of three or more independent groups.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the average ratings between Mediterranean cuisine and North Indian cuisine.

**Alternative Hypothesis (H1):** There is a significant difference in the average ratings between Mediterranean cuisine and North Indian cuisine.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value


In [None]:
#### 2. Prepare Data for Hypothesis Test

# Extract ratings for Mediterranean cuisine
ratings_mediterranean = exploded_cuisines[exploded_cuisines['Cuisine_Single'] == 'Mediterranean']['Rating'].dropna()

# Extract ratings for North Indian cuisine
ratings_north_indian = exploded_cuisines[exploded_cuisines['Cuisine_Single'] == 'North Indian']['Rating'].dropna()

print(f"Number of ratings for Mediterranean cuisine: {len(ratings_mediterranean)}")
print(f"Average rating for Mediterranean cuisine: {ratings_mediterranean.mean():.2f}\n")

print(f"Number of ratings for North Indian cuisine: {len(ratings_north_indian)}")
print(f"Average rating for North Indian cuisine: {ratings_north_indian.mean():.2f}")

In [None]:
#### 3. Perform Statistical Test (Independent Samples t-test)

from scipy.stats import ttest_ind

# Perform independent samples t-test
t_statistic, p_value = ttest_ind(ratings_mediterranean, ratings_north_indian, equal_var=False) # Assuming unequal variances

print(f"T-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.3f}")

##### Which statistical test have you done to obtain P-Value?

 Independent Samples t-test

##### Why did you choose the specific statistical test?

 For this hypothesis, we were comparing the average ratings of exactly two specific cuisine types ('Mediterranean' and 'North Indian'). The independent samples t-test is specifically designed for comparing the means of two independent groups to see if they are statistically different. I used equal_var=False (Welch's t-test) to account for potential unequal variances between the two cuisine groups.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the average ratings given by reviewers with different activity levels.

**Alternative Hypothesis (H1):** There is a significant difference in the average ratings given by reviewers with different activity levels.

#### 2. Perform an appropriate statistical test.

In [None]:
#### 2. Prepare Data for Hypothesis Test

# Extract ratings for each reviewer activity level
# We already have 'Reviewer_Activity_Level' from Chart 10 prep

# Get a list of ratings for each activity level
ratings_by_activity_level = []
for level in review['Reviewer_Activity_Level'].unique():
    if level != 'Unknown Activity': # Exclude 'Unknown Activity' for ANOVA if it's truly unknown/missing data
        ratings_by_activity_level.append(review[review['Reviewer_Activity_Level'] == level]['Rating'].dropna())

# Print the mean rating for each group to get an idea
print("Average Rating for Each Reviewer Activity Level (excluding Unknown):")
for i, level in enumerate(review['Reviewer_Activity_Level'].unique()):
    if level != 'Unknown Activity':
        print(f"- {level}: {ratings_by_activity_level[i].mean():.2f}")


In [None]:
#### 3. Perform Statistical Test (ANOVA)

from scipy.stats import f_oneway

# Perform ANOVA test on the ratings from different activity levels
f_statistic, p_value = f_oneway(*ratings_by_activity_level)

print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.3f}")

##### Which statistical test have you done to obtain P-Value?

ANOVA (Analysis of Variance)

##### Why did you choose the specific statistical test?

Similar to Hypothesis 1, this hypothesis involved comparing the average ratings across multiple distinct groups of reviewers based on their activity levels ('No Reviews', '1-5 Reviews', '6-20 Reviews', '21+ Reviews'). Since there are more than two groups, ANOVA was again the suitable test to determine if a significant difference exists among their average ratings.


## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
#already handled in the data cleaning process

#### What all missing value imputation techniques have you used and why did you use those techniques?

For the resto_names DataFrame:

*  **Collections Column:** Missing values were filled with the string 'Unknown'. This technique was chosen because Collections is a categorical column and 'Unknown' serves as a clear indicator for missing information without altering the distribution of existing categories.
*  **Timings Column:** Missing values were imputed with the mode (the most frequent value) of the column. This is appropriate for categorical or object-type columns where the mode represents the most common entry, preserving the overall distribution as much as possible.

#For the review DataFrame:

*  **Rating and Review Columns:** Rows containing missing values in either the 'Rating' or 'Review' columns were dropped. This decision was made because these columns are critical for the core analysis (e.g., sentiment analysis, rating distribution), and imputing them could introduce significant bias or inaccurate information.
*  **Reviewer and Metadata Columns:** Missing values were filled with the string 'Unknown'. Similar to the 'Collections' column in resto_names, these are categorical/textual fields where 'Unknown' is a suitable placeholder to retain the rows while indicating absent data.
*  **Time Column:** After dropping rows with missing Rating or Review, there were no remaining missing values in the Time column. Initially, the plan was to fill missing 'Time' values with 'Unknown', but this became unnecessary. Additionally, the 'Time' column was converted to datetime objects to enable time-based analysis.


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Function to detect and cap outliers using IQR method
def cap_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Cap outliers
    df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
    print(f"Outliers in '{column}' capped. Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")

# Apply outlier capping to relevant numerical columns

# For resto_names DataFrame: 'Cost'
print("\n--- Processing resto_names['Cost'] ---")
print("Original min/max of Cost:", resto_names['Cost'].min(), resto_names['Cost'].max())
cap_outliers_iqr(resto_names, 'Cost')
print("Capped min/max of Cost:", resto_names['Cost'].min(), resto_names['Cost'].max())

# For review DataFrame: 'Pictures', 'Num_Reviews', 'Num_Followers'
print("\n--- Processing review['Pictures'] ---")
print("Original min/max of Pictures:", review['Pictures'].min(), review['Pictures'].max())
cap_outliers_iqr(review, 'Pictures')
print("Capped min/max of Pictures:", review['Pictures'].min(), review['Pictures'].max())

print("\n--- Processing review['Num_Reviews'] ---")
print("Original min/max of Num_Reviews:", review['Num_Reviews'].min(), review['Num_Reviews'].max())
cap_outliers_iqr(review, 'Num_Reviews')
print("Capped min/max of Num_Reviews:", review['Num_Reviews'].min(), review['Num_Reviews'].max())

print("\n--- Processing review['Num_Followers'] ---")
print("Original min/max of Num_Followers:", review['Num_Followers'].min(), review['Num_Followers'].max())
cap_outliers_iqr(review, 'Num_Followers')
print("Capped min/max of Num_Followers:", review['Num_Followers'].min(), review['Num_Followers'].max())

# Verify changes and check for remaining outliers (should be none as they are capped)
print("\n--- Post-treatment summary ---")
print("resto_names describe after outlier treatment:")
print(resto_names['Cost'].describe())
print("\nreview describe after outlier treatment (selected columns):")
print(review[['Pictures', 'Num_Reviews', 'Num_Followers']].describe())

##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the **IQR (Interquartile Range)** method to cap outliers in the following numerical columns:

resto_names['Cost']
review['Pictures']
review['Num_Reviews']
review['Num_Followers']

##Why the IQR method was chosen:

*  **Robustness to Extreme Values:** The IQR method is robust to extreme values, unlike methods that rely on the mean and standard deviation, which can be heavily skewed by outliers themselves. This makes it suitable for datasets that might have non-normal distributions or significant outliers.
* **Preservation of Data Structure:** Instead of removing outliers entirely (which can lead to data loss), capping replaces them with the nearest reasonable values (the upper or lower bound calculated by the IQR). This helps to reduce the impact of extreme values without discarding potentially valuable data points.
*  **Domain Appropriateness:** For variables like Cost, Num_Reviews, and Num_Followers, while extreme values might exist, they often represent genuine, albeit rare, occurrences (e.g., a very expensive restaurant, a highly prolific reviewer). Capping allows these data points to remain in the dataset with a reduced, but still present, influence, preventing them from unduly skewing statistical models while acknowledging their existence.
After applying this technique, the maximum values in review['Pictures'] became 0. This suggests that after capping, all values above 0 were considered outliers and were brought down to 0, which implies that most reviews did not include pictures and any significant number of pictures was considered an outlier within this dataset context.



### 3. Categorical Encoding

In [None]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Re-create merged_df to ensure it includes all newly engineered features
# 'Reviewer_Activity_Level' is now in the 'review' DataFrame.
merged_df = pd.merge(resto_names, review, left_on='Name', right_on='Restaurant', how='inner')

# Add 'Cost_Range' to the newly created merged_df (as it was previously added to an older merged_df instance)
cost_bins = [0, 600, 1000, merged_df['Cost'].max() + 1] # Ensure bins are consistent with previous calculation
cost_labels = ['Low Cost', 'Medium Cost', 'High Cost']
merged_df['Cost_Range'] = pd.cut(merged_df['Cost'], bins=cost_bins, labels=cost_labels, right=False)

# --- 1. MultiLabelBinarizer for 'Cuisines' and 'Collections' ---

# Handle 'Cuisines'
# Ensure no NaN values before splitting and binarizing
merged_df_temp = merged_df.copy() # Use the updated merged_df
merged_df_temp['Cuisines'] = merged_df_temp['Cuisines'].fillna('')
mlb_cuisines = MultiLabelBinarizer()
cuisine_encoded = mlb_cuisines.fit_transform(merged_df_temp['Cuisines'].apply(lambda x: x.split(', ')))
cuisine_df = pd.DataFrame(cuisine_encoded, columns=[f"Cuisine_{c}" for c in mlb_cuisines.classes_], index=merged_df_temp.index)

# Handle 'Collections'
# Ensure no NaN values before splitting and binarizing
merged_df_temp['Collections'] = merged_df_temp['Collections'].fillna('')
mlb_collections = MultiLabelBinarizer()
collection_encoded = mlb_collections.fit_transform(merged_df_temp['Collections'].apply(lambda x: x.split(', ')))
collection_df = pd.DataFrame(collection_encoded, columns=[f"Collection_{c}" for c in mlb_collections.classes_], index=merged_df_temp.index)

# --- 2. OneHotEncoder for 'Timings', 'Reviewer_Activity_Level', 'Cost_Range' ---

categorical_features_ohe = ['Timings', 'Reviewer_Activity_Level', 'Cost_Range']

# Create a column transformer for one-hot encoding
# The remainder='passthrough' will keep all other columns not specified in 'categorical_features_ohe'
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features_ohe)
    ],
    remainder='passthrough' # Keep other columns as they are
)

# Apply the ColumnTransformer to the relevant parts of merged_df
# First, combine merged_df with the multi-label encoded features, dropping original multi-label columns
merged_df_encoded = pd.concat([merged_df.drop(columns=['Cuisines', 'Collections'], errors='ignore'), cuisine_df, collection_df], axis=1)

# Filter out potential NaNs in categorical columns before OHE if they exist
# For `Reviewer_Activity_Level` and `Cost_Range`, ensure they are string/category type if not already
# Note: This is important because OneHotEncoder expects a consistent type or will treat NaNs as a category.
merged_df_encoded['Reviewer_Activity_Level'] = merged_df_encoded['Reviewer_Activity_Level'].astype('category')
merged_df_encoded['Cost_Range'] = merged_df_encoded['Cost_Range'].astype('category')

# Apply the ColumnTransformer to the entire merged_df_encoded
# This will transform the specified categorical columns and pass through the rest
transformed_data = preprocessor.fit_transform(merged_df_encoded)

# Get the feature names for the transformed data
# This includes the OHE features and the passthrough features
final_columns = preprocessor.get_feature_names_out()

# Create the final DataFrame from the transformed data and new column names
final_merged_df = pd.DataFrame(transformed_data, columns=final_columns, index=merged_df_encoded.index)

print("Shape of original merged_df:", merged_df.shape)
print("Shape of final_merged_df after encoding:", final_merged_df.shape)
print("First 5 rows of final_merged_df (selected columns):")
print(final_merged_df.head())


#### What all categorical encoding techniques have you used & why did you use those techniques?

I used two main categorical encoding techniques:

**MultiLabelBinarizer:**

Columns Used On: Cuisines and Collections from the resto_names DataFrame (and subsequently merged_df).
Why: These columns contain multiple categories (e.g., "Chinese, Continental, Kebab") within a single entry, separated by commas. MultiLabelBinarizer is ideal for this scenario as it transforms each unique category into a separate binary feature (0 or 1), indicating the presence or absence of that specific cuisine or collection. This avoids issues where a single entry's text might be treated as a unique category by other encoders, preserving the individual contribution of each cuisine/collection.

**OneHotEncoder:**

Columns Used On: Timings, Reviewer_Activity_Level, and Cost_Range (all derived or present in merged_df).
Why: These columns are nominal categorical variables (i.e., there's no inherent order between their categories). OneHotEncoder creates a new binary column for each unique category, where a '1' indicates the presence of that category and '0' otherwise. This prevents the machine learning model from interpreting any arbitrary numerical order as a meaningful relationship, which would happen with simple label encoding. The handle_unknown='ignore' parameter ensures that if an unseen category appears during testing, it won't cause an error, and sparse_output=False makes the output a dense NumPy array, which is often easier to work with.


### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re

# Dictionary of common contractions
CONTRACTION_MAP = {
    "ain't": "is not", "aren't": "are not","can't": "cannot",
    "can't've": "cannot have", "'cause": "because", "could've": "could have",
    "couldn't": "could not", "couldn't've": "could not have",
    "didn't": "did not", "doesn't": "does not", "don't": "do not",
    "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not",
    "haven't": "have not", "he'd": "he would", "he'd've": "he would have",
    "he'll": "he will", "he'll've": "he will have", "he's": "he is",
    "how'd": "how did", "how'd'y": "how do you", "how'll": "how will",
    "how's": "how is", "I'd": "I would", "I'd've": "I would have",
    "I'll": "I will", "I'll've": "I will have", "I'm": "I am",
    "I've": "I have", "isn't": "is not", "it'd": "it would",
    "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
    "it's": "it is", "let's": "let us", "ma'am": "madam",
    "mayn't": "may not", "might've": "might have", "mightn't": "might not",
    "mightn't've": "might not have", "must've": "must have",
    "mustn't": "must not", "mustn't've": "must not have",
    "needn't": "need not", "needn't've": "need not have",
    "o'clock": "of the clock", "oughtn't": "ought not",
    "oughtn't've": "ought not have", "shan't": "shall not",
    "sha'n't": "shall not", "shan't've": "shall not have",
    "she'd": "she would", "she'd've": "she would have",
    "she'll": "she will", "she'll've": "she will have",
    "she's": "she is", "should've": "should have", "shouldn't": "should not",
    "shouldn't've": "should not have", "so've": "so have",
    "so's": "so is", "that'd": "that would", "that'd've": "that would have",
    "that's": "that is", "there'd": "there would",
    "there'd've": "there would have", "there's": "there is",
    "these's": "these is", "they'd": "they would",
    "they'd've": "they would have", "they'll": "they will",
    "they'll've": "they will have", "they're": "they are",
    "they've": "they have", "to've": "to have", "wasn't": "was not",
    "we'd": "we would", "we'd've": "we would have", "we'll": "we will",
    "we'll've": "we will have", "we're": "we are", "we've": "we have",
    "weren't": "were not", "what'll": "what will",
    "what'll've": "what will have", "what're": "what are",
    "what's": "what is", "what've": "what have", "when's": "when is",
    "when've": "when have", "where'd": "where did",
    "where's": "where is", "where've": "where have", "who'll": "who will",
    "who'll've": "who will have", "who's": "who is", "who've": "who have",
    "why's": "why is", "why've": "why have", "will've": "will have",
    "won't": "will not", "won't've": "will not have", "would've": "would have",
    "wouldn't": "would not", "wouldn't've": "would not have",
    "y'all": "you all", "y'all'd": "you all would",
    "y'all'd've": "you all would have", "y'all're": "you all are",
    "y'all've": "you all have", "you'd": "you would",
    "you'd've": "you would have", "you'll": "you will",
    "you'll've": "you will have", "you're": "you are",
    "you've": "you have"
}

def expand_contractions(text, contraction_map=CONTRACTION_MAP):
    contractions_pattern = re.compile('({})'.format('|'.join(re.escape(key) for key in contraction_map.keys())),
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_map.get(match)
        if expanded_contraction:
            expanded_contraction = first_char+expanded_contraction[1:] if expanded_contraction[0] == "'" else expanded_contraction
            return expanded_contraction
        else:
            return match
    expanded_text = contractions_pattern.sub(expand_match, text)
    return expanded_text

# Before applying
print("Before contraction expansion (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

# Apply to 'Review' column
review['Review'] = review['Review'].astype(str).apply(expand_contractions)

# After applying
print("\nAfter contraction expansion (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

#### 2. Lower Casing

In [None]:
# Lower Casing

print("Before lower casing (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

review['Review'] = review['Review'].str.lower()

print("\nAfter lower casing (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Before applying
print("Before punctuation removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

def remove_punctuations(text):
    return text.translate(str.maketrans('', '', string.punctuation))

review['Review'] = review['Review'].apply(remove_punctuations)

# After applying
print("\nAfter punctuation removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

# Before applying
print("Before URL and digit-word removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

def remove_digit_words(text):
    # Remove words that contain digits
    return re.sub(r'\b\w*\d\w*\b', '', text)

# Apply URL removal first
review['Review'] = review['Review'].apply(remove_urls)
# Then apply digit-word removal
review['Review'] = review['Review'].apply(remove_digit_words)

# After applying
print("\nAfter URL and digit-word removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Before applying
print("Before stopword removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)

review['Review'] = review['Review'].apply(remove_stopwords)

# After applying
print("\nAfter stopword removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

In [None]:
# Remove White spaces

# Before applying
print("Before whitespace removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

def remove_extra_whitespaces(text):
    return re.sub(r'\s+', ' ', text).strip()

review['Review'] = review['Review'].apply(remove_extra_whitespaces)

# After applying
print("\nAfter whitespace removal (sample):")
for i in range(5):
    print(f"- {review['Review'].iloc[i]}")

#### 6. Rephrase Text

In [None]:
# Rephrase Text


#### 7. Tokenization

In [None]:
# Tokenization
import nltk

# Download the 'punkt' tokenizer if not already downloaded
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Also explicitly download 'punkt_tab' which is used internally by word_tokenize
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

# Tokenize the 'Review' column
review['Tokenized_Review'] = review['Review'].apply(nltk.word_tokenize)

# Display the first few entries of the new 'Tokenized_Review' column
print("First 5 entries of 'Tokenized_Review' column after tokenization:")
for i in range(5):
    print(f"- {review['Tokenized_Review'].iloc[i]}")

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK data if not already downloaded
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    nltk.download('averaged_perceptron_tagger')
# Explicitly download 'averaged_perceptron_tagger_eng' as suggested by the error
try:
    nltk.data.find('taggers/averaged_perceptron_tagger_eng')
except LookupError:
    nltk.download('averaged_perceptron_tagger_eng')

# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Function to convert NLTK POS tag to WordNet POS tag format
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun if not found

# Function to lemmatize a list of tokens
def lemmatize_tokens(tokens):
    pos_tags = nltk.pos_tag(tokens)
    lemmas = []
    for word, tag in pos_tags:
        w_pos = get_wordnet_pos(tag)
        lemmas.append(lemmatizer.lemmatize(word, pos=w_pos))
    return lemmas

# Apply lemmatization to the 'Tokenized_Review' column
review['Lemmatized_Review'] = review['Tokenized_Review'].apply(lemmatize_tokens)

# Display the first few entries of the new 'Lemmatized_Review' column
print("First 5 entries of 'Lemmatized_Review' column after lemmatization:")
for i in range(5):
    print(f"- {review['Lemmatized_Review'].iloc[i]}")

##### Which text normalization technique have you used and why?

I used Lemmatization for text normalization.

Why Lemmatization was chosen:

*  **Reduction to Base Form:** Lemmatization reduces words to their meaningful base or root form (lemma). For example, 'running', 'runs', and 'ran' all become 'run'. This is crucial for ensuring that different inflections of the same word are treated as a single token, reducing the vocabulary size and improving the accuracy of text analysis tasks.
*  **Contextual Understanding (with POS tagging):** Unlike stemming, which often chops off word endings and can result in non-words, lemmatization considers the word's Part-of-Speech (POS) to accurately determine its lemma. I explicitly used NLTK's pos_tag to determine whether a word is a noun, verb, adjective, or adverb, which then guided the WordNetLemmatizer to produce more semantically correct base forms. This contextual awareness helps preserve meaning, which is vital for review analysis where sentiment can depend heavily on word meaning.
*  **Improved Feature Representation:** By normalizing words, lemmatization helps in creating a more concise and relevant set of features for subsequent text vectorization. This can lead to more robust models by avoiding the treatment of different forms of the same word as distinct features.


#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk

# Download necessary NLTK data if not already downloaded
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    nltk.download('averaged_perceptron_tagger')

# Apply POS tagging to the 'Tokenized_Review' column
review['POS_Tagged_Review'] = review['Tokenized_Review'].apply(nltk.pos_tag)

# Display the first few entries of the new 'POS_Tagged_Review' column
print("First 5 entries of 'POS_Tagged_Review' column after POS tagging:")
for i in range(5):
    print(f"- {review['POS_Tagged_Review'].iloc[i]}")

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Join the lemmatized tokens back into a single string for TF-IDF
review['Lemmatized_Review_Str'] = review['Lemmatized_Review'].apply(lambda x: ' '.join(x))

# Initialize TfidfVectorizer
# You might want to adjust parameters like max_features, min_df, max_df based on your dataset
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limiting to 5000 features for demonstration

# Fit and transform the lemmatized reviews
tfidf_matrix = tfidf_vectorizer.fit_transform(review['Lemmatized_Review_Str'])

# Convert the TF-IDF matrix to a DataFrame (optional, but good for inspection)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("Shape of TF-IDF matrix:", tfidf_matrix.shape)
print("First 5 rows of TF-IDF DataFrame (sample features):")
print(tfidf_df.iloc[:5, :10]) # Display first 5 rows and first 10 features

##### Which text vectorization technique have you used and why?

I used **TF-IDF (Term Frequency-Inverse Document Frequency)** for text vectorization because:

*  **Captures Word Importance: **TF-IDF assigns a weight to each word that reflects its importance within a document and across the entire corpus. Words that appear frequently in a specific review but rarely in other reviews receive a higher TF-IDF score, indicating their discriminative power. This is more effective than simple word counts (like Bag of Words), which might give high importance to common words that are not very informative.
*  **Handles Stop Words Implicitly (to some extent):** While I explicitly removed stop words earlier, TF-IDF naturally down-weights very common words (even if not explicitly in a stop word list) because their inverse document frequency will be low across the corpus.
*  **Reduces Dimensionality for Common Words:** By emphasizing unique and significant words, TF-IDF helps in reducing the effective dimensionality by giving less weight to generic terms, making the feature space more meaningful for machine learning models.
*  **Widely Used and Effective: **TF-IDF is a robust and widely adopted technique in NLP for converting text into a numerical format suitable for various machine learning tasks like sentiment analysis, topic modeling, and classification. It provides a good balance between simplicity and effectiveness.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
import pandas as pd

# 1. Access the 'Lemmatized_Review' column from the review DataFrame.
# 2. For each entry in 'Lemmatized_Review', calculate the number of words.
#    Assuming 'Lemmatized_Review' contains lists of tokens.
review['Review_Length'] = review['Lemmatized_Review'].apply(len)

# 3. Display the first few rows of the review DataFrame, including the new 'Review_Length' column.
print("First 5 entries of 'review' DataFrame with 'Review_Length' column:")
print(review[['Lemmatized_Review', 'Review_Length']].head())


#### 2. Feature Selection

In [None]:
import pandas as pd

# Re-create merged_df to ensure it includes all newly engineered features like 'Review_Length'
# This merge should happen after 'Review_Length' has been added to the 'review' DataFrame
merged_df = pd.merge(resto_names, review, left_on='Name', right_on='Restaurant', how='inner')

# Ensure y is aligned with review DataFrame index if not already
y_aligned = merged_df['Rating'].dropna()

# Select the numerical features relevant for correlation analysis
# Exclude 'Pictures' as it often results in NaN correlation due to low variance after capping
# We will use 'Cost', 'Num_Reviews', 'Num_Followers', 'Review_Length'
numerical_features_to_correlate = merged_df[['Cost', 'Num_Reviews', 'Num_Followers', 'Review_Length']].copy()

# Align numerical_features_to_correlate with y_aligned indices
common_indices_num = numerical_features_to_correlate.index.intersection(y_aligned.index)
numerical_features_to_correlate = numerical_features_to_correlate.loc[common_indices_num]
y_aligned_for_corr = y_aligned.loc[common_indices_num]

# Calculate Pearson correlation of each numerical feature with 'Rating'
correlation_with_rating = numerical_features_to_correlate.corrwith(y_aligned_for_corr)

print("Pearson Correlation with 'Rating' for Numerical Features:")
print(correlation_with_rating)

# Filter out features with absolute correlation less than 0.01
threshold = 0.01
kept_numerical_features_corr = correlation_with_rating[abs(correlation_with_rating) >= threshold].index.tolist()

print(f"\nNumerical features kept after Pearson correlation filtering (abs correlation >= {threshold}):")
print(kept_numerical_features_corr)

# Create a new DataFrame with only the selected numerical features
X_numerical_filtered = numerical_features_to_correlate[kept_numerical_features_corr]

print(f"\nShape of X_numerical_filtered: {X_numerical_filtered.shape}")
print("First 5 rows of X_numerical_filtered:")
print(X_numerical_filtered.head())

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# --- Define X and y for this step ---
# y_aligned is already defined in the kernel from previous steps, representing the target 'Rating' after dropping NaNs
y = y_aligned

# Features from TF-IDF
X_tfidf = tfidf_df

# Features from One-Hot Encoding and MultiLabelBinarizer (from final_merged_df)
# Select columns starting with 'cat__', 'Cuisine_', 'Collection_'
X_encoded_categorical = final_merged_df.filter(regex='^(cat__|Cuisine_|Collection_)')

# Numerical features (already selected and processed, from previous cell)
X_numerical = X_numerical_filtered

# Align all feature sets and the target variable to the same indices
alignment_indices = y.index.intersection(X_tfidf.index).intersection(X_encoded_categorical.index).intersection(X_numerical.index)

y = y.loc[alignment_indices]
X_tfidf_aligned = X_tfidf.loc[alignment_indices]
X_encoded_categorical_aligned = X_encoded_categorical.loc[alignment_indices]
X_numerical_aligned = X_numerical.loc[alignment_indices]

# Concatenate all feature sets to create the final X_filtered
X_filtered = pd.concat([X_numerical_aligned, X_encoded_categorical_aligned, X_tfidf_aligned], axis=1)


# 1. Instantiate a RandomForestRegressor model
# Using n_estimators=100 and random_state=42 as a starting point
model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1) # n_jobs=-1 to use all available cores

# 2. Fit the model to your filtered feature set X_filtered and target variable y
model.fit(X_filtered, y)

# 3. Extract feature importances from the fitted model
feature_importances = model.feature_importances_

# 4. Create a pandas Series of feature importances, mapping them to the feature names
importance_series = pd.Series(feature_importances, index=X_filtered.columns)

# 5. Sort the feature importances in descending order
sorted_importance = importance_series.sort_values(ascending=False)

# 6. Select a top subset of features (e.g., the top 100)
top_n_features = 100 # Define the number of top features to select
selected_features_names = sorted_importance.head(top_n_features).index.tolist()

print(f"Top {top_n_features} features selected by RandomForestRegressor:\n{selected_features_names}")

# 7. Create a new DataFrame X_embedded_selected containing only these selected features from X_filtered
X_embedded_selected = X_filtered[selected_features_names]

print(f"\nShape of X_embedded_selected: {X_embedded_selected.shape}")
print("First 5 rows of X_embedded_selected (sample features):")
print(X_embedded_selected.head())

In [None]:
import pandas as pd
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

# 1. Instantiate a LinearRegression model with default parameters.
estimator = LinearRegression()

# Ensure y_aligned is a Series and X_embedded_selected is a DataFrame
# Align indices if necessary, though it should be already from the previous step

# --- Impute missing values in X_embedded_selected ---
imputer = SimpleImputer(strategy='mean') # Using mean imputation
X_embedded_selected_imputed = pd.DataFrame(imputer.fit_transform(X_embedded_selected),
                                           columns=X_embedded_selected.columns,
                                           index=X_embedded_selected.index)

# 2. Instantiate a SequentialFeatureSelector object.
sfs = SequentialFeatureSelector(estimator,
                                n_features_to_select=20, # Example: Select top 20 features
                                direction='forward',
                                scoring='neg_mean_squared_error',
                                cv=3,
                                n_jobs=-1)

# Fit the SequentialFeatureSelector to the imputed X_embedded_selected DataFrame and the y_aligned Series.
sfs.fit(X_embedded_selected_imputed, y_aligned)

# Retrieve the names of the selected features.
selected_features_sfs = list(X_embedded_selected_imputed.columns[sfs.get_support()])

print(f"Selected features by SequentialFeatureSelector: {selected_features_sfs}")

# Create a new DataFrame named X_wrapper_selected.
X_wrapper_selected = X_embedded_selected_imputed[selected_features_sfs]

# Print the shape of the X_wrapper_selected DataFrame and display its first 5 rows.
print(f"\nShape of X_wrapper_selected: {X_wrapper_selected.shape}")
print("First 5 rows of X_wrapper_selected:")
print(X_wrapper_selected.head())

##### What all feature selection methods have you used  and why?

Filter Methods (Initial Screening): This stage acts as a coarse filter, removing features with low individual relevance to the target variable.

*  ** Pearson Correlation for Numerical Features: ** Features with an absolute correlation less than 0.01 with the 'Rating' were removed. This helps avoid including features that contribute little to explaining the target variance, thereby reducing noise and simplifying the model.
T-tests for One-Hot Encoded Categorical Features: Features whose presence/absence did not significantly impact 'Rating' (p-value > 0.05) were removed. This statistically based removal ensures that only categorical indicators with a demonstrable relationship to the target are retained, preventing the model from learning from irrelevant binary features.
Document Frequency Thresholding for TF-IDF Text Features: Terms appearing in less than 5 documents or more than 80% of documents were removed. This prunes rare and overly common terms, which often carry little predictive power and can inflate feature dimensionality, making the model less efficient and prone to overfitting to specific, non-generalizable patterns. This initial screening significantly reduces the feature space, improving computational efficiency for subsequent steps and removing obvious noise, thus providing a cleaner base for model training.
*  **Embedded Methods (Primary Selection):** Utilizing RandomForestRegressor, this method inherently performs feature selection during model training. The feature importance scores derived from the model indicate how much each feature contributes to prediction accuracy. By selecting the top 100 features based on these scores, the model itself guides the selection process, prioritizing features that are collectively most important for the specific predictive task. This approach helps avoid overfitting by focusing on features that the chosen model architecture finds most relevant, potentially capturing non-linear relationships and interactions.

*  **Wrapper Methods (Fine-tuning):** Employing SequentialFeatureSelector with LinearRegression, this method iteratively adds or removes features based on their impact on a model's performance (e.g., neg_mean_squared_error). While computationally intensive, it provides a fine-grained selection specific to the chosen model (Linear Regression in this case). This method is highly effective in finding an optimal subset for a particular model, as it directly evaluates feature combinations, thereby creating the most robust and parsimonious feature set tailored to the final predictive task, significantly reducing the risk of overfitting by eliminating redundant or less impactful features within the model context.



##### Which all features you found important and why?

Important Features Identified:
Sentiment-Related TF-IDF Terms:

Features: Words like 'bad', 'good', 'pathetic', 'best', 'love', 'awesome', 'great', 'nice', 'delicious', 'excellent', 'amaze', 'poor', 'waste', 'horrible', 'super', 'worst', 'yummy', 'amazing', 'ok', 'spicy', 'perfect', 'wonderful', 'decent'.
Why Important: These are highly indicative of customer sentiment, directly reflecting positive or negative experiences. Machine learning models strongly rely on these terms to gauge the overall opinion expressed in a review and thus predict the rating.
Reviewer Activity Metrics:

Features: 'Num_Reviews', 'Num_Followers', 'Review_Length'.
Why Important: While showing weak correlations individually, collectively and within the context of tree-based models, these features provide insights into reviewer behavior. For example, the Num_Reviews and Num_Followers might indirectly signal the credibility or influence of a reviewer's opinion, or distinguish between casual and experienced reviewers whose rating patterns might differ.
Restaurant-Specific Numerical Feature:

Feature: 'Cost'
Why Important: Cost showed a moderate positive correlation with ratings, suggesting that more expensive restaurants might generally offer better experiences, justifying their importance in predicting ratings.
Categorical Features (Derived from One-Hot Encoding and Multi-Label Binarization):

Key Examples:
Collection_Hyderabad's Hottest: This collection consistently appeared as a significant feature. Restaurants in popular or highly-rated collections are inherently expected to receive better ratings.
Reviewer_Activity_Level_1-5 Reviews: This feature might capture specific rating tendencies of reviewers with low activity.
Cuisine_Chinese: As a dominant cuisine, its presence might influence ratings due to its widespread popularity or specific customer expectations.
Various Timings related features: Specific operating hours or patterns might correlate with certain types of dining experiences or customer satisfaction levels.
Why Important: These features capture essential metadata about the restaurants and reviewers. They help the model understand how aspects like the type of cuisine, collection a restaurant belongs to, or the time of operation, and even the reviewer's general activity level, contribute to the rating. They differentiate restaurants and reviewer profiles that are associated with higher or lower average ratings.
Contribution of Feature Selection to Model Robustness:
Filter methods (Pearson Correlation, T-tests, TF-IDF Thresholding) initially pruned a vast number of irrelevant features, especially from the high-dimensional TF-IDF data and less significant categorical indicators. This significantly reduced noise and computational burden.
Embedded methods (RandomForestRegressor feature importance) further refined this by identifying the top 100 features that directly contributed to the model's predictive power, considering interactions between features. This step ensures that features are selected based on their real-world impact on rating prediction.
Wrapper methods (SequentialFeatureSelector), while computationally intensive, then selected an even smaller, optimal subset of 20 features for a specific model (Linear Regression in our case). This fine-tuning aims for the most parsimonious yet powerful feature set, drastically reducing the risk of overfitting by eliminating any remaining redundant or weakly predictive features and building a simpler, more generalized model.
By following this multi-stage approach, we've moved from thousands of potential features to a highly curated set of about 20-100 features that are most relevant and impactful for predicting restaurant ratings, greatly enhancing model performance and interpretability while guarding against overfitting.



### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data definitely needed transformation, and several transformations have already been applied during the data wrangling and preprocessing phases. These were crucial to ensure the data is in a suitable format for analysis and machine learning models:

'Cost' Column (in resto_names):

Transformation: Removed commas and converted from object (string) to int64 (numeric) data type.
Why: The 'Cost' column initially stored numerical values as strings with commas (e.g., '1,300'). For any numerical calculations, statistical analysis, or machine learning algorithms to work correctly, it needed to be a proper numeric type. Removing commas is a standard step to enable this conversion.
'Rating' Column (in review):

Transformation: Replaced the non-numeric string 'Like' with NaN and then converted from object to float64 (numeric) data type.
Why: The 'Rating' column was intended to be numerical but contained an inconsistent string value ('Like'). Converting it to a numeric type (float) allows for quantitative analysis, aggregation (like calculating average ratings), and use in regression models. Replacing 'Like' with NaN preserved the structure while marking invalid entries for later handling (dropping rows with missing ratings).
'Time' Column (in review):

Transformation: Converted from object (string) to datetime64[ns] data type.
Why: The 'Time' column contained date and time information as strings. Converting it to a datetime object unlocks powerful time-series analysis capabilities, such as extracting months, years, or days of the week, analyzing trends over time, or filtering by date ranges. This is essential for any temporal insights.
Categorical Features ('Timings', 'Cost_Range', 'Reviewer_Activity_Level', 'Cuisines', 'Collections'):

Transformation: Applied One-Hot Encoding and MultiLabel Binarization.
Why: Machine learning models typically require numerical input. Categorical features need to be transformed into a numerical representation. One-Hot Encoding converts nominal categories into binary (0/1) features, avoiding the false sense of ordinality that label encoding might create. MultiLabel Binarization is used for columns where a single entry can have multiple categories (like 'Cuisines' with "Italian, Chinese"), converting each unique sub-category into its own binary feature.
Textual Data ('Review' column):

Transformation: A series of transformations including contraction expansion, lowercasing, punctuation removal, URL/digit-word removal, stopword removal, lemmatization, and finally, TF-IDF Vectorization.
Why: Raw text is unstructured and cannot be directly fed into machine learning models. These steps collectively clean the text, standardize words to their base forms (lemmatization), remove noise (stopwords, punctuation), and convert it into numerical vectors (TF-IDF) that represent word importance. This process is essential for tasks like sentiment analysis or classification, allowing models to understand the content and extract patterns from the reviews.


In [None]:
# Transform Your data



In [None]:
# Transform Your data

# For the 'X_wrapper_selected' feature set, conventional data scaling (like standardization or normalization) is not strictly necessary for several reasons:
# 1. TF-IDF Features: Term Frequency-Inverse Document Frequency (TF-IDF) values are already inherently scaled. They represent the normalized frequency of a term in a document, adjusted by how rare the term is across all documents. These values typically range between 0 and 1, or close to it, and are already in a comparable range.
# 2. Binary Categorical Features: One-hot encoded and multi-label binarized features are binary (0 or 1). These features are already on a fixed scale and do not benefit from further scaling, as scaling them would distort their interpretability as presence/absence indicators.
# 3. Numerical Features: While numerical features like 'Cost', 'Num_Reviews', 'Num_Followers', and 'Review_Length' were included, many machine learning models, especially tree-based models (like RandomForestRegressor used for feature selection), are insensitive to the scale of numerical features. For models sensitive to scale (e.g., SVMs, neural networks, or linear models), scaling would be beneficial. However, given the mixed nature of features and the typical use cases, the current representation is often acceptable, especially after outlier treatment which helps manage extreme values. The primary goal of TF-IDF and binary encoding is already to bring features into a somewhat comparable range or a specific format.

### 6. Data Scaling

In [None]:
# Scaling your data


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Given the current state of our X_wrapper_selected feature set, data do not need that further explicit dimensionality reduction techniques are needed at this stage.

Here's why:

Dimensionality Already Significantly Reduced: We started with a very high-dimensional dataset (3614 features in X_full) which included numerous TF-IDF terms and one-hot encoded categorical variables. Our multi-stage feature selection process (filter, embedded, and wrapper methods) has already effectively reduced this to a highly curated set of just 20 features in X_wrapper_selected.

Manageable Feature Count: A feature set of 20 is typically considered very manageable for most machine learning algorithms. We are well past the 'curse of dimensionality' concerns that necessitate techniques like PCA or t-SNE.

Preserving Interpretability: The 20 features in X_wrapper_selected were specifically chosen because they were identified as the most impactful and relevant features by our selection methods. Applying further dimensionality reduction (like PCA) would transform these features into new, synthetic components that are often harder to interpret. We want to retain the interpretability of our selected terms (e.g., 'bad', 'good', 'Cost', 'Collection_Hyderabad's Hottest') because they offer direct business insights.

Risk of Information Loss: With an already compact and optimized feature set, further dimensionality reduction might discard subtle but important variance or relationships, potentially degrading model performance rather than improving it.

In essence, our feature selection strategy has already achieved the primary goals of dimensionality reduction: removing irrelevant/redundant features, combating overfitting, and creating a robust, efficient, and interpretable feature set for model training. Therefore, adding another layer of dimensionality reduction would likely be redundant and potentially counterproductive.



In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.


In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = X_wrapper_selected
y = y_aligned

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

##### What data splitting ratio have you used and why?

I used an 80/20 split for the data, meaning 80% of the data was allocated for the training set and 20% for the testing set.

Here's why this ratio was chosen:

Sufficient Data for Training: An 80% training set typically provides the machine learning model with enough data to learn underlying patterns, relationships, and complexities within the dataset effectively. This is crucial for developing a robust model.
Reliable Evaluation on Unseen Data: A 20% testing set is a good balance for evaluating the model's performance on data it has not encountered during training. This provides an unbiased estimate of how well the model will generalize to new, real-world data, which is essential for detecting overfitting. A test set that is too small might not be representative, while one that is too large might reduce the amount of data available for the model to learn from.
Common Practice: The 80/20 split (along with 70/30 or 75/25) is a widely adopted convention in machine learning, offering a practical balance between model learning capacity and evaluation reliability, especially for datasets of this size. It helps ensure that the model is neither undertrained due to insufficient data nor evaluated on an unrepresentative sample.


### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**YES**,cause

Based on the "Distribution of Customer Ratings" chart (Chart 1) we explored earlier, the distribution of ratings is heavily skewed towards higher values. We observed:

A dominant number of high ratings (specifically 4.0 and 5.0).
Relatively few negative ratings (1.0, 1.5, 2.0, 2.5) compared to the positive ones.
The peak frequencies were at ratings of 4.0 and 5.0.
This means that there are significantly more instances of highly-rated reviews than low-rated reviews. While this is common in customer feedback datasets (as satisfied customers might be more likely to leave reviews, or extremely negative experiences are rarer), it creates an imbalance. For a regression task, this implies that the model will have much more data to learn from for predicting high ratings and less data for predicting lower ratings. This can make it challenging for a model to accurately predict rare, lower rating scores if not addressed during model training or evaluation.



In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# 1. Instantiate RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# 2. Fit the Algorithm
print("Fitting RandomForestRegressor model...")
rf_model.fit(X_train, y_train)
print("Model fitting complete.")

# 3. Predict on the model
print("Making predictions on the test set...")
y_pred_rf = rf_model.predict(X_test)
print("Predictions made.")

# 4. Calculate and print evaluation metrics
r2 = r2_score(y_test, y_pred_rf)
mae = mean_absolute_error(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
rmse = mse**0.5 # Calculate RMSE from MSE

print(f"\n--- RandomForestRegressor Model Performance ---")
print(f"R-squared (R2): {r2:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Create a DataFrame for evaluation metrics
metrics_data = {
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Value': [r2, mae, mse, rmse]
}
metrics_df = pd.DataFrame(metrics_data)

# Create a bar plot for the evaluation metrics
plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Value', data=metrics_df, hue='Metric', palette='viridis', legend=False)

# Add title and labels
plt.title('RandomForestRegressor Model Evaluation Metrics')
plt.xlabel('Evaluation Metric')
plt.ylabel('Value')

# Display the plot
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}

# Instantiate RandomForestRegressor
# Use a smaller n_estimators for the base estimator within GridSearchCV if computation is very heavy
base_rf = RandomForestRegressor(random_state=42, n_jobs=-1)

# Instantiate GridSearchCV
# Using neg_mean_squared_error as scoring for regression tasks (GridSearchCV tries to maximize score)
print("Starting GridSearchCV for hyperparameter tuning...")
grid_search = GridSearchCV(estimator=base_rf,
                           param_grid=param_grid,
                           scoring='neg_mean_squared_error',
                           cv=3,
                           n_jobs=-1,
                           verbose=2)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

print("GridSearchCV complete.")

# Print the best parameters and best score
print(f"\nBest Parameters found: {grid_search.best_params_}")
print(f"Best Cross-validation Score (neg_mean_squared_error): {grid_search.best_score_:.4f}")

# Use the best estimator to make predictions on the test set
print("Making predictions with the best tuned model...")
y_pred_rf_tuned = grid_search.best_estimator_.predict(X_test)
print("Predictions complete.")

# Calculate and print evaluation metrics for the tuned model
r2_tuned = r2_score(y_test, y_pred_rf_tuned)
mae_tuned = mean_absolute_error(y_test, y_pred_rf_tuned)
mse_tuned = mean_squared_error(y_test, y_pred_rf_tuned)
rmse_tuned = mse_tuned**0.5

print(f"\n--- Tuned RandomForestRegressor Model Performance ---")
print(f"R-squared (R2) Tuned: {r2_tuned:.4f}")
print(f"Mean Absolute Error (MAE) Tuned: {mae_tuned:.4f}")
print(f"Mean Squared Error (MSE) Tuned: {mse_tuned:.4f}")
print(f"Root Mean Squared Error (RMSE) Tuned: {rmse_tuned:.4f}")


##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization technique used is Grid Search (GridSearchCV). This technique was chosen because it systematically explores all possible combinations of hyperparameter values specified in a predefined grid. It is crucial for improving a model's performance and generalization by finding the optimal configuration, especially for complex models like RandomForestRegressor where hyperparameters interact in intricate ways. Properly tuned hyperparameters lead to improved predictive accuracy, better generalization (preventing underfitting and overfitting), and a model tailored to the specific characteristics of the dataset.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Baseline model metrics (assuming they are stored in r2, mae, mse, rmse)
# If these variables are not available globally, ensure they are passed or re-calculated.
# For the purpose of this step, we'll use the values from the previous execution context.
# Baseline metrics:
r2_baseline = 0.4541
mae_baseline = 0.8015
mse_baseline = 1.1973
rmse_baseline = 1.0942

# Tuned model metrics (assuming they are stored in r2_tuned, mae_tuned, mse_tuned, rmse_tuned)
# Tuned metrics:
r2_tuned_val = 0.5116
mae_tuned_val = 0.7842
mse_tuned_val = 1.0711
rmse_tuned_val = 1.0349

# Create DataFrames for baseline and tuned metrics
baseline_metrics_df = pd.DataFrame({
    'Model': 'Baseline',
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Value': [r2_baseline, mae_baseline, mse_baseline, rmse_baseline]
})

tuned_metrics_df = pd.DataFrame({
    'Model': 'Tuned',
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Value': [r2_tuned_val, mae_tuned_val, mse_tuned_val, rmse_tuned_val]
})

# Combine the DataFrames
comparison_df = pd.concat([baseline_metrics_df, tuned_metrics_df])

# Create a bar plot for comparison
plt.figure(figsize=(12, 7))
sns.barplot(x='Metric', y='Value', hue='Model', data=comparison_df, palette='viridis')

# Add title and labels
plt.title('Comparison of Baseline and Tuned RandomForestRegressor Model Metrics')
plt.xlabel('Evaluation Metric')
plt.ylabel('Value')

# Display the plot
plt.tight_layout()
plt.show()

Hyperparameter tuning led to a notable improvement in model performance. The R-squared (R2) score increased by approximately 5.75% (from 0.4541 to 0.5116), indicating that the tuned model explains more variance in the restaurant ratings. All error metrics (MAE, MSE, RMSE) also decreased, confirming better predictive accuracy and generalization of the optimized model.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Gradient Boosting Regressor is another powerful ensemble technique that builds models sequentially, with each new model attempting to correct the errors of the previous ones. It differs from RandomForestRegressor primarily in its sequential nature and how it combines weak learners.


In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# 1. Instantiate GradientBoostingRegressor
#    Using n_estimators=100, learning_rate=0.1, and random_state=42 for reproducibility
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# 2. Fit the GradientBoostingRegressor model to the training data
print("Fitting GradientBoostingRegressor model...")
gb_model.fit(X_train, y_train)
print("Model fitting complete.")

# 3. Use the trained model to make predictions on the test data
print("Making predictions on the test set...")
y_pred_gb = gb_model.predict(X_test)
print("Predictions made.")

# 4. Calculate and print evaluation metrics for the GradientBoostingRegressor
print(f"\n--- GradientBoostingRegressor Model Performance ---")

r2_gb = r2_score(y_test, y_pred_gb)
print(f"R-squared (R2): {r2_gb:.4f}")

mae_gb = mean_absolute_error(y_test, y_pred_gb)
print(f"Mean Absolute Error (MAE): {mae_gb:.4f}")

mse_gb = mean_squared_error(y_test, y_pred_gb)
print(f"Mean Squared Error (MSE): {mse_gb:.4f}")

rmse_gb = np.sqrt(mse_gb)
print(f"Root Mean Squared Error (RMSE): {rmse_gb:.4f}")

In [None]:
# Visualizing evaluation Metric Score chart


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Create a DataFrame for evaluation metrics for the GradientBoostingRegressor
metrics_data_gb = {
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Value': [r2_gb, mae_gb, mse_gb, rmse_gb]
}
metrics_gb_df = pd.DataFrame(metrics_data_gb)

# Create a bar plot for the evaluation metrics
plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Value', data=metrics_gb_df, hue='Metric', palette='cividis', legend=False)

# Add title and labels
plt.title('GradientBoostingRegressor Model Evaluation Metrics')
plt.xlabel('Evaluation Metric')
plt.ylabel('Value')

# Display the plot
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model



In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Define the parameter grid for GradientBoostingRegressor
param_grid_gb = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

# Instantiate GradientBoostingRegressor
base_gb = GradientBoostingRegressor(random_state=42)

# Instantiate GridSearchCV
print("Starting GridSearchCV for GradientBoostingRegressor hyperparameter tuning...")
grid_search_gb = GridSearchCV(estimator=base_gb,
                              param_grid=param_grid_gb,
                              scoring='neg_mean_squared_error',
                              cv=3,
                              n_jobs=-1,
                              verbose=2)

# Fit GridSearchCV to the training data
grid_search_gb.fit(X_train, y_train)

print("GridSearchCV complete for GradientBoostingRegressor.")

# Print the best parameters and best score
print(f"\nBest Parameters found for GradientBoostingRegressor: {grid_search_gb.best_params_}")
print(f"Best Cross-validation Score (neg_mean_squared_error): {grid_search_gb.best_score_:.4f}")

# Use the best estimator to make predictions on the test set
print("Making predictions with the best tuned GradientBoostingRegressor model...")
y_pred_gb_tuned = grid_search_gb.best_estimator_.predict(X_test)
print("Predictions complete.")

# Calculate and print evaluation metrics for the tuned model
r2_gb_tuned = r2_score(y_test, y_pred_gb_tuned)
mae_gb_tuned = mean_absolute_error(y_test, y_pred_gb_tuned)
mse_gb_tuned = mean_squared_error(y_test, y_pred_gb_tuned)
rmse_gb_tuned = mse_gb_tuned**0.5

print(f"\n--- Tuned GradientBoostingRegressor Model Performance ---")
print(f"R-squared (R2) Tuned: {r2_gb_tuned:.4f}")
print(f"Mean Absolute Error (MAE) Tuned: {mae_gb_tuned:.4f}")
print(f"Mean Squared Error (MSE) Tuned: {mse_gb_tuned:.4f}")
print(f"Root Mean Squared Error (RMSE) Tuned: {rmse_gb_tuned:.4f}")

##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization technique used for the Gradient Boosting Regressor was Grid Search (GridSearchCV). This technique systematically explores all possible combinations of hyperparameter values specified in a predefined grid (param_grid_gb).

Why Grid Search was used:

Comprehensive Exploration: Grid Search exhaustively evaluates every combination of the specified hyperparameters (n_estimators, learning_rate, max_depth). This ensures that the optimal set of hyperparameters within the defined search space is found.
Performance Improvement: Gradient Boosting models are highly sensitive to their hyperparameters. Proper tuning is crucial to prevent overfitting and improve generalization performance. Grid Search helps in identifying the configuration that yields the best performance on the validation set.
Comparability: By using a structured and exhaustive search, we can confidently compare the tuned Gradient Boosting Regressor's performance with other models (like the Random Forest Regressor) knowing that its hyperparameters have been optimized for the given task.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Baseline model metrics (using the values from the previous execution context)
r2_gb_baseline = 0.4904
mae_gb_baseline = 0.8317
mse_gb_baseline = 1.1176
rmse_gb_baseline = 1.0571

# Tuned model metrics (using the values from the previous execution context)
r2_gb_tuned_val = 0.5105
mae_gb_tuned_val = 0.7996
mse_gb_tuned_val = 1.0735
rmse_gb_tuned_val = 1.0361

# Create DataFrames for baseline and tuned metrics
baseline_gb_metrics_df = pd.DataFrame({
    'Model': 'Baseline GB',
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Value': [r2_gb_baseline, mae_gb_baseline, mse_gb_baseline, rmse_gb_baseline]
})

tuned_gb_metrics_df = pd.DataFrame({
    'Model': 'Tuned GB',
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Value': [r2_gb_tuned_val, mae_gb_tuned_val, mse_gb_tuned_val, rmse_gb_tuned_val]
})

# Combine the DataFrames
comparison_gb_df = pd.concat([baseline_gb_metrics_df, tuned_gb_metrics_df])

# Create a bar plot for comparison
plt.figure(figsize=(12, 7))
sns.barplot(x='Metric', y='Value', hue='Model', data=comparison_gb_df, palette='cubehelix')

# Add title and labels
plt.title('Comparison of Baseline and Tuned GradientBoostingRegressor Model Metrics')
plt.xlabel('Evaluation Metric')
plt.ylabel('Value')

# Display the plot
plt.tight_layout()
plt.show()

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.



**Business Impact of GradientBoostingRegressor Model and Evaluation Metrics:**

**R-squared (R2):**
*   **Indication:** R2 measures the proportion of the variance in the dependent variable (restaurant rating) that can be predicted from the independent variables (our selected features). An R2 of 0.5105 means that approximately 51.05% of the variability in restaurant ratings can be explained by our model's features.
*   **Business Impact:** A higher R2 indicates a more robust understanding of the factors influencing customer satisfaction. For businesses, this means the model can provide valuable insights into *why* restaurants receive certain ratings. Knowing what drives 51% of the rating variance allows restaurant owners, marketing teams, and food delivery platforms to focus efforts on improving specific aspects (e.g., specific cuisine qualities, service aspects, cost value) that have a significant impact on customer perception. This can guide investment decisions, menu optimizations, and operational improvements to positively influence overall ratings and, consequently, revenue and customer loyalty.

**Mean Absolute Error (MAE):**
*   **Indication:** MAE (0.7996) represents the average magnitude of the errors in our predictions. On average, our model's predictions are about 0.8 points away from the actual rating.
*   **Business Impact:** MAE is directly interpretable. An average error of 0.8 points on a 5-point scale is relatively low, suggesting the model provides fairly accurate predictions. From a business perspective, this level of accuracy can be useful for:
    *   **Forecasting Performance:** Predicting future ratings for new restaurants or menu items with reasonable accuracy.
    *   **Identifying Underperformers:** Restaurants whose predicted ratings consistently deviate significantly from actual high ratings might need attention, even if the model's average error is low.
    *   **Setting Realistic Expectations:** Management can understand that while the model is good, there's still an average deviation, which helps in managing expectations for predicted outcomes.

**Mean Squared Error (MSE) & Root Mean Squared Error (RMSE):**
*   **Indication:** MSE (1.0735) and RMSE (1.0361) give more weight to larger errors because they square the differences. RMSE is in the same units as the target variable (rating points), making it easier to interpret than MSE.
*   **Business Impact:** RMSE provides a measure of the typical error size, penalizing larger mistakes more heavily. An RMSE of 1.0361 means that, on average, the errors in our rating predictions are around 1 point, with larger errors having a greater influence on this metric. This metric is critical for:
    *   **Risk Management:** Businesses can assess the potential impact of prediction errors. For instance, consistently under-predicting a restaurant's rating could lead to missed marketing opportunities, while over-predicting could lead to customer disappointment.
    *   **Quality Assurance:** By minimizing RMSE, the model helps ensure that restaurant performance assessments (based on predicted ratings) are as close to actual customer sentiment as possible, supporting data-driven quality control.
    *   **Strategic Planning:** The model's ability to predict ratings with an RMSE of ~1 point means it can be a reliable tool for strategic planning, such as identifying restaurants with high growth potential or those requiring immediate intervention.

**Overall Business Impact of the Tuned GradientBoostingRegressor Model:**

The Gradient Boosting Regressor, particularly after hyperparameter tuning, demonstrates a solid capability to predict restaurant ratings. This model can serve as a valuable analytical tool for:

*   **Performance Monitoring:** Automatically flag restaurants that deviate significantly from their predicted ratings, indicating potential service issues or exceptional performance.
*   **Strategic Marketing:** Identify key features (from feature importance analysis) that drive high ratings, allowing for targeted marketing campaigns that highlight these strengths.
*   **Menu and Service Optimization:** Use predicted ratings to test the impact of changes in menu, pricing, or service offerings, guiding data-driven improvements.
*   **Investment Decisions:** For platforms like Zomato, the model can help in identifying high-performing restaurants for promotion or partnership opportunities.

The improvement in R2, MAE, and RMSE after tuning indicates a more reliable model. While the model is not perfect (R2 is ~0.5), it provides a quantitative basis for understanding and predicting customer satisfaction, which is a key driver for success in the restaurant industry.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model


In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# 1. Instantiate the Lasso regressor
# Using alpha=0.1 and random_state=42 for reproducibility
lasso_model = Lasso(alpha=0.1, random_state=42)

# 2. Fit the Lasso model to the training data
print("Fitting Lasso Regressor model...")
lasso_model.fit(X_train, y_train)
print("Model fitting complete.")

# 3. Make predictions on the test data
print("Making predictions on the test set...")
y_pred_lasso = lasso_model.predict(X_test)
print("Predictions made.")

# 4. Calculate and print evaluation metrics for the Lasso model
print(f"\n--- Lasso Regressor Model Performance (Baseline) ---")

r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"R-squared (R2): {r2_lasso:.4f}")

mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
print(f"Mean Absolute Error (MAE): {mae_lasso:.4f}")

mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f"Mean Squared Error (MSE): {mse_lasso:.4f}")

rmse_lasso = np.sqrt(mse_lasso)
print(f"Root Mean Squared Error (RMSE): {rmse_lasso:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Create a DataFrame for evaluation metrics for the Lasso Regressor
metrics_data_lasso = {
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)'],
    'Value': [r2_lasso, mae_lasso, mse_lasso, rmse_lasso]
}
metrics_lasso_df = pd.DataFrame(metrics_data_lasso)

# Create a bar plot for the evaluation metrics
plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Value', data=metrics_lasso_df, hue='Metric', palette='coolwarm', legend=False)

# Add title and labels
plt.title('Lasso Regressor Model Evaluation Metrics (Baseline)')
plt.xlabel('Evaluation Metric')
plt.ylabel('Value')

# Display the plot
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model



In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# 1. Define a parameter grid named param_grid_lasso
param_grid_lasso = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
    'selection': ['cyclic', 'random']
}

# 2. Instantiate a Lasso regressor named base_lasso
base_lasso = Lasso(random_state=42, max_iter=2000)

# 3. Initialize GridSearchCV named grid_search_lasso
print("Starting GridSearchCV for Lasso Regressor hyperparameter tuning...")
grid_search_lasso = GridSearchCV(estimator=base_lasso,
                                 param_grid=param_grid_lasso,
                                 scoring='neg_mean_squared_error',
                                 cv=3,
                                 n_jobs=-1,
                                 verbose=2)

# 4. Fit grid_search_lasso to the training data
grid_search_lasso.fit(X_train, y_train)

print("GridSearchCV complete for Lasso Regressor.")

# 5. Print the best parameters and best score
print(f"\nBest Parameters found for Lasso Regressor: {grid_search_lasso.best_params_}")
print(f"Best Cross-validation Score (neg_mean_squared_error): {grid_search_lasso.best_score_:.4f}")

# 6. Make predictions on the test set using the best estimator
print("Making predictions with the best tuned Lasso Regressor model...")
y_pred_lasso_tuned = grid_search_lasso.best_estimator_.predict(X_test)
print("Predictions complete.")

# 7. Calculate and print evaluation metrics for the tuned Lasso model
r2_lasso_tuned = r2_score(y_test, y_pred_lasso_tuned)
mae_lasso_tuned = mean_absolute_error(y_test, y_pred_lasso_tuned)
mse_lasso_tuned = mean_squared_error(y_test, y_pred_lasso_tuned)
rmse_lasso_tuned = np.sqrt(mse_lasso_tuned)

print(f"\n--- Tuned Lasso Regressor Model Performance ---")
print(f"R-squared (R2) Tuned: {r2_lasso_tuned:.4f}")
print(f"Mean Absolute Error (MAE) Tuned: {mae_lasso_tuned:.4f}")
print(f"Mean Squared Error (MSE) Tuned: {mse_lasso_tuned:.4f}")
print(f"Root Mean Squared Error (RMSE) Tuned: {rmse_lasso_tuned:.4f}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Visualize Evaluation Metrics for Tuned Lasso
metrics_data_lasso_tuned = {
    'Metric': ['R-squared (R2)', 'Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)']
}

# Use the calculated tuned metrics
metrics_data_lasso_tuned['Value'] = [r2_lasso_tuned, mae_lasso_tuned, mse_lasso_tuned, rmse_lasso_tuned]

metrics_lasso_tuned_df = pd.DataFrame(metrics_data_lasso_tuned)

plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Value', data=metrics_lasso_tuned_df, hue='Metric', palette='coolwarm', legend=False)
plt.title('Tuned Lasso Regressor Model Evaluation Metrics')
plt.xlabel('Evaluation Metric')
plt.ylabel('Value')
plt.tight_layout()
plt.show()

##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization technique used for the Lasso Regressor was **Grid Search (GridSearchCV)**. This technique systematically explores all possible combinations of hyperparameter values specified in a predefined grid (`param_grid_lasso`).

**Why Grid Search was used:**

1.  **Comprehensive Exploration:** Grid Search exhaustively evaluates every combination of the specified hyperparameters (`alpha` and `selection`). This ensures that the optimal set of hyperparameters within the defined search space is found.
2.  **Performance Improvement:** Lasso Regression models are sensitive to their `alpha` parameter, which controls the strength of the regularization. Proper tuning is crucial to find the right balance between bias and variance, preventing underfitting or overfitting. Grid Search helps in identifying the configuration that yields the best performance on the validation set.
3.  **Comparability:** By using a structured and exhaustive search, we can confidently compare the tuned Lasso Regressor's performance with other models, knowing that its hyperparameters have been optimized for the given task.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The R-squared (R2) dramatically improved from -0.0002 to 0.4291, indicating that the tuned model can now explain approximately 42.91% of the variance in restaurant ratings, a substantial gain in predictive power.
All error metrics significantly decreased:
MAE decreased from 1.2738 to 0.9040, meaning the average absolute difference between predicted and actual ratings is much smaller.
MSE decreased from 2.1937 to 1.2522, signifying a substantial reduction in the average squared error.
RMSE decreased from 1.4811 to 1.1190, showing that the magnitude of errors is generally smaller and the model's predictions are closer to the actual values.
These improvements demonstrate that hyperparameter tuning was critical for the Lasso Regressor, enabling it to move from a very poor baseline performance to a more reasonable and interpretable predictive model, although still trailing the tree-based models like RandomForest and GradientBoosting.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For a positive business impact, I primarily considered R-squared (R2), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).

R-squared (R2): Indicates the proportion of variance in ratings explained by the model. Business Impact: A higher R2 means better understanding of what drives customer satisfaction, allowing businesses to focus efforts on impactful factors (e.g., cuisine quality, service) to improve ratings and revenue.

Mean Absolute Error (MAE): Represents the average magnitude of error in rating predictions. Business Impact: A low MAE means more accurate predictions on average, useful for forecasting restaurant performance, identifying underperformers, and setting realistic expectations for new ventures.

Root Mean Squared Error (RMSE): Similar to MAE but penalizes larger errors more. Business Impact: A low RMSE signifies that typical prediction errors are small, aiding in risk management, ensuring quality assessments are reliable, and supporting strategic planning with more precise rating forecasts.



### 2. Which ML model did you choose from the above created models as your final prediction model and why?

From the models we've trained, the tuned RandomForestRegressor is chosen as the final prediction model.

 **reason:**

It achieved a slightly better overall performance with an R-squared of 0.5116 and RMSE of 1.0349, which marginally outperforms the tuned GradientBoostingRegressor's R-squared of 0.5105 and RMSE of 1.0361. Both models performed well, but RandomForestRegressor showed a slight edge in its ability to explain variance and its overall predictive accuracy after tuning.



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The final prediction model chosen is the tuned RandomForestRegressor.

**Model Explanation **(RandomForestRegressor): It's an ensemble learning model that builds many decision trees independently during training. For regression, it averages the predictions of all these individual trees to make a final, more robust prediction. It's excellent for handling diverse data types, capturing non-linear relationships, and is less prone to overfitting than single decision trees.

Feature Importance (Model Explainability Tool): RandomForestRegressor itself has a built-in model explainability tool: feature_importances_.

How it works: This attribute quantifies the contribution of each feature to the model's predictive power. It's calculated by measuring the average reduction in impurity (e.g., Mean Squared Error for regression) caused by splits on a particular feature across all the trees in the forest. A higher value indicates a more important feature.
Insights: This allows us to directly see which textual terms (like 'bad', 'good'), numerical attributes (like 'Cost'), and categorical indicators (like 'Collection_Hyderabad's Hottest') were most influential in determining the restaurant ratings, providing clear, actionable insights into customer satisfaction drivers.


## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


# **conclusion**
The Zomato Restaurant Rating Prediction project aimed to develop a machine learning model capable of accurately predicting restaurant ratings based on their metadata and customer reviews. This involved a comprehensive data science pipeline:

1. Data Wrangling & Preprocessing:

Cleaning: Duplicate rows were removed, and data types were meticulously corrected for columns like 'Cost' (to numeric), 'Rating' (to numeric, handling 'Like' values), and 'Time' (to datetime).
Missing Values: Strategically handled missing values by filling with 'Unknown' (for categorical 'Collections', 'Reviewer', 'Metadata') or mode (for 'Timings'), and dropping critical missing 'Rating' or 'Review' entries.
Text Preprocessing: The 'Review' text underwent extensive cleaning including contraction expansion, lowercasing, punctuation and noise removal (URLs, digits in words, stopwords), lemmatization, and finally, TF-IDF vectorization to convert it into a numerical feature representation.
2. Feature Engineering & Selection:

New Features: A 'Review_Length' feature was engineered to capture the verbosity of reviews.
Multi-stage Feature Selection: A robust strategy combining filter, embedded, and wrapper methods was employed:
Filter Methods (Pearson correlation, t-tests, TF-IDF thresholds) were used for initial screening, significantly reducing the feature space from thousands to hundreds by removing irrelevant or uninformative features.
Embedded Methods (RandomForestRegressor's feature importances) identified the top 100 most influential features, including sentiment-related TF-IDF terms, numerical metrics like 'Cost' and 'Num_Reviews', and specific categorical indicators.
Wrapper Methods (SequentialFeatureSelector with LinearRegression) further refined this to a highly impactful set of 20 features, ensuring model parsimony and combating overfitting.
3. Model Building & Evaluation:

The data was split into training (80%) and testing (20%) sets.
Three Regression Models were implemented and hyperparameter tuned using GridSearchCV:
RandomForestRegressor: Demonstrated strong performance, achieving an R-squared of 0.5116 and RMSE of 1.0349 after tuning. This model explains approximately 51.16% of the variance in ratings.
GradientBoostingRegressor: Performed comparably well, yielding an R-squared of 0.5105 and RMSE of 1.0361 after tuning.
Lasso Regression: Showed significant improvement from a poor baseline after tuning, reaching an R-squared of 0.4291. While interpretable, its predictive power was lower than ensemble methods.
Imbalance Handling: The inherent imbalance of the target 'Rating' (skewed towards higher values) was addressed by choosing robust tree-based models and relying on appropriate evaluation metrics like MAE, MSE, RMSE, and R-squared, rather than direct data re-sampling which is generally ill-suited for continuous targets.
4. Final Model Selection:

The tuned RandomForestRegressor was chosen as the final prediction model. It slightly outperformed the tuned GradientBoostingRegressor in terms of R-squared and RMSE, making it the most accurate and robust model developed in this project.
Conclusion & Business Impact: This project successfully developed a predictive model for restaurant ratings, capable of explaining over 51% of the variance in customer satisfaction. The identified important features (e.g., sentiment terms, restaurant cost, specific collections) offer actionable insights for businesses. Restaurants can leverage these findings to:

Optimize menus and services by focusing on attributes that significantly drive high ratings.
Refine marketing strategies by highlighting strong points or addressing weaknesses identified through feature importance.
Forecast performance for new offerings or locations with reasonable accuracy.
While the model provides a strong foundation, further enhancements could involve exploring more advanced NLP techniques (e.g., deep learning for text), alternative regression algorithms, or even developing custom loss functions to emphasize prediction accuracy for less frequent (e.g., very low) ratings.

