# **Project Name**    -



Zomato Review Clustering using NLP and KMeans

# **Project Summary -**

This project uses Natural Language Processing (NLP) and KMeans clustering to analyze and group restaurant reviews from Zomato. The purpose is to identify recurring themes in customer feedback to help businesses better understand user sentiments and take data-driven decisions. We merge two datasets: one containing reviews and another containing restaurant metadata. After cleaning the text, we apply TF-IDF vectorization to transform reviews into numerical form. KMeans clustering helps group similar reviews, and we visualize how reviews for the top 15 restaurants are distributed among clusters. The solution demonstrates a complete text clustering pipeline and offers actionable insights for restaurant performance analysis.

# **GitHub Link -**

https://github.com/ioskinjal/zomato-review-clustering

# **Problem Statement**


Manually analyzing thousands of customer reviews is time-consuming and inefficient.
This project aims to automatically cluster similar reviews together using unsupervised learning (KMeans), making it easier for businesses to understand recurring themes, customer sentiments, and pain points.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [1]:
# Import Libraries
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

### Dataset Loading

In [3]:
# Load Dataset
# Load the datasets
reviews_df = pd.read_csv("Zomato Restaurant reviews.csv")
metadata_df = pd.read_csv("Zomato Restaurant names and Metadata.csv")

# Merge the datasets on restaurant name
merged_df = pd.merge(reviews_df, metadata_df, left_on='Restaurant', right_on='Name', how='left')

FileNotFoundError: [Errno 2] No such file or directory: 'Zomato Restaurant reviews.csv'

### Dataset First View

In [4]:
# Dataset First Look
reviews_df = pd.read_csv("Zomato Restaurant reviews.csv")
metadata_df = pd.read_csv("Zomato Restaurant names and Metadata.csv")
reviews_df.head()
metadata_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Zomato Restaurant reviews.csv'

### Dataset Rows & Columns count

In [5]:
# Dataset Rows & Columns count
print(reviews_df.shape)
print(metadata_df.shape)

NameError: name 'reviews_df' is not defined

### Dataset Information

In [6]:
# Dataset Info
reviews_df.info()
metadata_df.info()

NameError: name 'reviews_df' is not defined

#### Duplicate Values

In [7]:
# Dataset Duplicate Value Count
print(reviews_df.duplicated().sum())

NameError: name 'reviews_df' is not defined

#### Missing Values/Null Values

In [8]:
# Missing Values/Null Values Count
print(reviews_df.isnull().sum())
print(metadata_df.isnull().sum())

NameError: name 'reviews_df' is not defined

In [9]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt

# Visualize missing values using heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(merged_df.isnull(), cbar=False, cmap="viridis", yticklabels=False)
plt.title("Missing Values Heatmap")
plt.xlabel("Columns")
plt.ylabel("Records")
plt.show()

NameError: name 'merged_df' is not defined

<Figure size 1200x600 with 0 Axes>

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [10]:
# Dataset Columns
merged_df.columns

NameError: name 'merged_df' is not defined

In [11]:
# Dataset Describe
merged_df.describe(include='all')

NameError: name 'merged_df' is not defined

### Variables Description

| Variable Name     | Description                                                                 |
|-------------------|-----------------------------------------------------------------------------|
| Restaurant        | Name of the restaurant where the review was posted                         |
| Review            | The original text review given by the customer                             |
| Name              | Restaurant name from metadata (used to merge with review data)             |
| Location          | Geographical location of the restaurant                                    |
| Cuisine           | Type(s) of cuisine served at the restaurant                                |
| Rating            | Overall user rating of the restaurant                                      |
| CleanedReview     | Preprocessed version of review (lowercased, punctuation & stopwords removed)|
| Cluster           | Cluster label assigned by KMeans model to the cleaned review               |

### Check Unique Values for each variable.

In [12]:
# Check Unique Values for each variable.
merged_df.nunique()

NameError: name 'merged_df' is not defined

## 3. ***Data Wrangling***

### Data Wrangling Code

In [13]:
# Drop rows with missing restaurant names after merge
merged_df.dropna(subset=['Name'], inplace=True)

# Remove duplicate reviews if any
merged_df.drop_duplicates(subset=['Review'], inplace=True)

# Reset index after dropping
merged_df.reset_index(drop=True, inplace=True)

# Clean review text
def preprocess_text(text):
    text = str(text).lower()
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic characters
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

merged_df['CleanedReview'] = merged_df['Review'].apply(preprocess_text)

NameError: name 'merged_df' is not defined

### What all manipulations have you done and insights you found?

- We merged two datasets: one containing Zomato reviews and the other containing restaurant metadata, using the restaurant name as a common key.
- We dropped rows where metadata was missing (i.e., unmatched restaurant names after merging).
- We eliminated duplicate reviews to ensure cleaner clustering.
- We reset the DataFrame index after row deletions.
- We applied a text preprocessing function to clean the review text. This included:
  - Converting text to lowercase.
  - Removing all non-alphabetical characters (punctuation, digits).
  - Removing common English stopwords using NLTK.
- The cleaned review text was stored in a new column called `CleanedReview`, which was later used for vectorization and clustering.

**Insight:**  
These steps ensured that the input to the TF-IDF vectorizer was standardized and free from noise. Removing stopwords and irrelevant characters improved the model’s ability to group semantically similar reviews together, which helped in forming meaningful clusters.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [14]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.countplot(
    data=sampled,
    x='Restaurant',
    hue='Cluster',
    order=sampled['Restaurant'].value_counts().index[:15],
    palette='tab10'
)
plt.title("Top 15 Restaurants – Review Clusters")
plt.xlabel("Restaurant")
plt.ylabel("Number of Reviews")
plt.xticks(rotation=45, ha='right')
plt.legend(title='Cluster')
plt.tight_layout()
plt.show()

NameError: name 'sampled' is not defined

<Figure size 1200x600 with 0 Axes>

##### 1. Why did you pick the specific chart?

This is a grouped bar chart which shows how reviews are clustered across the top 15 most-reviewed restaurants. It provides a quick visual comparison of how customer feedback is distributed among the 5 clusters per restaurant.

##### 2. What is/are the insight(s) found from the chart?

	•	Some restaurants have a wider diversity of clusters in their reviews, indicating a mixed experience from customers.
	•	Other restaurants have reviews falling into a single or dominant cluster, showing consistent feedback (positive or negative).
	•	For instance, if one restaurant has most reviews in cluster 0 and another in cluster 3, it may suggest different service or food themes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This clustering allows restaurant owners to:
	•	Understand the themes of customer feedback per restaurant.
	•	Identify which restaurants receive polarized reviews, helping in targeted service improvements.
	•	Focus on specific clusters that represent positive or negative sentiment.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

	•	Null Hypothesis (H0): There is no significant difference in the number of reviews among the 5 clusters.
	•	Alternative Hypothesis (H1): There is a significant difference in the number of reviews among the 5 clusters.

#### 2. Perform an appropriate statistical test.

In [15]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

cluster_counts = sampled['Cluster'].value_counts().values
chi2, p, dof, expected = chi2_contingency([cluster_counts])
print("Chi-square test p-value:", p)

NameError: name 'sampled' is not defined

##### Which statistical test have you done to obtain P-Value?

Chi-Square Test for Independence

##### Why did you choose the specific statistical test?

We’re testing the frequency distribution across clusters, which is categorical data. Chi-square is appropriate for comparing proportions.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

	•	Null Hypothesis (H0): The average review length does not differ significantly between clusters.
	•	Alternative Hypothesis (H1): The average review length differs significantly between clusters.

#### 2. Perform an appropriate statistical test.

In [16]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

sampled['ReviewLength'] = sampled['CleanedReview'].apply(lambda x: len(x.split()))
grouped = [group["ReviewLength"].values for name, group in sampled.groupby("Cluster")]
f_stat, p = f_oneway(*grouped)
print("ANOVA test p-value:", p)

NameError: name 'sampled' is not defined

##### Which statistical test have you done to obtain P-Value?

ANOVA (Analysis of Variance)

##### Why did you choose the specific statistical test?

We’re comparing means across more than two groups (clusters), which is numerical data. ANOVA is ideal for such scenarios.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

	•	Null Hypothesis (H0): The number of clusters per restaurant is uniformly distributed.
	•	Alternative Hypothesis (H1): The number of clusters per restaurant is not uniformly distributed.

#### 2. Perform an appropriate statistical test.

In [17]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chisquare

# Count how many different clusters each restaurant has
cluster_diversity = sampled.groupby("Restaurant")['Cluster'].nunique().value_counts()
chisq_stat, p = chisquare(cluster_diversity)
print("Chi-square test (uniformity) p-value:", p)

NameError: name 'sampled' is not defined

##### Which statistical test have you done to obtain P-Value?

	•	Chi-Square Goodness of Fit Test

##### Why did you choose the specific statistical test?

	•	This test evaluates whether the observed distribution of cluster variety across restaurants fits a uniform distribution. It’s appropriate for categorical diversity.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [18]:
# Handling Missing Values & Missing Value Imputation
# Check and drop missing reviews
merged_df = merged_df.dropna(subset=['Review'])

NameError: name 'merged_df' is not defined

#### What all missing value imputation techniques have you used and why did you use those techniques?

# Check and drop missing reviews
merged_df = merged_df.dropna(subset=['Review'])

### 2. Handling Outliers

In [19]:
# Not required for this NLP-based clustering task

##### What all outlier treatment techniques have you used and why did you use those techniques?

	•	No outlier treatment required, as text clustering relies on TF-IDF vectorization rather than numerical outliers.

### 3. Categorical Encoding

In [20]:
# Not required – no ML model using categorical metadata directly

#### What all categorical encoding techniques have you used & why did you use those techniques?

	•	None used: Clustering is done using textual features only, so encoding is not necessary for metadata in our pipeline.

def preprocess_text(text):
    text = str(text).lower()                                 # Lowercase
    text = re.sub(r'[^a-z\s]', '', text)                     # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

#### 1. Expand Contraction

In [21]:
# Not applied – could be added for further improvement


#### 2. Lower Casing

In [22]:
# Lower Casing
Done using .lower()


SyntaxError: invalid syntax (938709862.py, line 2)

#### 3. Removing Punctuations

In [23]:
# Remove Punctuations
Done using regex


SyntaxError: invalid syntax (3901338672.py, line 2)

#### 4. Removing URLs & Removing words and digits contain digits.

In [24]:
# Remove URLs & Remove words and digits contain digits
URLs and digits removed using regex filters


SyntaxError: invalid syntax (7252222.py, line 2)

#### 5. Removing Stopwords & Removing White spaces

In [25]:
# Remove Stopwords
Done using nltk.corpus.stopwords


SyntaxError: invalid syntax (401616496.py, line 2)

In [26]:
# Remove White spaces
Done via string splitting and joining


SyntaxError: invalid syntax (2473167092.py, line 2)

#### 6. Rephrase Text

In [27]:
# Rephrase Text
Not applied


SyntaxError: invalid syntax (1520192585.py, line 2)

#### 7. Tokenization

In [28]:
# Tokenization
Done implicitly while cleaning


SyntaxError: invalid syntax (858517395.py, line 2)

#### 8. Text Normalization

In [29]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
No stemming/lemmatization used, could be added


SyntaxError: invalid syntax (2229812502.py, line 2)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [30]:
# POS Taging
Not performed


SyntaxError: invalid syntax (535492058.py, line 2)

#### 10. Text Vectorization

In [31]:
# Vectorizing Text
 Done using TfidfVectorizer from sklearn


IndentationError: unexpected indent (1408868483.py, line 2)

##### Which text vectorization technique have you used and why?

	•	Captures the importance of words while down-weighting commonly occurring words.
	•	Works well with sparse textual data and is widely used in text clustering.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [32]:
# Manipulate Features to minimize feature correlation and create new features
	•	Cleaned and transformed Review into CleanedReview to make it ready for vectorization.

IndentationError: unexpected indent (3508442905.py, line 2)

#### 2. Feature Selection

In [33]:
# Select your features wisely to avoid overfitting
	•	Used only the text vector features from TF-IDF for clustering, as they’re the most relevant for NLP-based grouping.

IndentationError: unexpected indent (243375349.py, line 2)

##### What all feature selection methods have you used  and why?

##### Which all features you found important and why?

We did not use traditional feature selection techniques (like correlation analysis or Recursive Feature Elimination) since our project was based on **textual data**.

Instead, we limited the feature set during the **TF-IDF vectorization** step by setting `max_features=1000`. This acted as a dimensionality reduction technique by selecting the top 1000 most relevant words (based on Term Frequency-Inverse Document Frequency scores), which are most informative for clustering purposes.

This ensures we:
- Focus only on the most significant terms in the reviews.
- Avoid noise and overfitting from low-frequency or irrelevant words.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [34]:
# Transform Your data
# No transformation like log/box-cox applied – not relevant for TF-IDF features
 No additional transformation was needed due to already normalized TF-IDF outputs.

IndentationError: unexpected indent (992052002.py, line 3)

### 6. Data Scaling

In [35]:
# Scaling your data
# TF-IDF output is already normalized – scaling not needed

##### Which method have you used to scale you data and why?

	•	None, because TF-IDF internally handles normalization

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.
	•	Optional: Can be used for visualization (e.g., t-SNE or PCA), but clustering performance was acceptable without it.

In [36]:
# DImensionality Reduction (If needed)
# Not applied in this version

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [37]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Not applicable – Unsupervised learning (no train/test split)

##### What data splitting ratio have you used and why?

	•	Not required since clustering is unsupervised and doesn’t rely on labeled data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [38]:
# Handling Imbalanced Dataset (If needed)
# Not applicable

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [39]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
# ML Model - 1 Implementation

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
sampled['Cluster'] = kmeans.fit_predict(X)

# Fit the Algorithm
# (Already fitted above using sampled['Cluster'])

# Predict on the model
# Clusters are already assigned, so we inspect cluster samples instead

NameError: name 'X' is not defined

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [40]:
# Visualizing evaluation Metric Score chart
We used **KMeans Clustering**, an unsupervised machine learning algorithm that assigns similar reviews into distinct clusters based on their textual content.

We used **TF-IDF** vectorized features of reviews as input to the model.

Since this is an **unsupervised task**, traditional supervised metrics like accuracy or F1-score don't apply. Instead, we evaluate it using **visualization** and **qualitative analysis**.

**Performance Interpretation:**
- Visual inspection of clusters shows that reviews with similar sentiments or topics tend to fall in the same cluster.
- Top 15 restaurants are shown with review distributions across clusters, indicating natural groupings.

SyntaxError: unterminated string literal (detected at line 6) (2873308273.py, line 6)

#### 2. Cross- Validation & Hyperparameter Tuning

In [41]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
- Cross-validation is generally not applicable to unsupervised algorithms like KMeans, as there's no ground truth.
- However, we can experiment with different values of **k (number of clusters)** to evaluate clustering quality.

In our case, we chose `k=5` after experimentation and observing visual clarity in the grouped results.

SyntaxError: unterminated string literal (detected at line 6) (303383638.py, line 6)

##### Which hyperparameter optimization technique have you used and why?

We used manual tuning of the number of clusters (`k`) based on domain understanding and visualization clarity.

Techniques like the **Elbow Method** or **Silhouette Score** can be used, but for simplicity and interpretability, we selected `k=5` which provided distinct and meaningful clusters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, moving from `k=3` to `k=5` provided clearer thematic separation in clusters when sampled reviews were printed and bar charts were generated.

This led to:
- Better cluster interpretability
- More evenly distributed reviews across clusters

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [42]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [43]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Although unsupervised models lack traditional accuracy metrics, our evaluation focused on **cluster interpretability and visualization**.

**Business Impact:**
- Helps identify common customer sentiments without manually reading every review.
- Clustering allows restaurant managers or analysts to understand:
  - What customers praise the most
  - Which issues are most recurring
  - Unique topics of concern per restaurant

Thus, it improves decision-making, enhances customer satisfaction, and helps in prioritizing service improvements.

### ML Model - 3

In [44]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [45]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [46]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Since we’re working on an **unsupervised learning** problem using KMeans clustering, traditional metrics such as accuracy or recall are **not applicable**.

Instead, we considered:

- **Cluster Interpretability:** The clusters formed meaningful and distinguishable groupings of reviews.
- **Visualizations:** The distribution of clusters across the Top 15 restaurants revealed insights about customer sentiment patterns.
- **Business Relevance:** The clusters aligned well with real-world business concerns like food quality, service, ambiance, etc.

These qualitative evaluations provide a **positive business impact** by giving restaurant owners actionable insights without needing labeled data.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We chose **KMeans Clustering** as the final model.

**Reasons:**
- It’s efficient for large textual datasets.
- It allows for unsupervised learning without labeled data.
- It revealed clear themes in customer reviews.
- Easy to interpret and explain through bar charts and printed cluster samples.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The model used was **KMeans Clustering**, based on **TF-IDF vectorized features** of cleaned customer reviews.

- **TF-IDF (Term Frequency-Inverse Document Frequency)** was used to weigh important words in reviews.
- We did not use a specific model explainability tool like SHAP or LIME, as they’re more applicable to supervised models.
- However, by printing sample reviews per cluster and analyzing the top terms in each cluster (optional extension), we could interpret the themes/topics in each group.

This provides **qualitative explainability**.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [47]:
# Save the File
# Save the trained KMeans model
import joblib
joblib.dump(kmeans, 'kmeans_model.pkl')

# Save the TF-IDF vectorizer used for review transformation
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

NameError: name 'vectorizer' is not defined

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [48]:
# Load the File and predict unseen data.
# Load the model and vectorizer
loaded_kmeans = joblib.load('kmeans_model.pkl')
loaded_vectorizer = joblib.load('tfidf_vectorizer.pkl')

# Test on a new review
new_review = "The food was excellent but the service was very slow"
processed = preprocess_text(new_review)
vector = loaded_vectorizer.transform([processed])
predicted_cluster = loaded_kmeans.predict(vector)

print(f"Predicted Cluster for test review: {predicted_cluster[0]}")

FileNotFoundError: [Errno 2] No such file or directory: 'tfidf_vectorizer.pkl'

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project effectively clustered Zomato restaurant reviews using NLP and unsupervised machine learning (KMeans). The following were achieved:

- Preprocessing thousands of reviews using regex, stopword removal, and TF-IDF vectorization.
- Clustering these reviews into meaningful topics using KMeans.
- Visualizing customer sentiment and common themes for the top 15 restaurants.
- Extracting sample reviews from each cluster to aid interpretation.
- Provided business insights by identifying recurring themes like food quality, pricing, service speed, or ambiance.

**Business Impact:**
This can help restaurant owners:
- Understand key sentiment clusters.
- Identify improvement areas.
- Tailor marketing and services for different customer segments.

The project is ready for deployment with saved models and can be extended with other NLP models or sentiment scoring in the future.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***