# **SCI 111 - Zomato Recommendation - Data Analysis**

## **Authors of this Repository**
- **Binwag, Louis G. III** - [@louisuwie](https://github.com/louis-uwie)
- **Lozada, Godwyn Idris L.** - [@godwynlozada](https://github.com/godwynlozada)

<!-- CSCI 111 is a class that teaches how Artificial Intelligence (AI) think.
The class first dwells into basic data structures that concern with how AI may approach decision making processes (i.e., Breadth-First-Search, Depth-First-Search, A* Algorithm, etc.).

Data structures help us understand and visualize easier how "problem solving Agents" think. These are classifications of Agents that have a degree of what they can essentially 'solve' (i.e., Problem-Based Agent, Supervised or Unsupervised Agents, etc.) ...

After understanding solving Agents, we move into the basics of Machine Learning which includes Data Transformation, Decision Trees, K-Nearest Neighbours (kNN), and Clustering. This is where we first try Jupyter in programming a script that uses various models in transforming, manipulating, and analysing datasets.

Lastly, we tackle Logical Agents. This is adding more logic towards what we previously learned as problem solving Agents where we use propositional logic, inferences, and entailment to establish how we can visualize how an Agent might look at a problem (i.e., Truth table, Inference diagrams, Logical statements / sentences, Knowledge Base (KB)) -->

# **I. Initial Set Up**

## **Project on Evaluating Machine Learning Models.**

**Instructions.** Select a dataset from [UCI](https://archive.ics.uci.edu/ml/datasets.php) or [Google](https://datasetsearch.research.google.com/), formulate a machine learning problem (supervised or unsupervised), and build and evaluate two models (different methods) that solve the problem. Any programming language may be used.
- You may also use other legitimate sources at the same level of the UCI and Google sites provided.
- You may use methods not taught in class. KNN is not an option.
- You may also use a portion of the dataset if its size causes problems (e.g. reduce the number of rows)

**Deliverables.** In a Google Drive folder that I can access, submit the following:
- Source code and executables
- Instructions on how to use your resources (i.e. your program)
- Slide deck explaining your work
- Recorded video presentation of your work (approx 20-30mins)

**Expected Output.**
- Jupyter Notebook (.ipynb)
- Resources (csv unclean and cleaned)
- Video Presentation
- Slide Deck Presentation

---

# **II. Data Set**

**Dataset Overview.**

The dataset contains raw information sourced from the Zomato Recommendation Platform for restaurants based in Pune, India, covering the year 2023. Each row corresponds to a single restaurant entry and includes a variety of attributes such as the restaurant’s name, multiple types of cuisine offered (up to eight slots), its categorized food type, the average cost for two people, the locality within Pune, and the average customer dining rating.

This dataset provides a foundation for predictive modeling and exploratory analysis, as it blends both categorical (e.g., cuisine types, locality) and numerical (e.g., rating, pricing) data. Through this structure, we can investigate patterns in consumer preferences, identify key factors influencing restaurant ratings, and evaluate the performance of machine learning models like Decision Trees and Mixed Naive Bayes in classifying highly rated restaurants.

| **Features**              | **Short Explanation**                                                         | **Possible Values / Example**                 |
| ------------------------ | ----------------------------------------------------------------------------- | --------------------------------------------- |
| `Restaurant_Name`        | Name of the restaurant listed on Zomato                                       | `"Le Plaisir"`, `"Savya Rasa"`                |
| `Cuisine1` to `Cuisine8` | Different types of cuisines offered by the restaurant, in order of prominence | `"South Indian"`, `"Desserts"`, `"MISSING"`   |
| `Category`               | Grouped categories combining all cuisine types into a readable list           | `"Cafe, Italian, Continental..."`             |
| `Pricing_for_2`          | Approximate cost for two people, in INR                                       | `600`, `1200`, `2100`                         |
| `Locality in Pune`       | Location/neighborhood of the restaurant in Pune                               | `"Koregaon Park"`, `"Baner"`, `"Viman Nagar"` |
| `Dining_Rating`          | Average customer rating of the restaurant (out of 5)                          | `4.2`, `3.8`, `4.9`                           |


---


# **III. Ideal Pipeline**

Our goal for this analysis is to be able to determine which model is able to more accurately predict what are the top restaurants in the locale (possibly depending on cuisines, locality, or average price.) <!-- Expound >

**1. Data Preprocessing**
- Load and Inspection of data.
- Cleaning the data (i.e. Tableau) <!-- care of Godwyn -->

**2. Exploratory Data Analysis (EDA)**
- This will be more on understanding which features create a reactive effect towards the rest of the feature.
- Identifies which feature is able to change the course of the data. From there, we will implement the models.

**3. Decision Tree Implementation 1 (DT1)**
- This will be one of the initial basis of our model apart from EDA.

**4. Apply Decision Tree Implementation 2 (DT2)**
- The second implementation of Decision Tree will consist of the data set where we have omitted certain features (To be identified soon. _i.e., MISSING values, certain irrelevant features_) based on our domain knowledge.
- Comparing this to Decision Tree Implementation 1, we may be able to justify that omitting certain "junk" features can make Decision Tree model more accurate.

**5. Apply Mixed Naive Bayes (MNB)**
- The final model we use in this study is the Mixed Naive Bayes (MNB) classifier. This model is a variation of the standard Naive Bayes algorithm that allows us to handle both categorical and continuous features—whic makes it especially well-suited for real-world datasets like Zomato’s, where variables such as cuisine type (categorical) and average price (numerical) coexist.

**6. Conclusion**
- Generally, through ***Exploratory Data Analysis (EDA) and both Decision Tree implementations***, you may conclude that certain features—such as Cuisine type, Locality, or Average Price—have a strong influence on whether a restaurant receives high ratings. _Features like 'MISSING' or non-informative columns could be confirmed as noise, negatively affecting model accuracy._
- Comparing ***Decision Tree 1 (all features) with Decision Tree 2 (cleaned features)***, you might find that:
    - Removing irrelevant or noisy features leads to higher accuracy and simpler tree structures.
    - This supports the idea that domain knowledge-based feature pruning improves model performance.
- ***Mixed Naive Bayes (MNB) might perform competitively or better on some metrics*** (like precision or recall) compared to Decision Trees, especially in cases where feature independence is mostly true. However, MNB might underperform if features are highly correlated, where Decision Trees can better handle interactions.

---

# **IV. Data Preprocessing**

< This section will include general importing and inspection of the data. Cleaning the data as well for nullified or duplicated values. > <!-- Expound more >

In [1]:
## Assume that we do not have the necessary libraries installed.
%pip install pandas numpy matplotlib seaborn scikit-learn mixed-naive-bayes graphviz #This is to install the libraries needed to run the code.
%pip install --upgrade pip #Updates pip

# Need to install tkinter.
# For mac: brew install python-tk

import pandas as pd
import numpy as np
# import tkinter as tk
# from tkinter import filedialog

import math

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
from scipy.stats import percentileofscore
from scipy.stats import f
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from matplotlib.patches import Patch

import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold ## https://www.geeksforgeeks.org/cross-validation-machine-learning/
from mixed_naive_bayes import MixedNB

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error




Collecting mixed-naive-bayes
  Downloading mixed_naive_bayes-0.0.3-py3-none-any.whl.metadata (8.8 kB)
Downloading mixed_naive_bayes-0.0.3-py3-none-any.whl (11 kB)
Installing collected packages: mixed-naive-bayes
Successfully installed mixed-naive-bayes-0.0.3
Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.1.1


In [None]:
## Use if you are using Google Drive
import io

## Use if you are using Google Colab
from google.colab import files
uploaded = files.upload()

Make this cell run before uploading the input file, please.

If in case you uploaded the file before, please make sure to replace the file name in the file path.

If you have not, kindly upload the data file.

In [None]:
## If using Jupyter Notebook / Run Locally via VS Code. Import the file LOZADA, BINWAG, CSCI 211 Zomato Dataset Pune.csv
# Hardcoded file path to your dataset

file_path = "LOZADA, BINWAG, CSCI 211 Zomato Dataset Pune.csv"
zomato_pune = pd.read_csv(file_path)

# Create a working copy for analysis
zomato_for_eda = zomato_pune.copy()

# Display the data
zomato_for_eda.head()

In [None]:
# Display the data
zomato_for_eda.tail()

## **Restaurant Count per Locality**

In [None]:
# Count restaurants per locality
locality_counts = zomato_for_eda['Locality in Pune'].value_counts()

# Plot the top 15
plt.figure(figsize=(12, 6))
sns.barplot(x=locality_counts.head(15).values, y=locality_counts.head(15).index, palette="viridis")

plt.title("Top 15 Localities by Number of Restaurants")
plt.xlabel("Number of Restaurants")
plt.ylabel("Locality")
plt.tight_layout()
plt.show()

## **Listing All Cuisines**

In [None]:
# List of cuisine columns
cuisine_cols = [f'Cuisine{i}' for i in range(1, 9)]

# Flatten, drop NAs and "MISSING", then get unique values
all_cuisines = pd.unique(
    zomato_for_eda[cuisine_cols]
    .values
    .ravel()
)

# Clean list
unique_cuisines = sorted([c for c in all_cuisines if pd.notna(c) and c != 'MISSING'])

# Display
print("Number of unique cuisines:", len(unique_cuisines))
print(unique_cuisines)


## **Correlation between Pricing_for_2 and Dining_Rating.**
This is to be able to understand if pricing is "cheaper" gains a better rating as a restaurant. However, this is just a shallow experiment as Pricing can't be the only factor in a high-rating.

In [None]:
plt.hexbin(zomato_for_eda['Pricing_for_2'], zomato_for_eda['Dining_Rating'], gridsize=30, cmap='Blues')
plt.colorbar(label='Count in Bin')
plt.xlabel('Pricing for 2')
plt.ylabel('Dining Rating')
plt.title('Hexbin: Price vs Rating')
plt.show()

In [None]:
correlation = zomato_for_eda[['Pricing_for_2', 'Dining_Rating']].corr()
print("Correlation between Pricing and Rating:")
print(correlation)

## **Correlation between Locality and Cuisine (1-8) to Dining_Rating.**
We proceed to test if there is a correlation between a cuisine served in certain locality. Such that, if for instance, `Mediterranean` and `European` cuisines served in	`Koregaon Park` receives a high rating whilst `Coffee` and `Desserts` served in the same locale has low-ratings.

In [None]:
# Reshape cuisine columns into one
cuisine_cols = [f'Cuisine{i}' for i in range(1, 9)]

# Melt cuisine columns
long_df = zomato_for_eda.melt(
    id_vars=['Dining_Rating', 'Locality in Pune'],
    value_vars=cuisine_cols,
    var_name='CuisineCol',
    value_name='Cuisine'
)

# Drop missing cuisines
long_df = long_df.dropna(subset=['Cuisine'])

# Grouping
rating_by_combo = (
    long_df
    .groupby(['Cuisine', 'Locality in Pune'])['Dining_Rating']
    .mean()
    .reset_index()
    .rename(columns={'Dining_Rating': 'Avg_Rating'})
)

# Top 15 highest-rated cuisine-location combos.
# Maintains the original index of the row.
rating_by_combo.sort_values('Avg_Rating', ascending=False).head(15)

# Revises the index starting from 0.
# Uncomment if prefer to use original indexing.
    # cleaned_top = rating_by_combo.sort_values('Avg_Rating', ascending=False).head(15).reset_index(drop=True)
    # cleaned_top # Prints



In [None]:
# Filter to threshold. Only Avg_Rating over 3.8
filtered = rating_by_combo[rating_by_combo['Avg_Rating'] > 3.8]

# Keep only top cuisines and localities by frequency
top_cuisines = (
    long_df['Cuisine'].value_counts()
    .loc[lambda x: x.index != 'MISSING']
    .head(10).index
)

top_localities = long_df['Locality in Pune'].value_counts().head(10).index

# Apply filter
filtered = filtered[
    (filtered['Cuisine'].isin(top_cuisines)) &
    (filtered['Locality in Pune'].isin(top_localities))
]

# Pivot
heatmap_data = filtered.pivot(
    index="Cuisine",
    columns="Locality in Pune",
    values="Avg_Rating"
)

# Plot
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, fmt=".1f", cmap="YlGnBu", linewidths=0.5, linecolor='gray')
plt.title("Dining Rating > 3.8 — Top 10 Cuisines x Top 5 Localities")
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Define the features and filter the DataFrame
input_features = ["Cuisine1", "Cuisine2", "Cuisine3", "Cuisine4",
                  "Cuisine5", "Cuisine6", "Cuisine7", "Cuisine8",
                  "Pricing_for_2", "Locality in Pune"]

X = zomato_for_eda.filter(items=input_features)

# all cuisines
cuisine_cols = [f'Cuisine{i}' for i in range(1, 9)]
cuisines_flat = pd.Series(X[cuisine_cols].values.ravel())

# Clean values
cuisines_flat = cuisines_flat.replace("MISSING", np.nan).dropna()

# Count top 15
cuisines_freq = cuisines_flat.value_counts().head(15)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=cuisines_freq.values, y=cuisines_freq.index)
plt.title("Top 15 Most Common Cuisines")
plt.xlabel("Number of Occurrences")
plt.tight_layout()
plt.show()


## **More EDA**

In [None]:
# Set rating threshold
rating_threshold = 3.75 ## Bhatia et. al. 2023

# First-level helper functions
def find_percent(df, feature_group, feature):
    feature_filter = df.loc[df[feature_group] == feature]
    if not feature_filter.empty:
        percent_above_cutoff = 100 - percentileofscore(
            feature_filter['Dining_Rating'], rating_threshold, kind='strict'
        )
    else:
        percent_above_cutoff = 0
    return percent_above_cutoff

def find_mean_rating(df, feature_group, feature):
    feature_filter = df.loc[df[feature_group] == feature]
    return feature_filter['Dining_Rating'].mean() if not feature_filter.empty else np.nan

# Second-level EDA helper
def eda_resto_data_numerical(df, feature_group):
    feature_list = pd.unique(df[feature_group]).tolist()
    feature_list2 = [feature for feature in feature_list if feature != 'MISSING' and pd.notnull(feature)]

    percent_exceeding_cutoff_feature = []
    average_per_cutoff_feature = []

    for feature in feature_list2:
        above_threshold = round(find_percent(df, feature_group, feature), 4)
        percent_exceeding_cutoff_feature.append(above_threshold)

        mean_rating = round(find_mean_rating(df, feature_group, feature), 4)
        average_per_cutoff_feature.append(mean_rating)

    # MinMax scaling
    scaler = MinMaxScaler()
    mean_scaled = scaler.fit_transform(np.array(average_per_cutoff_feature).reshape(-1, 1)).flatten()
    mean_scaled = [round(score, 4) for score in mean_scaled]

    # Create summary DataFrame
    df_mean_feature = pd.DataFrame({
        feature_group: feature_list2,
        '% Above 3.75': percent_exceeding_cutoff_feature,
        'Mean Rating': average_per_cutoff_feature,
        'MinMax Scale Score': mean_scaled
    })

    df_mean_feature = df_mean_feature.sort_values(by=['Mean Rating', feature_group], ascending=[False, True]).reset_index(drop=True)
    df_mean_feature['Rank'] = df_mean_feature.index + 1

    return df_mean_feature

In [None]:
eda_locality = eda_resto_data_numerical(zomato_for_eda, 'Locality in Pune')
eda_locality.head()

In [None]:
eda_resto_cuisine1 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine1')
eda_resto_cuisine1.head()

In [None]:
eda_resto_cuisine2 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine2')
eda_resto_cuisine2.head()

In [None]:
eda_resto_cuisine3 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine3')
eda_resto_cuisine3.head()

In [None]:
eda_resto_cuisine4 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine4')
eda_resto_cuisine4.head()

In [None]:
eda_resto_cuisine5 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine5')
eda_resto_cuisine5.head()

In [None]:
eda_resto_cuisine6 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine6')
eda_resto_cuisine6.head()

In [None]:
eda_resto_cuisine7 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine7')
eda_resto_cuisine7.head()

In [None]:
eda_resto_cuisine8 = eda_resto_data_numerical(zomato_for_eda, 'Cuisine8')
eda_resto_cuisine8.head()

In [None]:
zomato_pune_yes = zomato_for_eda[zomato_for_eda['Dining_Rating'] >= 3.8]
zomato_pune_yes.head()

In [None]:
zomato_pune_no = zomato_for_eda[zomato_for_eda['Dining_Rating'] < 3.8]
zomato_pune_no.head()

In [None]:
min_cost = min(zomato_pune_yes['Pricing_for_2'])
max_cost = max(zomato_pune_yes['Pricing_for_2'])

num_bins = int((max_cost - min_cost) / 50) + 1
num_bins

plt.figure(figsize=(15, 5))
sns.histplot(zomato_pune_yes['Pricing_for_2'], bins = num_bins, color='magenta')
plt.title("Meal Cost Distribution of the Zomato Dataset in Pune (if rating >= 3.8)")
plt.xlabel("Cost of Meal for 2 (INR)")
plt.xlim(0, 5000)
plt.ylabel("Frequency")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
min_cost = min(zomato_pune_no['Pricing_for_2'])
max_cost = max(zomato_pune_no['Pricing_for_2'])

num_bins = int((max_cost - min_cost) / 50) + 1
num_bins

plt.figure(figsize=(15, 5))
sns.histplot(zomato_pune_no['Pricing_for_2'], bins = num_bins, color='red')
plt.title("Meal Cost Distribution of the Zomato Dataset in Pune (if rating < 3.8)")
plt.xlabel("Cost of Meal for 2 (INR)")
plt.xlim(0, 5000)
plt.ylabel("Frequency")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

## **Descriptive and Inferential statistics**
For the two subgroups of the dataset, we take the mean price (in Indian Rupees), standard deviation and number of instances per classification in this dataset. Then we do the unequal variances t-test.

In [None]:
zomato_price_overall = zomato_for_eda['Pricing_for_2']
zomato_prices_yes = np.array(zomato_pune_yes['Pricing_for_2'])
zomato_prices_no = np.array(zomato_pune_no['Pricing_for_2'])

mean_prices_overall = np.mean(zomato_price_overall)
mean_prices_yes = np.mean(zomato_prices_yes)
mean_prices_no = np.mean(zomato_prices_no)

stdev_prices_overall = np.std(zomato_price_overall, ddof=1) ## Sample standard deviation hence N - 1 degrees of freedom
stdev_prices_yes = np.std(zomato_prices_yes, ddof=1) ## Sample standard deviation hence N - 1 degrees of freedom
stdev_prices_no = np.std(zomato_prices_no, ddof=1)

count_prices_overall = len(zomato_price_overall)
count_prices_yes = len(zomato_prices_yes)
count_prices_no = len(zomato_prices_no)


print("Overall distribution")
print(f"Mean Price: {mean_prices_overall:.2f}")
print(f"Standard Deviation: {stdev_prices_overall:.2f}")
print(f"Number of Instances: {count_prices_overall}")

print("\nFor Pune Restaurants with rating >= 3.8")
print(f"Mean Price: {mean_prices_yes:.2f}")
print(f"Standard Deviation: {stdev_prices_yes:.2f}")
print(f"Number of Instances: {count_prices_yes}")

print("\nFor Pune Restaurants with rating < 3.8")
print(f"Mean Price: {mean_prices_no:.2f}")
print(f"Standard Deviation: {stdev_prices_no:.2f}")
print(f"Number of Instances: {count_prices_no}")

t_value, p_value = stats.ttest_ind(zomato_prices_yes, zomato_prices_no, equal_var=False)
print(f"p-Value: {p_value}")

In [None]:
## Map the cuisines to numerical features according
## Instead of mapping 'MISSING': to zero
## We now map "MISSING" as the (N + 1)th feature

def feature_map(feature_df, row_feature):
    feature_map = {}

    for index, row in feature_df.iterrows():
        feature_name = row[row_feature]
        rank = row['Rank']

        feature_map[feature_name] = rank - 1 ## To account for the requirements of Mixed Naive Bayes

    feature_map['MISSING'] = len(feature_map)
    ## Heads up: From now on, MISSING will be rank N + 1, but encoded as index N
    ## This is to ensure that the behavior of mixed Naive Bayes will function as intended.
    return feature_map

# Create mappings for cuisine 1 to cuisine 8
cuisine1_map = feature_map(eda_resto_cuisine1, 'Cuisine1')
cuisine2_map = feature_map(eda_resto_cuisine2, 'Cuisine2')
cuisine3_map = feature_map(eda_resto_cuisine3, 'Cuisine3')
cuisine4_map = feature_map(eda_resto_cuisine4, 'Cuisine4')
cuisine5_map = feature_map(eda_resto_cuisine5, 'Cuisine5')
cuisine6_map = feature_map(eda_resto_cuisine6, 'Cuisine6')
cuisine7_map = feature_map(eda_resto_cuisine7, 'Cuisine7')
cuisine8_map = feature_map(eda_resto_cuisine8, 'Cuisine8')

# Create mapping for locality
locality_map = feature_map(eda_locality, 'Locality in Pune')

# map the columns for cuisine
zomato_for_eda['Cuisine1'] = zomato_for_eda['Cuisine1'].map(cuisine1_map)
zomato_for_eda['Cuisine2'] = zomato_for_eda['Cuisine2'].map(cuisine2_map)
zomato_for_eda['Cuisine3'] = zomato_for_eda['Cuisine3'].map(cuisine3_map)
zomato_for_eda['Cuisine4'] = zomato_for_eda['Cuisine4'].map(cuisine4_map)
zomato_for_eda['Cuisine5'] = zomato_for_eda['Cuisine5'].map(cuisine5_map)
zomato_for_eda['Cuisine6'] = zomato_for_eda['Cuisine6'].map(cuisine6_map)
zomato_for_eda['Cuisine7'] = zomato_for_eda['Cuisine7'].map(cuisine7_map)
zomato_for_eda['Cuisine8'] = zomato_for_eda['Cuisine8'].map(cuisine8_map)

# Map the columns for locality in Pune
zomato_for_eda['Locality in Pune'] = zomato_for_eda['Locality in Pune'].map(locality_map)

# Create binary classification column
zomato_for_eda['isHighRating'] = (zomato_for_eda['Dining_Rating'] >= rating_threshold).astype(int)

# Drop the original numerical rating column
zomato_for_eda = zomato_for_eda.drop(columns=['Dining_Rating'])

# Reorder columns to ensure binary classification is at the rightmost position
zomato_for_eda = zomato_for_eda[[col for col in zomato_for_eda.columns if col not in ['isHighRating']] + ['isHighRating']]

In [None]:
input_features = ["Cuisine1", "Cuisine2", "Cuisine3", "Cuisine4", "Cuisine5", "Cuisine6", "Cuisine7", "Cuisine8", "Pricing_for_2", "Locality in Pune"]

X = zomato_for_eda.filter(items = input_features)
X

In [None]:
y = zomato_for_eda["isHighRating"]
y

### **Post-EDA**

After testing various statistical methods for EDA, ...

---

# **V. Decision Tree Implementation 1 (DT1)**

In [None]:
## 70% train, 30% others (as a working concept, assume that other)
X_train, X_other, y_train, y_other = train_test_split(X, y, test_size = 0.3, random_state = 45)

dec_tree = DecisionTreeClassifier(random_state = 45)
dec_tree.fit(X_train, y_train)

print("DataFrame with MISSING set to 0:")
print(X_train.head())

In [None]:
y_pred = dec_tree.predict(X_other)

cm = confusion_matrix(y_other, y_pred)
cm_display = ConfusionMatrixDisplay(cm, display_labels = dec_tree.classes_)
cm_display.plot(cmap = plt.cm.Blues)
plt.show()

In [None]:
tn, fp, fn, tp = cm.ravel()
sensitivity = tp / (tp + fn) if (tp + fn) != 0 else 0
specificity = tn / (tn + fp) if (tn + fp) != 0 else 0
accuracy = accuracy_score(y_other, y_pred)

print(f"True Positives: {tp}")
print(f"Sensitivity: {sensitivity*100:.2f} %")
print(f"Specificity: {specificity*100:.2f} %")
print(f"Accuracy: {accuracy*100:.2f} %")

In [None]:
def DecisionTreeImplementor1(instance, random_state, accuracy_list, sensitivity_list, specificity_list):

  X_train, X_other, y_train, y_other = train_test_split(X, y, test_size = 0.3, random_state = random_state)

  # Instantiate last instance
  accuracy_list = accuracy_list
  sensitivity_list = sensitivity_list
  specificity_list = specificity_list

  dec_tree = DecisionTreeClassifier(random_state = random_state)
  dec_tree.fit(X_train, y_train)

  y_pred = dec_tree.predict(X_other)
  cm = confusion_matrix(y_other, y_pred)
  cm_display = ConfusionMatrixDisplay(cm, display_labels = dec_tree.classes_)
  cm_display.plot(cmap = plt.cm.Blues)
  plt.show()

  tn, fp, fn, tp = cm.ravel()
  sensitivity = tp / (tp + fn) if (tp + fn) != 0 else 0
  specificity = tn / (tn + fp) if (tn + fp) != 0 else 0
  accuracy = accuracy_score(y_other, y_pred)
  print(f"True Positives: {tp}")
  print(f"Sensitivity: {sensitivity*100:.2f} %")
  print(f"Specificity: {specificity*100:.2f} %")
  print(f"Accuracy: {accuracy*100:.2f} %")

  accuracy_list.append(round(100*accuracy, 2))
  sensitivity_list.append(float(round(100*sensitivity, 2)))
  specificity_list.append(float(round(100*specificity, 2)))

  # Updated instantiation
  accuracy_list = accuracy_list
  sensitivity_list = sensitivity_list
  specificity_list = specificity_list

  return accuracy_list, sensitivity_list, specificity_list


random_state_list = [45, 65, 105, 225, 335]

accuracy_list_DT1 = []
sensitivity_list_DT1 = []
specificity_list_DT1 = []

for i in range(len(random_state_list)):
  accuracy_list_DT1, sensitivity_list_DT1, specificity_list_DT1 = DecisionTreeImplementor1(i, random_state_list[i], accuracy_list_DT1, sensitivity_list_DT1, specificity_list_DT1)

mean_accuracy_DT1 = round(np.mean(accuracy_list_DT1), 2)
mean_sensitivity_DT1 = round(np.mean(sensitivity_list_DT1), 2)
mean_specificity_DT1 = round(np.mean(specificity_list_DT1), 2)

## Sample standard deviation
standev_accuracy_DT1 = round(np.std(accuracy_list_DT1, ddof = 1), 2)
standev_sensitivity_DT1 = round(np.std(sensitivity_list_DT1, ddof = 1), 2)
standev_specificity_DT1 = round(np.std(specificity_list_DT1, ddof = 1), 2)

print(f"For {len(random_state_list)} iterations of Decision Tree Classifier")
print(f"Accuracy — mean: {mean_accuracy_DT1}%, standard deviation: {standev_accuracy_DT1}%")
print(f"Sensitivity — mean: {mean_sensitivity_DT1}%, standard deviation: {standev_sensitivity_DT1}%")
print(f"Specificity — mean: {mean_specificity_DT1}%, standard deviation: {standev_specificity_DT1}%")

The current Decisition Tree (DT1) implementation consists of the basic cleaned data. This means that we've decided to keep the `MISSING` values that occur in the way the data is collected. While restaurants may have 1-2 cuisines, some have over 2 cuisines served. This is the reason why there is `cuisine 1 to 8` in the dataset. As such, `MISSING` values may incur false-negative results in predicting `Dining_Ratings`.

We tried making Decision Tree classifiers by performing this across several random states: `45`, `65`, `105`, `225`, `335`. In this case, DT1 has a mean accuracy of `63.50%`.

## **VI. Decision Tree Implementation 2 (DT2)**

For DT2, the dataset has been further cleaned by removing ` MISSING ` values and making sure that there are no _"Junk"_ data that possibly makes Decision Trees lose its accuracy.

In [None]:
# Replace 'MISSING' (0) with -1
cuisine_cols = [f'Cuisine{i}' for i in range(1, 9)]
X_all = zomato_for_eda.drop(columns=['isHighRating']).copy()
X_all[cuisine_cols] = X_all[cuisine_cols].replace(0, -1)

# Keep only numeric columns (drop Restaurant_Name etc.)
X_clean = X_all.select_dtypes(include=[np.number])
y_clean = zomato_for_eda['isHighRating']

print("Cleaned DataFrame with MISSING features replaced with -1:")
print(X_clean.head())


In [None]:
# Train-test split
X_train2, X_other2, y_train2, y_other2 = train_test_split(X_clean, y_clean, test_size=0.3, random_state=45)

# Train the model
dec_tree2 = DecisionTreeClassifier(random_state=45)
dec_tree2.fit(X_train2, y_train2)

# Predict and evaluate
y_pred2 = dec_tree2.predict(X_other2)
cm2 = confusion_matrix(y_other2, y_pred2)
cm_display2 = ConfusionMatrixDisplay(cm2, display_labels=dec_tree2.classes_)
cm_display2.plot(cmap=plt.cm.Greens)
plt.title("Confusion Matrix for Decision Tree 2")
plt.show()

In [None]:
# Metrics
tn2, fp2, fn2, tp2 = cm2.ravel()
sensitivity2 = tp2 / (tp2 + fn2) if (tp2 + fn2) != 0 else 0
specificity2 = tn2 / (tn2 + fp2) if (tn2 + fp2) != 0 else 0
accuracy2 = accuracy_score(y_other2, y_pred2)

print(f"Decision Tree 2 Metrics (Ignoring MISSING features):")
print(f"True Positives: {tp2}")
print(f"Sensitivity: {sensitivity2*100:.2f} %")
print(f"Specificity: {specificity2*100:.2f} %")
print(f"Accuracy: {accuracy2*100:.2f} %")

In [None]:
feature_importances = pd.Series(dec_tree2.feature_importances_, index=X_clean.columns)
feature_importances = feature_importances.sort_values(ascending=False)
print(feature_importances.head(10))

## **Decision Tree Implementation 3 (DT3)**

Will removing `Cuisine 3-8` make the model mroe accurate? since, cuisines 3-8 have least importance and the fact that most have missing values.

In [None]:
# Keep only Cuisine1, Cuisine2, Pricing_for_2, Locality
selected_cuisine_cols = ['Cuisine1', 'Cuisine2']
keep_cols = selected_cuisine_cols + ['Pricing_for_2', 'Locality in Pune']

X_subset = zomato_for_eda[keep_cols].copy()
X_subset[selected_cuisine_cols] = X_subset[selected_cuisine_cols].replace(0, -1)

# Keep only numeric columns
X_dt3 = X_subset.select_dtypes(include=[np.number])
y_dt3 = zomato_for_eda['isHighRating']

# Split 70/30
X_train3, X_other3, y_train3, y_other3 = train_test_split(X_dt3, y_dt3, test_size=0.3, random_state=45)

# Train DT3
dec_tree3 = DecisionTreeClassifier(random_state=45)
dec_tree3.fit(X_train3, y_train3)

# Predict + Confusion Matrix
y_pred3 = dec_tree3.predict(X_other3)
cm3 = confusion_matrix(y_other3, y_pred3)
cm_display3 = ConfusionMatrixDisplay(cm3, display_labels=dec_tree3.classes_)
cm_display3.plot(cmap=plt.cm.Oranges)
plt.title("Confusion Matrix for Decision Tree 3 (Cuisine 1 & 2 only)")
plt.show()

# Metrics
tn3, fp3, fn3, tp3 = cm3.ravel()
sensitivity3 = tp3 / (tp3 + fn3) if (tp3 + fn3) != 0 else 0
specificity3 = tn3 / (tn3 + fp3) if (tn3 + fp3) != 0 else 0
accuracy3 = accuracy_score(y_other3, y_pred3)

print(f"Decision Tree 3 Metrics (Only Cuisine1 & Cuisine2):")
print(f"True Positives: {tp3}")
print(f"Sensitivity: {sensitivity3*100:.2f} %")
print(f"Specificity: {specificity3*100:.2f} %")
print(f"Accuracy: {accuracy3*100:.2f} %")


In [None]:
def DecisionTreeImplementor3(instance, random_state, accuracy_list, sensitivity_list, specificity_list):

  # Step 1: Keep only Cuisine1, Cuisine2, Pricing_for_2, Locality
  selected_cuisine_cols = ['Cuisine1', 'Cuisine2']
  keep_cols = selected_cuisine_cols + ['Pricing_for_2', 'Locality in Pune']

  X_subset = zomato_for_eda[keep_cols].copy()
  X_subset[selected_cuisine_cols] = X_subset[selected_cuisine_cols].replace(0, -1)

  # Step 2: Keep only numeric columns
  X_dt3 = X_subset.select_dtypes(include=[np.number])
  y_dt3 = zomato_for_eda['isHighRating']

  # Step 3: Split 70/30
  X_train3, X_other3, y_train3, y_other3 = train_test_split(X_dt3, y_dt3, test_size=0.3, random_state=random_state)

  # Step 4: Train DT3
  dec_tree3 = DecisionTreeClassifier(random_state=45)
  dec_tree3.fit(X_train3, y_train3)

  y_pred3 = dec_tree3.predict(X_other3)
  cm3 = confusion_matrix(y_other3, y_pred3)
  cm_display3 = ConfusionMatrixDisplay(cm3, display_labels=dec_tree3.classes_)
  cm_display3.plot(cmap=plt.cm.Oranges)
  plt.title("Confusion Matrix for Decision Tree 3 (Cuisine 1 & 2 only)")
  plt.show()

  tn, fp, fn, tp = cm3.ravel()
  sensitivity = tp / (tp + fn) if (tp + fn) != 0 else 0
  specificity = tn / (tn + fp) if (tn + fp) != 0 else 0
  accuracy = accuracy_score(y_other3, y_pred3)
  print(f"True Positives: {tp}")
  print(f"Sensitivity: {sensitivity*100:.2f} %")
  print(f"Specificity: {specificity*100:.2f} %")
  print(f"Accuracy: {accuracy*100:.2f} %")

  accuracy_list.append(round(100*accuracy, 2))
  sensitivity_list.append(float(round(100*sensitivity, 2)))
  specificity_list.append(float(round(100*specificity, 2)))

  # Updated instantiation
  accuracy_list = accuracy_list
  sensitivity_list = sensitivity_list
  specificity_list = specificity_list

  return accuracy_list, sensitivity_list, specificity_list


random_state_list = [45, 65, 105, 225, 335]

accuracy_list_DT3 = []
sensitivity_list_DT3 = []
specificity_list_DT3 = []

for i in range(len(random_state_list)):
  accuracy_list, sensitivity_list, specificity_list = DecisionTreeImplementor3(i, random_state_list[i], accuracy_list_DT3, sensitivity_list_DT3, specificity_list_DT3)

mean_accuracy_DT3 = round(np.mean(accuracy_list_DT3), 2)
mean_sensitivity_DT3 = round(np.mean(sensitivity_list_DT3), 2)
mean_specificity_DT3 = round(np.mean(specificity_list_DT3), 2)

## Sample standard deviation
standev_accuracy_DT3 = round(np.std(accuracy_list_DT3, ddof = 1), 2)
standev_sensitivity_DT3 = round(np.std(sensitivity_list_DT3, ddof = 1), 2)
standev_specificity_DT3 = round(np.std(specificity_list_DT3, ddof = 1), 2)

print(f"For {len(random_state_list)} iterations of Decision Tree Classifier")
print(f"Accuracy — mean: {mean_accuracy_DT3}%, standard deviation: {standev_accuracy_DT3}%")
print(f"Sensitivity — mean: {mean_sensitivity_DT3}%, standard deviation: {standev_sensitivity_DT3}%")
print(f"Specificity — mean: {mean_specificity_DT3}%, standard deviation: {standev_specificity_DT3}%")

In [None]:
## Set variances to be true since we are using the entire sample and train-test-split anway
print("For Decision 1 and 3")
t_value_accuracy, p_value_accuracy = stats.ttest_ind(accuracy_list_DT1, accuracy_list_DT3, equal_var=True)
print(f"p-Value for accuracy: {round(p_value_accuracy, 6)}")

t_value_sensitivity, p_value_sensitivity = stats.ttest_ind(sensitivity_list_DT1, sensitivity_list_DT3, equal_var=True)
print(f"p-Value for sensitivity: {round(p_value_sensitivity, 6)}")

t_value_specificity, p_value_specificity = stats.ttest_ind(specificity_list_DT1, specificity_list_DT3, equal_var=True)
print(f"p-Value for specificity: {round(p_value_specificity, 6)}")

## **DT Plotting**

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(360, 90))  # Resize for visibility
plot_tree(dec_tree,
          feature_names=X_train.columns,
          class_names=["Low", "High"],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title("Decision Tree Visualization")
plt.show()


### **Post Decision Tree Implementation (DT1, DT2, and DT3)**

DT2 Implementation findings show us that even ommitting the ` MISSING ` values present the same metrics as it would in DT1. Furthermore, removing `Cuisines 3-8` and focusing on ` Cuisine 1-2 ` barely made significant changes to the accuracy.

Effectively, for this study, we can _ignore_ DT2 and focus on DT1 and DT3.

---


# **VII. Mixed Naive Bayes (MNB)**

## **Implementing MNB to the cleaned dataset (same with cleaned dataset of DT3)**

Mixed Naive Bayes is an implementation of the Naive Bayes classifier where both categorical (Boolean or otherwise) and numerical data are taken together. We assume that the numerical data fall under the Gaussian distribution

A Naive Bayes classifier is a classifier that assumes the features are independent of each other.

For binary classification...

If the feature is categorical, then p(desired label | categorical feature) is the complement of p(undesired label | categorical feature). In shorthand $$p(Y = 1 | x_i) = 1 - p(Y = 0 | x_i) $$

Numerical features, they likelihood functions, folllwing the general form $$ L(x_i | Y) = \frac{1}{σ_i\sqrt{2\pi}}e^\frac{-(x-\mu_i)^2}{2\sigma_i^2} $$

So, for a Naive Bayes classifier, the proportional probability of a desired label, given features in question are $$ p(Y = 1 | all \ features \ x) = p(Y = 1| x_{cat1},x_{cat2},...,x_{catn_1},x_{num1},x_{num2},...,x_{numn_2}) \propto p(Y = 1) \cdot \prod_{i=1}^{n_1} p(x_{cat_i} | Y = 1) \cdot \prod_{j=1}^{n_2} L(x_{cat_j} | Y = 1) $$

We compute the proportional probability $ p(Y = 0 | all \ features \ x) $ as analogous to the formula above.

Compare the respective proportions of $ p(Y = 1 | all \ features \ x) $ and $ p(Y = 0 | all \ features \ x) $ to get the classification.

# **VIII. General Finding and Analysis**

## **Initial Data**

**Decision Trees (DT)**
The data was tested with the following specifications:

- **DT1** - Basic preprocessing transformed the data to numerical, and all `Missing` set to 0 (not `NULL`)
- **DT2** - This was removed due to the fact that transforming the `MISSING` values to `-1` does not create any difference from DT1 as initially hypothesized.
- **DT3** - Dataset ignored / removed `Cuisines 3-8` due to the majority of the restaurants only having `Cuisines 1-2`.
- Each data set were trained with varying `random_states = [45, 65, 105, 225, 335]`

<br> **True Values**

|              | DT 1             | DT 3             |
|--------------|------------------|------------------|
|Random State = 45 |              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 65 |              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 105|              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 225|              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 335|              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |


<br> **Confusion Matrix for DT**

| Metric       | DT 1             | DT 3             |
|--------------|------------------|------------------|
|Random State = 45 |              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 65 |              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 105|              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 225|              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |
|--------------|------------------|------------------|
|Random State = 335|              |                  |
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |

<br> **For 5 iterations of Decision Tree Classifier**

| DT 1 and 2   |   Mean    | Standard Deviation |
|--------------|-----------|--------------------|
| Accuracy     | 63.5%     | 1.83%              |
| Sensitivity  | 47.41%    | 3.27%              |
| Specificity  | 71.9%     | 1.77%              |


<br> **P-Values for DT Classfiers 1 and 3**

| Metric       |  p-Values for DT 1 and 2  |
|--------------|---------------------------|
| Accuracy     | 0.69297                   |
| Sensitivity  | 0.352116                  |
| Specificity  | 0.244064                  |

<br>

---



<br> **Mixed Naive Bayes(MNB)**

| Metric       | MNB 1            | MNB 2            |
|--------------|------------------|------------------|
| Means        |                  |                  |  
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |

<br> **Confusion Matrix for MNB**

| Metric       | MNB 1            | MNB 2            |
|--------------|------------------|------------------|
| Means        |                  |                  |  
| Accuracy     | a                | B                |
| Specificity  | a                | B                |
| Sensitivity  | a                | B                |


<br> **P-Values for MNB Classfiers 1 and 2**

| Metric       |  p-Values for MNB 1 and 2  |
|--------------|----------------------------|
| Accuracy     | a                          |
| Sensitivity  | a                          |
| Specificity  | a                          |


<br>

---

1) Mention the statistics of my previous work using Decision Trees and Naive Bayes Classifiers. Make note of why decision trees vary wildly in terms of behavior but the Naive Bayes Classifiers have a pattern of high specificity and lower accuracy.

2) Then, if we are up to it, analyze the Naive Bayes classifier, especially when it comes to misclassifications. We can analyze at random 10 false positives and false negatives and link it back to low-sensitivity, high-specificity behavior. I know 10 sounds arbitrary, but I wish we explore why Naive Bayes acts the way it does.

3) Only if time permits, do the same for the decision tree (of course, the better performing one)

### **Exploratory Data Analysis (EDA)**
EDA presented...


### **Decision Tree Implementations (DT1, DT2, and DT3)**
DT1 and DT2 unfortunately ...

### **Mixed Naive Bayes Implementation (MNB1 & MNB2)**
MNB presented...

# **IX. Conclusion**

Based on the findings, it is evident that DT2 and MNB was able to accurately predict... as such, we think it would be more accurate to remove _"Junk"_ data such as `MISSING, NULL, -12931931, 13412131` that can affect the model in predicting.

In the case of predicting ` dining ratings `, it is possible that there could be a better model to be used since **Decision Trees** and **Mixed Naive Bayes** are only able to get `x%` and `y%`. Judging from the dataset, it may be even viable to use **Ensemble Learning Models (random forest)**, or the need to heavily manipulate the data to accomodate ` MISSING ` values and so on...

# **X. References**