# Task: Cuisine Classification
Objective: Develop a machine learning model to
classify restaurants based on their cuisines.
Steps:
Preprocess the dataset by handling missing values
and encoding categorical variables.
Split the data into training and testing sets.
Select a classification algorithm (e.g., logistic
regression, random forest) and train it on the
training data.
Evaluate the model's performance using
appropriate classification metrics (e.g., accuracy,
precision, recall) on the testing data.
Analyze the model's performance across different
cuisines and identify any challeng or biases.
T

In [410]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

In [411]:
import warnings
warnings.filterwarnings("ignore")

In [412]:
df=pd.read_csv("C:\\Users\\Mpatt\\OneDrive\\Desktop\\AssignmentDATASet\\Dataset.csv")
print(df)

      Restaurant ID           Restaurant Name  Country Code              City  \
0           6317637          Le Petit Souffle           162       Makati City   
1           6304287          Izakaya Kikufuji           162       Makati City   
2           6300002    Heat - Edsa Shangri-La           162  Mandaluyong City   
3           6318506                      Ooma           162  Mandaluyong City   
4           6314302               Sambo Kojin           162  Mandaluyong City   
...             ...                       ...           ...               ...   
9546        5915730               Naml۱ Gurme           208         ��stanbul   
9547        5908749              Ceviz A��ac۱           208         ��stanbul   
9548        5915807                     Huqqa           208         ��stanbul   
9549        5916112               A���k Kahve           208         ��stanbul   
9550        5927402  Walter's Coffee Roastery           208         ��stanbul   

                           

In [413]:
df.isnull().sum()

Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64

In [414]:
# Handle missing values
df.fillna(method='ffill', inplace=True)
df= df.replace("�", "",  regex=True)
print(df)

      Restaurant ID           Restaurant Name  Country Code              City  \
0           6317637          Le Petit Souffle           162       Makati City   
1           6304287          Izakaya Kikufuji           162       Makati City   
2           6300002    Heat - Edsa Shangri-La           162  Mandaluyong City   
3           6318506                      Ooma           162  Mandaluyong City   
4           6314302               Sambo Kojin           162  Mandaluyong City   
...             ...                       ...           ...               ...   
9546        5915730               Naml۱ Gurme           208           stanbul   
9547        5908749                Ceviz Aac۱           208           stanbul   
9548        5915807                     Huqqa           208           stanbul   
9549        5916112                  Ak Kahve           208           stanbul   
9550        5927402  Walter's Coffee Roastery           208           stanbul   

                           

In [415]:
  # Data Preprocessing
df=df.drop(['Restaurant ID','Country Code','City','Address', 'Locality','Locality Verbose','Longitude','Latitude','Currency','Has Table booking','Has Online delivery','Is delivering now','Switch to order menu','Price range','Aggregate rating','Rating color','Votes'], axis=1)
print(df.columns)

Index(['Restaurant Name', 'Cuisines', 'Average Cost for two', 'Rating text'], dtype='object')


In [416]:
df.shape
df.info

<bound method DataFrame.info of                Restaurant Name                          Cuisines  \
0             Le Petit Souffle        French, Japanese, Desserts   
1             Izakaya Kikufuji                          Japanese   
2       Heat - Edsa Shangri-La  Seafood, Asian, Filipino, Indian   
3                         Ooma                   Japanese, Sushi   
4                  Sambo Kojin                  Japanese, Korean   
...                        ...                               ...   
9546               Naml۱ Gurme                           Turkish   
9547                Ceviz Aac۱   World Cuisine, Patisserie, Cafe   
9548                     Huqqa            Italian, World Cuisine   
9549                  Ak Kahve                   Restaurant Cafe   
9550  Walter's Coffee Roastery                              Cafe   

      Average Cost for two Rating text  
0                     1100   Excellent  
1                     1200   Excellent  
2                     4000  

In [417]:
df.describe()

Unnamed: 0,Average Cost for two
count,9551.0
mean,1199.210763
std,16121.183073
min,0.0
25%,250.0
50%,400.0
75%,700.0
max,800000.0


In [418]:
df.isnull().sum()

Restaurant Name         0
Cuisines                0
Average Cost for two    0
Rating text             0
dtype: int64

In [419]:
# Checking missing vaalues for each col
missing_values = df.isna().sum() 
missing_values_column = df['Restaurant Name'].isna().sum() 
missing_values_column = df['Cuisines'].isna().sum() 
missing_values_column = df['Rating text'].isna().sum()
missing_values_column = df['Average Cost for two'].isna().sum()

In [420]:
df_cleaned = df.dropna() 
df_cleaned = df.dropna(subset=['Restaurant Name']) 
df_cleaned = df.dropna(subset=['Cuisines']) 
df_cleaned = df.dropna(subset=['Rating text'])
df_cleaned = df.dropna(subset=['Average Cost for two'])

In [421]:
df.describe(include="all")

Unnamed: 0,Restaurant Name,Cuisines,Average Cost for two,Rating text
count,9551,9551,9551.0,9551
unique,7446,1825,,6
top,Cafe Coffee Day,North Indian,,Average
freq,83,936,,3737
mean,,,1199.210763,
std,,,16121.183073,
min,,,0.0,
25%,,,250.0,
50%,,,400.0,
75%,,,700.0,


In [422]:
# Converting Categorical data to NUmerical
from sklearn.preprocessing import LabelEncoder 

label_encoder = LabelEncoder() 
df['Restaurant Name'] = label_encoder.fit_transform(df['Restaurant Name']) 
df['Cuisines'] = label_encoder.fit_transform(df['Cuisines'])
df

Unnamed: 0,Restaurant Name,Cuisines,Average Cost for two,Rating text
0,3748,920,1100,Excellent
1,3172,1111,1200,Excellent
2,2897,1671,4000,Very Good
3,4707,1126,1500,Excellent
4,5523,1122,1500,Excellent
...,...,...,...,...
9546,4443,1813,80,Very Good
9547,1311,1824,105,Very Good
9548,3069,1110,170,Good
9549,210,1657,120,Very Good


In [423]:
rating_mapping = {
    "Excellent": 5,
    "Very Good": 4,
    "Good":3 ,
    "Not rated":0,
    "Average":2,
    "Poor":1
    }
df["Rating Numerical"] = df["Rating text"].map(rating_mapping)

print(df)


      Restaurant Name  Cuisines  Average Cost for two Rating text  \
0                3748       920                  1100   Excellent   
1                3172      1111                  1200   Excellent   
2                2897      1671                  4000   Very Good   
3                4707      1126                  1500   Excellent   
4                5523      1122                  1500   Excellent   
...               ...       ...                   ...         ...   
9546             4443      1813                    80   Very Good   
9547             1311      1824                   105   Very Good   
9548             3069      1110                   170        Good   
9549              210      1657                   120   Very Good   
9550             7240       331                    55   Very Good   

      Rating Numerical  
0                    5  
1                    5  
2                    4  
3                    5  
4                    5  
...                ..

In [424]:
df.isnull().sum()

Restaurant Name         0
Cuisines                0
Average Cost for two    0
Rating text             0
Rating Numerical        0
dtype: int64

# Building Model

In [425]:
'''x= df[['Restaurant Name', 'Rating Numerical']] 
y = df['Cuisines']

from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(x)
x= scaler.transform(x)

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=30,random_state=10)'''
                                                    

"x= df[['Restaurant Name', 'Rating Numerical']] \ny = df['Cuisines']\n\nfrom sklearn.preprocessing import StandardScaler\nscaler= StandardScaler()\nscaler.fit(x)\nx= scaler.transform(x)\n\nfrom sklearn.model_selection import train_test_split\nx_train,x_test,y_train,y_test=train_test_split(x,y,test_size=30,random_state=10)"

In [426]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Encode 'Restaurant Name' (categorical feature) as numerical
df['Restaurant Name Encoded'] = LabelEncoder().fit_transform(df['Restaurant Name'])

# Encode target variable 'Cuisines'
y = LabelEncoder().fit_transform(df['Cuisines'])

# Define features (using numeric columns only)
x = df[['Restaurant Name Encoded', 'Rating Numerical']]

# Standardize numerical features
scaler = StandardScaler()
x = scaler.fit_transform(x)

# Split dataset into training and testing sets (30% test size)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)


# Random Forest

In [427]:
# Train the model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=42)
model.fit(x_train, y_train)

#print(X_train)
# Predict
y_pred = model.predict(x_test)
print(y_pred)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R² Score: {r2}")


[1227.57666667  943.98        899.07       ... 1034.37        686.78
  938.23      ]
Mean Squared Error: 243199.4296328044
R² Score: 0.06664638587921401


# DecisionTreeClassifier

In [428]:
# predicting using Decision Tree Classifier.
from sklearn.tree import DecisionTreeClassifier

model_DT = DecisionTreeClassifier(random_state=10,
                                   criterion="gini")

# fit the model on data and predict the values
model_DT.fit(x_train,y_train)      # fit is the function that is used for training the data
y_pred = model_DT.predict(x_test) # Validation Data
#print(Y_pred)
print(y_pred)

[1287 1015  630 ... 1031  177  891]


In [429]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
'''cfm=confusion_matrix(y_test,y_pred)
print(cfm)
 
print("modelDT report: ")
print(model_DT(y_test,y_pred))'''
 
acc=accuracy_score(y_test,y_pred)
print("Accuracy of the model: ",acc)

Accuracy of the model:  0.19050942079553385


# Logistic Regression

In [430]:
from sklearn.linear_model import LogisticRegression
#create a model object
classifier = LogisticRegression(multi_class="multinomial")
#train the model object
classifier.fit(x_train,y_train)     
y_pred = classifier.predict(x_test)
print(y_pred)

[1306 1031 1306 ...  331 1306  331]


In [431]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
cfm=confusion_matrix(y_test,y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(y_test,y_pred))
 
acc=accuracy_score(y_test,y_pred)
print("Accuracy of the model: ",acc)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
Classification report: 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           4       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         5
           7       0.00      0.00      0.00         1
           8       0.00      0.00      0.00         1
           9       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         2
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         1
          15       0.00      0.00      0.00         1
          18       0.00      0.00      0.00         2
          21       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          27       0.00      0.00      0.00         1
          29       0.00      0.00      0.00  

By comparing DecisionTreeClassifier and Logistic Regression. DecisionTreeClassifier is Performing Better as compared to Logistic Regression.