# Level 3
---
### Task: Predictive Modeling
- Build a regression model to predict the
aggregate rating of a restaurant based on
available features.

- Split the dataset into training and testing sets
and evaluate the model's performance using
appropriate metrics.

- Experiment with different algorithms (e.g.,
linear regression, decision trees, random
forest) and compare their performance.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
df = pd.read_csv('Dataset.csv')

In [None]:
# Dropping irrelevant columns
df_cleaned = df.drop(columns=[
    "Restaurant ID", "Restaurant Name", "Address", "Locality", "Locality Verbose",
    "Rating color", "Rating text", "Currency", "Switch to order menu"
])

In [None]:
df_cleaned.dropna(subset=['Cuisines'],inplace=True)

In [None]:
# Binary Encoding for binary columns
df_cleaned["Has Table booking"] = df_cleaned["Has Table booking"].map({"Yes": 1, "No": 0})
df_cleaned["Has Online delivery"] = df_cleaned["Has Online delivery"].map({"Yes": 1, "No": 0})
df_cleaned["Is delivering now"] = df_cleaned["Is delivering now"].map({"Yes": 1, "No": 0})

# One-hot encoding categorical features (City, Cuisines)
encoder = OneHotEncoder(drop="first", sparse_output=False)
encoded_features = encoder.fit_transform(df_cleaned[["City", "Cuisines"]])
encoded_feature_names = encoder.get_feature_names_out(["City", "Cuisines"])

# Converting encoded features to a DataFrame with the same index as df_cleaned
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names, index=df_cleaned.index)

df_cleaned = pd.concat([df_cleaned, encoded_df], axis=1)

df_cleaned.drop(columns=["City", "Cuisines"], inplace=True)

In [None]:
# Defining target variable (Aggregate rating)
X = df_cleaned.drop(columns=["Aggregate rating"])
y = df_cleaned["Aggregate rating"]

In [None]:
# Splitting into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardizing numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    return rmse, r2

In [None]:
# Initializing models
models = {
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=42),
    "Gradient Boosting Regressor": GradientBoostingRegressor(n_estimators=100, random_state=42)

}

In [None]:
# Evaluating each model
results = {}
for name, model in models.items():
    rmse, r2 = evaluate_model(model, X_train_scaled, y_train, X_test_scaled, y_test)
    results[name] = {"RMSE": rmse, "R² Score": r2}

In [12]:
for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    print(f"RMSE: {metrics['RMSE']:.3f}")
    print(f"R² Score: {metrics['R² Score']:.3f}")
    print("-" * 30)

Model: Random Forest Regressor
RMSE: 0.301
R² Score: 0.960
------------------------------
Model: Decision Tree Regressor
RMSE: 0.405
R² Score: 0.928
------------------------------
Model: Gradient Boosting Regressor
RMSE: 0.314
R² Score: 0.957
------------------------------


---
#### Why Linear Regression is Not Useful Here:

Linear Regression assumes a **straight-line relationship** between features and the target variable. However, restaurant ratings are influenced by complex, **non-linear interactions** (e.g., city, cuisine type, online delivery availability).  

#### Problems with Linear Regression:
- **Failed Completely**: It produced an **R² score worse than guessing** and an RMSE in the trillions.  
- **Can’t Handle Non-Linearity**: Ratings don’t follow a simple linear pattern.  
- **Sensitive to Feature Encoding**: One-hot encoding created many new columns, making it unstable.  

#### Why These Models Are Better:
| Model                          | RMSE  | R² Score |
|--------------------------------|-------|----------|
| **Random Forest Regressor**    | 0.301 | 0.960    |
| **Decision Tree Regressor**    | 0.405 | 0.928    |
| **Gradient Boosting Regressor** | 0.314 | 0.957    |

✅ **Random Forest & Gradient Boosting** excel at capturing complex patterns in structured data.  
✅ **Decision Tree**, while not as strong, still handles non-linearity better than Linear Regression.  