<h2 style='text-align: center;'> Data Science Technology and Systems </h2>
<h3 style='text-align: center;'> Assignment 1: Predictive Modelling of Eating-Out Problem </h3>
<h3 style='text-align: center;'> Part B - Predictive Modelling </h3>
<h4 style='text-align: center;'> Pauline Armamento - u3246782 </h4>

## Introduction

This project aims to leverage a real-world dataset comprising over 10,000 Sydney restaurants from 2018 to predict restaurant ratings. The objective of this project is to conduct a comprehensive Exploratory Data Analysis (EDA), feature engineering, develop regression and classification models, and demonstrate practical deployment skills. 

### The following libraries were used to retrieve, explore, process and present data within the dataset.

In [2]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import ast
import geopandas as gpd
from shapely.geometry import Point
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

%matplotlib inline

os.getcwd()

'C:\\Users\\pauar\\Desktop\\UC\\DSTS\\DSTS Assignment'

### Feature Engineering

In [3]:
# Load the data
df = pd.read_csv("data/zomato_df_final_data.csv")

# Display initial information about the DataFrame
print(df.info())

# 1. Handle Missing Values
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

# Option 1: Drop rows with missing values
df_cleaned = df.dropna()

# Option 2: Impute missing values (example: fill with mean for numeric columns)
# df['cost'] = df['cost'].fillna(df['cost'].mean())
# df['rating_number'] = df['rating_number'].fillna(df['rating_number'].mean())

# 2. Remove Duplicates
df_cleaned = df_cleaned.drop_duplicates()

# 3. Convert Data Types if necessary
# For example, if 'cost' should be numeric and it's an object
df_cleaned['cost'] = pd.to_numeric(df_cleaned['cost'], errors='coerce')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10500 entries, 0 to 10499
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   address        10500 non-null  object 
 1   cost           10154 non-null  float64
 2   cuisine        10500 non-null  object 
 3   lat            10308 non-null  float64
 4   link           10500 non-null  object 
 5   lng            10308 non-null  float64
 6   phone          10500 non-null  object 
 7   rating_number  7184 non-null   float64
 8   rating_text    7184 non-null   object 
 9   subzone        10500 non-null  object 
 10  title          10500 non-null  object 
 11  type           10452 non-null  object 
 12  votes          7184 non-null   float64
 13  groupon        10500 non-null  bool   
 14  color          10500 non-null  object 
 15  cost_2         10154 non-null  float64
 16  cuisine_color  10500 non-null  object 
dtypes: bool(1), float64(6), object(10)
memory usage: 1

In [4]:


# Select features for encoding
categorical_features = ['cuisine', 'subzone', 'title', 'type', 'groupon']

# 1. One-Hot Encoding for Nominal Features
df_encoded = pd.get_dummies(df_cleaned, columns=categorical_features, drop_first=True)

# 2. Label Encoding for Ordinal Features (if any)
# Assuming 'rating_text' is ordinal
label_encoder = LabelEncoder()
df_encoded['rating_text'] = label_encoder.fit_transform(df_encoded['rating_text'])

# Display the cleaned and encoded DataFrame
print(df_encoded.head())


                                             address   cost        lat  \
0                      371A Pitt Street, CBD, Sydney   50.0 -33.876059   
1      Shop 7A, 2 Huntley Street, Alexandria, Sydney   80.0 -33.910999   
2   Level G, The Darling at the Star, 80 Pyrmont ...  120.0 -33.867971   
3   Sydney Opera House, Bennelong Point, Circular...  270.0 -33.856784   
4              20 Campbell Street, Chinatown, Sydney   55.0 -33.879035   

                                                link         lng  \
0    https://www.zomato.com/sydney/sydney-madang-cbd  151.207605   
1  https://www.zomato.com/sydney/the-grounds-of-a...  151.193793   
2        https://www.zomato.com/sydney/sokyo-pyrmont  151.195210   
3  https://www.zomato.com/sydney/bennelong-restau...  151.215297   
4  https://www.zomato.com/sydney/chat-thai-chinatown  151.206409   

          phone  rating_number  rating_text   votes    color  ...  \
0  02 8318 0406            4.0            4  1311.0  #e15307  ...   
1  02 96

### Regression

### Model 1

In [22]:
# Define features (X) and target variable (y)
X = df_encoded.drop(columns=['rating_number'])  # All columns except the target
y = df_encoded['rating_number']                  # The target variable

In [23]:
# Check data types of columns in the features DataFrame
print(X.dtypes)

address                                object
cost                                  float64
lat                                   float64
link                                   object
lng                                   float64
                                       ...   
type_['Pub', 'Wine Bar']                uint8
type_['Pub']                            uint8
type_['Wine Bar', 'Casual Dining']      uint8
type_['Wine Bar']                       uint8
groupon_True                            uint8
Length: 8836, dtype: object


In [24]:
# Drop irrelevant columns that can't be used for prediction
X = X.drop(columns=['address', 'link', 'phone', 'color', 'cuisine_color']) 


In [25]:
print(X.dtypes)

cost                                  float64
lat                                   float64
lng                                   float64
rating_text                             int32
votes                                 float64
                                       ...   
type_['Pub', 'Wine Bar']                uint8
type_['Pub']                            uint8
type_['Wine Bar', 'Casual Dining']      uint8
type_['Wine Bar']                       uint8
groupon_True                            uint8
Length: 8831, dtype: object


In [26]:
# Split the data into training (80%) and testing (20%) sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Build the Linear Regression Model
model_regression_1 = LinearRegression()
model_regression_1.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model_regression_1.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

# Display the coefficients
coefficients = pd.DataFrame(model_regression_1.coef_, X.columns, columns=['Coefficient'])
print(coefficients)


Mean Squared Error: 0.11
R-squared: 0.47
                                    Coefficient
cost                                   0.000850
lat                                    0.001169
lng                                    0.000434
rating_text                            0.158797
votes                                  0.000765
...                                         ...
type_['Pub', 'Wine Bar']               0.131269
type_['Pub']                          -0.078813
type_['Wine Bar', 'Casual Dining']     0.081050
type_['Wine Bar']                      0.142112
groupon_True                          -0.063953

[8831 rows x 1 columns]


### Model 2


In [30]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_regression_2 = SGDRegressor(loss="squared_error", max_iter=1000, random_state=0)
model_regression_2.fit(X_train_scaled, y_train)

y_pred_model2 = model_regression_2.predict(X_test_scaled)

mse_model2 = mean_squared_error(y_test, y_pred_model2)
r2_model2 = r2_score(y_test, y_pred_model2)

print(f"Mean Squared Error: {mse_model2}")
print(f"R-squared: {r2_model2}")

Mean Squared Error: 5.099746192133137e+18
R-squared: -2.440971479996034e+19


### Classification