### Data Preparation
Here we import the required libraries and load our datasets. We use `pandas` for data manipulation, `numpy` for numerical operations, and `matplotlib` for plotting. Additionally, we use scikit-learn for encoding categorical variables and splitting the data, and TensorFlow with Keras for building the neural network model. After loading, we pre-process the data by selecting the relevant columns and encoding the user and business IDs.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import hashlib

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Flatten, Input, Dot, Concatenate, Dense, Dropout, BatchNormalization
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

pd.set_option('display.max_columns', None)

In [2]:
review = pd.read_csv('review.csv')
business = pd.read_csv('business.csv')


In [3]:
review = review[['user_id','business_id','stars']]

In [4]:
train_data, test_data = train_test_split(review, test_size=0.2)

### Encoding User and Business IDs
To handle categorical data in our model, we encode the user and business IDs using label encoding, which converts each unique string into a numerical representation. This is a necessary step before feeding the data into our neural network model. The encoding process is applied to both the training and testing datasets to ensure consistency.


In [5]:

user_encoder = LabelEncoder()
business_encoder = LabelEncoder()

train_data['user_id_encoded'] = user_encoder.fit_transform(train_data['user_id'])
train_data['business_id_encoded'] = business_encoder.fit_transform(train_data['business_id'])

In [6]:
test_data = test_data[test_data['user_id'].isin(user_encoder.classes_)]
test_data = test_data[test_data['business_id'].isin(business_encoder.classes_)]

In [7]:
test_data['user_id_encoded'] = user_encoder.transform(test_data['user_id'])
test_data['business_id_encoded'] = business_encoder.transform(test_data['business_id'])

In [10]:
num_users = len(user_encoder.classes_)
num_businesses = len(business_encoder.classes_)

print(f"Unique Users: {num_users}, Unique Businesses: {num_businesses}")

Unique Users: 75298, Unique Businesses: 1705


### Defining the Neural Network Model
Here we define our neural network model architecture. We use embeddings to capture the latent factors of users and businesses and concatenate these embeddings to form the input to the dense layers of the network. The model aims to predict user ratings for businesses. We use a mean squared error loss function and the Adam optimizer.


In [11]:
embedding_dim=32

user_input = Input(shape=(1,), name='user_input')
business_input = Input(shape=(1,), name='business_input')

user_embedding = Embedding(input_dim=num_users, output_dim=embedding_dim, embeddings_regularizer=l2(1e-6))(user_input)
business_embedding = Embedding(input_dim=num_businesses, output_dim=embedding_dim, embeddings_regularizer=l2(1e-6))(business_input)

user_flatten = Flatten()(user_embedding)
business_flatten = Flatten()(business_embedding)

merged = Concatenate()([user_flatten, business_flatten])
merged = BatchNormalization()(merged)

dense_layer = Dense(128, activation='relu')(merged)
dropout = Dropout(0.4)(dense_layer)
output_layer = Dense(1, activation='linear')(dropout)

model = Model(inputs=[user_input, business_input], outputs=output_layer)
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
model.summary()

### Model Training
In this section, we fit the model to our training data using a batch size of 128 and a specified number of epochs. We implement early stopping to prevent overfitting and a model checkpoint to save the best weights during training. We monitor the validation loss and stop training if it doesn't improve after a defined number of epochs.


In [12]:
batch_size = 128
epochs = 20

user_ids = train_data['user_id_encoded'].values
business_ids = train_data['business_id_encoded'].values
stars = train_data['stars'].values


In [13]:
model_checkpoint = ModelCheckpoint(f'./model/model.weights.h5',
                             monitor='val_loss',   # Monitor validation loss
                             save_best_only=True,  # Save only the best model
                             save_weights_only=True,
                             mode='min'            # Mode of monitoring (minimize validation loss)
                            )

early_stopping = EarlyStopping(monitor='val_loss',
                               patience=5,
                               restore_best_weights=True
                              )

In [14]:
history = model.fit(
    [user_ids, business_ids],
    stars,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
    callbacks=[early_stopping, model_checkpoint]
)

Epoch 1/20
[1m1222/1222[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 15ms/step - loss: 3.5553 - mae: 1.5008 - val_loss: 1.7296 - val_mae: 1.0813
Epoch 2/20
[1m1222/1222[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 17ms/step - loss: 1.4982 - mae: 0.9843 - val_loss: 1.7730 - val_mae: 1.0836
Epoch 3/20
[1m1222/1222[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 19ms/step - loss: 1.0635 - mae: 0.8124 - val_loss: 1.8116 - val_mae: 1.0967
Epoch 4/20
[1m1222/1222[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 17ms/step - loss: 0.8521 - mae: 0.7193 - val_loss: 1.8284 - val_mae: 1.0879
Epoch 5/20
[1m1222/1222[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 19ms/step - loss: 0.7334 - mae: 0.6627 - val_loss: 1.8340 - val_mae: 1.0910
Epoch 6/20
[1m1222/1222[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 19ms/step - loss: 0.6563 - mae: 0.6251 - val_loss: 1.8366 - val_mae: 1.0868


### Model Evaluation
Once training is complete, we use our test dataset to evaluate the model's performance. We predict the ratings and calculate the mean squared error between the predicted and actual ratings to understand the model's accuracy.


In [15]:
test_user_ids = test_data['user_id_encoded'].values
test_business_ids = test_data['business_id_encoded'].values
test_stars = test_data['stars'].values



In [16]:
predictions = model.predict([test_user_ids, test_business_ids])

[1m1138/1138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step


In [17]:
mean_squared_error(predictions,test_stars)

1.5855795689528172

In [20]:
%%time
user_encoder = LabelEncoder()
business_encoder = LabelEncoder()

review['user_id_encoded'] = user_encoder.fit_transform(review['user_id'])
review['business_id_encoded'] = business_encoder.fit_transform(review['business_id'])

CPU times: user 218 ms, sys: 18.3 ms, total: 236 ms
Wall time: 241 ms


In [21]:
num_users = len(user_encoder.classes_)
num_businesses = len(business_encoder.classes_)

print(f"Unique Users: {num_users}, Unique Businesses: {num_businesses}")

Unique Users: 87090, Unique Businesses: 1705


In [22]:
embedding_dim=32

user_input = Input(shape=(1,), name='user_input')
business_input = Input(shape=(1,), name='business_input')

user_embedding = Embedding(input_dim=num_users, output_dim=embedding_dim, embeddings_regularizer=l2(1e-6))(user_input)
business_embedding = Embedding(input_dim=num_businesses, output_dim=embedding_dim, embeddings_regularizer=l2(1e-6))(business_input)

user_flatten = Flatten()(user_embedding)
business_flatten = Flatten()(business_embedding)

merged = Concatenate()([user_flatten, business_flatten])
merged = BatchNormalization()(merged)

dense_layer = Dense(128, activation='relu')(merged)
dropout = Dropout(0.4)(dense_layer)
output_layer = Dense(1, activation='linear')(dropout)

model = Model(inputs=[user_input, business_input], outputs=output_layer)
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
model.summary()

In [23]:
batch_size = 128
epochs = 10

user_ids = review['user_id_encoded'].values
business_ids = review['business_id_encoded'].values
stars = review['stars'].values


### Saving Encoders and Model
After training and evaluation, we save the label encoders and the trained model to disk. This allows us to reload the trained model and encoders for future predictions without retraining from scratch.


In [24]:
model_checkpoint = ModelCheckpoint(f'./model/model.weights.h5',
                             monitor='val_loss',   # Monitor validation loss
                             save_best_only=True,  # Save only the best model
                             save_weights_only=True,
                             mode='min'            # Mode of monitoring (minimize validation loss)
                            )

early_stopping = EarlyStopping(monitor='val_loss',
                               patience=1,
                               restore_best_weights=True
                              )

In [25]:
history = model.fit(
    [user_ids, business_ids],
    stars,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
    callbacks=[early_stopping, model_checkpoint]
)

Epoch 1/10


[1m1528/1528[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 20ms/step - loss: 3.3225 - mae: 1.4495 - val_loss: 2.6059 - val_mae: 1.3838
Epoch 2/10
[1m1528/1528[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 20ms/step - loss: 1.4873 - mae: 0.9781 - val_loss: 2.4224 - val_mae: 1.3149
Epoch 3/10
[1m1528/1528[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 22ms/step - loss: 1.0785 - mae: 0.8163 - val_loss: 2.3259 - val_mae: 1.2788
Epoch 4/10
[1m1528/1528[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 20ms/step - loss: 0.8733 - mae: 0.7263 - val_loss: 2.2459 - val_mae: 1.2485
Epoch 5/10
[1m1528/1528[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 20ms/step - loss: 0.7631 - mae: 0.6715 - val_loss: 2.2721 - val_mae: 1.2565


In [26]:
import pickle

with open('./model/user_encoder.pickle', 'wb') as f:
    pickle.dump(user_encoder, f)
    
with open('./model/business_encoder.pickle', 'wb') as f:
    pickle.dump(business_encoder, f)