## Predicting a Yelp User's Average Rating of Italian Restaurants
This model uses a two-step process to predict a Yelp user's average rating of Italian restaurants.

The first step is to cluster Yelp restaurants based on their categories. Since we are predicting Italian restaurant ratings, Italian restaurants are removed from the clustering step. This clustering is performed with DBSCAN using an L1 metric and an epsilon of 1 (i.e. two restaurants that differ by one category are considered in the same "neighborhood").

Then, a training set is created of users that have reviewed at least 5 Italian restaurants. A portion of this training set is held out to test the accuracy of the model. Each user's average rating by cluster is calculated. These ratings are then used as inputs to train a neural network model with the user's average rating of Italian restaurants as the output.

In [None]:
import json
import pandas as pd
import tensorflow as tf
import warnings
warnings.simplefilter("ignore")

from matplotlib import pyplot as plt
from pprint import pprint
from sklearn.cluster import DBSCAN
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split

Loading in restaurant data

In [None]:
restaurants = []
with open('I:/yelp_dataset/restaurant_data/restaurant.json', encoding='utf-8') as f:
    for line in f:
        restaurant = json.loads(line)
        restaurant['categories'] = [x.strip() for x in restaurant['categories'].split(',')]
        restaurants.append(restaurant)

In [None]:
italian_restaurants = [x for x in restaurants if 'Italian' in x['categories']]
other_restaurants = [x for x in restaurants if 'Italian' not in x['categories']]

Getting restaurant categories

In [None]:
categories = set()
for restaurant in restaurants:
    for category in restaurant['categories']:
        categories.add(category)
print(len(categories))

Saving restaurant categories

In [None]:
with open('I:/yelp_dataset/restaurant_data/categories.json', mode='w') as f:
    json.dump(list(categories), f)

In [None]:
def restaurant_to_row(restaurant, categories):
    row = {i: 0 for i in categories}
    row['business_id'] = restaurant['business_id']
    for category in restaurant['categories']:
        row[category] = 1
    return row

In [None]:
other_restaurant_rows = [restaurant_to_row(x, categories) for x in other_restaurants]

In [None]:
df = pd.DataFrame(other_restaurant_rows)
X = df.drop('business_id', axis=1)

In [None]:
%%time
# CAUTION! This step may take several hours
model = DBSCAN(eps=1, metric='l1')
model.fit(df.drop('business_id', axis=1))

In [None]:
len(model.components_)

In [None]:
pd.Series(model.labels_).value_counts()

In [None]:
labeled_df = df.drop('Restaurants', axis=1)
labeled_df['label'] = model.labels_ + 1
labeled_df

In [None]:
labeled_df[['business_id', 'label']].to_json('I:/yelp_dataset/restaurant_data/business_clusters.json', orient='records')

In [None]:
vectors = labeled_df.groupby('label').mean()
counts = labeled_df.groupby('label')['label'].count()

for cluster in range(labeled_df['label'].max()):
    print(f'\nCluster {cluster}, count: {counts[cluster]}')
    temp_df = vectors.transpose()[cluster].sort_values(ascending=False)
    identifying_categories = temp_df[temp_df > 0.9]
    [print(x) for x in identifying_categories.index] if len(identifying_categories > 0) else print('()')
    print('\n')
    print(vectors.transpose()[cluster].sort_values(ascending=False).head())
    print('\n' + '-'*40)

In [None]:
reviews = []
with open('I:/yelp_dataset/restaurant_data/review.json', encoding='utf-8') as f:
    for line in f:
        reviews = json.loads(line)

In [None]:
df2 = pd.DataFrame(reviews[:-1]).merge(labeled_df[['business_id','label']], on='business_id', how='left')
df2['label'] = df2['label'].fillna(-1)

In [None]:
italian_restaurant_reviewers = df2[df2['label'] == -1].groupby('user_id').count()['label']
top_italian_restaurant_reviewers = italian_restaurant_reviewers[italian_restaurant_reviewers >= 5]
top_italian_restaurant_reviewers

In [None]:
italian_restaurant_ids = set([x['business_id'] for x in italian_restaurants])
italian_restaurant_reviewer_ids = [x['user_id'] for x in reviews[:-1] if x['business_id'] in italian_restaurant_ids]

In [None]:
df3 = df2[df2['user_id'].isin(top_italian_restaurant_reviewers.index)].groupby(['user_id', 'label']).mean().reset_index().set_index('user_id').pivot(columns='label', values='stars')
df3

In [None]:
X = df3.drop(-1, axis=1).fillna(3)
y = df3[-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
nn_model = tf.keras.Sequential()
nn_model.add(tf.keras.layers.Dense(units=1, input_dim=len(X.columns)))
nn_model.compile(loss="MSE", optimizer="adam", metrics=["mse", "mae"])
fit_model = nn_model.fit(X_train, y_train, epochs=50)

In [None]:
nn_model.save("yelp_model.h5")

In [None]:
pd.DataFrame(fit_model.history, index=range(1,len(fit_model.history["loss"])+1)).plot(y="mean_squared_error")

In [None]:
r2_score(y_test, nn_model.predict(X_test))

In [None]:
mean_absolute_error(y_test, nn_model.predict(X_test))

In [None]:
plt.scatter(nn_model.predict(X_test), y_test)

*Conclusion:*

While the R-square of 0.35 shows there is room for improvement in the model, on average, the model is within .4 stars of the actual result