## Singular Value Decomposition Recommendation Engine

### Project Description:

The below code seeks to use the yelp reviews data to build a SVD recommendation engine for all standards and reviews.

In [1]:
# Import packages
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
import surprise
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import cross_validate
from surprise import accuracy
from surprise.model_selection import train_test_split

### Data Preprocessing

In [2]:
# Import data set
data = pd.read_csv('Data/Preprocessed_Reviews_Data.csv')
data = data.drop(['Unnamed: 0'], axis = 1)

# Print data summary
print('\n')
data.info()
data.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100763 entries, 0 to 100762
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   business_name  100763 non-null  object 
 1   user_id        100763 non-null  object 
 2   old_text       100763 non-null  object 
 3   stars          100763 non-null  float64
 4   new_text       100763 non-null  object 
 5   topic          100763 non-null  int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 4.6+ MB


Unnamed: 0,business_name,user_id,old_text,stars,new_text,topic
0,Levetto,TZQSUDDcA4ek5gBd6BzcjA,"In the heart of Chinatown, I discovered it enr...",4.0,heart chinatown discov enrout kensington marke...,3
1,Café La Gaffe,TZQSUDDcA4ek5gBd6BzcjA,One of my Baldwin Village favourites!\n\nIt's ...,5.0,one baldwin villag favourit well establish wel...,0
2,Niuda Hand-Pulled Noodles,TZQSUDDcA4ek5gBd6BzcjA,Great first experience.\n\nMy friend and I wer...,4.0,great first experi friend late dinner last wee...,1
3,Light Cafe,TZQSUDDcA4ek5gBd6BzcjA,Lots of new things to try on Baldwin this summ...,3.0,lot new thing tri baldwin summer includ new ki...,2
4,Raijin Ramen,TZQSUDDcA4ek5gBd6BzcjA,With the exponential growth of ramen joints in...,4.0,exponenti growth ramen joint citi one remain o...,1


The data set processed in the topic modeling analysis is imported into the notebook.

In [3]:
# Preprocess data set
reader = surprise.Reader(rating_scale = (1, 5))
data = Dataset.load_from_df(data[['user_id', 'business_name', 'stars']], reader)

The data set is preprocessed before sitting to the SVD algorithm.

### Model Creation and Testing

In [5]:
# Validate SVD algorithm via cross validation
algo = SVD()
cross_val = cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv = 10, verbose = False)
values = cross_val.values()
values_list = list(values)

# RMSE
rmse = values_list[0]
rmse_avg = round(sum(rmse) / len(rmse),2)
rmse_std = round(rmse.std(),2)

#MAE
mae = values_list[1]
mae_avg = round(sum(mae) / len(mae),2)
mae_std = round(mae.std(),2)

# Print results
print('\n' + 'rmse average: ' + str(rmse_avg) + '\n' + 'rmse standard deviation: ' + str(rmse_std) + '\n')
print('\n' + 'mae average: ' + str(mae_avg) + '\n' + 'mae standard deviation: ' + str(mae_std) + '\n')


rmse average: 0.94
rmse standard deviation: 0.01


mae average: 0.73
mae standard deviation: 0.01



Using cross validation the SVD algorithm appears to be generating fair results considering the star ratings are in range from 1 - 5.

In [6]:
# Split data into an 80/20 split
trainset, testset = train_test_split(data, test_size = 0.20, random_state = 100)
# Generate SVD Algorithm
algo = SVD(n_factors = 200, n_epochs = 200, random_state = 100)
# Fit trainset to SVD algorithm
algo.fit(trainset)
# Generate rating predictions
predictions_test = algo.test(testset)

# Print test results
print('\n')
print(str(round(accuracy.rmse(predictions_test),2)))
print(str(round(accuracy.mse(predictions_test),2)))
print(str(round(accuracy.mae(predictions_test),2)))
print('\n')
print(str(algo.qi.shape) + '\n')



RMSE: 0.9619
0.96
MSE: 0.9252
0.93
MAE:  0.7454
0.75


(911, 200)



The results for the test set are slight lower than the cross validation results. This is a good sign because the variation between the two are low.

In [19]:
# Save SVD model
surprise.dump.dump(file_name = 'SVD_Model', predictions = predictions_test, algo = algo, verbose = 1)

The dump has been saved as file SVD_Model


In [20]:
# Import SVD model
algo = surprise.dump.load(file_name = 'SVD_Model')[1]

### Recommendations

In [23]:
# Create recommendation function
def recommendation (rest1, rating1, rest2, rating2, rest3, rating3):
    
    # Cosine distance between vectors calculation
    def cosine_distance(vector_a = np.array, vector_b = np.array):
        return cosine(vector_a, vector_b)
    
    # Retrieve vectors by restaurant name
    def get_vector_by_rest_name(rest_name, trained_model):
        rest_row_idx = trained_model.trainset._raw2inner_id_items[rest_name]
        return trained_model.qi[rest_row_idx]
    
    # Get vectors by restaurant name for three restaurants
    vector1 = get_vector_by_rest_name(rest1, algo)
    score1 = rating1
    vector2 = get_vector_by_rest_name(rest2, algo)
    score2 = rating2
    vector3 = get_vector_by_rest_name(rest3, algo)
    score3 = rating3
    
    ##############################################################################################################
    
    # Calculate cosine similarity for all three chosen restaurants' vectors against all other restaurant vectors
    similarity_table1 = []
    for rest_name in algo.trainset._raw2inner_id_items.keys():
        rest_vector = get_vector_by_rest_name(rest_name, algo)
        similarity_score = cosine_distance(vector1, rest_vector)
        similarity_table1.append(((1-similarity_score), rest_name))
        
    # Convert similarity table into a data frame
    rest_rec1 = pd.DataFrame(similarity_table1, columns = ['similarity', 'restaurant name'])
    # Sort data set to descending
    rest_rec1 = rest_rec1.sort_values('similarity', ascending = False)
    # Scale cosine score by duplicates
    rest_rec1 = rest_rec1.groupby(by = "restaurant name").sum()
    # Scale cosine score by rating
    rest_rec1['similarity'] = rest_rec1['similarity'] * score1
    # Sort data set to descending
    rest_rec1 = rest_rec1.sort_values('similarity', ascending = False).reset_index()
    
    ##############################################################################################################
    
    similarity_table2 = []
    for rest_name in algo.trainset._raw2inner_id_items.keys():
        rest_vector = get_vector_by_rest_name(rest_name, algo)
        similarity_score = cosine_distance(vector2, rest_vector)
        similarity_table2.append(((1-similarity_score) * score2, rest_name))
        
    # Convert similarity table into a data frame
    rest_rec2 = pd.DataFrame(similarity_table2, columns = ['similarity', 'restaurant name'])
    # Sort data set to descending
    rest_rec2 = rest_rec2.sort_values('similarity', ascending = False)
    # Scale cosine score by duplicates
    rest_rec2 = rest_rec2.groupby(by = "restaurant name").sum()
    # Scale cosine score by rating
    rest_rec2['similarity'] = rest_rec2['similarity'] * score2
    # Sort data set to descending
    rest_rec2 = rest_rec2.sort_values('similarity', ascending = False).reset_index()
    
    ##############################################################################################################
    
    similarity_table3 = []
    for rest_name in algo.trainset._raw2inner_id_items.keys():
        rest_vector = get_vector_by_rest_name(rest_name, algo)
        similarity_score = cosine_distance(vector3, rest_vector)
        similarity_table3.append(((1-similarity_score) * score3, rest_name))
    
    # Convert similarity table into a data frame
    rest_rec3 = pd.DataFrame(similarity_table3, columns = ['similarity', 'restaurant name'])
    # Sort data set to descending
    rest_rec3 = rest_rec3.sort_values('similarity', ascending = False)
    # Scale cosine score by duplicates
    rest_rec3 = rest_rec3.groupby(by = "restaurant name").sum()
    # Scale cosine score by rating
    rest_rec3['similarity'] = rest_rec3['similarity'] * score2
    # Sort data set to descending
    rest_rec3 = rest_rec3.sort_values('similarity', ascending = False).reset_index()
    
    # Create a list of all data frames
    df_list = [rest_rec1, rest_rec2, rest_rec3]
    # Concatenate all data frames by axis 0
    rest_rec4 = pd.concat(df_list, axis = 0)
    # Remove all three chosen restaurants 
    rest_rec4 = rest_rec4.loc[(rest_rec4['restaurant name'] != rest1) & (rest_rec4['restaurant name'] != rest2) &
                              (rest_rec4['restaurant name'] != rest3)].reset_index(drop = True)
    # Scale cosine score by duplicates
    rest_rec4 = rest_rec4.groupby(by = "restaurant name").sum().reset_index()
    # Sort values by cosine values in descending order
    rest_rec4 = rest_rec4.sort_values('similarity', ascending = False).reset_index(drop = True)
    
    # Print recommendations
    print('\n')
    rest_rec4.info()
    return rest_rec4.head(10)   

Using cosine distances, three restaurant vectors are chosen to compare to the rest of the restaurant vectors among the the data set.

The cosine similarity between the vectors are measured returning a list of restaurant names with attached cosine values similar to the chosen restaurants. 

Duplicate restaurants' cosine values will be summed returning scaled results.

Further the chosen restaurants are scaled from 0 - 5 to represent how much the restaurant is liked and depending on these ratings, the list of restaurant names' values are scaled accordingly. The scaling will return similar restaurants that are liked in descending order. 

In [24]:
# Test recommendation engine function
recommendation("Uncle Tetsu's Japanese Cheesecake", 5, 
               "Kyoto House Japanese Restaurant", 5, 
               "Wheat Sheaf Tavern", 5)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 908 entries, 0 to 907
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   restaurant name  908 non-null    object 
 1   similarity       908 non-null    float64
dtypes: float64(1), object(1)
memory usage: 14.3+ KB


Unnamed: 0,restaurant name,similarity
0,The Shore Club - Toronto,7.809874
1,Buddha's Vegan Foods,7.175418
2,Café Pamenar,6.913647
3,Jumbo Empanadas,6.182954
4,Pizzeria Libretto Danforth,6.057171
5,Coco Rice Thai Cuisine,5.838682
6,Kit Kat Italian Bar & Grill,5.684728
7,Bar Fancy,5.430922
8,Sushi Inn,5.190064
9,Karaikudi,5.176134


the recommendation function is used and the top 10 restaurant recommendations are returned.