# Implemeting Recommender system for recommending Restuarents in Yelp

Nowadays, recommender systems are used to personalize your experience on the web, telling you what to buy, where to eat or even who you should be friends with. People's tastes vary, but generally follow patterns. People tend to like things that are similar to other things they like, and they tend to have similar taste as other people they are close with. Recommender systems try to capture these patterns to help predict what else you might like. E-commerce, social media, video and online news platforms have been actively deploying their own recommender systems to help their customers to choose products more efficiently, which serves win-win strategy.

Two most ubiquitous types of recommender systems are Content-Based and Collaborative Filtering (CF). Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the “wisdom of the crowd” to recommend items. In contrast, content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them.

In general, Collaborative filtering (CF) is the workhorse of recommender engines. The algorithm has a very interesting property of being able to do feature learning on its own, which means that it can start to learn for itself what features to use. CF can be divided into Memory-Based Collaborative Filtering and Model-Based Collaborative filtering. Here, I have implemented Model-Based CF by using singular value decomposition (SVD).

I have used Yelp dataset by scrappig data from web for Los Angeles and specific type Indian and American by doing this I understood to preprocess data which is one of the challenge in normal data obtained.

In [31]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import scipy.sparse as sp
from scipy.sparse.linalg import svds
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from math import sqrt

### PreProcessed Data (refer file PreProceesing.ipynb)

After preprocessing the data I have created reviews.txt file which contains resturent name, user id, username rating that user gave to that restuarent.

In [3]:
df = pd.read_csv('reviews.txt', header= None,names=['restaurant', 'user', 'user_name', 'rating'])

In [4]:
df.shape

(52915, 4)

### Cleaning Data
In the below section we remove duplicate values, remove cells with no rating and other junk values to reduce Data Sparsity.

In [5]:
df.drop_duplicates()

Unnamed: 0,restaurant,user,user_name,rating
0,Samosa House Santa Monica,N,A,
1,Samosa House Santa Monica,WQkxoL5KRcjgTL-djd0uzQ,Rachel R.,4.0
2,Samosa House Santa Monica,z6HgK3tFehFP0oGxcHhLsA,Allie L.,3.0
3,Samosa House Santa Monica,CViY3KcIpdEeNGk_5uwhVg,Jason C.,4.0
4,Samosa House Santa Monica,T3d5LRpapvLfxQwCbg6-hw,Ike C.,4.0
5,Samosa House Santa Monica,rHb0CFwn5Y8nhdluj88QYA,Ben W.,5.0
6,Samosa House Santa Monica,YRMlaUOvUHP_EymcMVtxFQ,Sukraat C.,5.0
7,Samosa House Santa Monica,EjZcBBnYtw_XdMRg4YVEgw,Jackie O.,1.0
8,Samosa House Santa Monica,Tmuf79E1N2iasYBAxKoSow,Vaishali P.,1.0
9,Samosa House Santa Monica,dtU2TOk-sXvSzieNNPkQbw,Ernie L.,5.0


In [7]:
df1 = df[df.user != 'N']

Get a sneak peek of the first 5 rows in the dataset

In [8]:
df1.head()

Unnamed: 0,restaurant,user,user_name,rating
1,Samosa House Santa Monica,WQkxoL5KRcjgTL-djd0uzQ,Rachel R.,4.0
2,Samosa House Santa Monica,z6HgK3tFehFP0oGxcHhLsA,Allie L.,3.0
3,Samosa House Santa Monica,CViY3KcIpdEeNGk_5uwhVg,Jason C.,4.0
4,Samosa House Santa Monica,T3d5LRpapvLfxQwCbg6-hw,Ike C.,4.0
5,Samosa House Santa Monica,rHb0CFwn5Y8nhdluj88QYA,Ben W.,5.0


In [9]:
df2 = df1.drop(labels=['user_name'],axis=1)

In [10]:
df2.shape

(50310, 3)

In [11]:
users = np.unique(np.array(df2.user))

In [12]:
restaurants = np.unique(np.array(df2.restaurant))

In [10]:
len(restaurants)

129

In [11]:
len(users)

42162

In [13]:
train,test = train_test_split(df2, test_size = 0.01)

In [None]:
# users = np.unique(np.array(k2.user))
# restaurants = np.unique(np.array(k2.restaurant))
# len(users)

In [15]:
# len(restaurants)

101

### Creating Matrix of users versus restuarents that user has rated

In [16]:
restaurants_dict = {} 
for restaurant in restaurants:
    restaurants_dict[restaurant] = []
    for user in users:
        rating = df2.ix[(df2['restaurant']==restaurant) & (df2['user'] == user)].rating
        if(rating.empty):
            rating = 0
        else:
            rating = rating.item()
        
        restaurants_dict[restaurant].append(rating)
        

In [19]:
matrix_df =pd.DataFrame(restaurants_dict, index=users)

In [21]:
matrix_df.head()


Unnamed: 0,Addis Tandoor,Agra Indian Kitchen,Agra Tandoori,Akbar Cuisine of India,Al Watan Halal Restaurant,Al-Noor,All India Cafe - West LA,Anar Indian Restaurant,Anarbagh,Anarbagh Indian Cuisine,...,Stout Burgers & Beers,Taj Mahal,Taras Himalayan Cuisine,Taste of India,The Hungry Pig,The India Restaurant,The Kitchen,Urban Masala,Zaiqa Grill,un solo sol
-2n466i88rP141C_JyarnQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Nu4itEKolLR_ELgnxm_Yg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-PFFaOrqO5P2oSx_sNC6yA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-RjJfXz8_qrk1Hyqd-gyaA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-TwywRYyoMDgnMak0u4Kfw,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
m = matrix_df.as_matrix()

In [27]:
matrix = preprocessing.scale(matrix_df, axis=0, with_mean=True, with_std=True, copy=True)

### Evaluation
There are many evaluation metrics but one of the most popular metric used to evaluate accuracy of predicted ratings is Root Mean Squared Error (RMSE).

Since I only want to consider predicted ratings that are in the test dataset, I have filtered out all other elements in the prediction matrix with prediction[ground_truth.nonzero()].

In [32]:
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

### Model Based Collaborative Filtering
Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF. The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.

A well-known matrix factorization method is Singular value decomposition (SVD). Collaborative Filtering can be formulated by approximating a matrix XX by using singular value decomposition. The general equation can be expressed as follows: X=U×S×VT

Given an m×n matrix X:

U is an m×r orthogonal matrix
S is an r×r diagonal matrix with non-negative real numbers on the diagonal
VT is an r×n orthogonal matrix
Elements on the diagnoal in S are known as singular values of X.

In [33]:


#get SVD components from train matrix. Choose k.
u, s, vt = svds(matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print ('User-based CF MSE: ' + str(rmse(X_pred, matrix)))

User-based CF MSE: 0.8905228061849477


In [37]:
# Original rating for the first test user for a restaurant which is representing column 20
m[0][19]

5.0

In [38]:
# Predicted rating for the first test user for a restaurant which is representing column 20
X_pred[0][19]

5.7882914965909027

In [None]:
# def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
#     Q = Q.T
#     for step in range(steps):
#         for i in range(len(R)):
#             for j in range(len(R[i])):
#                 if R[i][j] > 0:
#                     eij = R[i][j] - np.dot(P[i,:],Q[:,j])
#                     for k in range(K):
#                         P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
#                         Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
#         eR = np.dot(P,Q)
#         e = 0
#         for i in range(len(R)):
#             for j in range(len(R[i])):
#                 if R[i][j] > 0:
#                     e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
#                     for k in range(K):
#                         e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
#         if e < 0.001:
#             break
#     return P, Q.T

In [None]:
# R = [
#      [5,3,0,1],
#      [4,0,0,1],
#      [1,1,0,5],
#      [1,0,0,4],
#      [0,1,5,4],
#     ]

# R = np.array(R)

# N = len(R)
# M = len(R[0])
# K = 2

# P = np.random.rand(N,K)
# Q = np.random.rand(M,K)

# nP, nQ = matrix_factorization(R, P, Q, K)
# nR = np.dot(nP, nQ.T)
