These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

They were originally published by: Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

# Making Recommendations Based on Correlation

In [None]:
import numpy as np
import pandas as pd

In [None]:
# rating_final.csv
url = 'https://drive.google.com/file/d/1ptu4AlEXO4qQ8GytxKHoeuS1y4l_zWkC/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
frame = pd.read_csv(path)

# chefmozcuisine.csv
url = 'https://drive.google.com/file/d/1S0_EGSRERIkSKW4D8xHPGZMqvlhuUzp1/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
cuisine = pd.read_csv(path)

# 'geoplaces2.csv'
url = 'https://drive.google.com/file/d/1ee3ib7LqGsMUksY68SD9yBItRvTFELxo/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
geodata = pd.read_csv(path, encoding = 'CP1252') # change encoding to 'mbcs' in Windows

### Preparing Data For Correlation

We will look for restaurants that are similar to the most popular restaurant from the last notebook "Tortas Locas Hipocampo". "Similarity" will be defined by how well other places correlate with "Tortas Locas" in the user-item matrix. In this matrix, we have all the users in the rows and all the restaurants in the columns. It has many NaNs because most of the time users have not visited many restaurants —we call this a sparse matrix.

In [None]:
frame.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


In [None]:
user_item_df = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')
user_item_df.head(10)

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,
U1006,,,,1.0,,,,,,,...,,,,,,,,,,
U1007,,,,1.0,,,,,,,...,,,,1.0,0.0,,,,1.0,
U1008,,,,,,,,,,,...,,,,,,,,,1.0,
U1009,,,,,,,,,,,...,,,,,,,,,,
U1010,,,,,,,,,,,...,,,,,,,,,,


Let's look at the users that have visited "Tortas Locas":

In [None]:
tortas_id = 135085

Tortas_ratings = user_item_df.loc[:,tortas_id]
Tortas_ratings[Tortas_ratings>=0] # exclude NaNs

userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

## Evaluating Similarity Based on Correlation

Now we will look at how well other restaurants correlate with Tortas Locas. A strong positive correlation between two restaurants indicates that users who liked one restaruant also liked the other. A negative correlation would mean that users who liked one restaurant did not like the other. So, we will look for strong, positive correlations to find similar restaurants.

In [None]:
# we get warnings because computing the pearson correlation coefficient with NaNs, but the results are still ok
similar_to_Tortas = user_item_df.corrwith(Tortas_ratings)
similar_to_Tortas

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


placeID
132560         NaN
132561         NaN
132564         NaN
132572   -0.428571
132583         NaN
            ...   
135088         NaN
135104         NaN
135106    0.454545
135108         NaN
135109         NaN
Length: 130, dtype: float64

Many restuarants get a NaN, because there are no users that went to both that restaurant _and_ Tortas Locas. But some of them give us a correlation score. Let's drop NaNs and look at the valid results:

In [None]:
corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
corr_Tortas.dropna(inplace=True)
corr_Tortas.head(12)

Unnamed: 0_level_0,PearsonR
placeID,Unnamed: 1_level_1
132572,-0.428571
132723,0.301511
132754,0.930261
132825,0.700745
132834,0.814823
132856,0.475191
132861,0.5
132862,0.559017
132872,0.840168
132921,0.493013


Some correlations are a perfect 1. It is possible that this is because very few users went to both that restaurant and "Tortas Locas" (also because there are very few rating options, only 0, 1 and 2). 

In [None]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
rating['rating_count'] = frame.groupby('placeID')['rating'].count()

In [None]:
rating.head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132560,0.5,4
132561,0.75,4
132564,1.25,4
132572,1.0,15
132583,1.0,4


In [None]:
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
Tortas_corr_summary.drop(tortas_id, inplace=True) # drop Tortas Locas itself
Tortas_corr_summary

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132572,-0.428571,15
132723,0.301511,12
132754,0.930261,13
132825,0.700745,32
132834,0.814823,25
132856,0.475191,14
132861,0.5,7
132862,0.559017,18
132872,0.840168,12
132921,0.493013,17


Let's filter out restaurants with a rating count below 10.

Then, take the top 10 restaurants in terms of similarity to Tortas:

In [None]:
top10 = Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)
top10

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135076,1.0,13
135066,1.0,12
132754,0.930261,13
135045,0.912871,13
135062,0.898933,21
135028,0.892218,15
135042,0.881409,20
135046,0.867722,11
132872,0.840168,12
135038,0.831513,24


In [None]:
places =  geodata[['placeID', 'name']]

In [None]:
top10 = top10.merge(places, left_index=True, right_on="placeID")
top10

Unnamed: 0,PearsonR,rating_count,placeID,name
13,1.0,13,135076,Restaurante Pueblo Bonito
52,1.0,12,135066,Restaurante Guerra
117,0.930261,13,132754,Cabana Huasteca
28,0.912871,13,135045,Restaurante la Gran Via
113,0.898933,21,135062,Restaurante El Cielo Potosino
120,0.892218,15,135028,La Virreina
25,0.881409,20,135042,Restaurant Oriental Express
42,0.867722,11,135046,Restaurante El Reyecito
90,0.840168,12,132872,Pizzeria Julios
60,0.831513,24,135038,Restaurant la Chalita


Let's look at the cuisine type (some restaurants do not have a cuisine type... but for the ones that do, here it is):

In [None]:
top10.merge(cuisine)

Unnamed: 0,PearsonR,rating_count,placeID,name,Rcuisine
0,0.930261,13,132754,Cabana Huasteca,Mexican
1,0.892218,15,135028,La Virreina,Mexican
2,0.881409,20,135042,Restaurant Oriental Express,Chinese
3,0.867722,11,135046,Restaurante El Reyecito,Fast_Food
4,0.840168,12,132872,Pizzeria Julios,American


## Challenge:

Create a function that takes as input a restaurant id and a number (n), and outputs the names of the top n most similar restuarants to the inputed one.

You can assume that the user-item matrix (user_item_df) is already created.

In [None]:
# Alex
def similar_restaurant(rest_id, n):

  rest_ratings = user_item_df.loc[:,rest_id]
  similar_to_rest = user_item_df.corrwith(rest_ratings)

  corr_rest_id = pd.DataFrame(similar_to_rest, columns=['PearsonR'])
  corr_rest_id.dropna(inplace=True)

  rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
  rating['rating_count'] = frame.groupby('placeID')['rating'].count()

  Rest_corr_summary = corr_rest_id.join(rating['rating_count'])
  Rest_corr_summary.drop(rest_id, inplace=True)
  topn = Rest_corr_summary[Rest_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)

  places =  geodata[['placeID', 'name']]

  topn = topn.merge(places, left_index=True, right_on="placeID")
  topn = topn.merge(cuisine)

  return topn

In [None]:
similar_restaurant(132754, 5)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,PearsonR,rating_count,placeID,name,Rcuisine
0,1.0,12,132723,Gordas de morales,Mexican
1,1.0,10,132951,VIPS,American
2,0.930261,36,135085,Tortas Locas Hipocampo,Fast_Food
3,0.866025,12,132872,Pizzeria Julios,American
4,0.845154,18,135058,Restaurante Tiberius,Pizzeria


In [None]:
# Ouss
def top_resto (n,placeID):
  resto_ratings = user_item_df.loc[:,placeID]
  resto_ratings[resto_ratings>=0]
  similar_to_resto = user_item_df.corrwith(resto_ratings)
  corr_resto = pd.DataFrame(similar_to_resto, columns=['PearsonR'])
  corr_resto.dropna(inplace=True)
  rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
  rating['rating_count'] = frame.groupby('placeID')['rating'].count()
  resto_corr_summary = corr_resto.join(rating['rating_count'])
  resto_corr_summary.drop(placeID, inplace=True)
  top_n = resto_corr_summary[resto_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
  places =  geodata[['placeID', 'name']]
  top_n = top_n.merge(places, left_index=True, right_on="placeID")
  return top_n

In [None]:
top_resto(5, 132754)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,PearsonR,rating_count,placeID,name
62,1.0,12,132723,Gordas de morales
75,1.0,10,132951,VIPS
121,0.930261,36,135085,Tortas Locas Hipocampo
90,0.866025,12,132872,Pizzeria Julios
116,0.845154,18,135058,Restaurante Tiberius


### BONUS (Next iteration)
Instead of flitering out restaurants with a rating count below 10, let's consider a restaurant X as similar to Y only if at least 3 users have gone to both X and Y. 

i.e. user 143, 153, and 168 went to both restaurants - not 3 random users visited X, and a different 3 random users visited y

In [None]:
# your code here