# Making Restaurant Recommendations Based on Pearson Correlation

This would be an item based recommendation system, because the recommender will compare items based on user reviews. In the dataset we are going to use are the items of different places to eat and the users are restaurant goers. Making recommendation based on correlation is a simple form of collaborative filtering, or user to user filtering. The items that are recommended are based on similarities in user reviews.

In [39]:
#Importing libraries
import numpy as np
import pandas as pd
from parser import *

In [40]:
pip install parse



These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

They were originally published by: Blanca Vargas-Govea, Juan Gabriel GonzÃ¡lez-Serna, Rafael Ponce-MedellÃ­n. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSysâ€™11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

In [41]:
#Reading the dataset to the Jupyter notebook
frame =  pd.read_csv('rating_final.csv')
cuisine = pd.read_csv('chefmozcuisine.csv')
geodata = pd.read_csv('geoplaces2.csv', encoding = 'ISO-8859-1')

In [42]:
frame.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


In [43]:
geodata.head()

Unnamed: 0,placeID,latitude,longitude,the_geom_meter,name,address,city,state,country,fax,zip,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,134999,18.915421,-99.184871,0101000020957F000088568DE356715AC138C0A525FC46...,Kiku Cuernavaca,Revolucion,Cuernavaca,Morelos,Mexico,?,?,No_Alcohol_Served,none,informal,no_accessibility,medium,kikucuernavaca.com.mx,familiar,f,closed,none
1,132825,22.147392,-100.983092,0101000020957F00001AD016568C4858C1243261274BA5...,puesto de tacos,esquina santos degollado y leon guzman,s.l.p.,s.l.p.,mexico,?,78280,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,open,none
2,135106,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rincón de San Francisco,Universidad 169,San Luis Potosi,San Luis Potosi,Mexico,?,78000,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
3,132667,23.752697,-99.163359,0101000020957F00005D67BCDDED8157C1222A2DC8D84D...,little pizza Emilio Portes Gil,calle emilio portes gil,victoria,tamaulipas,?,?,?,No_Alcohol_Served,none,informal,completely,low,?,familiar,t,closed,none
4,132613,23.752903,-99.165076,0101000020957F00008EBA2D06DC8157C194E03B7B504E...,carnitas_mata,lic. Emilio portes gil,victoria,Tamaulipas,Mexico,?,?,No_Alcohol_Served,permitted,informal,completely,medium,?,familiar,t,closed,none


The reason that we want this dataset is that it provides a name for each of the unique places that's been reviewed, but since we dont need all the attributes in the dataframe, we should subset it down to only placeID and name.

In [44]:
places =  geodata[['placeID', 'name']]
places.head()

Unnamed: 0,placeID,name
0,134999,Kiku Cuernavaca
1,132825,puesto de tacos
2,135106,El Rincón de San Francisco
3,132667,little pizza Emilio Portes Gil
4,132613,carnitas_mata


In [45]:
cuisine.head()

Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food


## Grouping and Ranking Data

Let's look at the ratings these places are getting. To do that we will look at the mean value of all the ratings that are given to each place.

In [46]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
rating.head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
132560,0.5
132561,0.75
132564,1.25
132572,1.0
132583,1.0


In addition to the mean value, we also want to look at how popular each of the places was. So to do this, let's add a column called rating count, and within that column we'll generate counts for how many reviews each place got.

In [47]:
rating['rating_count'] = pd.DataFrame(frame.groupby('placeID')['rating'].count())
rating.head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132560,0.5,4
132561,0.75,4
132564,1.25,4
132572,1.0,15
132583,1.0,4


Now lets look at the statistical description of this rating data frame.

In [48]:
rating.describe()

Unnamed: 0,rating,rating_count
count,130.0,130.0
mean,1.179622,8.930769
std,0.349354,6.124279
min,0.25,3.0
25%,1.0,5.0
50%,1.181818,7.0
75%,1.4,11.0
max,2.0,36.0


For the count, taking a count of the rating data frame we get 130 and that indicates that there are 130 unique places that have been reviewed in the rating data frame. The max value of the rating comes out to 36. What this means is that the most popular place in the dataset has got a total of 36 reviews. To see what place that is, we have to sort the dataset in descending order.

In [49]:
rating.sort_values('rating_count', ascending=False).head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135085,1.333333,36
132825,1.28125,32
135032,1.178571,28
135052,1.28,25
132834,1.0,25


We can see that the most popular place has got a place ID of 135085. So let's find the name of this place.

In [50]:
#Create a filter to find a true value of where the placeID is equal to 135085
places[places['placeID']==135085]

Unnamed: 0,placeID,name
121,135085,Tortas Locas Hipocampo


In [51]:
cuisine[cuisine['placeID']==135085]

Unnamed: 0,placeID,Rcuisine
44,135085,Fast_Food


Here we know that the most popular restaurant is a fast food restaurant with the name Tortas.

## Preparing Data For Analysis

The next thing we need to do is to build a user by item utility matrix. 

In [52]:
places_crosstab = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')
places_crosstab.head()

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,132626,132630,132654,132660,132663,132665,132667,132668,132706,132715,132717,132723,132732,132733,132740,132754,132755,132766,132767,132768,132773,132825,132830,132834,132845,132846,132847,132851,132854,132856,...,135044,135045,135046,135047,135048,135049,135050,135051,135052,135053,135054,135055,135057,135058,135059,135060,135062,135063,135064,135065,135066,135069,135070,135071,135072,135073,135074,135075,135076,135079,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
U1001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,1.0,,,,,,,,...,,1.0,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
U1002,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,...,,,,,,,,,1.0,,,,,,1.0,,1.0,,,,,,,,,,,,,,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,,,,,,,,,,,,2.0,,,,2.0,2.0,,,,,2.0,,,,,,,,,...,,,,,,,,,,,,,,,2.0,,,,0.0,,,,,,,,,2.0,,2.0,2.0,,,,,,,,,
U1004,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,1.0,2.0,,,,,,,,,,,,,,,,,,,,,2.0,,
U1005,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,...,,,,,,,1.0,,,,,,1.0,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,,


Now the first thing that you will notice is that the matrix is full of null values. That's because people never review that many places. Just a few people review a few places; hence, the sparsity of this matrix. The numbers are the ratings that each user gave to the respective place that they made a restaurant review.

We can use the data above to find places that are correlated. We need to first isolate the user ratings from our restaurant called Tortas.

In [53]:
Tortas_ratings = places_crosstab[135085]
Tortas_ratings[Tortas_ratings>=0]

userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

Here we got 36 review scores and they range between zero and two.

## Evaluating Similarity Based on Correlation


To find correlation between each of the places and the Tortas restaurant by calling the corrwith method off of the places crosstab and passing it to the Tortas rating series. What this will do is to generate a pearson r coefficient between Tortas and each other places that's been reviewed in the dataset. Keep in mind that this correlation is based on similarities and user reviews that were given to each place.

In [54]:
 similar_to_Tortas = places_crosstab.corrwith(Tortas_ratings)

corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
corr_Tortas.dropna(inplace=True)
corr_Tortas.head()

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,PearsonR
placeID,Unnamed: 1_level_1
132572,-0.428571
132723,0.301511
132754,0.930261
132825,0.700745
132834,0.814823


If the place is really correlated to the Tortas rating but if there is only very few ratings, then those places probably wouldn't be all that similar to Tortas. The place got similar ratings but it wouldn't be very popular. Therefore, that correlation really wouldn't be significant.

We need to take stock of how popular each of these places is, in addition to how well the review scores correlate with the ratings that were given to other places in the dataset.

In [55]:
#Join corr_Tortas dataframe with the rating dataframe
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])

In [56]:
Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135076,1.0,13
135085,1.0,36
135066,1.0,12
132754,0.930261,13
135045,0.912871,13
135062,0.898933,21
135028,0.892218,15
135042,0.881409,20
135046,0.867722,11
132872,0.840168,12


We now have a list of top reviewed places that are most similar to Tortas. The places that has the PearsonR value of 1 are not meaningful here. The reason the value is one is because there was only one user who gave a review to both places. That user gave both places the same rating which is why there is a PearsonR value of one. 

The correlation that is based on similarities between only one review rating is not meaningful. The places need to have more than one reviewer in common.So, we will throw these places out.

In [57]:
#Take the top correlated results that remain and see if any of these places also serve fast food
places_corr_Tortas = pd.DataFrame([135085, 132754, 135045, 135062, 135028, 135042, 135046], index = np.arange(7), columns=['placeID'])

#Take a summary of each of the top correlated place IDs and the types of food they serve
summary = pd.merge(places_corr_Tortas, cuisine,on='placeID')
summary

Unnamed: 0,placeID,Rcuisine
0,135085,Fast_Food
1,132754,Mexican
2,135028,Mexican
3,135042,Chinese
4,135046,Fast_Food


Here we only get five results even though we included seven place IDs in this data frame. The reasons we only see 5 places here is that not all of the places were listed in the cusine's dataset. The places that were not in the cuisine data set were not able to be returned in this merged output table. What we are seeing here is that among the top six places that were most correlated with Tortas, at least one of these places also serves fast food.

In [58]:
#Get a name of the place that is most correlated with Tortas since the Rcuisine that serves fast food
places[places['placeID']==135046]

Unnamed: 0,placeID,name
42,135046,Restaurante El Reyecito


To evaluate how relevant the similarity metric really is, let's consider the entire set of possibilities. Meaning how many cuisine types are served at places in this dataset.

In [59]:
cuisine['Rcuisine'].describe()

count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object

According to our cuisine dataframe, there are 59 unique types of cuisines that are served. So in our last analysis, what we got back were six top places that are similar to Tortas based on correlation and popularity. Of these six places, one other place also served fast food. Considering that there are 59 total cuisine types that could have been offered, and that we got back another fast food place in our top six most similar places. It looks like the correlation based recommendation system is on track. In this case, it would be safe recommending the places Restaurante El Reyecito to users who also like the restaurant Tortas.