In [1]:
import numpy as np
import pandas as pd

from sklearn.decomposition import TruncatedSVD

In [3]:
df_recsys = pd.read_csv('./../data/sb_restaurants_clean.csv', index_col=0)
df_recsys.head(5)

Unnamed: 0,user_id,business_id,stars,name
0,IlWLPCRQp8iqX0X-ExQccQ,bdfZdB2MTXlT6-RBjSIpQg,4.0,Pho Bistro
1,WSMIRegvrsEgFGEraf_LwQ,bdfZdB2MTXlT6-RBjSIpQg,5.0,Pho Bistro
2,2odfcvFhkb8SedI3vCLpmQ,bdfZdB2MTXlT6-RBjSIpQg,5.0,Pho Bistro
3,OEKu0Rts0spELpbnucDKmA,bdfZdB2MTXlT6-RBjSIpQg,5.0,Pho Bistro
4,4MSEWnnxKhNdh7GgNHM-YQ,bdfZdB2MTXlT6-RBjSIpQg,4.0,Pho Bistro


In [4]:
df_recsys.shape 

(37830, 4)

Iniciamos por elegir a algun usuario aleatorio y vemos qué restaurantes ha calificado

In [5]:
# Choose a random user
idx = (df_recsys['user_id']== '4MSEWnnxKhNdh7GgNHM-YQ')

# print user's reviews
print('Total de restaurantes calificados por el usuario: ',idx.sum())
df_recsys[idx]

Total de restaurantes calificados por el usuario:  5


Unnamed: 0,user_id,business_id,stars,name
4,4MSEWnnxKhNdh7GgNHM-YQ,bdfZdB2MTXlT6-RBjSIpQg,4.0,Pho Bistro
1586,4MSEWnnxKhNdh7GgNHM-YQ,ju4YP8SLdR_BmWr_-Xh83Q,3.0,Phamous Cafe
10500,4MSEWnnxKhNdh7GgNHM-YQ,zJpZ-uQ_F0XVgK1u98uWSg,5.0,Gloria's Gourmet Kitchen
25128,4MSEWnnxKhNdh7GgNHM-YQ,IoAubn7CpU2rqcURxUo1MA,4.0,Barb's Pies
34522,4MSEWnnxKhNdh7GgNHM-YQ,yM7eA2uuH3Ch7OYVj3PKSw,5.0,Wabi Sabi


Ahora generamos una matriz de utilidad, la cual nos proporciona la calificación que cada usuario le da a cada restaurante que ha visitado.

In [64]:
# Create a utility matrix
UtlMtrx = df_recsys.pivot_table(values='stars', index='user_id', columns='name', fill_value=0)

UtlMtrx.head() 

name,101 Deli,1114 Sports Bar & Games,4 Eggs & Pizza,A Slice of Woodstock's,AH Juice Organics,AR Catering,ASIE,Aksum Restaurant,Aladdin Cafe,Alcazar Tapas Bar,...,Yona Redz,Your Choice,Your Place Thai Restaurant,Yume Sushi Japanese Restaurant,Z's - Taphouse & Grill,Zaytoon,Zen Yai Thai Cuisine,Zizzo's Coffeehouse,Zodo's Bowlero Bowling & Beyond,Zookers Cafe
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
---zemaUC8WeJeWKqS6p9Q,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--2F5G5LKt3h2cAXJbZptg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--7XOV5T9yZR5w1DIy_Dog,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--8YG4BoOWFGjdh9Fxop-w,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--_QWYpJ2In2IgW_bouSuQ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0


- Observamos que una gran cantidad de valores en la matriz son 0. De los 659 restaurantes es de esperarse que los usuarios solo han evaluado unos cuantos, por eso la gran cantidad de 0´s. 
- En álgebra lineal este tipo de matriz se le conoce como matriz dispersa (sparse matrix).

In [7]:
print('Total de elementos de la matriz de utilidad: %d' % (UtlMtrx.size))
print('Total de elementos diferentes de cero: %d' % (np.count_nonzero(UtlMtrx)))
print('Porcentaje de elementos diferentes de cero: %.1f%%' % (100 * np.count_nonzero(UtlMtrx) / UtlMtrx.size))

Total de elementos de la matriz de utilidad: 15465412
Total de elementos diferentes de cero: 37166
Porcentaje de elementos diferentes de cero: 0.2%


Construiremos el sistema de recomendación utilizando el metódo de reducción de dimensionalidad SVD (Singular Value Decomposition) o Descomposicion de Valores Singulares. 

Este tipo de solución es utilizado comunmente como un primer sistema de recomendación, basado en las calificaciones que los usuarios han asignado en el pasado.

Para obtener la información relacionada con los restaurantes, tomamos la transpuesta de la matriz de utilidad.

In [8]:
# Transpose the utility matrix
X = UtlMtrx.T
X.shape

(659, 23468)

In [65]:
X.head()

user_id,---zemaUC8WeJeWKqS6p9Q,--2F5G5LKt3h2cAXJbZptg,--7XOV5T9yZR5w1DIy_Dog,--8YG4BoOWFGjdh9Fxop-w,--_QWYpJ2In2IgW_bouSuQ,--kinfHwmtdjz03g8B8z8Q,-0-Cbo-4YzKmoJXpnckwbg,-0-TtVhV4PIUoDpUCOC0uQ,-06H6BQ5cS5A5Jpd_iiwCA,-09MjuD1Q0KwSYwdYbAzMw,...,zyP0-AxYAb61wVroBgR0ng,zyQcIz_DR9CM-Tah3B7Tfw,zyWX5BNC9JhEsrrzaK-X9g,zyakHeX-9wVqU33jn6NMtA,zym3VpXHS4YWRMXUJ-MxsQ,zyvAj-SNHqHnrTN0X5_adA,zyzSePGcKiiA15e1takl4g,zzkJpm2B4PNPHnXg-CMyog,zzudZ50i88aStmQ6Ft1cVA,zzzGgfvrSJ4AQeKtcgocIw
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101 Deli,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1114 Sports Bar & Games,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Eggs & Pizza,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Slice of Woodstock's,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AH Juice Organics,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
SVD = TruncatedSVD(n_components=658, random_state=42)  
SVD.fit(X)

num_sv = 7

print('Cantidad de información simplificada con los primeros %d vectores singulares:' % num_sv)
print('%.1f%%' %  (100 * (1- (SVD.singular_values_[0:num_sv]).sum() / (SVD.singular_values_).sum())))

Cantidad de información simplificada con los primeros 7 vectores singulares:
95.8%


### 

In [10]:
num_sv = 7

SVD = TruncatedSVD(n_components=num_sv, random_state=42)

resultant_matrix = SVD.fit_transform(X)
resultant_matrix.shape

(659, 7)

In [28]:
# Pearson correlation matrix
corrMtx = np.corrcoef(resultant_matrix)
corrMtx.shape  
corrMtx[0:5,0:5]


array([[1.        , 0.40414244, 0.2721704 , 0.59065268, 0.38875077],
       [0.40414244, 1.        , 0.98296528, 0.24337884, 0.96173764],
       [0.2721704 , 0.98296528, 1.        , 0.12296692, 0.93044831],
       [0.59065268, 0.24337884, 0.12296692, 1.        , 0.43032776],
       [0.38875077, 0.96173764, 0.93044831, 0.43032776, 1.        ]])

Elegimos un restaurante para ver su correlación con todos los demás en la matriz

In [51]:
# Look for Taco Bell index

liked = 'Taco Bell'

names = UtlMtrx.columns 
names_list = list(names)
id_liked = names_list.index(liked)

id_liked

543

In [52]:
# Check the correlation with some restaurants
corr_recom = corrMtx[id_liked]
print(corr_recom.shape)
corr_recom[0:10]

(659,)


array([ 0.27722971,  0.84905127,  0.88626614, -0.23836612,  0.68905721,
        0.19158651,  0.83262829,  0.4679791 ,  0.84919995,  0.91376032])

Comprobamos que la correlacion de el restaurante elegido consigo mismo es = 1

In [53]:
# Check restaurant correlation with itself
corr_recom[id_liked]

1.0

#### Obtenemos las recomendaciones de acuerdo al restaurante seleccionado

#### *Ejemplo 1*:

In [54]:
# Restaurant 'Taco Bell'

print('Recomendaciones: ')
list(names[(corr_recom > .97) & (corr_recom < .99)]) 

Recomendaciones: 


['El Rincon Bohemio',
 'Kanaloa Seafood',
 'La Salsa Fresh Mexican Grill',
 'Le Cafe Stella',
 "Louie's California Bistro",
 "Norton's Pastrami & Deli",
 "Rudy's Restaurant No 1",
 "Rusty's Pizza Parlor",
 'Santa Barbara Roasting Company',
 'Subway',
 'Sushi GoGo',
 'The Creekside Restaurant & Bar',
 'The Natural Cafe',
 'The Spot']

Ordenamos de manera descendente de acuerdo al coeficiente de correlación 

In [55]:
ids = (corr_recom > .97) & (corr_recom < 0.99)
tmp = list()

for i in range(len(names[ids])):
    tmp.append((corr_recom[ids][i], names[ids][i]))

sorted(tmp, key=lambda x:x[0], reverse=True) 

[(0.9833773235273976, 'Sushi GoGo'),
 (0.981287684808783, 'Kanaloa Seafood'),
 (0.9798377810806902, 'El Rincon Bohemio'),
 (0.9784690140612011, "Rusty's Pizza Parlor"),
 (0.9747570172467663, 'The Spot'),
 (0.9741633073734431, 'The Creekside Restaurant & Bar'),
 (0.974039616483174, 'Le Cafe Stella'),
 (0.9739885560901294, "Norton's Pastrami & Deli"),
 (0.9736606168271407, "Louie's California Bistro"),
 (0.9725712765566373, 'La Salsa Fresh Mexican Grill'),
 (0.9725167542395202, "Rudy's Restaurant No 1"),
 (0.9722586533717934, 'The Natural Cafe'),
 (0.9714720542932522, 'Santa Barbara Roasting Company'),
 (0.9713876876267626, 'Subway')]

#### *Ejemplo 2:*

In [61]:
# Restaurant 'Presto Pasta'
liked = "Presto Pasta"
id_liked = names_list.index(liked)
corr_recom = corrMtx[id_liked]

print('Recomendaciones: ')
ids = (corr_recom > .98) & (corr_recom < 0.99)
tmp = list()

for i in range(len(names[ids])):
    tmp.append((corr_recom[ids][i], names[ids][i]))

sorted(tmp, key=lambda x:x[0], reverse=True) 

Recomendaciones: 


[(0.9895331095411122, "Ca' Dario Goleta"),
 (0.9894856825419752, 'Pizza Online Company'),
 (0.9893928162973604, 'Eureka!'),
 (0.9877970831852984, "Petrini's Italian Restaurant - Santa Barbara"),
 (0.9876517652603611, 'Dave’s Dogs'),
 (0.9861694750347899, 'On The Alley - Goleta'),
 (0.9851451845488123, "Lito's Take  Out"),
 (0.9848178253853797, 'South Coast Deli- Carrillo'),
 (0.9846829822461028, 'IHOP'),
 (0.9827862975267132, "Guicho's Eatery"),
 (0.9826138192217945, 'Java Station'),
 (0.9817918052692886, 'JJ’s Diner'),
 (0.9817349443452706, 'Zen Yai Thai Cuisine'),
 (0.9814142632448193, 'La Tapatia'),
 (0.9814033472917097, 'China Pavilion'),
 (0.9810185527577012, 'Ichiban'),
 (0.9804089530385499, "Big Joe's Tacos"),
 (0.9803203307117494, "Super Cuca's Restaurant"),
 (0.9800081323765931, 'Nutbelly Pizzeria and Deli')]

_ _ _

#### Modelo basado en los usuarios

In [66]:
Xu = UtlMtrx
Xu.shape

(23468, 659)

In [67]:
Xu.head()

name,101 Deli,1114 Sports Bar & Games,4 Eggs & Pizza,A Slice of Woodstock's,AH Juice Organics,AR Catering,ASIE,Aksum Restaurant,Aladdin Cafe,Alcazar Tapas Bar,...,Yona Redz,Your Choice,Your Place Thai Restaurant,Yume Sushi Japanese Restaurant,Z's - Taphouse & Grill,Zaytoon,Zen Yai Thai Cuisine,Zizzo's Coffeehouse,Zodo's Bowlero Bowling & Beyond,Zookers Cafe
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
---zemaUC8WeJeWKqS6p9Q,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--2F5G5LKt3h2cAXJbZptg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--7XOV5T9yZR5w1DIy_Dog,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--8YG4BoOWFGjdh9Fxop-w,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0
--_QWYpJ2In2IgW_bouSuQ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.0,0,0,0


In [68]:
SVDu = TruncatedSVD(n_components=658, random_state=42)  
SVDu.fit(Xu)

num_sv = 10

print('Cantidad de información simplificada con los primeros %d vectores singulares:' % num_sv)
print('%.1f%%' %  (100 * (1- (SVDu.singular_values_[0:num_sv]).sum() / (SVDu.singular_values_).sum())))

Cantidad de información simplificada con los primeros 10 vectores singulares:
94.4%


In [69]:
num_sv = 10

SVDu = TruncatedSVD(n_components=num_sv, random_state=42)

resultant_umatx = SVDu.fit_transform(Xu)
resultant_umatx.shape

(23468, 10)

In [70]:
# Pearson correlation matrix
corrUmtx = np.corrcoef(resultant_umatx)
corrUmtx.shape  

(23468, 23468)

Seleccionamos algun usuario arbitrariamente

In [71]:
user = '4MSEWnnxKhNdh7GgNHM-YQ'

userids = UtlMtrx.index 
users_list = list(userids)
user_id = users_list.index(user)

corr_urecom = corrUmtx[user_id]

In [72]:
uids = (corr_urecom > .98) & (corr_urecom < .99)
tmp = list()

for i in range(len(userids[uids])):
    tmp.append((corr_urecom[uids][i], userids[uids][i]))

print('Usuarios relacionados: ')
sorted(tmp, key=lambda x:x[0], reverse=True)

Usuarios relacionados: 


[(0.9870877598077107, 'KcZ52cNO_o9SR5WFkpFCBw'),
 (0.9840279154451025, 'jxFgDOyjI1u0vznjm3q2QQ'),
 (0.9836280647749515, 'BWjVF5476cdpwqXKh5r8Kw'),
 (0.982849830971292, 'wK0Q4UtcvMsZNLitf5hKTA'),
 (0.9825545775716569, 'z1HlPVSEde7iuaPtwTx6Fw'),
 (0.9822834953748576, 'knReN1F-2WbpQzyhElvnYQ'),
 (0.9809998598494641, 'QmF2rD3E4erx9471QxIUDQ'),
 (0.9802074702158957, 'HhW_eGhtUJryU2ZaI0uSpw')]

Para tener mejor contexto, observemos las calificaciones que los primeros dos usuarios de la lista han hecho

In [73]:
idx = (df_recsys['user_id']== 'KcZ52cNO_o9SR5WFkpFCBw')
# print user's reviews
df_recsys[idx]

Unnamed: 0,user_id,business_id,stars,name
8613,KcZ52cNO_o9SR5WFkpFCBw,IZZ7J14VnJGyUqYH1tgT6w,5.0,Satellite
11387,KcZ52cNO_o9SR5WFkpFCBw,4h4Uskb7B8OTI7oDR7wImQ,5.0,Mesa Burger - Goleta
27245,KcZ52cNO_o9SR5WFkpFCBw,3rzlcCvPUJ56jvYWl7iGvA,5.0,Modern Times Kitchen
32656,KcZ52cNO_o9SR5WFkpFCBw,UQZe-qWOpGiyEPsTUko5Rg,5.0,Rascal's Vegan Food


In [74]:
idx = (df_recsys['user_id']== 'jxFgDOyjI1u0vznjm3q2QQ')
# print user's reviews
df_recsys[idx]

Unnamed: 0,user_id,business_id,stars,name
131,jxFgDOyjI1u0vznjm3q2QQ,MbzgGsMQpGyVrUJXi_Jw0Q,5.0,Dawn Patrol
9140,jxFgDOyjI1u0vznjm3q2QQ,E1R-xslwl7XeTN5lQ7mWPg,5.0,Little Kitchen
10351,jxFgDOyjI1u0vznjm3q2QQ,q36uZc2hQml10YGHazJaoQ,5.0,Secret Bao
15291,jxFgDOyjI1u0vznjm3q2QQ,L6nIOUwcTGgQKExGHmvTzQ,4.0,Noodle City
16239,jxFgDOyjI1u0vznjm3q2QQ,2MWEEuY3dtD2DNP99sNe_A,4.0,Elubia's Kitchen
21882,jxFgDOyjI1u0vznjm3q2QQ,ZJzlns_dFkapVajmdFHGuw,5.0,Embermill
26261,jxFgDOyjI1u0vznjm3q2QQ,O2lFAWD2WXXx5EKkzPutWQ,5.0,Your Place Thai Restaurant
31663,jxFgDOyjI1u0vznjm3q2QQ,bBA62rVklEG2WUnvi8RtrQ,5.0,El Sitio
32639,jxFgDOyjI1u0vznjm3q2QQ,UQZe-qWOpGiyEPsTUko5Rg,4.0,Rascal's Vegan Food
35370,jxFgDOyjI1u0vznjm3q2QQ,xkS_mRHVHuX0ko6kVIyBdA,5.0,Masala Spice Indian Cuisine


Notamos que ambos usuarios han dado una buena calificación a los restaurantes **Mesa Burger** y **Rascal's Vegan Food**