# Content-Based Recommendation Model

**Objective:** Improve recommendation for new and low-popularity businesses.<br><br>
**Motivation:** While our baseline collaborative filtering models appear to make strong predictions for businesses with a lot of reviews, we observe the cold start problem; poor performance for businesses with few reviews.<br><br>
**Rationale:** By mapping all businesses into the same high-dimensional space, we believe a content-based model will allow us to leverage similarity between businesses. In other words, if a businesses with a small number of reviews is similar (in terms of "business category") to a business with many reviews, we can presumably recommend the low-popularity businesses to fans of the high-popularity business. We expect our content-based model to do this effectively.

In [0]:
# Import required packages:
import pandas as pd
import numpy as np
import scipy
from google.colab import files, drive
import time

In [0]:
# Connect to Google Drive (to load raw data)
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Prep development dataset

In [3]:
# Download data:

csv_files = {
  # 'rating': '1kUiiQxmaRevPbMVryD_9kxOT3hL0yXbT',
  # 'user': '1wgtFyCliBnHf3814uG6OWLD_JTtKOYj4',
  'business': '1lTKDsBNMMYJ9cthYQUvS6gqPauk_YGY5',
  # 'checkin': '1AVdM2PY56JOvDBUJimAZ7tHnKU2Dyczb',
  # 'tip': '1zQhiSZgadQyMFA0UjDkfti_AbDDkonU-',
  # 'photo': '129APnJqG63D_C_MDhJsBrlQFZzdNlU11',
  'train_set_new': '1-1vfc5jxrCggpVYDkf8Z2rVLjIIOh6bu',
}

# Create a dictionary of dataframes:
dfs = {}

for key, value in csv_files.items():
  csv_name = key + '.csv'
  downloaded = drive.CreateFile({'id': value})
  downloaded.GetContentFile(csv_name)
  dfs[key] = pd.read_csv(csv_name, low_memory=False)
  print("Done with: ", key)

b = dfs['business'][['business_id','categories','city']]
train_set_new = dfs['train_set_new'].drop(['review_id','date'], axis=1)

Done with:  business
Done with:  train_set_new


In [4]:
# Clean data: Most importantly, can't have multiple reviews per user/business combo for this method.
# Solution: Group by user_id and business_id, take mean rating (then join on your business categories)

content_based_train_data = pd.DataFrame(train_set_new.groupby(['user_id','business_id']).agg('mean')['rating'].reset_index())
content_based_train_data = pd.merge(content_based_train_data, b, on='business_id')
print(train_set_new.shape, content_based_train_data.shape)
content_based_train_data.head()

(3398090, 3) (3281121, 5)


Unnamed: 0,user_id,business_id,rating,categories,city
0,---1lKK3aKOuomHnwAkAow,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun/Creole, Seafood, Steakhouses, Restaurants",Las Vegas
1,-9Wnr-QvOeDfuIRyaH6VRQ,--9e1ONYQuAa-CB_Rrw7Tw,1.0,"Cajun/Creole, Seafood, Steakhouses, Restaurants",Las Vegas
2,-CaPL8t5RmQM8sIxdlVCxQ,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun/Creole, Seafood, Steakhouses, Restaurants",Las Vegas
3,-H9hVJDZM60kVvyXYtSwYQ,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun/Creole, Seafood, Steakhouses, Restaurants",Las Vegas
4,-L1yBTxJ9O9HysxDNjHuog,--9e1ONYQuAa-CB_Rrw7Tw,5.0,"Cajun/Creole, Seafood, Steakhouses, Restaurants",Las Vegas


### Pre-process category data

In [5]:
# Make sure you have no NaNs in "categories"
content_based_train_data['categories'] = content_based_train_data['categories'].fillna('No_cat')

# Prepare your categories to be vectorized (make sure each category is a single 'word')
def category_pre_process(row):
  return str(row['categories']).replace(" ", "").replace("(", "_").replace(")", "_").replace("&", "_").replace("-", "_").replace("/", "_")

content_based_train_data['categories'] = content_based_train_data.apply(category_pre_process, axis=1)
content_based_train_data.head()

Unnamed: 0,user_id,business_id,rating,categories,city
0,---1lKK3aKOuomHnwAkAow,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
1,-9Wnr-QvOeDfuIRyaH6VRQ,--9e1ONYQuAa-CB_Rrw7Tw,1.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
2,-CaPL8t5RmQM8sIxdlVCxQ,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
3,-H9hVJDZM60kVvyXYtSwYQ,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
4,-L1yBTxJ9O9HysxDNjHuog,--9e1ONYQuAa-CB_Rrw7Tw,5.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas


### Quickly upload this csv so it can be quickly retrieved later:

In [0]:
def upload_csv_to_drive(destination_folder, dataframe, csv_filename):
  dataframe.to_csv(csv_filename, index=False)
  tmp = drive.CreateFile({"parents": [{"kind": "drive#fileLink", "id": destination_folder}]})
  tmp.SetContentFile(csv_filename)
  tmp.Upload()
  print("Upload complete for: ", csv_filename)

upload_csv_to_drive("138BMjfpGmQescIUmfX_4TA6A2l6g_eP_", content_based_train_data, "content_based_train_data.csv")

Upload complete for:  content_based_train_data.csv


### Quickly retrieve CSV:

In [6]:
# Download data:
csv_files = {
  'content_based_train_data': '1YCMG6R4dVz4Gaf-fqFNtHpsj-wYEQpod',
}

# Create a dictionary of dataframes:
dfs = {}

for key, value in csv_files.items():
  csv_name = key + '.csv'
  downloaded = drive.CreateFile({'id': value})
  downloaded.GetContentFile(csv_name)
  dfs[key] = pd.read_csv(csv_name, low_memory=False)
  print("Done with: ", key)

content_based_train_data = dfs['content_based_train_data']

Done with:  content_based_train_data


In [7]:
content_based_train_data.head()

Unnamed: 0,user_id,business_id,rating,categories,city
0,---1lKK3aKOuomHnwAkAow,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
1,-9Wnr-QvOeDfuIRyaH6VRQ,--9e1ONYQuAa-CB_Rrw7Tw,1.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
2,-CaPL8t5RmQM8sIxdlVCxQ,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
3,-H9hVJDZM60kVvyXYtSwYQ,--9e1ONYQuAa-CB_Rrw7Tw,4.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas
4,-L1yBTxJ9O9HysxDNjHuog,--9e1ONYQuAa-CB_Rrw7Tw,5.0,"Cajun_Creole,Seafood,Steakhouses,Restaurants",Las Vegas


### Small development dataset (subset of training)


*   First try for a single city
*   Then try for full training data



In [8]:
# Random sample of 10k user-item pairs:
# oj = content_based_train_data.sample(frac=1).head(n=10000)
# print("oj shape: ", oj.shape, "Users: ", len(oj['user_id'].unique()), "Businesses: ", len(oj['business_id'].unique()))

# Toronto Only:
oj = content_based_train_data[content_based_train_data['city'] == 'Toronto']
print("oj shape: ", oj.shape, "Users: ", len(oj['user_id'].unique()), "Businesses: ", len(oj['business_id'].unique()))

# All data: This is too much data to be processed in a single iteration. 
# This method will need to be run city-by-city.
# oj = content_based_train_data
# print("oj shape: ", oj.shape, "Users: ", len(oj['user_id'].unique()), "Businesses: ", len(oj['business_id'].unique()))

del content_based_train_data
# All subsequent matrices should match these dimensions!
# With 10k rows and 8k unique businesses, not many users went to the same business. Maybe we should focus on a city? E.g. Toronto?

oj shape:  (302091, 5) Users:  31595 Businesses:  17839


## Implement content-based recommendation:
### 1) Setup data (user_pref, b)

In [0]:
# THIS IS KEY, B2 IS OUR SIMPLE BUSINESS DF
b2 = oj[['business_id','categories']].drop_duplicates()
user_pref2 = oj[['user_id','business_id','rating']]

print(b2.shape)
print(user_pref2.shape)

unique_users = set(user_pref2['user_id'].unique())
# unique_users


(17839, 2)
(302091, 3)


### 2) Vectorize biz data:

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf2 = TfidfVectorizer(sublinear_tf=True, min_df=1, norm='l2', encoding='latin-1', ngram_range=(1,1), stop_words='english') 

b2_ids = list(b2['business_id']) # use this later
b2_vec = tfidf2.fit_transform(b2['categories'])

print(b2_ids)
print("Businesses x categories: ", b2_vec.shape)
# print(tfidf.get_feature_names())

['9HAfloFDDOH0f8fmA5nkaw', 'GTNhbajbPNao5ITndlYy6Q', 'IiG1_hV_TyQgLzh2j8Zncg', 'brNYDrnZhjZjbef9iXQVQw', '-b94nkPVLQw95zgtDhcpYA', '1W_gZM_uuEvJqapbIl6z9Q', '41o1FUbCYKJv2djtnlkzlg', '4SyAKQevPuB4punANp41-Q', '5DDvJhkk3zsd9jBxDQpkow', '68PJJkcq_i0SlLqO6t7Qxw', '6BC87j5FxoIwa-atC77WYQ', 'BUcTdN-rNE8urCCQuxSOQA', 'DjRMmmVjz2UIH5y5-dt8ww', 'G8qvbhfbCyMAeZzYrbmZxA', 'GkY6UWWn0Fz2ehcuBp66pg', 'Ibp4hEKSE8JaX9OvfEiFqg', 'IflW1yOEcNQrB2SWxHKoSg', 'N93EYZy9R0sdlEvubu94ig', 'QaxDKkqYTtVYZJcqBNTnvQ', 'RMLdpPgaYUsa9LIS7UnTNQ', 'RPmgVYvtqg2MaKYSxUrchQ', 'UXHlrjrt72KLojnXZsz7gw', 'VMfEWlSwDt9fhwNN868NNA', 'Y8qupLl3mHmXeNSbsHEcrA', '_HqZL3gK98-Q4ObAoyM1aw', '_cjPEH9wXhKS-HQe_U3M4Q', 'dAs_epbGSYP0uT44fI7W-w', 'd_AcktF-fWL9zzvNPg7euQ', 'e1jI2-vU1fd4UKpyogtxuA', 'fh8a_k9oslEDSHbmJLzUrQ', 'hQtGXpMq4gyRWH4s2iNpUQ', 'iGEvDk6hsizigmXhDKs2Vg', 'iZJ5pdY558VodrEumGyVug', 'jrrNP2Ait97pp3Z6oVQtPA', 'kOFDVcnj-8fd3doIpCQ06A', 'klu0zF1rWAoNAhKPsFyUog', 'oJSa5HCiZXKLXxggQmecEQ', 'oOGLDf2rzeCPS7UQ8hhPlQ', 'r_BrIgzYcw

### 3) Take users, create a User x Business sparse matrix. This is simply a user-item rating matrix, but it is:

1.   A scipy CSR sparse matrix (important to avoid memory problems later)
2.   Guaranteed to have one column per business, even if no users rated that business (important for the dimensions of subsequent matrix math to work).

In [0]:
# Ensure that you have all of your businesses represented in your ratings matrix. 
# Do this by taking all unique business IDs and right joining them onto your user pref table:
unique_b = pd.DataFrame({'business_id': b2['business_id'].unique()})
rating_mat_precursor2 = pd.merge(user_pref2, unique_b, how='right', on='business_id')

# Now, if a user has been to a business multiple times, you won't be able to make the matrix wide.
# So if a user has rated a business multiple times, take the average rating.
# rating_mat_precursor2 = pd.DataFrame(rating_mat_precursor2.groupby(['user_id','business_id']).agg('mean')['stars'].reset_index())

# Pivot ratings matrix to wide.
# When pivoting, the first row is filled with NaN values, so this is dropped. 
# Remaning NaN values replaced with zeros, then matrix converted to scipy CSR sparse.
tmp2 = rating_mat_precursor2.pivot(index='user_id', columns='business_id', values='rating').fillna(0)

u2_ids = tmp2.reset_index()['user_id'] # Use this later
csr_rat2 = scipy.sparse.csr_matrix(tmp2.values)
del tmp2

# Check the results:

print("rating_mat_precursor2: ", rating_mat_precursor2)
# print("tmp2: ", tmp2)
print("Users x businesses: ", csr_rat2.shape)
# print(csr_rat2.toarray())
# print(csr_rat2)

rating_mat_precursor2:                         user_id             business_id  rating
0       --7gjElmOrthETJ8XqzMBw  9HAfloFDDOH0f8fmA5nkaw     3.0
1       1UBbtDQM1xX2_EsGzhuRhQ  9HAfloFDDOH0f8fmA5nkaw     2.0
2       8mUQTXD-R0Tbo0aZ_c1Nyg  9HAfloFDDOH0f8fmA5nkaw     4.0
3       I-PsFvYzyM6Mgc-IOe6WsA  9HAfloFDDOH0f8fmA5nkaw     4.0
4       Imzg-UhRBKoNY2yKKKKdcQ  9HAfloFDDOH0f8fmA5nkaw     4.0
...                        ...                     ...     ...
302086  zqV2O4uDVqx7Pp20Dc9kqw  wq9nslE5sdOunE1M4Bz_CA     1.0
302087  zrOHYc2fWlyMbc9cukTWCQ  bhFW5a_OD2kZ-YZN64tWSQ     4.0
302088  zsZVg16yjZu5NIiS0ayjrQ  a3cKEh8Ez0im7pLU__BUww     1.0
302089  zsZVg16yjZu5NIiS0ayjrQ  tiSn-OANuPMTyH4b_PZmEA     2.0
302090  zyYWUdaodH0h1jCZAvFRPg  3eKjj7VahnWjaNwpZnmYZQ     5.0

[302091 rows x 3 columns]
Users x businesses:  (31595, 17839)


### 4) Take dot product (User x Biz) x (Biz x category_dimensions) to create user embeddings

In [0]:
print(csr_rat2.shape, b2_vec.shape)

user_embed2 = np.dot(csr_rat2, b2_vec)
print("Results: \n", type(user_embed2))
print(user_embed2.toarray())
# print(user_embed2)

(31595, 17839) (17839, 883)
Results: 
 <class 'scipy.sparse.csr.csr_matrix'>
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         3.93868374 ... 0.         5.64177381 0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


### 5) Take matrix product of user embeddings (User x biz) with business embeddings transposed (biz x category_dimensions).T in order to get a ranked list for each user (User X biz):

In [0]:
print(user_embed2.shape, b2_vec.T.shape)

preds2 = np.dot(user_embed2, b2_vec.T)
print("Results: \n", type(preds2), preds2.shape)

(31595, 883) (883, 17839)
Results: 
 <class 'scipy.sparse.csr.csr_matrix'> (31595, 17839)


### 6) Now you have a (form of) predicted rating for every user/business pair (problem: not sure what the range of possible values is here). Need to:

1.   Make your data long-form, prepare for evaluation.
2.   For each user, can you map the ratings onto the range 1-5?

In [0]:
pred_wide_df2 = pd.DataFrame(preds2.toarray())
pred_wide_df2.columns = b2_ids
pred_wide_df2 = pd.DataFrame(u2_ids).join(pred_wide_df2) # Index by original users

pred_wide_df2.head()

Unnamed: 0,user_id,9HAfloFDDOH0f8fmA5nkaw,GTNhbajbPNao5ITndlYy6Q,IiG1_hV_TyQgLzh2j8Zncg,brNYDrnZhjZjbef9iXQVQw,-b94nkPVLQw95zgtDhcpYA,1W_gZM_uuEvJqapbIl6z9Q,41o1FUbCYKJv2djtnlkzlg,4SyAKQevPuB4punANp41-Q,5DDvJhkk3zsd9jBxDQpkow,68PJJkcq_i0SlLqO6t7Qxw,6BC87j5FxoIwa-atC77WYQ,BUcTdN-rNE8urCCQuxSOQA,DjRMmmVjz2UIH5y5-dt8ww,G8qvbhfbCyMAeZzYrbmZxA,GkY6UWWn0Fz2ehcuBp66pg,Ibp4hEKSE8JaX9OvfEiFqg,IflW1yOEcNQrB2SWxHKoSg,N93EYZy9R0sdlEvubu94ig,QaxDKkqYTtVYZJcqBNTnvQ,RMLdpPgaYUsa9LIS7UnTNQ,RPmgVYvtqg2MaKYSxUrchQ,UXHlrjrt72KLojnXZsz7gw,VMfEWlSwDt9fhwNN868NNA,Y8qupLl3mHmXeNSbsHEcrA,_HqZL3gK98-Q4ObAoyM1aw,_cjPEH9wXhKS-HQe_U3M4Q,dAs_epbGSYP0uT44fI7W-w,d_AcktF-fWL9zzvNPg7euQ,e1jI2-vU1fd4UKpyogtxuA,fh8a_k9oslEDSHbmJLzUrQ,hQtGXpMq4gyRWH4s2iNpUQ,iGEvDk6hsizigmXhDKs2Vg,iZJ5pdY558VodrEumGyVug,jrrNP2Ait97pp3Z6oVQtPA,kOFDVcnj-8fd3doIpCQ06A,klu0zF1rWAoNAhKPsFyUog,oJSa5HCiZXKLXxggQmecEQ,oOGLDf2rzeCPS7UQ8hhPlQ,r_BrIgzYcwo1NAuG9dLbpg,...,S8uOotWjk_ZmM5JIJCHKpw,_FzJ85Z7qYk0LEeCQqTjhQ,jeWsP1La31uBkqykfQmrUw,zVTGIoyDZOa2oydPzdis_w,GBmBY3yGlPbQfYRDIFVC_g,JFX1lf5vxJAMzAGJV_MMjQ,K17u6tHn4zvy1bMnSszO0w,ztEHmXTrWAfpFOOFktKzSQ,Hn5IEGWDvMgAON1GMNxglw,CaJBcoMR1oBcvKdSNvpMWQ,B121FD0U2KkK9KfQUgXrNg,a8uuOGWdjVWRCIXoQGVkIA,hwI6jmzI7lDMiT4lInQ18g,FM5fffpZZkr30mRfx3cFJA,NwP5VEf48zp_KJt_xovQ1Q,2atnjto__w6cqTDDgttDYA,-yjmwvcbk22t94j485BZ2g,A014v-_dW6y0rGRxnHvZHA,JRIfRgz6kX06SCzBzHImpg,KZEmp6MJQXGLjD1k-tA0Vw,v-peZTG-0pe8n8EDbbg91w,_anbtkLtkiZ0PlN7WqpxwA,HBfQjGoM2da8yjpigqhRSA,KMKZ2HK91lSySfs3C0a21Q,MyJBf8vhDN1-7uWy8vrULw,bgWl9oEVCnZrz0Gqz68DkQ,jEog9fZLV9M6ZHuhs5-Y2Q,wNhqJfD4nqVBCEsbct6I8Q,6l73cxX_tol8RsvTZZIvPg,v414rGFU9GX0wpcNXLbXOQ,3rtMbY6lpwi_rCtmIOBGFw,G3itRWcQprHJ7w-5vl1_2Q,Z-y_2gkGo8DmXxwWspWORQ,qBggpCHUpsTYxG1CEX4Z6Q,-aYhxDHPvGreHKc0rdvQZw,wq9nslE5sdOunE1M4Bz_CA,bhFW5a_OD2kZ-YZN64tWSQ,a3cKEh8Ez0im7pLU__BUww,tiSn-OANuPMTyH4b_PZmEA,3eKjj7VahnWjaNwpZnmYZQ
0,--7gjElmOrthETJ8XqzMBw,0.0,0.0,0.0,0.421554,0.0,0.0,0.0,0.0,0.0,1.417861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.277416,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.265456,1.562931,0.0,6.498464,0.0,0.217821,2.020382,0.551871,0.0,0.665425,3.148212,0.161278,0.0,1.869461,0.0,0.491033,0.0,0.622592,2.472623,0.0,0.0,0.0,0.328349,0.299681,0.58746,0.0,0.0,0.328349,0.34644,0.0,0.0,0.268491,0.0,2.319681,0.0,0.432089,0.0
1,--BumyUHiO_7YsHurb9Hkw,0.0,5.374316,1.790793,6.8599,4.11205,5.69419,2.677286,8.11413,9.080836,7.688167,5.231521,16.437914,10.137911,5.025272,6.946292,7.355049,11.912105,9.771407,7.123432,14.88152,16.829835,16.51986,3.992534,4.15678,6.642367,12.90056,4.897835,3.463918,12.528129,9.500777,6.886728,3.833873,2.852143,6.57136,6.578821,7.326924,3.243882,6.367614,9.343342,...,0.0,3.463694,3.050186,5.632631,3.885936,6.725622,2.943231,1.488423,0.973387,9.023365,2.903441,10.630845,7.589304,2.790912,1.940629,12.052485,2.560265,6.794135,4.439502,9.516271,3.627444,2.191998,17.358591,9.03565,6.725622,1.46731,4.763302,3.422757,5.66189,4.159162,1.46731,2.018485,9.84637,1.313695,2.893409,2.547193,2.641188,4.953043,3.906588,2.394118
2,--Qh8yKWAvIP4V4K8ZPfHA,5.597207,26.3594,19.031096,42.041109,51.190467,35.329624,32.11512,57.010801,53.881821,49.429105,45.257905,79.9187,47.00724,28.738361,30.926082,44.484211,69.745432,48.138326,40.546092,72.097638,85.679236,87.431349,41.357079,26.195687,48.384843,73.809374,30.788995,31.194319,80.619829,35.757123,68.114311,32.034771,30.192959,33.632116,38.472276,54.959852,27.721089,40.30237,52.913663,...,11.851601,7.190099,29.972894,52.119573,46.881309,51.979189,41.108454,7.997618,6.733794,14.008498,13.984838,55.725607,19.3105,32.583985,11.714806,47.647997,38.523475,10.295079,14.803108,42.035276,18.067483,29.864407,77.790214,52.135953,51.979189,8.45063,11.340734,22.041992,48.430734,37.574104,8.45063,22.754263,68.858514,6.38065,14.872779,9.940071,27.01452,40.948953,13.836887,9.110419
3,--UOvCH5qEgdNQ8lzR8QYQ,0.0,4.973486,0.31837,4.075332,0.731048,1.223084,0.26767,3.660657,3.743427,1.850454,0.685963,4.303609,3.602851,0.42834,3.85428,0.953694,1.608007,1.211884,1.350065,1.663245,1.448753,1.611861,0.295527,0.305299,0.548364,1.689269,2.072667,0.615822,5.330091,1.463179,0.545749,0.482374,0.285152,1.867789,0.810498,0.383815,0.471166,0.367146,6.3529,...,0.0,0.0,2.822695,1.922168,1.326098,2.603446,2.026241,0.0,0.0,0.0,0.0,2.605479,0.0,2.136826,0.379888,1.193904,0.582903,0.0,0.0,4.668426,0.0,1.678275,2.437991,1.120722,2.603446,0.0,0.0,0.0,0.673561,0.42393,0.0,0.0,1.985247,0.0,0.0,0.0,0.0,1.917292,0.0,0.0
4,--YhjyV-ce1nFLYxP49C5A,0.155347,4.539284,1.681816,8.071391,3.861815,4.326907,4.723149,8.452315,5.938776,6.297503,3.623652,15.521731,6.811717,3.822504,5.077376,6.922928,8.69687,4.950335,5.653694,8.323669,9.084033,9.224877,5.576845,1.612766,4.435128,8.985424,2.415892,3.253124,10.151095,6.836299,10.488689,3.451993,5.031625,8.726725,5.233354,9.067698,3.679682,2.774861,6.115357,...,0.453344,0.86509,2.576264,0.638611,0.440576,4.200836,0.60581,1.744399,0.937152,1.343109,1.34084,10.911017,2.079963,0.574458,1.435325,5.860164,0.526984,1.939877,2.048695,4.493869,1.946076,0.451183,11.966651,7.795078,4.200836,1.034036,0.910043,2.865422,3.558139,3.998337,1.034036,1.082891,11.816268,0.0,1.694904,0.0,1.143553,3.09368,1.232334,0.980033


In [0]:
# Melt data -- Runs out of memory at this stage for a full city (Toronto), but can do 10k users at a time.

test = pred_wide_df2.loc[:10000,]
pred_long_10k = pd.melt(test, id_vars=['user_id']).rename(columns={'variable': 'business_id', 'value': 'raw_prediction'})

# long_preds = pd.melt(pred_wide_df2, id_vars=['user_id']).rename(columns={'variable': 'business_id', 'value': 'raw_prediction'})
# long_preds

In [0]:
# Do your predictions make sense? See how you did for one power user:

pred_long_10k.head()

bob_pred = pred_long_10k[pred_long_10k['user_id'] == '--YhjyV-ce1nFLYxP49C5A']
bob_actual = oj[oj['user_id'] == '--YhjyV-ce1nFLYxP49C5A']

print(bob_pred.shape, bob_actual.shape)

side_by_side = pd.merge(bob_actual, bob_pred, on=['user_id', 'business_id'], how='inner')
side_by_side.sort_values(by='raw_prediction', ascending=False)

(17839, 3) (33, 5)


Unnamed: 0,user_id,business_id,rating,categories,city,raw_prediction
25,--YhjyV-ce1nFLYxP49C5A,hm0y7QxT-UUmQcF074qV0g,3.0,"Coffee_Tea,Bars,Food,Nightlife,Restaurants,Caf...",Toronto,13.195358
15,--YhjyV-ce1nFLYxP49C5A,LQ9WorDtNJXeEfA7GWIXTA,4.0,"Food,Coffee_Tea",Toronto,11.966651
21,--YhjyV-ce1nFLYxP49C5A,d_jxInosU_3cAYI0qi34UA,3.0,"Restaurants,Chinese",Toronto,10.911017
5,--YhjyV-ce1nFLYxP49C5A,4oWEqa3paBylDXRfTPB-bA,3.0,"Nightlife,American_Traditional_,Burgers,Bars,C...",Toronto,8.741912
29,--YhjyV-ce1nFLYxP49C5A,tzl4KHt6ZAwyUJIEyemrtQ,4.0,"Restaurants,Nightlife,Steakhouses,American_Tra...",Toronto,7.483753
17,--YhjyV-ce1nFLYxP49C5A,S2yp22ExErM1wtpUgPC3TQ,1.0,"MusicVenues,Arts_Entertainment,Nightlife,Dance...",Toronto,7.045748
18,--YhjyV-ce1nFLYxP49C5A,X2I47eENvYeVL6QlzAZ0wA,3.0,"Chinese,Noodles,Restaurants",Toronto,6.491775
13,--YhjyV-ce1nFLYxP49C5A,HawJbjbA70EtOOJFzMZoSA,4.0,"Creperies,Japanese,Restaurants,Chinese",Toronto,6.17816
0,--YhjyV-ce1nFLYxP49C5A,r_BrIgzYcwo1NAuG9dLbpg,4.0,"Restaurants,Food,Thai,EthnicFood,SpecialtyFood",Toronto,6.115357
8,--YhjyV-ce1nFLYxP49C5A,8I2XBrjf4rOEWx7pnKpVeg,3.0,"Thai,Food,Restaurants,FoodDeliveryServices",Toronto,6.015998


## Sanity Check: Implement toy example from internet
### https://www.analyticsvidhya.com/blog/2015/08/beginners-guide-learn-content-based-recommender-systems/
### 1) Setup data (user_pref, b)


In [0]:
articles = ['art1','art2','art3','art4','art5','art6']
text = ['bd,py,lp', 'rcode,py,ml', 'ml,lp', 'py,ml', 'rcode', 'bd,ml']

b = pd.DataFrame({'articles': articles, 'text': text})
print("b: \n", b)

uid = ['u1', 'u1', 'u1', 'u2', 'u2', 'u2']
bid = ['art1', 'art2', 'art6', 'art1', 'art2', 'art4']
rat = [1, -1, 1, -1, 1, 1]

user_pref = pd.DataFrame({'uid': uid, 'bid': bid, 'rat': rat})
print("user_pref: \n", user_pref, user_pref.shape)
# print("pivoted pref: \n", user_pref.pivot(index='uid', columns='bid', values='rat'))

b: 
   articles         text
0     art1     bd,py,lp
1     art2  rcode,py,ml
2     art3        ml,lp
3     art4        py,ml
4     art5        rcode
5     art6        bd,ml
user_pref: 
   uid   bid  rat
0  u1  art1    1
1  u1  art2   -1
2  u1  art6    1
3  u2  art1   -1
4  u2  art2    1
5  u2  art4    1 (6, 3)


### 2) Vectorize biz data:

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=1, norm='l2', encoding='latin-1', ngram_range=(1,1), stop_words='english') 

b_ids = list(b['articles']) # use this later
b_vec = tfidf.fit_transform(b['text'])

print(b_ids)
print(b_vec)

['art1', 'art2', 'art3', 'art4', 'art5', 'art6']
  (0, 1)	0.6071443239386358
  (0, 3)	0.5125929572459945
  (0, 0)	0.6071443239386358
  (1, 2)	0.48380155055600843
  (1, 4)	0.668719891595794
  (1, 3)	0.5645792825314363
  (2, 2)	0.5861569567966913
  (2, 1)	0.8101975203608325
  (3, 2)	0.6506955769621281
  (3, 3)	0.7593387031634324
  (4, 4)	1.0
  (5, 2)	0.5861569567966913
  (5, 0)	0.8101975203608325


### 3) Take users, create a User x Business sparse matrix. This is simply a user-item rating matrix, but it is:

1.   A scipy CSR sparse matrix (important to avoid memory problems later)
2.   Guaranteed to have one column per business, even if no users rated that business (important for the dimensions of subsequent matrix math to work).




In [0]:
# Ensure that you have all of your businesses represented in your ratings matrix:
rating_mat_precursor = pd.merge(user_pref, b, how='right', left_on='bid', right_on='articles').drop(['bid','text'], axis=1)
rating_mat_precursor

# Pivot ratings matrix to wide.
# When pivoting, the first row is filled with NaN values, so this is dropped. 
# Remaning NaN values replaced with zeros, then matrix converted to scipy CSR sparse.
tmp = rating_mat_precursor.pivot(index='uid', columns='articles', values='rat')[1:].fillna(0)
u_ids = tmp.reset_index()['uid'] # Use this later
csr_rat = scipy.sparse.csr_matrix(tmp.values)
# del tmp

# Check the results:
print(tmp)
print(csr_rat)
print(csr_rat.toarray())

articles  art1  art2  art3  art4  art5  art6
uid                                         
u1         1.0  -1.0   0.0   0.0   0.0   1.0
u2        -1.0   1.0   0.0   1.0   0.0   0.0
  (0, 0)	1.0
  (0, 1)	-1.0
  (0, 5)	1.0
  (1, 0)	-1.0
  (1, 1)	1.0
  (1, 3)	1.0
[[ 1. -1.  0.  0.  0.  1.]
 [-1.  1.  0.  1.  0.  0.]]


### 4) Take dot product (User x Biz) x (Biz x category_dimensions) to create user embeddings

In [0]:
print(csr_rat.shape, b_vec.shape)

user_embed = np.dot(csr_rat, b_vec)
print("Results: \n", type(user_embed))
print(user_embed.toarray())

(2, 6) (6, 5)
Results: 
 <class 'scipy.sparse.csr.csr_matrix'>
[[ 1.41734184  0.60714432  0.10235541 -0.05198633 -0.66871989]
 [-0.60714432 -0.60714432  1.13449713  0.81132503  0.66871989]]


### 5) Take matrix product of user embeddings (User x biz) with business embeddings transposed (biz x category_dimensions).T in order to get a ranked list for each user (User X biz):

In [0]:
print(b.T)

preds = np.dot(user_embed, b_vec.T)
print("Results: \n", type(preds))
print(preds.toarray())

                 0            1      2      3      4      5
articles      art1         art2   art3   art4   art5   art6
text      bd,py,lp  rcode,py,ml  ml,lp  py,ml  rcode  bd,ml
Results: 
 <class 'scipy.sparse.csr.csr_matrix'>
[[ 1.20250746 -0.42701699  0.55190316  0.02712698 -0.66871989  1.20832318]
 [-0.32136896  1.45411507  0.17308656  1.35428276  0.66871989  0.17308656]]


### 6) Now you have a (form of) predicted rating for every user/business pair (problem: not sure what the range of possible values is here). Need to:

1.   Make your data long-form, prepare for evaluation.
2.   For each user, can you map the ratings onto the range 1-5?




In [0]:
pred_wide_df = pd.DataFrame(preds.toarray())
pred_wide_df.columns = b_ids
pred_wide_df = pd.DataFrame(u_ids).join(pred_wide_df) # Index by original users

pred_wide_df

Unnamed: 0,uid,art1,art2,art3,art4,art5,art6
0,u1,1.202507,-0.427017,0.551903,0.027127,-0.66872,1.208323
1,u2,-0.321369,1.454115,0.173087,1.354283,0.66872,0.173087


In [0]:
# Melt data
pd.melt(pred_wide_df, id_vars=['uid'])

Unnamed: 0,uid,variable,value
0,u1,art1,1.202507
1,u2,art1,-0.321369
2,u1,art2,-0.427017
3,u2,art2,1.454115
4,u1,art3,0.551903
5,u2,art3,0.173087
6,u1,art4,0.027127
7,u2,art4,1.354283
8,u1,art5,-0.66872
9,u2,art5,0.66872
