In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# User similarity based predictions

Our goal in this notebook is to accomplish the following tasks:
* Come up with a meaningful way of measuring the similarity between two users
* Obtain predictions for unrated items by using the ratings of similar users.
using only ratings data.

I'm going to use the toy dataset:

In [3]:
minidf = pd.read_pickle('../data/authlabrat.pk')
minidf.head()

Unnamed: 0_level_0,ReviewerId,RecipeLabel,Rating,LabelId
Pair,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(8629, Easy Vidalia Onion Casserole)",8629,Easy Vidalia Onion Casserole,4.0,4741
"(4470, Veggie Soup)",4470,Veggie Soup,5.0,4450
"(6357, Mango Daiquiri)",6357,Mango Daiquiri,4.0,2577
"(6357, Butter Chicken)",6357,Butter Chicken,4.0,2560
"(6357, Baklava)",6357,Baklava,3.0,4245


In [4]:
author_ids = minidf.ReviewerId.unique()
recipe_labels = minidf.RecipeLabel.unique()

We're going to represent "ratings vectors" (the rows in the ratings matrix) as dictionaries.

In [116]:
def get_dictionaries(ratingsdf,usercol,itemcol,ratingcol):
    users = list(set(ratingsdf[usercol].values))
    items = list(set(ratingsdf[itemcol].values))
    ratings_by_user = {user_id:{} for user_id in users}
    users_by_item = {item_id:[] for item_id in items}
    for review_index in ratingsdf.index:
        user_id = ratingsdf[usercol][review_index]
        item_id = ratingsdf[itemcol][review_index]
        rating = ratingsdf[ratingcol][review_index]
        ratings_by_user[user_id][item_id] = rating
        users_by_item[item_id].append(user_id)
    return ratings_by_user,users_by_item


When we make predictions on our test set, we should only use ratings from the training set.

Thus, we do our train test split now and then compute the ratings vectors using only reviews in the training set.

However, we have to do this carefully - in order to make predictions for a user-item pair in the holdout set, we will need that user and that item to have appeared in a different pair in the training set. We can ensure that every user appears in the training set by stratifying the sampling using the user column.

In [15]:
from sklearn.model_selection import train_test_split

In [20]:
ratdf_train, ratdf_test = train_test_split(minidf,test_size=0.2, random_state=1729,stratify = minidf.ReviewerId.values)

In [26]:
ratings_by_author,reviewers_by_recipe = get_dictionaries(ratdf_train,'ReviewerId','RecipeLabel','Rating')

{'Twice Baked Potato Casserole': [195589,
  369715,
  55729,
  35965,
  92816,
  178427,
  54678,
  125640,
  1175362,
  55221,
  229850,
  211184,
  73836,
  1365025,
  355637,
  173579,
  35140,
  28177],
 'English Muffin Pizza': [284897,
  75497,
  157102,
  526666,
  179133,
  178427,
  537179,
  174096,
  199848,
  70073,
  222478,
  80353,
  428885,
  185105,
  186802,
  286566],
 'Chesapeake Bay Crab Cakes': [312577,
  5060,
  496803,
  88378,
  355637,
  173579,
  93006,
  269521,
  470351,
  250031,
  268058,
  240552],
 'No Bake Chocolate Oatmeal Cookies': [1179225,
  189616,
  242484,
  64203,
  341344,
  673444,
  693345,
  57042,
  32058,
  203325,
  250031,
  192974,
  166294,
  465829,
  173579,
  527607,
  130133,
  218535,
  839492,
  305531,
  88099,
  359220,
  369715,
  36128,
  158086,
  265954,
  17803,
  10033,
  18391,
  140132,
  164610,
  1133190,
  500860,
  126435,
  452940],
 'Nutty Irishman': [394085,
  199020,
  86318,
  498271,
  43083,
  106624,
  60989

In [67]:
recipes_in_training = list(reviewers_by_recipe.keys())
mean_ratings_train = {author_id:np.mean(list(ratings_by_author[author_id].values()))
                      for author_id in ratings_by_author}
std_ratings_train = {author_id:np.std(list(ratings_by_author[author_id].values()))
                      for author_id in ratings_by_author}

In [86]:
def normalize(data):
    value = data[0]
    mean = data[1]
    std = data[2]
    if std == 0:
        return 0
    else:
        return (value-mean)/std



In [184]:
ratdf_train_ext = ratdf_train.copy()
ratdf_train_ext['MeanRatingTrain'] = [mean_ratings_train[author_id] for author_id in ratdf_train_ext.ReviewerId]
ratdf_train_ext['StdRatingTrain'] = [std_ratings_train[author_id] for author_id in ratdf_train_ext.ReviewerId]
ratdf_train_ext['Normalized_Ratings']= [normalize(data) for data in ratdf_train_ext[['Rating','MeanRatingTrain','StdRatingTrain']].values]

ratdf_train_ext = ratdf_test.copy()
ratdf_test_ext['MeanRatingTrain'] = [mean_ratings_train[author_id] for author_id in ratdf_test_ext.ReviewerId]
ratdf_test_ext['StdRatingTrain'] = [std_ratings_train[author_id] for author_id in ratdf_test_ext.ReviewerId]
ratdf_test_ext['Normalized_Ratings']= [normalize(data) for data in ratdf_test_ext[['Rating','MeanRatingTrain','StdRatingTrain']].values]

In [78]:
ratdf_test_ext['MeanRatingTrain'] = [mean_ratings_train[author_id] for author_id in ratdf_test_ext.ReviewerId]
ratdf_test_ext['StdRatingTrain'] = [std_ratings_train[author_id] for author_id in ratdf_test_ext.ReviewerId]
ratdf_test_ext['Normalized_Ratings']= [normalize(data) for data in ratdf_test_ext[['Rating','MeanRatingTrain','StdRatingTrain']].values]

Note that not every recipe appears in the training set; we can deal with this by doing a k-fold split or something.

In [42]:
recs_test_not_train = list(set(ratdf_test['RecipeLabel'].values)-set(recipes_in_training))
pairs_in_testtrain = [pair for pair in ratdf_test.index if pair[1] not in recs_test_not_train]

In [46]:
print(len(pairs_in_testtrain))
print(len(ratdf_test))

14183
14318


## Similarity scores
Given two such dictionaries that represent user ratings, we will need a way of quantifying how similarly two users rate recipes.

* If user1 and user2 have both rated every recipe, or they've rated the exact same set of recipes, we will use the usual formula for cosine similarity:
we normalize the ratings so that they have mean 0 and standard deviation 1, and compute the cosine similarity of those vectors.
* If user1 and user2 have rated none of the same recipes, we will not assign them a number (we could say 0 but I prefer to say "No comment").

Now, when two users have some overlap, we will do something that differs slightly from cosine similarity:
* First, we normalize the ratings of user1 and user2 so that the average rating is 0 for both when considering ALL recipes rated by each of those respective users.
* Then, we drop all recipes that haven't been rated by both authors from consideration. I.e. Let $V$ be the space with basis indexed by all recipes, and let $U$ be the subspace spanned by the recipes that have been rated by user1 and user2. We write $u_1, u_2$ to denote the (normalized) ratings vectors restricted to these subspaces.
* We will use the cosine similarity of $u_1, u_2$ as our measure of similarity between user1 and user2.

Important distinctions:
* I'm not renormalizing $u_1, u_2$ so that they have mean 0. So, it's like I'm doing cosine similarity in $V$.
* However, I'm not dividing by the actual length of the vectors in $V$ - I'm using the length of the projection to $U$.

Here is why these changes make sense:
* Say user1 gave above average scores to all recipes they have in common with user2, but user2 gave them below average scores. Then $u_1 \cdot u_2$ is going to be negative, and we would like it to be negative, since user1 and user2 clearly have opposite opinions on everything. If we translated $u_1, u_2$ so that they have mean 0, we would get a number that quantifies how user1 and user2 feel about that restricted set of recipes but not how they feel about recipes in general. Therefore, it makes more sense to ensure the global mean rating is 0 and not translate the ratings beyond that.
* Now, we could just say we're doing cosine similarity in $V$. The distinction between that and what we're doing is that the lengths of the ratings vectors in $V$ are bigger than the lengths of the ratings vectors in $U$. To see how this might affect things, suppose we have a third user user3. If user3 and user2 have the same overlap with user1, and the same ratings on that overlap, then we would like the similarity score of user1,user2 and user1,user3 to be the same. In particular, it shouldn't matter whether user3 has rated a million other recipes or 5 other recipes. Now, if we use the length in $V$, then users will be penalized for having rated other recipes - this doesn't make sense. So we will use the length of $u_1, u_2$, rather than the length of the original vector in $V$.

In [7]:
def similarity(ratings1,ratings2):
    user1mean = np.mean(list(ratings1.values()))
    user2mean = np.mean(list(ratings2.values()))
    common_recipes = set(ratings1.keys()).intersection(set(ratings2.keys()))
    if len(common_recipes)== 0:
        return -2
    l1 = 0
    l2 = 0
    dot = 0
    for recipe_id in common_recipes:
        r1 = ratings1[recipe_id]-user1mean
        r2 = ratings2[recipe_id]-user2mean
        l1+=r1**2
        l2+=r2**2
        dot+=r1*r2
    # If the normalized ratings are all 0 for one of the users,
    # the corresponding length will be 0. We may as well return 0.
    if l1*l2 == 0:
        return 0
    else:
        return dot/np.sqrt(l1*l2)

## Related ratings

Given a user and an unranked item (chosen from the testing set), we want to predict the rating using the training data.

We will do the following:
* First, we obtain a list of the users that have rated the item we're interested in.
* Next, we check whether any of those users have rated any of the same items as our user. If there are none, we do not make a prediction.
* Otherwise, we will make a prediction based on the ratings given by those users and their similarity to the original user.

Before trying to make any predictions, let's just try to collect the necessary data and see how many pairs we are even able to make a prediction for.

In [51]:
def get_related_ratings_info(author_id,recipe_label):
    rating_dic = ratings_by_author[author_id]
    other_recipes = set(rating_dic.keys())
    recipe_reviewers = reviewers_by_recipe[recipe_label]
    relevant_data = {}
    for author_id1 in recipe_reviewers:
        rating_dic1 = ratings_by_author[author_id1]
        no_common_recipes = len(set(rating_dic1.keys()).intersection(other_recipes))
        if no_common_recipes>0:
            simscore = similarity(rating_dic,rating_dic1)
            truerating = rating_dic1[recipe_label]
            normrating = truerating-np.mean(list(rating_dic1.values()))
            if normrating != 0:
                normrating = normrating/np.std(list(rating_dic1.values()))
            relevant_data[author_id1] = {'Similarity':simscore,
                                         'No_common_revs':no_common_recipes,
                                         'True_rating':truerating,
                                         'Norm_rating':normrating}
    return relevant_data

def make_prediction(pair,min_com_revs,min_sim):
    author_id = pair[0]
    recipe_label = pair[1]
    related_user_data = get_related_ratings_info(author_id,recipe_label)
    norm_ratings = []
    weights = []
    for author2 in related_user_data:
        author2_data = related_user_data[author2]
        if author2_data['No_common_revs']>= min_com_revs and author2_data['Similarity']>min_sim:
            norm_ratings.append(author2_data['Norm_rating'])
            weights.append(author2_data['No_common_revs']*author2_data['Similarity'])
    if len(norm_ratings)>0:
        return np.average(norm_ratings,weights = weights)
    else:
        return 'No prediction'

In [177]:
pred_train_data = {pair:make_prediction(pair,1,0.1) for pair in ratdf_train.index}

In [178]:
ratdf_train_ext['Pred'] = [pred_train_data[pair] for pair in ratdf_train_ext.index]

In [179]:
ratdf_train_ext.head()

Unnamed: 0_level_0,ReviewerId,RecipeLabel,Rating,LabelId,Pred0,Pred1,MeanRatingTrain,StdRatingTrain,Normalized_Ratings,Pred
Pair,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
"(195589, Twice Baked Potato Casserole)",195589,Twice Baked Potato Casserole,5.0,5228,0.381119,0.275118,4.940678,0.345488,0.171705,0.275118
"(284897, English Muffin Pizza)",284897,English Muffin Pizza,5.0,1738,0.280439,0.21002,4.966667,0.179505,0.185695,0.21002
"(312577, Chesapeake Bay Crab Cakes)",312577,Chesapeake Bay Crab Cakes,5.0,1396,0.42372,0.192172,4.947368,0.510282,0.103142,0.192172
"(1179225, No Bake Chocolate Oatmeal Cookies)",1179225,No Bake Chocolate Oatmeal Cookies,3.0,3390,-1.582706,-2.206435,4.689542,0.641493,-2.633765,-2.206435
"(394085, Nutty Irishman)",394085,Nutty Irishman,5.0,4380,0.443806,0.732812,4.633333,0.473169,0.774917,0.732812


In [170]:
len(preds_train)

57269

In [144]:
from sklearn.metrics import mean_squared_error

In [182]:
predlist = ratdf_train_ext.loc[ratdf_train_ext.Pred!='No prediction']['Pred'].values
truelist  = ratdf_train_ext.loc[ratdf_train_ext.Pred!='No prediction']['Normalized_Ratings'].values
zerolist = np.array([0 for i in predlist])

In [183]:
mean_squared_error(truelist,predlist)

0.02012060502491485

In [53]:
pred_data = {pair:get_related_ratings_info(pair[0],pair[1]) for pair in pairs_in_testtrain}

In [190]:
preds_test_dic = {pair:make_prediction(pair,1,0.1) for pair in pairs_in_testtrain}
preds_test_list = [preds_test_dic[pair] for pair in preds_test_dic if preds_test_dic[pair]!= 'No prediction']
test_true_norm_ratings = [ratdf_test_ext['Normalized_Ratings'][pair] for pair in preds_test_dic if preds_test_dic[pair]!= 'No prediction']

In [191]:
mean_squared_error(preds_test_list,test_true_norm_ratings)

1.8233612182661854

In [192]:
mean_squared_error([0 for a in test_true_norm_ratings],test_true_norm_ratings)

1.5657822174475704

In [194]:
def mse_dics(ratings1:dict,ratings2:dict):
    common_keys = set(ratings1.keys()).intersection(set(ratings2.keys()))
    if len(common_keys) == 0:
        return 'No common keys'
    else:
        return np.mean([(ratings1[key]-ratings2[key])**2 for key in common_keys])