## Conceptual introduction

Recommender systems often involve applying some kind of an algorithm to the so called **user-item ratings matrix** of size $NxM$ where $N$ - number of users, $M$ - number of items. This matrix is sparse which is good since our goal is to recommend products to users and it would be nice to have something to recommend in the first place.

The value in a cell could be a rating (eg. on a scale from 1 to 5 - star-based system), whether a user rated an item positively or negatively (thumb-up / thumb-down approach). This type of information is called **explicit** - we know the real attitude of a particular user toward an item. The second type of information used in recommender systems is **implicit** rating - we only know that a user browsed an item, watched a movie, listened to a song etc. - in that case we can't tell if he/she liked it explicitly, we can only assume that. It is often concluded that the more times a user interacted with an item/song/movie the more confidence we have that he/she in fact likes it.

The implementation presented below assumes we only deal with explicit ratings.

| UserId \ ItemId | Item 1 | Item 2 | Item 3 | Item 4 | ... | Item M |
|:---------------:|--------|--------|--------|--------|-----|--------|
| **User 1**          |    2   |    4   |    4   |    3   | ... |   4.5  |
| **User 2**          |   3.5  |    5   |   4.5  |   2.5  | ... |    4   |
| **User 3**          |    1   |    2   |    5   |   1.5  | ... |    5   |
| **...**             |   ...  |   ...  |   ...  |   ...  | ... |   ...  |
| **User N**          |    4   |   3.5  |    4   |   4.5  | ... |   2.5  |

<div style="text-align: center">Example of user-item ratings matrix</div>


In [4]:
import numpy as np
import pandas as pd
import random
from itertools import product
from datetime import datetime

# Data

As the main goal of this notebook is to explain the intuition behind the most common algorithms related to recommender systems and not efficiency of implementation, the dataset size is small. It is based on well-known movielens dataset available [here](https://grouplens.org/datasets/movielens/). The dataset was filtered to include only the most popular movies (top 1000) and the most active users (5000 users). Thus the dimensionality of user-item matrix is $5000x1000$ but the algorithm would work equally on larger dataset as well (at the cost of longer training time).

In [None]:
all_ratings = pd.read_csv('/home/kuba/Desktop/colab_filt_data/rating.csv', usecols=[0,1,2])

# calculate 1000 most popular movies
mov_chosen = all_ratings.groupby('movieId')['userId'].nunique().sort_values()
mov_chosen = mov_chosen.iloc[-1000:].index.values

# calculate top 5000 users who rated the most movies
user_chosen = all_ratings.groupby('userId')['movieId'].nunique().sort_values()
user_chosen = user_chosen.iloc[-5000:].index.values

# filter a data frame to selected movies and users
# we need to call copy() method, because selecting rows from dataframe only creates a view of original df
# hence we have to make a new df by calling copy(). It lets us overwrite the indexes
ratings = all_ratings[all_ratings.userId.isin(user_chosen) & all_ratings.movieId.isin(mov_chosen)].copy()
del all_ratings

# save the filtered data frame
ratings.to_csv('/home/kuba/Desktop/colab_filt_data/ratings_filtered.csv', index=False)

Now that we have a dataset ready we will split it into training and test sets (train-test split = 0.7).

In [17]:
ratings = pd.read_csv("/home/kuba/Desktop/colab_filt_data/new_ramka.csv")
random.seed(2019)

# train-test-split: 70% split
orig_ind = ratings.index.tolist()
tr_ind = random.sample(orig_ind, int(0.7*ratings.shape[0]), )

X_train = ratings.loc[tr_ind, :]
X_test = ratings.loc[list(set(orig_ind)-set(tr_ind)), :]

# what the training data look like?
print("We have " + str(X_train.shape[0]) + " distinct ratings in the training set.")
X_train.head()

We have 1647602 distinct ratings in the training set


Unnamed: 0,userId,movieId,rating
648382,37319,5266,3.0
1016091,59407,8376,2.5
2085030,122178,1387,3.5
673590,38899,2717,3.5
1028128,60020,1199,4.5


The next step is to convert existing dataframe to a $NxM$ ratings matrix.

In [34]:
R_mat = X_train.pivot(index = 'userId', columns = 'movieId', values = 'rating')

# we use a fact that pivot sorts indexes to create a mapping for movies and users
user_map = dict(zip(range(5000), R_mat.index.values))
movies_map = dict(zip(range(R_mat.shape[1]), R_mat.columns.values))
user_map_rev = {v: k for k, v in user_map.items()}
movies_map_rev = {v: k for k, v in movies_map.items()}

# create numpy array - we could use scipy sparse matrices in case of larger matrix size
R_mat = R_mat.values.astype(np.float64)

## Collaborative filtering

On a high level the main goal of collaborative filtering is to predict what rating would user $i$ give to item $j$. Let's call the prediction $\hat{r}(i,j)$ and the actual rating $r(i,j)$. Then we need to calculate mean squared error with a formula:
$$MSE = \frac{1}{|\Omega|}\sum_{i,j \in \Omega} (r_{ij} - \hat{r}_{ij})^2$$
, where $\Omega$ is a set of all non-missing user-item rating pairs.

Diving deeper, we need to calculate a set of similar users for each one of the users (by calculating correlation of ratings for each pair of users). Intuitively we should pay more attention to ratings of users who are more similar to a given user. When calculating a prediction for user $i$ and item $j$ we take into consideration the ratings given to item $j$ by all users who rated it (denoted by $\Omega_j$) and weighting the result thanks to correlation factor between users (denoted by $w_{ii'}$) which we want to be high for similar users and low for dissimilar users:
$$\hat{r}(i,j) = \frac{\sum_{i' \in \Omega_j} w_{ii'}r_{i'j}}{\sum_{i' \in \Omega_j} w_{ii'}}$$


The other thing we need to care about is a naturally existing bias in ratings - there are users whose average rating is high (it's easy to make them happy) and very picky users (who rarely give high rating and are generally dissatisfied). We need to accomodate for this fact by analyzing relative ratings. It is achieved by subtracting the mean rating for each users $\overline{r}_i$ from his/her ratings.

In [35]:
# calculate average ratings for each user
mean_ratings = np.nanmean(R_mat, axis=1).reshape(-1,1)
print("The most demanding user's average rating:" + str(min(mean_ratings)))
print("The least demanding user's average rating:" + str(max(mean_ratings)))

# subtract average ratings from known ratings
R_mat -= mean_ratings

The most demanding user's average rating:[1.22659176]
The least demanding user's average rating:[4.89194915]


We subtracted the average rating of each user from his/her known ratings. When making a prediction we need to account for this fact as well and the final prediction will be calculated as a sum of a given user's average rating ($\overline{r}_i$ part) and the sum of deviations ($r_{i'j}-\overline{r}_{i'}$ part) in ratings of item $j$ given by each user $i'$ who rated item $j$ ($\Omega_j$ set) weighted by similarity of all pairs of users $(i, i')$ - $w_{i,i'} $ coefficients.
$$\hat{r}(i,j) = \overline{r}_i + \frac{\sum_{i' \in \Omega_j} w_{i,i'}(r_{i'j}-\overline{r}_{i'})}{\sum_{i' \in \Omega_j} |w_{i,i'}|} $$

The intuition behind this equation is that we need to add to our bias (how in general we rate items) the averaged differences between a rating of item $j$ that a user $i'$ gave and his/her bias. If a user $i'$ likes an item $j$ very much in comparison to his normal ratings and in addition he/she is very similar to us (large $w_{i,i'}$ coefficient) he/she will influence our score greatly.

The final question arises - how do we compute correlation between users? We need to calculate ${N\choose 2}$ pairs of correlations since $corr(i,i')=corr(i',i)$. It is a simple Pearson correlation but restricted to data we have:
$$w_{i,i'} = \frac{\sum_{j \in \Omega_{i,i'}} (r_{ij}-\overline{r}_i)(r_{i'j}-\overline{r}_{i'})}{\sqrt{\sum_{j \in \Omega_i} (r_{ij}-\overline{r}_i)^2} \sqrt{\sum_{j \in \Omega_{i'}} (r_{i'j}-\overline{r}_{i'})^2}}$$, where:
* $\Omega_{i,i'}$ - set of items that both user $i$ and $i'$ have rated
* $\Omega_i$ - set of items that user $i$ has rated


In [None]:
# calculating Pearson correlactions between each pair of users
weights = np.empty((5000, 5000))

all_pairs = list(product(us_range, us_range))
# do we have all the pairs?
len(all_pairs) == 5000*5000

# instead of argwhere we could create dict user2movie, movie2user, usermovie2rating
# looping through an array is O(NM), looping through dict is O(|omega|), omega is set of all ratings
t1 = datetime.now()
for p1, p2 in all_pairs:
    # which movies have both users watched?
    mov1 = np.argwhere(~np.isnan(R_mat[p1, :]))
    
    u1 = R_mat[p1, mov1]
    mov2 = np.argwhere(~np.isnan(R_mat[p2, :]))
    u2 = R_mat[p2, mov2]
    mov_both = np.intersect1d(mov1, mov2)
    u_both = R_mat[np.array([p1,p2])[:,None], mov_both]
    # bias1 = avg_user.loc[para[0]]
    # bias2 = avg_user.loc[para[1]]
    # numerator
    num = np.sum(u_both[0,:]*u_both[1,:])
    # denominator
    den = np.sqrt((u1**2).sum()) * np.sqrt((u2**2).sum())
    sim = num / den
    weights[p1, p2] = sim
    #if (p1+2) % 500 == 0 & p2==0:
    #    print(f'{p1+1} rows done!')
print(f'Execution took: {datetime.now()-t1}')

# sanity checks:
print(weights[10, 10]) # must be equal to 1
print(weights[10, 15] == weights[15, 10]) # must return True

np.save('/home/kuba/Desktop/colab_filt_data/weights', weights)

499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows d

499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows d

499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows d

499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows d

499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows d

499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows done!
499 rows d

In [None]:
# select K top neighbors for each user
neigh = {}
for user in weights.index:
    neigh[user] = weights.loc[user].sort_values(ascending=False).loc[:50]


    # PREDICTION
def predict(i=10,j=23):
    who_watched = np.argwhere(~np.isnan(R_mat[:, j]))
    numer = (weights[i, who_watched].reshape(-1,1) * (R_mat[who_watched, j])).sum() # - srednia[who_watched]
    denom = (np.abs(weights[i, who_watched])).sum()
    s_ij = srednia[i] + numer / denom
    return s_ij

predict()

# defining MSE function
def MSE(true, pred):
    return np.sum((true-pred)**2) / len(true)

# calculating MSE on train set, which pairs to check?
train_pred_pairs = []
for ind, row in X_train.iterrows():
    tmp = (user_map_rev[row['userId']], movies_map_rev[row['movieId']])
    train_pred_pairs.append(tmp)

# calculate score for train set
train_scores = np.array([predict(x, y) for x, y in train_pred_pairs[:1000]]).reshape(-1,1)
true_train = X_train['rating'].values.reshape(-1,1)
train_MSE = MSE(np.array(true_train[:1000]), train_scores)


# calculating MSE on test set
test_pred_pairs = []
for ind, row in X_test.iterrows():
    tmp = (user_map_rev[row['userId']], movies_map_rev[row['movieId']])
    test_pred_pairs.append(tmp)

test_scores = [predict(x, y) for x, y in test_pred_pairs]
test_MSE = MSE(X_test['rating'].values.tolist(), test_scores)


## Matrix factorization