# Recommender Systems - notebook 1 - Traditional Approaches


In [1]:
"""
(Practical tip) Table of contents can be compiled directly in jupyter notebooks using the following code:
I set an exception: if the package is in your installation you can import it otherwise you download it 
then import it.
"""
try:
    from jyquickhelper import add_notebook_menu 
except:
    !pip install jyquickhelper
    from jyquickhelper import add_notebook_menu
    
"""
Output Table of contents to navigate easily in the notebook. 
For interested readers, the package also includes Ipython magic commands to go back to this cell
wherever you are in the notebook to look for cells faster
"""
add_notebook_menu()

## Imports


In [2]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context


In [3]:
from tqdm import tqdm
import pandas as pd
import numpy as np


## Dataset description


We use here the MovieLens 100K Dataset. It contain 100,000 ratings from 1000 users on 1700 movies.

- u.train / u.test part of the original u.data information
  - The full u data set, 100000 ratings by 943 users on 1682 items.
    Each user has rated at least 20 movies. Users and items are
    numbered consecutively from 1. The data is randomly
    ordered. This is a tab separated list of
    user id | item id | rating | timestamp.
    The time stamps are unix seconds since 1/1/1970 UTC
- u.info
  - The number of users, items, and ratings in the u data set.
- u.item
  - Information about the items (movies); this is a tab separated
    list of
    movie id | movie title | release date | video release date |
    IMDb URL | unknown | Action | Adventure | Animation |
    Children's | Comedy | Crime | Documentary | Drama | Fantasy |
    Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
    Thriller | War | Western |
    The last 19 fields are the genres, a 1 indicates the movie
    is of that genre, a 0 indicates it is not; movies can be in
    several genres at once.
    The movie ids are the ones used in the u.data data set.
- u.genre
  - A list of the genres.
- u.user
  - Demographic information about the users; this is a tab
    separated list of
    user id | age | gender | occupation | zip code
    The user ids are the ones used in the u.data data set.


In [4]:
path = "https://www.i3s.unice.fr/~riveill/dataset/dataset_movilens_100K/"


Before we build our model, it is important to understand the distinction between implicit and explicit feedback, and why modern recommender systems are built on implicit feedback.

- **Explicit Feedback:** in the context of recommender systems, explicit feedback are direct and quantitative data collected from users.
- **Implicit Feedback:** on the other hand, implicit feedback are collected indirectly from user interactions, and they act as a proxy for user preference.

The advantage of implicit feedback is that it is abundant. Recommender systems built using implicit feedback allow recommendations to be adapted in real time, with each click and interaction.

Today, online recommender systems are built using implicit feedback.


### Data preprocessing


In [5]:
# Load data
np.random.seed(123)

ratings = pd.read_csv(
    path + "u.data",
    sep="\t",
    header=None,
    names=["userId", "movieId", "rating", "timestamp"],
)
ratings = ratings.sort_values(["timestamp"], ascending=True)
print("Nb ratings:", len(ratings))
ratings


Nb ratings: 100000


Unnamed: 0,userId,movieId,rating,timestamp
214,259,255,4,874724710
83965,259,286,4,874724727
43027,259,298,4,874724754
21396,259,185,4,874724781
82655,259,173,4,874724843
...,...,...,...,...
46773,729,689,4,893286638
73008,729,313,3,893286638
46574,729,328,3,893286638
64312,729,748,4,893286638


### Data splitting

Separating the dataset between train and test in a random fashion would not be fair, as we could potentially use a user's recent evaluations for training and previous evaluations. This introduces a data leakage with an anticipation bias, and the performance of the trained model would not be generalizable to real world performance.

Therefore, we need to slice the train and test based on the timestamp


In [6]:
# Split dataset
train_ratings, test_ratings = np.split(ratings, [int(0.9 * len(ratings))])

max(train_ratings["timestamp"]) <= min(test_ratings["timestamp"])


True

In [7]:
# drop columns that we no longer need
train_ratings = train_ratings[["userId", "movieId", "rating"]]
test_ratings = test_ratings[["userId", "movieId", "rating"]]

len(train_ratings), len(test_ratings)


(90000, 10000)

In [8]:
# Get a list of all movie IDs
all_movieIds = ratings["movieId"].unique()


### Build pivot table


In [9]:
""" Pivot table for train set """
train_pivot = pd.pivot_table(
    data=train_ratings,
    values="rating",
    index="userId",
    columns="movieId",
)
print("Nb users: ", train_pivot.shape[0])
print("Nb movies:", train_pivot.shape[1])
train_pivot


Nb users:  867
Nb movies: 1637


movieId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,
6,4.0,,,,,,2.0,4.0,4.0,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,5.0,,...,,,,,,,,,,
940,,,,2.0,,,4.0,5.0,3.0,,...,,,,,,,,,,
941,5.0,,,,,,4.0,,,,...,,,,,,,,,,
942,,,,,,,,,,,...,,,,,,,,,,


In [10]:
train_users = train_pivot.index
train_movies = train_pivot.columns


## Collaborative filtering based on Users similarity

This approach uses scores that have been assigned by other users to calculate predictions.

In pivot table

- Rows are users, $u, v$
- Columns are items, $i, j$

$$pred(u, i) = \frac{\sum_v sim(u,v)*r_{v,i}}{\sum_v sim(u,v)}$$

Wich similarity function:

- Euclidean distance $[0,1]$: $sim(a,b)=\frac{1}{1+\sqrt{\sum_i (r_{a,i}-r_{b,i})^2}}$
- Pearson correlation $[-1,1]$: $sim(a,b)=\frac{\sum_i (r_{a,i}-r_a)(r_{b,i}-r_b)}{{\sum_i (r_{a,i}-r_a)^2}{\sum_i (r_{b,i}-r_b)^2}}$
- Cosine similarity $[-1,1]$: $sim(a, b)=\frac{a.b}{|a|.|b|}$

Which function should we use? The answer is that there is no fixed recipe; but there are some issues we can take into account when choosing the proper similarity function. On the one hand:

- Pearson correlation usually works better than Euclidean distance since it is based more on the ranking than on the values. So, two users who usually like more the same set of items, although their rating is on different scales, will come out as similar users with Pearson correlation but not with Euclidean distance.
- On the other hand, when dealing with binary/unary data, i.e., like versus not like or buy versus not buy, instead of scalar or real data like ratings, cosine distance is usually used.


### Build predictor


In [11]:
# Step 1: build the similarity matrix between users
correlation_matrix = train_pivot.transpose().corr("pearson")
correlation_matrix


userId,1,2,3,5,6,7,8,9,10,12,...,934,935,936,937,938,939,940,941,942,943
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,1.608412e-01,0.112780,0.420809,0.287159,1.237128e-01,0.692086,-0.102062,-0.092344,4.219129e-02,...,0.061695,-2.602417e-01,0.383733,2.899974e-02,0.326744,5.343904e-01,0.263289,0.205616,-0.180784,0.067549
2,0.160841,1.000000e+00,0.067420,0.327327,0.446269,4.807341e-01,0.585491,0.242536,0.668145,3.532708e-16,...,0.021007,-2.711631e-01,0.214017,5.616449e-01,0.331587,-7.671236e-18,-0.011682,-0.062017,0.085960,0.479702
3,0.112780,6.741999e-02,1.000000,,-0.109109,-5.037555e-17,0.291937,,0.311086,1.889822e-01,...,,,-0.045162,-5.233642e-17,-0.137523,,-0.104678,1.000000,-0.011792,
5,0.420809,3.273268e-01,,1.000000,0.241817,1.490373e-01,0.537400,0.577350,0.087343,1.169812e-01,...,0.229532,-5.000000e-01,0.439286,6.085806e-01,0.484211,8.807048e-01,0.027038,0.468521,0.318163,0.346234
6,0.287159,4.462695e-01,-0.109109,0.241817,1.000000,1.758687e-01,0.687745,0.132353,0.273987,1.121937e-01,...,-0.064033,-2.642921e-01,0.392760,2.706660e-01,0.229918,2.063147e-01,-0.024419,0.399186,0.092349,0.109833
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.534390,-7.671236e-18,,0.880705,0.206315,1.425665e-01,-0.333333,,0.316228,,...,0.374351,-3.305898e-02,0.471172,-2.758386e-01,-0.073374,1.000000e+00,-0.534522,-0.131306,-0.500000,-0.187317
940,0.263289,-1.168173e-02,-0.104678,0.027038,-0.024419,3.142734e-02,0.320487,0.171499,0.158976,8.530245e-02,...,-0.125059,4.352858e-01,-0.338327,-1.486075e-01,0.110022,-5.345225e-01,1.000000,0.632746,-0.022813,0.332497
941,0.205616,-6.201737e-02,1.000000,0.468521,0.399186,0.000000e+00,0.166667,1.000000,0.420084,,...,-0.500000,-2.355139e-16,0.273060,3.929526e-01,-0.214147,-1.313064e-01,0.632746,1.000000,-0.577350,-0.395285
942,-0.180784,8.596024e-02,-0.011792,0.318163,0.092349,4.548076e-01,0.201328,0.707107,0.408994,3.236481e-01,...,0.438252,-8.703883e-01,-0.216119,4.472136e-01,0.244989,-5.000000e-01,-0.022813,-0.577350,1.000000,0.277433


In [12]:
# Step2 build rating function
# We want to calculate the rating that a user could have given for an item.

# Il est plus efficace de travailler avec numpy qu'avec pandas.
# On transforme donc la matrice pivot en numpy
pivot = train_pivot.to_numpy()
# idem pour la matrice de correlation
corr = correlation_matrix.to_numpy()
# Malheureusement, on doit utiliser 2 dictionnaires pour passer
# Du nom de la colonne movieId dans son indice en numpy
movie2column = {j: i for i, j in enumerate(train_pivot.columns)}
# Du nom de la ligne userId dans son indice en numpy
user2row = {j: i for i, j in enumerate(train_pivot.index)}


def predict(pivot, corr, userId, movieId):
    if movieId in movie2column.keys():
        movie = movie2column[movieId]
    else:
        return 2.5
    if userId in user2row.keys():
        user = user2row[userId]
    else:
        return 2.5

    # Normalement le rating est inconnu
    if np.isnan(pivot[user, movie]):
        num = 0
        den = 0
        for u in range(len(corr)):
            if not np.isnan(pivot[u, movie]) and not np.isnan(corr[user, u]):
                # Si l'utilisateur u a déjà vu le film movie
                # et si les deux utilisateurs ont au moins vu un même film
                den += abs(corr[user, u])
                num += corr[user, u] * pivot[u, movie]
        if den != 0:
            return num / den
        else:
            return 2.5  # default value
    else:
        # le film a déjà été vu
        print(
            f"l'utilisateur {userId} a déjà vu le film {movieId}",
            f"et lui a donné la note de {pivot[user, movie]}",
        )
        return pivot[user, movie]


predict(pivot=pivot, corr=corr, userId=1, movieId=1)
predict(pivot=pivot, corr=corr, userId=3, movieId=28)


l'utilisateur 1 a déjà vu le film 1 et lui a donné la note de 5.0


1.8527972301545377

### Predict


In [13]:
# Step 3 add the predicted rating to the test set

test_ratings["User based"] = [
    predict(pivot, corr, userId, movieId)
    for _, userId, movieId, _ in tqdm(
        test_ratings[["userId", "movieId", "rating"]].itertuples()
    )
]
test_ratings


10000it [00:06, 1578.53it/s]


Unnamed: 0,userId,movieId,rating,User based
557,90,900,4,0.657995
6675,90,269,5,0.987827
562,90,289,3,0.572064
62660,90,270,4,1.077609
68756,90,268,4,1.202310
...,...,...,...,...
46773,729,689,4,2.500000
73008,729,313,3,2.500000
46574,729,328,3,2.500000
64312,729,748,4,2.500000


### Evaluate the predictor

Now that we have trained our model, assigned a value to each pair (userId, movieId), we are ready to evaluate it.


#### Evaluation with classical metrics: RMSE

In traditional machine learning projects, we evaluate our models using measures such as accuracy (for classification problems) and RMSE (for regression problems). This is what we will do in the first instance.


In [14]:
test_ratings["rating"]


557      4
6675     5
562      3
62660    4
68756    4
        ..
46773    4
73008    3
46574    3
64312    4
79208    4
Name: rating, Length: 10000, dtype: int64

In [15]:
# Step 4 evaluate the resulte : with classical metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error

print(
    "RMSE:",
    np.sqrt(mean_squared_error(test_ratings["rating"], test_ratings["User based"])),
)


RMSE: 1.7798966446725806


#### Hit Ratio @ K

However, a measure such as RMSE does not provide a satisfactory evaluation of recommender systems. To design a good metric for evaluating recommender systems, we need to first understand how modern recommender systems are used.

Amazon, Netflix and others uses a list of recommendations. The key here is that we don’t need the user to interact with every single item in the list of recommendations. Instead, we just need the user to interact with at least one item on the list — as long as the user does that, the recommendations have worked.

To simulate this, let’s run the following evaluation protocol to generate a list of top 10 recommended items for each user.

- For each user, randomly select 99 items that the user has not interacted with.
- Combine these 99 items with the test item (the actual item that the user last interacted with). We now have 100 items.
- Run the model on these 100 items, and rank them according to their predicted probabilities.
- Select the top 10 items from the list of 100 items. If the test item is present within the top 10 items, then we say that this is a hit.
- Repeat the process for all users. The Hit Ratio is then the average hits.

This evaluation protocol is known as **Hit Ratio @ K**, and it is commonly used to evaluate recommender systems.


<font color='red'>
$TO DO - Students$

- Fill the gaps
  </font>


In [16]:
# Step 2 with hit ratio
def HRatio(test_ratings, predictor, K=10, predict_func=predict):
    # User-item pairs for testing
    test_user_item_set = set(
        list(set(zip(test_ratings["userId"], test_ratings["movieId"])))[:1000]
    )

    # Dict of all items that are interacted with by each user
    user_interacted_items = ratings.groupby("userId")["movieId"].apply(list).to_dict()

    hits = []
    for (u, i) in tqdm(test_user_item_set):
        interacted_items = user_interacted_items[u]
        not_interacted_items = set(all_movieIds) - set(interacted_items)
        selected_not_interacted = list(np.random.choice(list(not_interacted_items), 99))
        test_items = selected_not_interacted + [i]
        predicted_labels = predictor(
            pairs=[np.array([u] * 100), np.array(test_items)],
            predict_func=predict_func,  # added to be able to pass custom predict functions
        ).reshape(-1)
        topK_items = [test_items[i] for i in np.argsort(predicted_labels)[-K:].tolist()]

        if i in topK_items:
            hits.append(1)
        else:
            hits.append(0)
    hr = np.average(hits)
    print("The Hit Ratio @ {} is {:.2f}".format(K, hr))
    return hr


In [17]:
def predictor(
    pairs,
    predict_func=predict,  # allows to pass custom predict functions
):
    pred = []
    for userId, movieId in zip(pairs[0], pairs[1]):
        pred += [predict_func(pivot, corr, userId, movieId)]
    return np.array(pred)


HR = dict()
hr = HRatio(
    test_ratings=test_ratings,
    predictor=predictor,
    K=25,
)
HR["User based"] = hr


100%|██████████| 1000/1000 [00:58<00:00, 17.22it/s]

The Hit Ratio @ 25 is 0.78





## Improve the rating


### Trick 1:

Since humans do not usually act the same as critics, i.e., some people usually rank movies higher or lower than others, this prediction function can be easily improved by taking into account the user mean as follows:

$$pred(u, i) = \overline{r_u} + \frac{\sum_v sim(u,v)*(r_{v,i} - \overline{r_v})}{\sum_v sim(u,v)}$$


<font color='red'>
$TO DO - Students$

- Modify the previous code in order to implement "Trick 1"
  </font>


In [18]:
def user_center(pivot):
    """
    Compute train_pivot user centered (uc), which removes the mean of every user
    """
    user_mean = pivot.transpose().mean()
    return (pivot.transpose() - user_mean).transpose()


train_pivot_uc = user_center(
    pivot=train_pivot
)  # user centered version of `train_pivot`
correlation_matrix = train_pivot_uc.transpose().corr("pearson")


def predict_uc(pivot, corr, userId, movieId):
    if movieId in movie2column.keys():
        movie = movie2column[movieId]
    else:
        return 2.5
    if userId in user2row.keys():
        user = user2row[userId]
    else:
        return 2.5

    # Normalement le rating est inconnu
    if np.isnan(pivot[user, movie]):
        num = 0
        den = 0
        for u in range(len(corr)):
            if not np.isnan(pivot[u, movie]) and not np.isnan(corr[user, u]):
                # Si l'utilisateur u a déjà vu le film movie
                # et si les deux utilisateurs ont au moins vu un même film
                den += abs(corr[user, u])

                # remove the mean of user rating
                num += corr[user, u] * (pivot[u, movie] - np.nanmean(pivot[u]))
        if den != 0:
            return (num / den) + np.nanmean(pivot[user])  # add user mean
        else:
            return 2.5  # default value
    else:
        # le film a déjà été vu
        print(
            f"l'utilisateur {userId} a déjà vu le film {movieId}",
            f"et lui a donné la note de {pivot[user, movie]}",
        )
        return pivot[user, movie]


predict_uc(pivot=pivot, corr=corr, userId=1, movieId=1)
predict_uc(pivot=pivot, corr=corr, userId=3, movieId=28)


l'utilisateur 1 a déjà vu le film 1 et lui a donné la note de 5.0


2.9400162942557957

In [19]:
# Step 3 add the predicted rating to the test set

test_ratings["User based_uc"] = [
    predict_uc(pivot, corr, userId, movieId)
    for _, userId, movieId, _ in tqdm(
        test_ratings[["userId", "movieId", "rating"]].itertuples()
    )
]
test_ratings


10000it [00:32, 308.22it/s]


Unnamed: 0,userId,movieId,rating,User based,User based_uc
557,90,900,4,0.657995,3.758684
6675,90,269,5,0.987827,3.845841
562,90,289,3,0.572064,3.687360
62660,90,270,4,1.077609,3.422702
68756,90,268,4,1.202310,3.836968
...,...,...,...,...,...
46773,729,689,4,2.500000,2.500000
73008,729,313,3,2.500000,2.500000
46574,729,328,3,2.500000,2.500000
64312,729,748,4,2.500000,2.500000


In [20]:
# Step 4 evaluate the resulte : with classical metrics

print(
    "RMSE:",
    np.sqrt(mean_squared_error(test_ratings["rating"], test_ratings["User based_uc"])),
)


RMSE: 1.4531244671937105


In [21]:
hr = HRatio(
    test_ratings=test_ratings,
    predictor=predictor,
    predict_func=predict_uc,
    K=25,
)
HR["User based_uc"] = hr


100%|██████████| 1000/1000 [02:32<00:00,  6.56it/s]

The Hit Ratio @ 25 is 0.80





### Trick 2:

If two users have very few items in common, let us imagine that there is only one, and the rating is the same, the user similarity will be really high; however, the confidence is really small. It's possible to add a ponderation coefficient.

$$newsim(a, b) = sim(a,b) * \frac{min(N, |P_{a,b}|)}{N}$$

where $|P_{a,b}|$ is the number of common items shared by user a and user b. The coefficient is $< 1$ if the number of common movies is $< N$ and $1$ otherwise.


In [22]:
# Count the number of common items shared by user 1 and user 2.
a = user2row[1]
b = user2row[2]

thresh_common_nb = 10  # N
common = pivot[a] + pivot[b]  # sum of np arrays propagates nan
common = np.count_nonzero(~np.isnan(common))
enough_common_coeff = np.min([thresh_common_nb, common]) / thresh_common_nb
enough_common_coeff


1.0

<font color='red'>
$TO DO - Students$

- Modify the previous code in order to implement "Trick 2"
  </font>


In [23]:
def predict_thresh(
    pivot,
    corr,
    userId,
    movieId,
    thresh_common_nb=10,  # N
):
    if movieId in movie2column.keys():
        movie = movie2column[movieId]
    else:
        return 2.5
    if userId in user2row.keys():
        user = user2row[userId]
    else:
        return 2.5

    # Normalement le rating est inconnu
    if np.isnan(pivot[user, movie]):
        num = 0
        den = 0
        for u in range(len(corr)):
            if not np.isnan(pivot[u, movie]) and not np.isnan(corr[user, u]):
                # Si l'utilisateur u a déjà vu le film movie
                # et si les deux utilisateurs ont au moins vu un même film
                common = pivot[user] + pivot[u]  # sum of np arrays propagates nan
                common = np.count_nonzero(~np.isnan(common))
                enough_common_coeff = np.min([thresh_common_nb, common]) / thresh_common_nb
                
                num += enough_common_coeff * corr[user, u] * pivot[u, movie]
                den += abs(enough_common_coeff * corr[user, u])

        if den != 0:
            return num / den

        else:
            return 2.5  # default value
    else:
        # le film a déjà été vu
        print(
            f"l'utilisateur {userId} a déjà vu le film {movieId}",
            f"et lui a donné la note de {pivot[user, movie]}",
        )
        return pivot[user, movie]


predict_thresh(pivot=pivot, corr=corr, userId=1, movieId=1)
predict_thresh(pivot=pivot, corr=corr, userId=3, movieId=28)


l'utilisateur 1 a déjà vu le film 1 et lui a donné la note de 5.0


2.082207668795514

In [24]:
for thresh_common_nb in [10, 15, 20, 30, 50]:
    # Step 3 add the predicted rating to the test set

    test_ratings[f"User based_thresh_{thresh_common_nb}"] = [
        predict_thresh(pivot, corr, userId, movieId, thresh_common_nb)
        for _, userId, movieId, _ in tqdm(
            test_ratings[["userId", "movieId", "rating"]].itertuples()
        )
    ]
    # test_ratings

    # Step 4 evaluate the resulte : with classical metrics

    print(
        f"RMSE for N={thresh_common_nb}:",
        np.sqrt(
            mean_squared_error(
                y_true=test_ratings["rating"],
                y_pred=test_ratings[f"User based_thresh_{thresh_common_nb}"],
            )
        ),
    )


10000it [00:18, 547.80it/s]


RMSE for N=10: 1.727123456965418


10000it [00:17, 582.31it/s]


RMSE for N=15: 1.710466778197879


10000it [00:17, 584.81it/s]


RMSE for N=20: 1.701306646530708


10000it [00:16, 593.53it/s]


RMSE for N=30: 1.6905802601250242


10000it [00:17, 575.09it/s]

RMSE for N=50: 1.6794334332143162





In [25]:
hr = HRatio(
    test_ratings=test_ratings,
    predictor=predictor,
    predict_func=predict_thresh,
    K=25,
)
HR["User based_thresh"] = hr


100%|██████████| 1000/1000 [01:37<00:00, 10.23it/s]

The Hit Ratio @ 25 is 0.78





In [26]:
test_ratings

Unnamed: 0,userId,movieId,rating,User based,User based_uc,User based_thresh_10,User based_thresh_15,User based_thresh_20,User based_thresh_30,User based_thresh_50
557,90,900,4,0.657995,3.758684,1.120209,1.204467,1.204467,1.204467,1.204467
6675,90,269,5,0.987827,3.845841,1.570607,1.585766,1.585766,1.585766,1.585766
562,90,289,3,0.572064,3.687360,1.100158,1.112312,1.112312,1.112312,1.112312
62660,90,270,4,1.077609,3.422702,1.241466,1.279309,1.279309,1.279309,1.279309
68756,90,268,4,1.202310,3.836968,1.587394,1.607269,1.607269,1.607269,1.607269
...,...,...,...,...,...,...,...,...,...,...
46773,729,689,4,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000
73008,729,313,3,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000
46574,729,328,3,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000
64312,729,748,4,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000


In [27]:
HR


{'User based': 0.78, 'User based_uc': 0.803, 'User based_thresh': 0.782}

## To go further

1. Do the same, but with correlation between items. It's Collaborative filtering based on Items similarity. It's also possible to use the 2 previous trick

1. Use Matrix factorization: decompose R in P, Q at rank k (i.e. if R is a m.n matrix, P is a m.k matrix and Q is a n.k matrix) the reconstruct R with P and Q (i.e. $\hat{R} = P Q^T$)

1. Use Matrix decomposition: do an truncated SVD decomposition in order to obtain U, S and V, build $\hat{R} = U S V^T$


<font color='red'>
$TO DO - Students$

- Choose, implement and evaluate one of the above strategies.
  </font>


### Collaborative filtering based on items similarity


In [28]:
""" Pivot table for train set """
train_pivot = pd.pivot_table(
    data=train_ratings,
    values="rating",
    index="movieId", # used to be userId
    columns="userId", # used to be movieId
) # the changes cause the axes to be reversed
print("Nb movies:", train_pivot.shape[0])
print("Nb users: ", train_pivot.shape[1])
train_pivot


Nb movies: 1637
Nb users:  867


userId,1,2,3,5,6,7,8,9,10,12,...,934,935,936,937,938,939,940,941,942,943
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,4.0,,4.0,4.0,,,,4.0,,...,2.0,3.0,4.0,,4.0,,,5.0,,
2,3.0,,,3.0,,,,,,,...,4.0,,,,,,,,,5.0
3,4.0,,,,,,,,,,...,,,4.0,,,,,,,
4,3.0,,,,,5.0,,,4.0,5.0,...,5.0,,,,,,2.0,,,
5,3.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,,,,,,,,,,,...,,,,,,,,,,
1679,,,,,,,,,,,...,,,,,,,,,,
1680,,,,,,,,,,,...,,,,,,,,,,
1681,,,,,,,,,,,...,,,,,,,,,,


In [29]:
# Step 1: build the similarity matrix between users

# no need to remove the transpose since we exchanged movieId and userId
# when making the pivot table
correlation_matrix = train_pivot.transpose().corr("pearson")
correlation_matrix


movieId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.198057,0.172936,0.128676,0.378934,0.529401,0.153225,0.272667,0.074523,0.204965,...,,,,,,,,,,
2,0.198057,1.000000,0.172189,0.187792,0.335075,-0.158114,0.140478,0.306391,-0.288128,0.174262,...,,,,,,,,,,
3,0.172936,0.172189,1.000000,-0.134625,0.177084,0.806226,0.017779,-0.182750,-0.089170,0.071563,...,,,,,,,,,,
4,0.128676,0.187792,-0.134625,1.000000,-0.190204,0.066625,0.186239,0.252612,0.273724,0.215495,...,,,,,,,,,,
5,0.378934,0.335075,0.177084,-0.190204,1.000000,1.000000,0.127930,0.233920,-0.043369,-0.855717,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,,,,,,,,,,,...,,,,,,,,,,
1679,,,,,,,,,,,...,,,,,,,,,,
1680,,,,,,,,,,,...,,,,,,,,,,
1681,,,,,,,,,,,...,,,,,,,,,,


In [30]:
# Step2 build rating function
# We want to calculate the rating that a user could have given for an item.

# Il est plus efficace de travailler avec numpy qu'avec pandas.
# On transforme donc la matrice pivot en numpy
pivot = train_pivot.to_numpy()
# idem pour la matrice de correlation
corr = correlation_matrix.to_numpy()
# Malheureusement, on doit utiliser 2 dictionnaires pour passer
# Du nom de la colonne movieId dans son indice en numpy
movie2column = {j: i for i, j in enumerate(train_pivot.columns)}
# Du nom de la ligne userId dans son indice en numpy
user2row = {j: i for i, j in enumerate(train_pivot.index)}


# the names of movieId and userId should be reversed
def predict(pivot, corr, userId, movieId):
    if movieId in movie2column.keys():
        movie = movie2column[movieId]
    else:
        return 2.5
    if userId in user2row.keys():
        user = user2row[userId]
    else:
        return 2.5

    # Normalement le rating est inconnu
    if np.isnan(pivot[user, movie]):
        num = 0
        den = 0
        for u in range(len(corr)):
            if not np.isnan(pivot[u, movie]) and not np.isnan(corr[user, u]):
                # Si l'utilisateur u a déjà vu le film movie
                # et si les deux utilisateurs ont au moins vu un même film
                den += abs(corr[user, u])
                num += corr[user, u] * pivot[u, movie]
        if den != 0:
            return num / den
        else:
            return 2.5  # default value
    else:
        # le film a déjà été vu
        print(
            f"le film {userId} a déjà été vu par l'utilisateur {movieId}",
            f"et a reçu la note de {pivot[user, movie]}",
        )
        return pivot[user, movie]


predict(pivot=pivot, corr=corr, userId=1, movieId=1)
predict(pivot=pivot, corr=corr, userId=3, movieId=28)


le film 1 a déjà été vu par l'utilisateur 1 et a reçu la note de 5.0


2.577331638036494

In [31]:
# Step 3 add the predicted rating to the test set

test_ratings["items_based"] = [
    predict(pivot, corr, userId, movieId)
    for _, userId, movieId, _ in tqdm(
        test_ratings[["movieId", "userId", "rating"]].itertuples() # invert movieId and userId
    )
]
test_ratings


10000it [00:11, 833.96it/s]


Unnamed: 0,userId,movieId,rating,User based,User based_uc,User based_thresh_10,User based_thresh_15,User based_thresh_20,User based_thresh_30,User based_thresh_50,items_based
557,90,900,4,0.657995,3.758684,1.120209,1.204467,1.204467,1.204467,1.204467,1.897624
6675,90,269,5,0.987827,3.845841,1.570607,1.585766,1.585766,1.585766,1.585766,3.232988
562,90,289,3,0.572064,3.687360,1.100158,1.112312,1.112312,1.112312,1.112312,3.032812
62660,90,270,4,1.077609,3.422702,1.241466,1.279309,1.279309,1.279309,1.279309,1.620365
68756,90,268,4,1.202310,3.836968,1.587394,1.607269,1.607269,1.607269,1.607269,2.928405
...,...,...,...,...,...,...,...,...,...,...,...
46773,729,689,4,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000
73008,729,313,3,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000
46574,729,328,3,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000
64312,729,748,4,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000,2.500000


In [32]:
# compute RMSE
print(
    "RMSE:",
    np.sqrt(mean_squared_error(test_ratings["rating"], test_ratings["items_based"])),
)


RMSE: 1.7554528828991547


### Matrix factorization

Same approach as in the slides.

<font color='blue'>
Below is a simple algorithm for factoring a matrix.
</font>


In [33]:
# Matrix factorization from scratch
def matrix_factorization(R, K, steps=10, alpha=0.005):
    """
    R: rating matrix
    K: latent features
    steps: iterations
    alpha: learning rate
    beta: regularization parameter"""

    # N: num of User
    N = R.shape[0]
    # M: num of Movie
    M = R.shape[1]

    # P: |U| * K (User features matrix)
    P = np.random.rand(N, K)
    # Q: |D| * K (Item features matrix)
    Q = np.random.rand(M, K).T

    for step in tqdm(range(steps)):
        for i in range(N):
            for j in range(M):
                if not np.isnan(R[i][j]):
                    # calculate error
                    eij = R[i][j] - np.dot(P[i, :], Q[:, j])

                    for k in range(K):
                        # calculate gradient with a and beta parameter
                        tmp = P[i][k] + alpha * (2 * eij * Q[k][j])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k])
                        tmp = P[i][k]

    return P, Q.T


In [34]:
# We try first on a toy example
# R: rating matrix
import math

R = [
    [5, 3, math.nan, 1],
    [4, math.nan, math.nan, 1],
    [1, 1, math.nan, 5],
    [1, math.nan, math.nan, 4],
    [0, 1, 5, 4],
    [2, 1, 3, math.nan],
]

R = np.array(R)
# Num of Features
K = 3

nP, nQ = matrix_factorization(R, K, steps=10)

nR = np.dot(nP, nQ.T)
nR


100%|██████████| 10/10 [00:00<00:00, 1784.28it/s]


array([[2.12758474, 1.84433709, 1.3047341 , 1.79528997],
       [1.62941137, 1.48932725, 1.00469   , 1.17875413],
       [1.41674085, 1.23374316, 0.87721332, 1.39813117],
       [1.06211232, 1.13392569, 0.70126885, 1.29508273],
       [1.27068877, 1.20998626, 0.79444241, 0.99844176],
       [1.83008845, 1.52090772, 1.09230563, 1.02466009]])

In [35]:
""" TRY to predict with matrix factorization """


' TRY to predict with matrix factorization '

In [36]:
""" Evaluate the result """


' Evaluate the result '

## Decomposition using latent factor.

We use SVD decomposition


In [37]:
pivot = train_pivot.fillna(0).to_numpy()
max_components = min(train_pivot.shape) - 1


In [38]:
from scipy.sparse.linalg import svds

k = 50
assert k < max_components

u, s, v_T = svds(pivot, k=k)
nR = u.dot(np.diag(s).dot(v_T))  # output of TruncatedSVD


In [39]:
s

array([ 57.25005481,  57.57480375,  57.75314656,  57.99880033,
        58.13254489,  58.42083719,  58.8588979 ,  59.07946095,
        59.47557093,  59.65841912,  59.81901285,  60.58211626,
        61.21112094,  61.2998157 ,  62.04973244,  62.38679731,
        62.57380979,  62.99998426,  63.59984122,  64.39513078,
        64.71710402,  65.03539222,  65.38927487,  65.61214988,
        66.80944062,  67.53227707,  69.31276991,  69.78436118,
        70.0279118 ,  71.31421275,  72.47720353,  73.71497721,
        74.67513991,  76.60176717,  78.43382523,  79.64080918,
        82.23187885,  85.53988698,  88.66338897,  90.01310144,
        96.65966513, 101.72448643, 116.74772225, 120.12556462,
       138.24673889, 145.94464114, 152.41581902, 203.78726783,
       230.63108475, 603.76784628])

In [40]:
""" TRY to predict with SVD decomposition """


' TRY to predict with SVD decomposition '

In [41]:
""" Evaluate the result """


' Evaluate the result '