## User-to-user collaborative filtering

*Source: https://github.com/nzhinusoftcm/review-on-collaborative-filtering/blob/master/2.User-basedCollaborativeFiltering.ipynb**

### Import requirements

In [12]:
import os

if not (os.path.exists("recsys.zip") or os.path.exists("recsys")):
    !wget https://github.com/nzhinusoftcm/review-on-collaborative-filtering/raw/master/recsys.zip    
    !unzip recsys.zip

--2024-01-06 10:53:46--  https://github.com/nzhinusoftcm/review-on-collaborative-filtering/raw/master/recsys.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nzhinusoftcm/review-on-collaborative-filtering/master/recsys.zip [following]
--2024-01-06 10:53:46--  https://raw.githubusercontent.com/nzhinusoftcm/review-on-collaborative-filtering/master/recsys.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15312323 (15M) [application/zip]
Saving to: ‘recsys.zip’


2024-01-06 10:53:47 (93.0 MB/s) - ‘recsys.zip’ saved [15312323/15312323]

Archive:  recsys.zip
   creating: recsys/
  inflating: recs

In [13]:
import numpy as np
import pandas as pd
import tqdm.notebook
from recsys.datasets import mlLatestSmall
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split as sklearn_train_test_split
import typing as tp
from functools import lru_cache

**Load MovieLens ratings**

In [101]:
ratings, movies = mlLatestSmall.load()

In [102]:
ratings = ratings.drop("timestamp", axis=1)
ratings.tail()

Unnamed: 0,userid,itemid,rating
100831,610,166534,4.0
100832,610,168248,5.0
100833,610,168250,5.0
100834,610,168252,5.0
100835,610,170875,3.0


My favourite movies:
1. Interstellar (2014)
2. Inception (2010)
3. The Intouchables (2011)
4. Howl's Moving Castle (2004)
5. Hachiko: A Dog's Story (2009)
6. Drive (2011)
7. The Prestige (2006)
8. A Beautiful Mind (2001)
9. Blade Runner 2049 (2017)
10. Avatar (2009)


In [None]:
# movies[movies['title'].str.contains('Drive')]

In [103]:
# itemid: rating
fav_movies = {
    109487: 5.0,
    79132: 5.0,
    92259: 5.0,
    31658: 5.0,
    73290: 5.0,
    88129: 4.5,
    48780: 5.0,
    4995: 4.5,
    176371: 4.5,
    72998: 5.0
}

userid = ratings.userid.max() + 1

for itemid, rating in fav_movies.items():
    new_record = pd.DataFrame([{'userid': userid, 'itemid': itemid, 'rating': rating}])
    ratings = pd.concat([ratings, new_record], ignore_index=True)

In [18]:
ratings.tail()

Unnamed: 0,userid,itemid,rating
100841,611,88129,4.5
100842,611,48780,5.0
100843,611,4995,4.5
100844,611,176371,4.5
100845,611,72998,5.0


**Userids and Itemids encoding**

In [104]:
def ids_encoder(ratings):
    users = sorted(ratings['userid'].unique())
    items = sorted(ratings['itemid'].unique())

    # create users and items encoders
    uencoder = LabelEncoder()
    iencoder = LabelEncoder()

    # fit users and items ids to the corresponding encoder
    uencoder.fit(users)
    iencoder.fit(items)

    # encode userids and itemids
    ratings.userid = uencoder.transform(ratings.userid.tolist())
    ratings.itemid = iencoder.transform(ratings.itemid.tolist())

    return ratings, uencoder, iencoder

# create the encoder
ratings, uencoder, iencoder = ids_encoder(ratings)

In [105]:
np_ratings = ratings.to_numpy()

In [106]:
np_ratings

array([[0.000e+00, 0.000e+00, 4.000e+00],
       [0.000e+00, 2.000e+00, 4.000e+00],
       [0.000e+00, 5.000e+00, 4.000e+00],
       ...,
       [6.100e+02, 3.635e+03, 4.500e+00],
       [6.100e+02, 9.586e+03, 4.500e+00],
       [6.100e+02, 7.195e+03, 5.000e+00]])

### Part 1
Implement similarity functions for user2user collaborative filtering using the following similarity metrics:
1. Jaccard's coefficient
2. Dot product of common ratings
3. Adjusted Pearson Correlation

#### 1. Jaccard's coefficient

In [22]:
def jaccard_similarity(
        np_ratings: np.array, i: int, j: int
)-> float:
    """
    np_ratings: array containing: (user_id, item_id, rating)
    i: index of the first user
    j: index of the second user

    Returns:
        Jaccard similarity between users i and j
    """
    if i == j:
        return 1.0
    
    @lru_cache(2000)
    def ratings_for_user(i):
        return np_ratings[np_ratings[:, 0] == i]
    
    ratings_i, ratings_j = ratings_for_user(i), ratings_for_user(j)
    intersection = len(np.intersect1d(ratings_i[:, 1], ratings_j[:, 1]))

    if intersection == 0:
        return -1.0
    
    union = (len(ratings_i) + len(ratings_j)) - intersection

    return float(intersection) / union if union else -1.0

In [23]:
jaccard_similarity(np_ratings, 1, 7)

0.013333333333333334

#### 2. Dot product of common ratings

In [24]:
def dot_similarity(
        np_ratings: np.array, i: int, j: int
)-> float:
    """
    np_ratings: array containing: (user_id, item_id, rating)
    i: index of the first user
    j: index of the second user

    Returns:
        Dot product similarity between users i and j
    """
    if i == j:
        return np.inf
    
    @lru_cache(2000)
    def ratings_for_user(i):
        return np_ratings[np_ratings[:, 0] == i]
    
    ratings_i, ratings_j = ratings_for_user(i), ratings_for_user(j)
    common_items = np.intersect1d(ratings_i[:, 1], ratings_j[:, 1])
    
    if len(common_items) == 0:
        return -1.0
    
    common_ratings_i = ratings_i[np.isin(ratings_i[:, 1], common_items)]
    common_ratings_j = ratings_j[np.isin(ratings_j[:, 1], common_items)]
    
    return np.dot(common_ratings_i[:, 2], common_ratings_j[:, 2])

In [25]:
dot_similarity(np_ratings, 1, 608)

12.0

#### 3. Adjusted Pearson Correlation

In [26]:
def cosine(x: np.array, y: np.array) -> float:
    if np.linalg.norm(x) == 0 or np.linalg.norm(y) == 0:
        return 0

    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

In [45]:
def normalize(ratings) -> pd.DataFrame:
    """Normalize ratings by item"""
    if type(ratings) == np.ndarray:
        ratings = pd.DataFrame(ratings)
        ratings.columns = ["userid", "itemid", "rating"]
    # calculate mean for every item
    mean = ratings.groupby(by="itemid", as_index=False)["rating"].mean()
    norm_ratings = pd.merge(ratings, mean, suffixes=("", "_mean"), on="itemid")

    # normalize each rating by substracting the mean rating of the corresponding item
    norm_ratings["norm_rating"] = norm_ratings["rating"] - norm_ratings["rating_mean"]

    return norm_ratings[ratings.columns.tolist() + ["norm_rating"]]

In [28]:
norm_ratings = normalize(ratings)
np_ratings = norm_ratings.to_numpy()

In [29]:
def pearson_similarity(
        np_ratings: np.array, i: int, j: int
)-> float:
    """
    np_ratings: array containing: (user_id, item_id, rating, norm_rating)
    i: index of the first user
    j: index of the second user

    Returns:
        adjusted pearson correlation between users i and j
    """
    if i == j:
        return 1.0

    @lru_cache(2000)
    def ratings_for_user(i):
        return np_ratings[np_ratings[:, 0] == i]
    
    ratings_i, ratings_j = ratings_for_user(i), ratings_for_user(j)
    common_items = np.intersect1d(ratings_i[:, 1], ratings_j[:, 1])
    
    if len(common_items) == 0:
        return -1.0
    
    common_ratings_i = ratings_i[np.isin(ratings_i[:, 1], common_items)]
    common_ratings_j = ratings_j[np.isin(ratings_j[:, 1], common_items)]
    x = common_ratings_i[:, 3]
    y = common_ratings_j[:, 3]

    return min(len(common_items) / 50, 1) * cosine(x, y)

In [30]:
pearson_similarity(np_ratings, 1, 610)

-0.031136626488094992

**train/test split**

In [31]:
def get_examples(dataframe, labels_column="rating"):
    examples = dataframe[['userid', 'itemid']].values
    labels = dataframe[f'{labels_column}'].values
    return examples, labels


def train_test_split(examples, labels, test_size=0.1, verbose=0):
    if verbose:
        print("Train/Test split ")
        print(100 - test_size * 100, "% of training data")
        print(test_size * 100, "% of testing data")    

    # split data into train and test sets
    train_examples, test_examples, train_labels, test_labels = sklearn_train_test_split(
        examples, 
        labels, 
        test_size=0.1, 
        random_state=42, 
        shuffle=True
    )

    # transform train and test examples to their corresponding one-hot representations
    train_users = train_examples[:, 0]
    test_users = test_examples[:, 0]

    train_items = train_examples[:, 1]
    test_items = test_examples[:, 1]

    # Final training and test set
    x_train = np.array(list(zip(train_users, train_items)))
    x_test = np.array(list(zip(test_users, test_items)))

    # Drop users which have less than 10 ratings
    x_test_val_counts = pd.Series(x_test[:, 0]).value_counts()
    drop_idx = x_test_val_counts[x_test_val_counts < 10].index
    drop_cond = ~np.isin(x_test[:, 0], drop_idx)
    x_test = x_test[drop_cond]
    
    y_train = train_labels
    y_test = test_labels[drop_cond]

    if verbose:
        print()
        print('number of training examples : ', x_train.shape)
        print('number of training labels : ', y_train.shape)
        print('number of test examples : ', x_test.shape)
        print('number of test labels : ', y_test.shape)

    return (x_train, x_test), (y_train, y_test)

In [32]:
# get examples as tuples of userids and itemids and labels
raw_examples, raw_labels = get_examples(ratings, labels_column='rating')

# train test split
(x_train, x_test), (y_train, y_test) = train_test_split(examples=raw_examples, labels=raw_labels)

In [107]:
np_ratings = np.hstack((x_train, y_train[..., np.newaxis]))
norm_ratings = normalize(np_ratings)
np_ratings = norm_ratings.to_numpy()

**Similarities calculation**

In [48]:
def calculate_similaritites(
    ratings, similarity_between_two: callable, n_neighbours=21
) -> tp.Tuple[np.array, np.array]:
    """Computes correlation for users pairs
    ratings: np.ndarray containing: (user_id, item_id, rating)
    similarity_between_two: function to calculate similarity
    """

    nb_users = np.unique(ratings[:, 0]).size
    similarities = np.full((nb_users, nb_users), -1, dtype=float)
    np.fill_diagonal(similarities, np.inf)

    users = sorted(set(map(int, np_ratings[:, 0])))

    with tqdm.notebook.tqdm(total=len(users) * (len(users) - 1) // 2) as pbar:
        for i in range(len(users)):
            for j in range(i + 1, len(users)):
                sim = similarity_between_two(ratings, users[i], users[j])
                similarities[users[i], users[j]] = sim
                similarities[users[j], users[i]] = sim
                pbar.update()

    assert np.all(
        similarities.T == similarities
    ), "Similarity matrix should be symmetrical"

    # get neighbors by their neighbors in decreasing order of similarities
    neighbors = np.flip(np.argsort(similarities), axis=1)

    # sort similarities in decreasing order
    similarities = np.flip(np.sort(similarities), axis=1)

    return similarities[:, 1:n_neighbours], neighbors[:, 1:n_neighbours]

In [49]:
similarities_jaccard, neighbors_jaccard = calculate_similaritites(np_ratings, jaccard_similarity)

  0%|          | 0/186355 [00:00<?, ?it/s]

In [50]:
similarities_jaccard, neighbors_jaccard

(array([[0.19672131, 0.1890411 , 0.17665615, ..., 0.14977307, 0.14723926,
         0.14647887],
        [0.17777778, 0.16949153, 0.15789474, ..., 0.09150327, 0.09090909,
         0.08928571],
        [0.04411765, 0.03846154, 0.03773585, ..., 0.02739726, 0.02727273,
         0.02702703],
        ...,
        [0.39534884, 0.37254902, 0.32653061, ..., 0.23809524, 0.2375    ,
         0.23728814],
        [0.2605042 , 0.23293608, 0.22642487, ..., 0.15571956, 0.15378671,
         0.14956522],
        [0.10714286, 0.0952381 , 0.08333333, ..., 0.05405405, 0.05376344,
         0.05357143]]),
 array([[312, 329, 265, ..., 216, 606, 353],
        [365, 377, 416, ..., 246,  29, 318],
        [160, 531, 553, ...,  71, 243, 377],
        ...,
        [339, 125, 497, ..., 484, 178,   7],
        [248, 379, 273, ..., 559,  61, 176],
        [458, 580, 299, ...,  64, 600, 122]]))

In [51]:
similarities_dot, neighbors_dot = calculate_similaritites(np_ratings, dot_similarity)

  0%|          | 0/186355 [00:00<?, ?it/s]

In [52]:
similarities_dot, neighbors_dot

(array([[2551.  , 2058.  , 1887.  , ..., 1463.5 , 1385.5 , 1380.  ],
        [ 324.5 ,  321.5 ,  290.25, ...,  214.5 ,  212.  ,  208.25],
        [  92.5 ,   87.75,   87.5 , ...,   55.5 ,   54.  ,   52.25],
        ...,
        [ 285.  ,  269.  ,  260.  , ...,  218.  ,  217.  ,  213.  ],
        [7691.  , 6472.  , 6365.5 , ..., 3376.5 , 3308.  , 3276.25],
        [ 171.  ,  158.25,  145.5 , ...,   97.5 ,   95.  ,   94.25]]),
 array([[413, 598, 287, ...,  44, 306, 602],
        [413, 248, 447, ..., 121, 494, 351],
        [312, 413, 609, ...,   0, 602, 376],
        ...,
        [413,  42, 469, ..., 173, 473,  57],
        [413, 248, 379, ..., 386,  17, 533],
        [338, 482, 304, ...,  14, 121, 317]]))

In [53]:
similarities_pearson, neighbors_pearson = calculate_similaritites(np_ratings, pearson_similarity)

  0%|          | 0/186355 [00:00<?, ?it/s]

In [54]:
similarities_pearson, neighbors_pearson

(array([[0.55741759, 0.55050629, 0.52861097, ..., 0.32074052, 0.3146402 ,
         0.29163171],
        [0.17516731, 0.15711062, 0.10913393, ..., 0.06405239, 0.0589808 ,
         0.05827584],
        [0.13600822, 0.11343786, 0.10670146, ..., 0.05241202, 0.05196966,
         0.05152555],
        ...,
        [0.24754421, 0.23682837, 0.17881734, ..., 0.11594025, 0.1155362 ,
         0.11543306],
        [0.63829059, 0.59091475, 0.54222693, ..., 0.32411931, 0.32384218,
         0.3182407 ],
        [0.13305503, 0.0975438 , 0.09459207, ..., 0.06      , 0.05835638,
         0.05800152]]),
 array([[451, 121, 596, ..., 233,  44, 533],
        [609, 110, 479, ..., 536, 550,   9],
        [598,  67,  27, ..., 327, 524, 291],
        ...,
        [599,  53, 346, ..., 329, 132, 306],
        [121,  94, 361, ..., 414, 185, 248],
        [482, 572, 494, ..., 153, 476, 458]]))

### Part 2
Implement collaborative filtering schemes based on similarities:
1. Simple nearest-neighbour averaging
2. Averaging taking into account the mean correction

#### 1. Simple nearest-neighbour averaging

\begin{equation}
     \hat{r}_{u,i}=\frac{\sum_{v\in G_u}r_{v,i}\cdot{w_{u,v}}}{\sum_{v\in G_u}|w_{u,v}|}.
\end{equation}

In [55]:
def avg_predict(np_ratings, similarities, neighbours, mean, userid, itemid):
    """
    predict what score userid would have given to itemid.
    
    :param
        - userid : user id for which we want to make prediction
        - itemid : item id on which we want to make prediction
        
    :return
        - r_hat : predicted rating of user userid on item itemid
    """
    user_similarities = similarities[userid]
    user_neighbors = neighbours[userid]

    # find users who rated item 'itemid'
    iratings = np_ratings[np_ratings[:, 1].astype('int') == itemid]
    
    # find similar users to 'userid' who rated item 'itemid'
    suri = iratings[np.isin(iratings[:, 0], user_neighbors)]

    # similar users who rated current item
    indexes = [np.where(user_neighbors == uid)[0][0] for uid in suri[:, 0].astype('int')]
    sims = user_similarities[indexes]
    
    num = np.dot(suri[:, 2], sims)
    den = np.sum(np.abs(sims))
    
    if num == 0 or den == 0:
        return mean[userid]
    
    r_hat = np.dot(suri[:, 2], sims) / np.sum(np.abs(sims))
    
    return r_hat

#### 2. Averaging taking into account the mean correction

\begin{equation}
    \hat{r}_{u,i}=\bar{r}_u + \frac{\sum_{v\in G_u}(r_{v,i}-\bar{r}_v)\cdot{w_{u,v}}}{\sum_{v\in G_u}|w_{u,v}|}.
\end{equation}

In [112]:
# mean ratings for each user
ratings = pd.DataFrame(np_ratings)
ratings.columns = ['userid', 'itemid', 'rating', 'norm_rating']
ratings.drop(columns=['norm_rating'], inplace=True)
mean = ratings.groupby(by='userid', as_index=False)['rating'].mean()
mean_ratings = pd.merge(ratings, mean, suffixes=('','_mean'), on='userid')

# normalized ratings for each items
mean_ratings['norm_rating'] = mean_ratings['rating'] - mean_ratings['rating_mean']

mean = mean.to_numpy()[:, 1]

In [113]:
np_ratings = mean_ratings.to_numpy()

In [116]:
def mean_avg_predict(np_ratings, similarities, neighbours, mean, userid, itemid):
    """
    predict what score userid would have given to itemid.
    
    :param
        - userid : user id for which we want to make prediction
        - itemid : item id on which we want to make prediction
        
    :return
        - r_hat : predicted rating of user userid on item itemid
    """
    user_similarities = similarities[userid]
    user_neighbors = neighbours[userid]

    # find users who rated item 'itemid'
    iratings = np_ratings[np_ratings[:, 1].astype('int') == itemid]
    
    # find similar users to 'userid' who rated item 'itemid'
    suri = iratings[np.isin(iratings[:, 0], user_neighbors)]

    # similar users who rated current item
    indexes = [np.where(user_neighbors == uid)[0][0] for uid in suri[:, 0].astype('int')]
    sims = user_similarities[indexes]

    num = np.dot(suri[:, 4], sims)
    den = np.sum(np.abs(sims))
    
    if num == 0 or den == 0:
        return mean[userid]
    
    r_hat = mean[userid] + np.dot(suri[:, 4], sims) / np.sum(np.abs(sims))
    
    return r_hat

In [117]:
mean_avg_predict(np_ratings, similarities_pearson, neighbors_pearson, mean, 120, 4)

2.9166036756096516

### Part 3
1. Build recommendations for users from the validation part
2. Select 10 to 50 of your favourite films (can be exported from Kinopoisk or IMDB)
3. Calculate the top 10 recommendations for each of the 6 methods.

**Find candidate items**

In [61]:
def find_candidate_items(ratings, neighbours, userid, n=10):
    """
    Find candidate items for an active user
    
    :param userid : active user
    :param neighbors : users similar to the active user        
    :return candidates : top n of candidate items
    """
    user_neighbors = neighbours[userid]
    activities = ratings.loc[ratings.userid.isin(user_neighbors)]
    
    # sort items in decreasing order of frequency
    frequency = activities.groupby('itemid')['rating'].count().reset_index(name='count').sort_values(['count'], ascending=False)
    Gu_items = frequency.itemid
    active_items = ratings.loc[ratings.userid == userid].itemid.to_list()
    candidates = np.setdiff1d(Gu_items, active_items, assume_unique=True)[:n]
        
    return candidates

**Predictions**

In [62]:
def user2userPredictions(
        users, ratings, np_ratings, similarities, neighbours, predict, n=10
    ):
    """
    Make rating prediction for all users on each candidate item
    """    

    # loop over users to make predictions
    preds = (
        (
            userid, 
            itemid, 
            predict(
                np_ratings, 
                similarities, 
                neighbours, 
                mean, 
                userid, 
                itemid
            )
        )
        for userid in np.unique(users)
        for itemid in find_candidate_items(ratings, neighbours, userid, n)
    )

    preds = pd.DataFrame(
            dict(zip(("userid", "itemid", "predicted_rating"), zip(*preds)))
        )

    return pd.merge(preds, movies[['itemid', 'title']], on='itemid', how='inner')

**Predictions for users from validation part**

In [119]:
preds = user2userPredictions(x_test[:, 0], ratings, np_ratings, similarities_jaccard, neighbors_jaccard, mean_avg_predict, n=10)

In [120]:
preds.sort_values("predicted_rating", ascending=False).groupby("userid").apply(print)

    userid  itemid  predicted_rating  \
44       0   507.0          4.995378   
40       0  1210.0          4.719032   
35       0  2096.0          4.495931   
0        0   337.0          4.294547   
18       0   334.0          3.726520   

                                                title  
44                            Perfect World, A (1993)  
40  Star Wars: Episode VI - Return of the Jedi (1983)  
35                             Sleeping Beauty (1959)  
0                  What's Eating Gilbert Grape (1993)  
18                        Vanya on 42nd Street (1994)  
     userid  itemid  predicted_rating  \
216       3   277.0          4.515797   
92        3   520.0          4.338936   
267       3   908.0          4.297460   
137       3  2144.0          4.168828   
75        3    46.0          4.094967   
167       3   910.0          3.679265   
212       3    32.0          3.662337   
120       3   835.0          3.645856   

                                         title  
216 

**Evaluation with NDCG**

In [76]:
from sklearn.metrics import ndcg_score

In [77]:
def evaluate(x_test, y_test, np_ratings, similarities, neighbours, predict):
    print('Evaluate the model on {} test data ...'.format(x_test.shape[0]))
    users = np.unique(x_test[:, 0])
    ndcg_scores = []
    for u in users:
        items = x_test[x_test[:, 0] == u, 1]
        y_score = [list(predict(np_ratings, similarities, neighbours, mean, u, i) for i in items)]
        y_true = [y_test[x_test[:, 0] == u]]
        ndcg_scores.append(ndcg_score(y_true, y_score, k=10))

    print('\nAverage NDCG :', np.mean(ndcg_scores))

**Simple nearest-neighbour averaging with jaccard similarity**

In [121]:
evaluate(x_test, y_test, np_ratings, similarities_jaccard, neighbors_jaccard, avg_predict)

Evaluate the model on 8560 test data ...

Average NDCG : 0.8700162755969616


**Simple nearest-neighbour averaging with dot product similarity**

In [122]:
evaluate(x_test, y_test, np_ratings, similarities_dot, neighbors_dot, avg_predict)

Evaluate the model on 8560 test data ...

Average NDCG : 0.8846237502700076


**Simple nearest-neighbour averaging with pearson similarity**

In [123]:
evaluate(x_test, y_test, np_ratings, similarities_pearson, neighbors_pearson, avg_predict)

Evaluate the model on 8560 test data ...

Average NDCG : 0.8731575485756966


**Averaging taking into account the mean correction with jaccard similarity**

In [124]:
evaluate(x_test, y_test, np_ratings, similarities_jaccard, neighbors_jaccard, mean_avg_predict)

Evaluate the model on 8560 test data ...

Average NDCG : 0.871534803820199


**Averaging taking into account the mean correction with dot product similarity**

In [125]:
evaluate(x_test, y_test, np_ratings, similarities_dot, neighbors_dot, mean_avg_predict)

Evaluate the model on 8560 test data ...

Average NDCG : 0.8838777529346217


**Averaging taking into account the mean correction with pearson similarity**

In [126]:
evaluate(x_test, y_test, np_ratings, similarities_pearson, neighbors_pearson, mean_avg_predict)

Evaluate the model on 8560 test data ...

Average NDCG : 0.8699603697286178


### Calculating the top 10 recommendations using each of the 6 methods for my user profile

In [127]:
user = np_ratings[:, 0].astype(int).max()
user

610

**Simple nearest-neighbour averaging with jaccard similarity**

In [128]:
preds = user2userPredictions(user, ratings, np_ratings, similarities_jaccard, neighbors_jaccard, avg_predict, n=10)
preds.sort_values("predicted_rating", ascending=False)

Unnamed: 0,userid,itemid,predicted_rating,title
3,610,461.0,4.496008,Go Fish (1994)
5,610,7022.0,4.382666,Battle Royale (Batoru rowaiaru) (2000)
1,610,3633.0,4.224269,On Her Majesty's Secret Service (1969)
0,610,1938.0,4.181072,"Lost Weekend, The (1945)"
2,610,4131.0,4.157123,Making Mr. Right (1987)
4,610,277.0,4.006365,Miracle on 34th Street (1994)


**Simple nearest-neighbour averaging with dot product similarity**

In [129]:
preds = user2userPredictions(user, ratings, np_ratings, similarities_dot, neighbors_dot, avg_predict, n=10)
preds.sort_values("predicted_rating", ascending=False)

Unnamed: 0,userid,itemid,predicted_rating,title
6,610,6755.0,4.574293,Bubba Ho-tep (2002)
0,610,277.0,4.438715,Miracle on 34th Street (1994)
1,610,8045.0,4.399481,Hamburger Hill (1987)
5,610,43.0,4.352351,Restoration (1995)
7,610,1502.0,4.348523,Kissed (1996)
2,610,4153.0,4.315848,Down to Earth (2001)
8,610,3633.0,4.205886,On Her Majesty's Secret Service (1969)
4,610,7022.0,4.164045,Battle Royale (Batoru rowaiaru) (2000)
3,610,4131.0,4.116172,Making Mr. Right (1987)


**Simple nearest-neighbour averaging with pearson similarity**

In [130]:
preds = user2userPredictions(user, ratings, np_ratings, similarities_pearson, neighbors_pearson, avg_predict, n=10)
preds.sort_values("predicted_rating", ascending=False)

Unnamed: 0,userid,itemid,predicted_rating,title
7,610,314.0,4.615036,"Secret of Roan Inish, The (1994)"
8,610,277.0,4.598156,Miracle on 34th Street (1994)
4,610,6755.0,4.564343,Bubba Ho-tep (2002)
0,610,1938.0,4.540696,"Lost Weekend, The (1945)"
6,610,4153.0,4.389718,Down to Earth (2001)
3,610,3633.0,4.224675,On Her Majesty's Secret Service (1969)
5,610,2077.0,4.220775,"Journey of Natty Gann, The (1985)"
1,610,4131.0,4.088517,Making Mr. Right (1987)
2,610,3189.0,3.903925,My Dog Skip (1999)


**Averaging taking into account the mean correction with jaccard similarity**

In [131]:
preds = user2userPredictions(user, ratings, np_ratings, similarities_jaccard, neighbors_jaccard, mean_avg_predict, n=10)
preds.sort_values("predicted_rating", ascending=False)

Unnamed: 0,userid,itemid,predicted_rating,title
3,610,461.0,5.079314,Go Fish (1994)
5,610,7022.0,5.023073,Battle Royale (Batoru rowaiaru) (2000)
1,610,3633.0,4.979636,On Her Majesty's Secret Service (1969)
2,610,4131.0,4.941626,Making Mr. Right (1987)
0,610,1938.0,4.919349,"Lost Weekend, The (1945)"
4,610,277.0,4.799313,Miracle on 34th Street (1994)


**Averaging taking into account the mean correction with dot product similarity**

In [132]:
preds = user2userPredictions(user, ratings, np_ratings, similarities_dot, neighbors_dot, mean_avg_predict, n=10)
preds.sort_values("predicted_rating", ascending=False)

Unnamed: 0,userid,itemid,predicted_rating,title
6,610,6755.0,5.504793,Bubba Ho-tep (2002)
0,610,277.0,5.481873,Miracle on 34th Street (1994)
1,610,8045.0,5.430696,Hamburger Hill (1987)
5,610,43.0,5.349343,Restoration (1995)
7,610,1502.0,5.345773,Kissed (1996)
2,610,4153.0,5.309285,Down to Earth (2001)
8,610,3633.0,5.237747,On Her Majesty's Secret Service (1969)
4,610,7022.0,5.156859,Battle Royale (Batoru rowaiaru) (2000)
3,610,4131.0,5.138244,Making Mr. Right (1987)


**Averaging taking into account the mean correction with pearson similarity**

In [133]:
preds = user2userPredictions(user, ratings, np_ratings, similarities_pearson, neighbors_pearson, mean_avg_predict, n=10)
preds.sort_values("predicted_rating", ascending=False)

Unnamed: 0,userid,itemid,predicted_rating,title
8,610,277.0,5.54999,Miracle on 34th Street (1994)
7,610,314.0,5.459759,"Secret of Roan Inish, The (1994)"
0,610,1938.0,5.404431,"Lost Weekend, The (1945)"
4,610,6755.0,5.402233,Bubba Ho-tep (2002)
6,610,4153.0,5.270895,Down to Earth (2001)
3,610,3633.0,5.116891,On Her Majesty's Secret Service (1969)
5,610,2077.0,5.106807,"Journey of Natty Gann, The (1985)"
1,610,4131.0,4.980733,Making Mr. Right (1987)
2,610,3189.0,4.809725,My Dog Skip (1999)
