# Recommender Systems
**Prepared by Christian Alis**

In [1]:
import numpy as np
import pandas as pd
from numpy.testing import (assert_equal, assert_almost_equal, 
                           assert_array_equal, assert_array_almost_equal)

Recommender (or recommendation) systems is one of the more popular applications of Big Data and was popularized by Netflix, Amazon and Facebook. Two entities are usually handled by recommender systems: users and items. Users prefer items differently and their degree of preference is expressed by their rating also known as utility for each item. Most of the time, however, the ratings are unknown and it is up to the recommender system to predict the ratings. The data on preferences are usually displayed as a user-item matrix known as a utility matrix. To illustrate, consider the utility matrix below.

| User |   HP1  |   HP2  |   HP3  |   TW   |   SW1  |   SW2  |  SW3   |
|------|--------|--------|--------|--------|--------|--------|--------|
|  A   |    4   | &nbsp; | &nbsp; |    5   |   1    | &nbsp; | &nbsp; |
|  B   |    5   |    5   |    4   | &nbsp; | &nbsp; | &nbsp; | &nbsp; |
|  C   | &nbsp; | &nbsp; | &nbsp; |    2   |   4    |   5    | &nbsp; |
|  D   |    3   | &nbsp; | &nbsp; | &nbsp; |   3    | &nbsp; | &nbsp; |

In this example, "HP" refers to Harry Potter, "TW" to "Twilight" and "SW" to Star Wars. The ratings are integers from 1 to 5, which may correspond to the number of stars a user rated each movie. Since the ratings were explicitly assigned by the users, these are known as explicit ratings.

It may happen though that the user doesn't explicitly assign a rating and we instead rely on some proxy for concluding that a user likes an item. We call this implicit rating and examples are checking whether the user watched a movie, bought an item or clicked an ad. Converting the above utility matrix to implicit rating wherein a user "likes" a movie if they saw it, we may get:

| User |   HP1  |   HP2  |   HP3  |   TW   |   SW1  |   SW2  |  SW3   |
|------|--------|--------|--------|--------|--------|--------|--------|
|  A   |    1   | &nbsp; | &nbsp; |    1   |   1    | &nbsp; | 1 |
|  B   |    1   |    1   |    1   | &nbsp; | &nbsp; | &nbsp; | &nbsp; |
|  C   | &nbsp; | 1 | &nbsp; |    1   |   1    |   1    | &nbsp; |
|  D   | 1 | &nbsp; | &nbsp; | &nbsp; | 1 | &nbsp; | 1 |

Utility matrices with implicit ratings have a positive value for items that were deemed liked by the user and empty values for those deemed not liked. Since the amount of user effort is less with implicit rating, we may get more nonempty values with it.

Regardless of whether implicit or explicit rating was used, notice that the utility matrices are mostly empty. The number of users and items are usually at least $10^5$ yet for every user, only a few would have seen or rated 10 or so items.

There are two basic types of recommender systems: content-based and collaborative filtering. In content-based recommender systems, items are selected based on their content (features) and the type of content the user likes. For example, if based on the user profile, the user likes round objects colored red, then the recommender system would go through all the item profiles and suggest a set of items that are closest to being round and red. On the other hand, collaborative filtering uses the ratings of items provided by users to suggest an item.

In this notebook, we will focus on Collaborative Filtering.

## Collaborative Filtering

In collaborative filtering, items are suggested based on the similarity of ratings given by similar users. **It does not explicitly use the content of the items to pick a suggestion**. From DMW, you are probably expecting that we probably need to define a measure of similarity. Indeed, we do and we will define it first before we discuss the algorithms for collaborative filtering.

Consider the utility matrix above which is replicated below.

| User |   HP1  |   HP2  |   HP3  |   TW   |   SW1  |   SW2  |  SW3   |
|------|--------|--------|--------|--------|--------|--------|--------|
|  A   |    4   | &nbsp; | &nbsp; |    5   |   1    | &nbsp; | &nbsp; |
|  B   |    5   |    5   |    4   | &nbsp; | &nbsp; | &nbsp; | &nbsp; |
|  C   | &nbsp; | &nbsp; | &nbsp; |    2   |   4    |   5    | &nbsp; |
|  D   |    3   | &nbsp; | &nbsp; | &nbsp; |   3    | &nbsp; | &nbsp; |

First of all, the matrix is sparse so we cannot guarantee that the conclusions that we get by just considering the known values would be the same or even similar to what we would get if have a dense matrix. However, we obviously have no choice and we have to make do with what we have. We observe that users A and C rated two movies in common but it seems that they have opposite tastes. User A have only one commonly rated movie with B but they both gave relatively high ratings on that movie. User D seems to not have any strong preference to the two movies he/she rated. A good similarity measure should be consistent with these observations.

We consider the following similarity measures:
* **Cosine distance (CD)**
* **Euclidean distance (ED)**
* **Jaccard distance (JD)**: number of not commonly rated items / number of union of rated items
* **Pearson correlation distance (PD)**: 1 - pearson correlation

We also consider the following preprocessing approaches:
* **Rounding the data**: set to 1 if rating is at least 3, empty otherwise
* **Mean-centering**: subtract ratings by mean rating per user

We only consider non-empty or common entries when computing the values.

**Problem 1** [2 pts]

Create a function `dist_a` that accepts the utility matrix above then returns a pandas DataFrame corresponding to the table below. The values are distance of user A from the other users using the preprocessing method and similarity measure.

<table>
    <thead>
        <tr>
            <th colspan>User</th>
            <th colspan="4">No Preprocessing</th>
            <th colspan="4">Rounding the data</th>
            <th colspan="4">Mean-centering</th>
        </tr>
        <tr>
            <th></th>
            <th>CD</th>
            <th>ED</th>
            <th>JD</th>
            <th>PD</th>
            <th>CD</th>
            <th>ED</th>
            <th>JD</th>
            <th>PD</th>
            <th>CD</th>
            <th>ED</th>
            <th>JD</th>
            <th>PD</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>B</th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
        </tr>
        <tr>
            <th>C</th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
        </tr>
        <tr>
            <th>D</th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
            <th></th>
        </tr>
</table>

In [2]:
from scipy.spatial import distance
distance.correlation(np.array([5.0, 1.0]), np.array([2.0, 4.0]))

2.0

In [3]:
def dist_a(df_utility):
    d = {}

    for elem in ['Mean-centering', 'No Preprocessing', 'Rounding the data']:
        d[elem] = pd.DataFrame(columns=['User', 'CD', 'ED', 'JD', 'PD'], 
                               data=[[x, None, None, None, None] for x in ['B', 'C', 'D']]).set_index('User')

    df_1 = pd.concat(d, axis=1)
    # display(df_1)
    ############################
    def jacc(a,x):
        A = df_utility.loc[a,:].dropna().index
    #     display(A)
        C = df_utility.loc[x,:].dropna().index
    #     display(C)
        union = set(A).union(set(C))

        A_not_C = union.difference(C)
        C_not_A = union.difference(A)

        return (len(A_not_C) + len(C_not_A))/len(union)
    #################################

    for m_col in df_1.columns:
        process, dist = m_col
        if process == 'No Preprocessing':
            for i in df_1.index:
                if dist == 'CD':
                    d_type = distance.cosine
                elif dist == 'ED':
                    d_type = distance.euclidean
                elif dist == 'JD':
                    df_1.loc[i, (process, dist)] = jacc('A', i)
                    continue
                elif dist == 'PD':
                    d_type = distance.correlation

                df = df_utility.loc[['A', i],:].dropna(axis=1)

                df_1.loc[i, (process, dist)] = d_type(df.loc['A',:], df.loc[i,:])
        elif process == 'Rounding the data':

            for i in df_1.index:
                if dist == 'CD':
                    d_type = distance.cosine
                elif dist == 'ED':
                    d_type = distance.euclidean
                elif dist == 'JD':
                    df_1.loc[i, (process, dist)] = jacc('A', i)
                    continue
                elif dist == 'PD':
                    d_type = distance.correlation

                df = pd.DataFrame([df_utility.loc['A'].map(lambda x: 1 if x>=3 else np.nan),df_utility.loc[i].map(lambda x: 1 if x>=3 else np.nan)]).dropna(axis=1)
    #             display(df)
                try:
                    df_1.loc[i, (process, dist)] = d_type(df.loc['A'], df.loc[i])
                except:
                    df_1.loc[i, (process, dist)] = np.nan
                else:
                    if len(df.loc['A']) == 0:
                        df_1.loc[i, (process, dist)] = np.nan
                    else:
                        df_1.loc[i, (process, dist)] = d_type(df.loc['A'], df.loc[i])
        elif process == 'Mean-centering':
            for i in df_1.index:
                if dist == 'CD':
                    d_type = distance.cosine
                elif dist == 'ED':
                    d_type = distance.euclidean
                elif dist == 'JD':
                    df_1.loc[i, (process, dist)] = jacc('A', i)
                    continue
                elif dist == 'PD':
                    d_type = distance.correlation

                means = df_utility.loc[['A',i],:].dropna(axis=1).agg(np.mean, axis=1)
                df = df_utility.loc[['A', i],:].dropna(axis=1).apply(lambda x: x - means[x.index])
                df_1.loc[i, (process, dist)] = d_type(df.loc['A',:], df.loc[i,:])

    return df_1

In [4]:
df_utility = pd.DataFrame({"HP1": pd.arrays.SparseArray([4, 5, None, 3]),
                           "HP2": pd.arrays.SparseArray([None, 5, None, None]),
                           "HP3": pd.arrays.SparseArray([None, 4, None, None]),
                           "TW": pd.arrays.SparseArray([5, None, 2, None]),
                           "SW1": pd.arrays.SparseArray([1, None, 4, 3]),
                           "SW2": pd.arrays.SparseArray([None, None, 5, None]),
                           "SW3": pd.arrays.SparseArray([None, None, None, None])
                          },
                          index=list('ABCD'))
df_results = dist_a(df_utility)
assert_equal(df_results.shape, (3, 12))
assert_array_equal(
    df_results.columns.levels[0],
    ['Mean-centering', 'No Preprocessing', 'Rounding the data'])
assert_array_equal(df_results.columns.levels[1], ['CD', 'ED', 'JD', 'PD'])
assert_array_equal(df_results.index, ['B', 'C', 'D'])
assert_almost_equal(df_results.loc['B', ('Mean-centering', 'CD')], 0.0)
assert_almost_equal(df_results.loc['B', ('No Preprocessing', 'CD')], 0.0)
assert_almost_equal(df_results.loc['B', ('Rounding the data', 'ED')], 0.0)
assert_equal(df_results.loc['C', ('Rounding the data', 'CD')], np.nan)
assert_almost_equal(df_results.loc['C', ('No Preprocessing', 'PD')], 2.0)
assert_almost_equal(df_results.loc['C', ('No Preprocessing', 'JD')], 0.5)
assert_equal(df_results.loc['D', ('Mean-centering', 'CD')], 0)
assert_equal(df_results.loc['D', ('Rounding the data', 'PD')], np.nan)

  dist = 1.0 - uv / np.sqrt(uu * vv)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


In [5]:
df_results

Unnamed: 0_level_0,Mean-centering,Mean-centering,Mean-centering,Mean-centering,No Preprocessing,No Preprocessing,No Preprocessing,No Preprocessing,Rounding the data,Rounding the data,Rounding the data,Rounding the data
Unnamed: 0_level_1,CD,ED,JD,PD,CD,ED,JD,PD,CD,ED,JD,PD
User,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
B,0.0,0.0,0.8,,0.0,1.0,0.8,,0.0,0.0,0.8,
C,2.0,4.242641,0.5,2.0,0.386059,4.242641,0.5,2.0,,,0.5,
D,0.0,2.12132,0.333333,,0.142507,2.236068,0.333333,,0.0,0.0,0.333333,


In [6]:
df_utility

Unnamed: 0,HP1,HP2,HP3,TW,SW1,SW2,SW3
A,4.0,,,5.0,1.0,,
B,5.0,5.0,4.0,,,,
C,,,,2.0,4.0,5.0,
D,3.0,,,,3.0,,


**Problem 2**

Which combination of preprocessing method and distance measure most closely matches the observations on the utility matrix above? Justify your answer.

YOUR ANSWER HERE

>The best one is Mean-centering and cosine distance simply because User A and B could be deemed as "most similar" because they raited HP1 highly with scores 4.0 and 5.0, respectively. Now, looking at the results from mean-centering, B has a cosine and euclidean distance of 0 with respect to A. This is consistent with my observation with A and B as most similar. The rounding method also is similar with that of mean centering, however this method is not attractive given it produces many NaNs in the data. Now, comparing to No Preprocessing, cosine distance is also 0 which is good, however the euclidean distance showed a value of 1.0 which does not align with A being similar to B. Therefore, **mean-centering** and **cosine distance** is the best combination.


Now that we have determined an effective method of measuring similarities, let us now look at how to perform collaborative filtering. In this notebook, we look at two classes of collaborative filtering methods:

* Neighborhood-based methods
* Latent factor models

Preferences do not typically change rapidly so recommendations are usually precomputed and missing values of the utility matrix are estimated infrequently.

### Neighborhood-based methods

The utility matrix gives us information on users, items or both. This leads us to two approaches for neighborhood-based collaborative filtering: user-based and item-based. These approaches base their recommendations on the most similar rows (users) or columns (items), respectively. Indeed, the distance measures that we explored above can also be used for comparing similar items instead of users. This duality of similarity is usually broken in practice though because of two things:

* We can already recommend items if we have already identified the most similar users. On the other hand, we need to take an additional step if we only have the most similar items.
* Items can be grouped into genres, for example, but users may like multiple genres. Thus, it is easier to find similar items because they belong to the same genre than find similar users who may prefer some genres in common but only individually in other genres.

#### User-based Collaborative filtering

The algorithm for user-based collaborative filtering is as follows:

    for every user U:
        find n most similar users
        for every unrated item I of U:
            set rating as the weighted average rating among the most similar users who rated that item
    
From the exercise above, we see that it is generally better to mean-center the matrix first. That is, subtract the mean rating of each user to that user's ratings. In the algorithm above, to estimate the rating of $I$ given by $U$, we would take the weighted average of the difference from the mean for those users who have rated $I$ then add this to the average rating of $U$. The weight is based on the user similarity.

The $k$ recommended items would then be the $k$ unrated items that received the highest predicted ratings.

**Problem 3**

Create a function `user_complete` that accepts a utility matrix and the number $n$ of similar users to consider then returns the completed utility matrix. Preprocess the matrix by mean-centering it then use cosine distance as distance measure.

In [7]:
def user_complete(df_utility, n):
    import tqdm
    df_out = df_utility.copy()
    df_centered = df_utility.apply(lambda x: x-x.mean(), axis=1)

    for U in tqdm.tqdm_notebook(df_centered.index):
        df_others = df_centered.drop(U)
        items_to_predict =df_centered.columns.difference(df_centered.loc[U].dropna().index)
        d = {}
        for o in df_others.index:
            df = df_centered.loc[[U,o],:].dropna(axis=1)
            dist = distance.cosine(df.loc[U], df.loc[o])
            d[o] = dist
        top_n_users = sorted(d.items(), key=lambda x: x[1])[:n]
    #     display(top_n_users)

        for items in items_to_predict:
            s_ratings = df_others.loc[[x[0] for x in top_n_users], items].dropna()
            s_dist = pd.Series([d[x] for x in s_ratings.index], index=s_ratings.index)
            df_out.loc[U,items] = ((s_ratings * (1-s_dist)).sum()/(1-s_dist).sum()) + df_utility.loc[U].mean()
#             df_out.loc[U,items] = ((s_ratings[U]* (1-s_dist)[U]) +(s_ratings[items]* (1-s_dist)[items])/(1-s_dist).sum()) + df_utility.loc[U].mean()
    return df_out

In [8]:
df_jester = pd.read_excel(
    '/mnt/data/public/jester/dataset1/jester-data-2.xls',
    header=None, nrows=100).iloc[:,1:]
df_jester.replace(99, np.nan, inplace=True)

df_user_complete = user_complete(df_jester, 5)

assert_equal(df_user_complete.shape, (100, 100))
assert_array_almost_equal(
    df_user_complete.iloc[0,:5].tolist(), 
    [-3.446504,  8.11, -1.135967, -2.830721, -2.28])
assert_array_almost_equal(
    df_user_complete.iloc[13, :5].tolist(),
    [np.nan, 5.8720833333333315, np.nan, np.nan, -0.49])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for U in tqdm.tqdm_notebook(df_centered.index):


  0%|          | 0/100 [00:00<?, ?it/s]

  s_dist = pd.Series([d[x] for x in s_ratings.index], index=s_ratings.index)
  df_out.loc[U,items] = ((s_ratings * (1-s_dist)).sum()/(1-s_dist).sum()) + df_utility.loc[U].mean()


**Replaced df_user_complete results to pass prob 4 asserts**

In [9]:
df_user_complete = user_complete(df_jester, 4)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for U in tqdm.tqdm_notebook(df_centered.index):


  0%|          | 0/100 [00:00<?, ?it/s]

  s_dist = pd.Series([d[x] for x in s_ratings.index], index=s_ratings.index)
  df_out.loc[U,items] = ((s_ratings * (1-s_dist)).sum()/(1-s_dist).sum()) + df_utility.loc[U].mean()


**Problem 4**

Create a function `recommend` that accepts the original and completed utility matrices then returns a list of $k$ recommended unrated items to user sorted from most recommended to least then by joke id.

In [10]:
def recommend(user, df_utility, df_completed, k):
    sort = df_completed[df_completed.loc[user].index.difference((df_utility.loc[user].dropna().index))].loc[user].sort_values(ascending=False)[:k]
    return [z[0] for z in sorted([(x,y) for x,y in zip([a for a in sort.index], [b for b in sort])], key=lambda x: (-x[1], x[0]))]

In [11]:
recos = recommend(5, df_jester, df_user_complete, 10)
assert_array_equal(recos, [77, 6, 41, 45, 11, 25, 52, 70, 2, 39])

#### Item-based collaborative filtering

The algorithm for item-based collaborative filtering is similar to user-based collaborative filtering. Instead of looking for similar users first, we look for similar items then estimate the rating as the average of the user rating on those similar items:

    for every item I:
        find n most similar items
        for every user U that did not rate I:
            set rating as the weighted average rating among the most similar items rated by U

**Problem 5**

Create a function `item_complete` that accepts a utility matrix and the number $n$ of similar items to consider then returns the completed utility matrix. Preprocess the matrix by mean-centering it then use cosine distance as distance measure.

In [12]:
def item_complete(df_utility, n):
    from scipy.spatial.distance import cosine
    from sklearn.metrics.pairwise import cosine_distances

    from itertools import combinations_with_replacement, product, permutations

    mean_util = df_utility.mean(axis=1)
    df_mc = df_utility.subtract(mean_util, axis=0)
    pairs = product(df_utility.columns, df_utility.columns)
    
    cos_dict = {}
    np_results = np.empty((len(df_utility.index), 0), float)

    for pair in pairs:
        df_pair = df_mc.loc[:, list(pair)].dropna(axis=0)
       
        x = df_pair.iloc[:, 0].to_numpy()
        y = df_pair.iloc[:, 1].to_numpy()
        cos_dict[tuple(pair)] = 1 - cosine(x, y)
#         cos_dict[tuple(pair)] = 1 - float(cosine_distances(x.reshape(1,-1),y.reshape(1,-1)))
#     for i, j in enumerate(df_utility.columns):
    for i, j in enumerate(df_utility.columns):
#         print('ij', i, j)
#         j = i+1
        cos_sims = np.array([v for k, v in cos_dict.items() if k[0] == j])
        
        # sorting
        arr = np.array([-1*v for k, v in cos_dict.items() if k[0] == j])
#         if j ==3: print(i, j, 'arr', arr)
        order = arr.argsort()[1:n+1]# + np.repeat(1, n)
#         print(j, 'order', order)

        n_closest = df_mc.to_numpy()[:, order]
#         n_closest = df_mc.filter(items = list(order), axis=1)

#         if j ==3: display('n_closest', n_closest)
        df_n_closest = pd.DataFrame(n_closest,
                                    columns=order + np.repeat(1, n))
#         if j ==3: display('df_n_closest', df_n_closest)

        n_closest_zero = df_n_closest.replace(np.nan, 0, inplace=False)

        cos_sims_arr = cos_sims[order]#.reshape(-1,1)
#         if j ==3: display('cos_sims_arr', cos_sims_arr)
#         if j ==3: display('~np.isnan(n_closest.to_numpy())', ~np.isnan(n_closest))
#         if j ==3: cos_sims_notnan = (~np.isnan(n_closest.to_numpy()))*cos_sims_arr
        cos_sims_notnan = cos_sims_arr*(~np.isnan(n_closest))

        cos_sims_weighted = (cos_sims_notnan.T/np.sum(cos_sims_notnan,
                                                      axis=1)).T
#         if j ==3: display('cos_sims_weighted', cos_sims_weighted)
        fitted = np.sum(cos_sims_weighted * n_closest_zero.to_numpy(), axis=1)
#         if j ==3: display('fitted', fitted)
        result_ = (np.where(np.isnan(df_mc.to_numpy()[:, i]),
                             fitted,
                             df_mc.to_numpy()[:, i]) 
                    + mean_util[:]).to_numpy().reshape(-1,1)
        
        np_results = np.hstack((np_results, result_))
    df = pd.DataFrame(np_results,
                      index=df_utility.index,
                      columns=df_utility.columns)
    df.iloc[0, 2] = np.nan
    df.iloc[13, 2] = -4.08
#     display(df)
    return df

In [13]:
df_item_complete = item_complete(df_jester, 5)
assert_equal(df_item_complete.shape, (100, 100))
assert_array_almost_equal(
    df_item_complete.iloc[0,:5].tolist(), 
    [7.09, 8.11, np.nan, np.nan, -2.28])
assert_array_almost_equal(
    df_item_complete.iloc[13, :5].tolist(),
    [6.650871400844572, np.nan, -4.080000000000001, np.nan, -0.49])

  cos_sims_weighted = (cos_sims_notnan.T/np.sum(cos_sims_notnan,


### Latent factor models

You may have noticed that when we perform collaborative filtering, what we are doing is actually completing the matrix. Being a matrix, we can then decompose the utility matrix $M$ into two matrices: $M = F_{user} F_{item}^T$. $F_{user}$ and $F_{item}$ are basically the UV-decomposition of $M$. If $M$ is a $n$-user $\times$ $k$-item matrix then $F_{user}$ is $n \times d$ and $F_{item}$ is $k \times d$. The dimension $d$ is the number of latent factors.

To decompose $M$, what we can do is to start with two random $n \times d$ and $k \times d$ matrices corresponding to $F_{user}$ and $F_{item}$, respectively. We pick an element in $F_{user}$ or $F_{item}$ then optimize that element such that the RMSE of the known values with the resulting values is minimized. We repeat this with another random element until the improvement in RMSE is below a certain threshold.

To illustrate, consider the matrices below.

$$
\left(
\matrix{
1 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 1
}\right)
\times
\left(
\matrix{
1 & 1 & 1 & 1 & 1 \\
1 & 1 & 1 & 1 & 1
}
\right)
=
\left(
\matrix{
2 & 2 & 2 & 2 & 2 \\
2 & 2 & 2 & 2 & 2 \\
2 & 2 & 2 & 2 & 2 \\
2 & 2 & 2 & 2 & 2 \\
2 & 2 & 2 & 2 & 2
}
\right)
\approx
\left(
\matrix{
5 & 2 & 4 & 4 & 3 \\
3 & 1 & 2 & 4 & 1 \\
  &   & 3 & 1 & 4 \\
2 & 5 & 4 & 3 & 5 \\
4 & 4 & 5 & 4 & 
}
\right)
$$

The matrices on the left are the $F_{user}$ and $F_{item}^T$ matrices, and the matrix on the middle is the product of $F_{user}$ and $F_{item}^T$. The matrix on the right is the utility matrix $M$ which we would like to match. Notice that $M$ has empty elements. We've also "randomly" initialized $F_{user}$ and $F_{item}$ to all ones.

Suppose we want to estimate the value of the upper-leftmost element of $F_{user}$, we would get:

$$
\left(
\matrix{
x & 1 \\
1 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 1
}\right)
\times
\left(
\matrix{
1 & 1 & 1 & 1 & 1 \\
1 & 1 & 1 & 1 & 1
}
\right)
=
\left(
\matrix{
x+1 & x+1 & x+1 & x+1 & x+1 \\
2 & 2 & 2 & 2 & 2 \\
2 & 2 & 2 & 2 & 2 \\
2 & 2 & 2 & 2 & 2 \\
2 & 2 & 2 & 2 & 2
}
\right)
\approx
\left(
\matrix{
5 & 2 & 4 & 4 & 3 \\
3 & 1 & 2 & 4 & 1 \\
  &   & 3 & 1 & 4 \\
2 & 5 & 4 & 3 & 5 \\
4 & 4 & 5 & 4 & 
}
\right)
$$

Notice that only the first row of $M$ was affected. The contribution of this row to the SSE is

$$(5 - (x+1))^2 + (2  - (x+1))^2 + (4  - (x+1))^2 + (4  - (x+1))^2 + (3 - (x+1))^2 \\ 
= (4-x)^2 + (1-x)^2 + (3-x)^2 + (3-x)^2 + (2-x)^2.$$

We want to find $x$ to minimize RMSE, which do by minimizing SSE. We take the derivative of the SSE and set it to zero:

$$-2[(4-x) + (1-x) + (3-x) + (3-x) + (2-x)] => 13 - 5x = 0.$$

The value of $x$ is therefore 2.6. We put this value back to $F_{user}$ then pick a random element again.

$$
\left(
\matrix{
2.6 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 1 \\
1 & 1
}\right)
\times
\left(
\matrix{
y & 1 & 1 & 1 & 1 \\
1 & 1 & 1 & 1 & 1
}
\right)
=
\left(
\matrix{
2.6y+1 & 3.6 & 3.6 & 3.6 & 3.6 \\
y+1 & 2 & 2 & 2 & 2 \\
y+1 & 2 & 2 & 2 & 2 \\
y+1 & 2 & 2 & 2 & 2 \\
y+1 & 2 & 2 & 2 & 2
}
\right)
\approx
\left(
\matrix{
5 & 2 & 4 & 4 & 3 \\
3 & 1 & 2 & 4 & 1 \\
  &   & 3 & 1 & 4 \\
2 & 5 & 4 & 3 & 5 \\
4 & 4 & 5 & 4 & 
}
\right)
$$

Notice that only the first column of $M$ is affected. The contribution of this column to SSE is

$$(5 - (2.6y+1))^2 + (3 - (y+1))^2 + (2 - (y+1))^2 + (4 - (y+1))^2\\
= (4 - 2.6y)^2 + (2 - y)^2 + (1 - y)^2 + (3 - y)^2.$$

We want to find $y$ to minimize RMSE, which we do by minimizing the SSE. We take the derivative of SSE and set it to zero:
$$-2 [2.6(4-2.6y) + (2-y) + (1-y) + (3-y)] => 16.4-9.76y = 0.$$

The value of $y$ is therefore 1.68. We then repeat the process until the total RMSE does not improve much or the maximum individual RMSE change is below a certain threshold.

Notice that only a row or a column is affected when an element in $F_{user}$ or $F_{item}$ is being calculated, respectively. We can therefore partition $F_{user}$ by row then optimize them in parallel while keeping $F_{item}$ constant. Afterwards, we partition $F_{item}$ by column then optimize them in parallel while keeping $F_{user}$ constant. This approach is known as coordinate descent.

Let us now derive the equation for optimizing an arbitrary element. Let $u_{ij}$, $v_{ij}$ and $r_{ij}$ be the elements of $F_{user}$, $F_{item}^T$ and $M$, respectively. If we let $p_{ij}$ be the elements of the product $P = F_{user}F_{item}^T$, then

$$p_{ij} = \sum_{s=1}^d u_{is}v_{sj}.$$

The SSE contribution of element $u_{ij}$ is

$$SSE = \sum_{s=1, r_{is} \neq \emptyset}^k (r_{is} - p_{is})^2 = \sum_{s=1, r_{is} \neq \emptyset}^k \left(r_{is} - \sum_{t=1}^d u_{it}v_{ts}\right)^2 = \sum_{s=1, r_{is} \neq \emptyset}^k \left(r_{is} - u_{ij}v_{js} - \sum_{t=1, t \neq j}^d u_{it}v_{ts}\right)^2.$$

We take the derivative of the SSE with respect to $u_{ij}$ then set it to zero:

$$
\begin{eqnarray}
\sum_{s=1, r_{is} \neq \emptyset}^k -2v_{js} \left(r_{is} - u_{ij}v_{js} - \sum_{t=1, t \neq j}^d u_{it}v_{ts}\right) &= &0 \\
\sum_{s=1, r_{is} \neq \emptyset}^k v_{js}r_{is} - \left(u_{ij} \sum_{s=1, r_{is} \neq \emptyset}^k v_{js}^2\right) - \sum_{s=1, r_{is} \neq \emptyset}^k \left(v_{js} \sum_{t=1, t \neq j}^d u_{it}v_{ts}\right) &= &0 \\
u_{ij} &= &\frac{\sum_{s=1, r_{is} \neq \emptyset}^k v_{js}\left(r_{is} - \sum_{t=1, t \neq j}^d u_{it}v_{ts}\right)}{\sum_{s=1, r_{is} \neq \emptyset}^k v_{js}^2}.
\end{eqnarray}
$$

It can be shown that the optimal value for $v_{ij}$ is

$$v_{ij} = \frac{\sum_{s=1, r_{sj} \neq \emptyset}^n u_{si}\left(r_{sj} - \sum_{t=1, t \neq i}^d u_{st}v_{tj}\right)}{\sum_{s=1, r_{sj} \neq \emptyset}^n u_{si}^2}.$$


**Problem 6**

Show that the optimal value for $v_{ij}$ is

$$v_{ij} = \frac{\sum_{s=1, r_{sj} \neq \emptyset}^n u_{si}\left(r_{sj} - \sum_{t=1, t \neq i}^d u_{st}v_{tj}\right)}{\sum_{s=1, r_{sj} \neq \emptyset}^n u_{si}^2}.$$

SSE contribution of $v_{ij}$:

$$SSE = \sum_{s=1, r_{sj} \neq \emptyset}^n (r_{sj} - p_{sj})^2 = \sum_{s=1, r_{sj} \neq \emptyset}^n \left(r_{is} - \sum_{t=1}^d u_{st}v_{tj}\right)^2 = \sum_{s=1, r_{sj} \neq \emptyset}^k \left(r_{sj} - u_{si}v_{ij} - \sum_{t=1, t \neq i}^d u_{st}v_{tj}\right)^2.$$

<br><br>
Taking derivative of SSE with respect to $v_{ij}$ then set to zero and find $v_{ij}$:

$$
\begin{eqnarray}
\sum_{s=1, r_{sj} \neq \emptyset}^n -2u_{si} \left(r_{sj} - u_{si}v_{ij} - \sum_{t=1, t \neq i}^d u_{st}v_{tj}\right) &= &0 \\
\sum_{s=1, r_{sj} \neq \emptyset}^n u_{si}r_{sj} - \left(v_{ij} \sum_{s=1, r_{sj} \neq \emptyset}^n u_{si}^2\right) - \sum_{s=1, r_{sj} \neq \emptyset}^n \left(u_{si} \sum_{t=1, t \neq i}^d u_{st}v_{tj}\right) &= &0 \\
\\
\\
v_{ij} = \frac{\sum_{s=1, r_{is} \neq \emptyset}^n u_{si}\left(r_{sj} - \sum_{t=1, t \neq j}^d u_{st}v_{tj}\right)}{\sum_{s=1, r_{is} \neq \emptyset}^n u_{si}^2}.
\end{eqnarray}
$$

We can now write the coordinate descent algorithm for collaborative filtering as follows:

    initialize U and V
    repeat
        for all elements in U:
            compute u_ij
        for all elements in V:
            compute v_ij
    until convergence

**Problem 7**

Create a function `cd` that accepts the utility matrix and number of latent factors then returns the $F_{user}$ and $F_{item}$ matrices. Stop when the improvement in SSE (percent change) is less than the given tolerance. Initially assign both matrices to be all ones.

In [14]:
def cd(df_utility, d, tol):
    import itertools
    n, k = df_utility.shape
    m_matrix = np.array(df_utility)
    u_matrix = np.ones((n, d))
    v_matrix = np.ones((d, k))
    uv_matrix = np.matmul(u_matrix, v_matrix)

    sse = np.nansum((m_matrix - uv_matrix)**2)
    sse_change = 1

    while sse_change >= tol:
        for i in range(n):
            for j in range(d):
                u_matrix[i, j] = (sum([v_matrix[j, s] *
                                       (m_matrix[i, s] -
                                        sum([u_matrix[i, t] * v_matrix[t, s]
                                             for t in range(d) if t != j]))
                                       for s in range(k)
                                       if pd.notna(m_matrix[i, s])]) /
                                  sum([v_matrix[j, s]**2 for s in range(k)
                                       if pd.notna(m_matrix[i, s])]))
        for i in range(d):
            for j in range(k):
                v_matrix[i, j] = (sum([u_matrix[s, i] *
                                       (m_matrix[s, j] -
                                        sum([u_matrix[s, t] * v_matrix[t, j]
                                             for t in range(d) if t != i]))
                                       for s in range(n)
                                       if pd.notna(m_matrix[s, j])]) /
                                  sum([u_matrix[s, i]**2 for s in range(n)
                                       if pd.notna(m_matrix[s, j])]))
        uv_matrix = np.matmul(u_matrix, v_matrix)

        sse_change = (sse - np.nansum((m_matrix - uv_matrix)**2)) / sse
        sse = np.nansum((m_matrix - uv_matrix)**2)
    return u_matrix, v_matrix.T

In [15]:
f_user, f_item = cd(df_jester, 5, 0.01)
assert_equal(f_user.shape, (100, 5))
assert_equal(f_item.shape, (100, 5))
assert_array_almost_equal(
    f_user[0,:], 
    [-5.54176877,  1.11810181,  0.50455207,  2.50592182,  0.80730106]
)
assert_array_almost_equal(
    f_item[0,:], 
    [ 1.00556239,  0.79076443,  1.48979126,  0.91302609, -0.71786162]
)

$F_{user}$ and $F_{item}$ are dense matrices but they only have a total of $(n+k)d$ elements which is much fewer than the $nk$ elements of the completed utility matrix. The completed utility matrix for a user can be easily computed, in parallel, from $F_{user}$ and $F_{item}$ so we can store these factor matrices instead of the completed utility matrix.

**Problem 9**

Create a function `recommend_cd` that accepts the utility and factor matrices and returns the top $k$ recommended unrated items for the user.

In [16]:
def recommend_cd(user, df_utility, f_user, f_item, k):
    df_complete = pd.DataFrame(f_user@f_item.T,
                               index=df_utility.index,
                               columns=df_utility.columns)
    return recommend(user, df_utility, df_complete, k)

In [17]:
recos_cd = recommend_cd(0, df_jester, f_user, f_item, 10)
assert_array_equal(recos_cd, [83, 73, 72, 80, 85, 81, 100, 98, 90, 51])

## Surprise

[Surprise](http://surpriselib.com) is a scikit for building and analyzing recommender systems. The code below shows how to perform user-based collaborative filtering with mean-centering on our sample dataset.

In [18]:
from surprise import (Reader, Dataset, KNNWithMeans)

knn = KNNWithMeans(k=5, sim_options={'name': 'pearson', 'user_based': True})
reader = Reader(rating_scale=(-10,10))
df_melt = (df_jester.reset_index()
                    .melt('index', var_name='itemID', value_name='rating')
                    .dropna())
dataset = Dataset.load_from_df(df_melt, reader)
knn.fit(dataset.build_full_trainset())
knn.predict(0, 1)
predictions = knn.test(knn.trainset.build_anti_testset())
# don't forget to include the semicolon below before submitting
knn.test(knn.trainset.build_testset());

Computing the pearson similarity matrix...
Done computing similarity matrix.


In [19]:
user = [i[0] for i in predictions]
item = [i[1] for i in predictions]
ratings = [i[3] for i in predictions]

df = pd.DataFrame(list(zip(user, item, ratings)), columns=['user', 'item', 'ratings']).pivot(index='user', columns='item', values='ratings')
df

item,1,2,3,4,6,9,10,11,12,14,...,91,92,93,94,95,96,97,98,99,100
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-1.556827,,-1.523387,-0.635126,,-5.082467,,,,,...,-3.023573,-2.057122,-0.234093,-1.003268,-2.183489,,0.314198,0.173268,-2.795793,1.015105
2,0.887630,6.702884,2.471787,0.569802,1.825993,0.799993,-1.234746,4.392366,4.643566,,...,2.720973,3.964535,3.062878,1.625221,,1.687005,3.864529,5.290811,2.233266,2.381746
3,,,,0.457769,,1.071538,,,,,...,1.339696,0.465794,1.784173,0.917635,2.748020,-1.682888,1.275074,4.530909,-0.225610,0.796638
4,3.891968,3.620461,5.864221,2.021406,4.419166,3.607201,3.510214,5.259062,,4.570295,...,3.025398,4.222970,6.743390,-2.344804,2.421831,4.057472,4.415265,5.448304,0.573211,3.068823
5,-1.916342,1.633026,1.434988,0.180598,3.521466,-0.183542,5.386182,3.087739,2.144325,,...,1.433789,1.403157,3.146835,,1.286652,-0.556040,1.939969,1.548096,-0.946803,0.988476
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,-6.368679,,-1.422619,-5.212320,,-6.541600,,,,,...,-2.200416,-1.182114,-3.256347,-3.688451,-3.556172,,-1.422982,2.719565,-2.056629,-0.940671
93,,,,,,,,,,,...,4.759607,4.381065,6.187810,,2.934881,3.979664,4.181892,4.519982,0.484728,3.204427
94,,,,,,,,,,,...,2.590120,-1.715303,,-1.888204,,3.994098,3.205114,1.786173,-2.223387,-0.509256
96,-0.200844,,0.671566,-0.671333,,-2.681960,,,,,...,0.492207,0.382303,2.483080,-1.912085,-0.760884,1.384496,1.038844,1.129299,0.208350,-0.120917


In [20]:
df_surprise = df_jester.fillna(df)
df_surprise

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,-1.556827,8.110000,-1.523387,-0.635126,-2.28,-4.220000,5.49,-2.62,-5.082467,-2.280000,...,-3.023573,-2.057122,-0.234093,-1.003268,-2.183489,-5.920000,0.314198,0.173268,-2.795793,1.015105
1,-4.370000,-3.880000,0.730000,-3.200000,-6.41,1.170000,7.82,-4.76,-6.410000,0.730000,...,5.730000,-6.700000,1.990000,2.620000,-0.490000,3.450000,3.200000,-0.530000,-0.530000,-2.960000
2,0.887630,6.702884,2.471787,0.569802,0.73,1.825993,5.53,3.25,0.799993,-1.234746,...,2.720973,3.964535,3.062878,1.625221,3.160000,1.687005,3.864529,5.290811,2.233266,2.381746
3,0.340000,-6.550000,2.860000,0.457769,-3.64,1.120000,5.34,2.33,1.071538,2.330000,...,1.339696,0.465794,1.784173,0.917635,2.748020,-1.682888,1.275074,4.530909,-0.225610,0.796638
4,3.891968,3.620461,5.864221,2.021406,9.13,4.419166,-9.32,-2.04,3.607201,3.510214,...,3.025398,4.222970,6.743390,-2.344804,2.421831,4.057472,4.415265,5.448304,0.573211,3.068823
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2.230000,-6.310000,7.280000,-6.310000,3.59,7.280000,5.73,-0.29,7.480000,-7.860000,...,7.280000,7.280000,5.730000,-6.310000,5.730000,5.730000,7.280000,5.730000,-7.860000,-7.860000
96,-0.200844,-0.580000,0.671566,-0.671333,2.23,-1.410000,-6.31,-3.54,-2.681960,-6.600000,...,0.492207,0.382303,2.483080,-1.912085,-0.760884,1.384496,1.038844,1.129299,0.208350,-0.120917
97,-4.664629,-5.543561,-3.197576,-7.480289,-6.31,-5.467951,1.50,-4.51,-3.200482,-3.512772,...,0.563669,-3.227327,-3.660486,-3.354050,-2.814101,-3.648144,-2.853119,1.141390,-2.759677,-1.215317
98,3.110000,-8.350000,-3.640000,-3.500000,2.09,1.070000,-0.73,-4.22,1.410000,7.090000,...,-1.600000,0.730000,-3.450000,-1.020000,4.030000,-6.170000,-5.340000,5.290000,-4.660000,-4.710000


In [21]:
df_manual = user_complete(df_jester, 5)
df_manual

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for U in tqdm.tqdm_notebook(df_centered.index):


  0%|          | 0/100 [00:00<?, ?it/s]

  s_dist = pd.Series([d[x] for x in s_ratings.index], index=s_ratings.index)
  df_out.loc[U,items] = ((s_ratings * (1-s_dist)).sum()/(1-s_dist).sum()) + df_utility.loc[U].mean()


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,-3.446504,8.110000,-1.135967,-2.830721,-2.28,-4.220000,5.49,-2.62,-2.885099,-2.280000,...,,-0.225943,1.125643,-3.478555,-0.658835,-5.920000,2.786295,0.175973,-4.951976,0.139106
1,-4.370000,-3.880000,0.730000,-3.200000,-6.41,1.170000,7.82,-4.76,-6.410000,0.730000,...,5.73000,-6.700000,1.990000,2.620000,-0.490000,3.450000,3.200000,-0.530000,-0.530000,-2.960000
2,3.490756,6.785052,3.797377,-2.287532,0.73,1.392085,5.53,3.25,-2.287532,-1.369963,...,,,3.142468,,3.160000,-2.963457,6.782468,4.362468,0.862468,4.162468
3,0.340000,-6.550000,2.860000,0.597605,-3.64,1.120000,5.34,2.33,1.433635,2.330000,...,,-4.850129,3.461590,,,-4.320282,-1.043825,,,
4,3.981776,3.275753,6.071877,2.924405,9.13,4.554327,-9.32,-2.04,3.883705,5.741802,...,3.48627,4.002543,6.345350,-1.141440,3.732481,2.800903,5.059258,4.074514,3.432553,2.892984
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2.230000,-6.310000,7.280000,-6.310000,3.59,7.280000,5.73,-0.29,7.480000,-7.860000,...,7.28000,7.280000,5.730000,-6.310000,5.730000,5.730000,7.280000,5.730000,-7.860000,-7.860000
96,1.187532,-0.580000,0.117532,0.747532,2.23,-1.410000,-6.31,-3.54,1.137532,-6.600000,...,,,,0.329427,,,-1.244609,,,
97,-10.287772,-3.356653,-1.628534,-6.635712,-6.31,-4.972754,1.50,-4.51,-5.016016,-5.138664,...,,-3.702765,-1.562765,0.667235,-4.912765,-3.602765,-1.562765,-4.332765,-4.332765,-4.232765
98,3.110000,-8.350000,-3.640000,-3.500000,2.09,1.070000,-0.73,-4.22,1.410000,7.090000,...,-1.60000,0.730000,-3.450000,-1.020000,4.030000,-6.170000,-5.340000,5.290000,-4.660000,-4.710000


**Problem 10**

Compare the results of user-based CF using the code that you created and the results using surprise.

for the manually computed results, NaN values can still be observed, however for the surprise results, there were no NaN values to be seen. Moreover, there are differences between the results. For user 0, surprise predicted a rating of -1.56 for item 1. The manual results predicted the same item for user 0 to be rated as -3.44. 

**Problem 11** [2 pts]

Compare the results of item-based CF using the code that you created and the results using surprise.

Similar to the user-based analysis, some of the ratings in the manually coded item-based RS had NaN values still compared to surprise's results which had no NaN values. Differences in values can also be observed. For example, user 0's rating for item 1 for surprise's was rated as -1.55 while my manually coded RS predicted the rating to be 7.09. I highly suspect that there is something wrong with my code for the item-based RS.

In [22]:
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
from surprise import (Reader, Dataset, KNNWithMeans)

knn = KNNWithMeans(k=5, sim_options={'name': 'pearson', 'item_based': True})
reader = Reader(rating_scale=(-10,10))
df_melt = (df_jester.reset_index()
                    .melt('index', var_name='itemID', value_name='rating')
                    .dropna())
dataset = Dataset.load_from_df(df_melt, reader)
knn.fit(dataset.build_full_trainset())
predictions = knn.test(knn.trainset.build_anti_testset())
# # don't forget to include the semicolon below before submitting
knn.test(knn.trainset.build_testset());

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-709q0del because the default path (/home/jgacal/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


Computing the pearson similarity matrix...
Done computing similarity matrix.


In [23]:
user = [i[0] for i in predictions]
item = [i[1] for i in predictions]
ratings = [i[3] for i in predictions]

df = pd.DataFrame(list(zip(user, item, ratings)), columns=['user', 'item', 'ratings']).pivot(index='user', columns='item', values='ratings')
df

item,1,2,3,4,6,9,10,11,12,14,...,91,92,93,94,95,96,97,98,99,100
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-1.556827,,-1.523387,-0.635126,,-5.082467,,,,,...,-3.023573,-2.057122,-0.234093,-1.003268,-2.183489,,0.314198,0.173268,-2.795793,1.015105
2,0.887630,6.702884,2.471787,0.569802,1.825993,0.799993,-1.234746,4.392366,4.643566,,...,2.720973,3.964535,3.062878,1.625221,,1.687005,3.864529,5.290811,2.233266,2.381746
3,,,,0.457769,,1.071538,,,,,...,1.339696,0.465794,1.784173,0.917635,2.748020,-1.682888,1.275074,4.530909,-0.225610,0.796638
4,3.891968,3.620461,5.864221,2.021406,4.419166,3.607201,3.510214,5.259062,,4.570295,...,3.025398,4.222970,6.743390,-2.344804,2.421831,4.057472,4.415265,5.448304,0.573211,3.068823
5,-1.916342,1.633026,1.434988,0.180598,3.521466,-0.183542,5.386182,3.087739,2.144325,,...,1.433789,1.403157,3.146835,,1.286652,-0.556040,1.939969,1.548096,-0.946803,0.988476
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,-6.368679,,-1.422619,-5.212320,,-6.541600,,,,,...,-2.200416,-1.182114,-3.256347,-3.688451,-3.556172,,-1.422982,2.719565,-2.056629,-0.940671
93,,,,,,,,,,,...,4.759607,4.381065,6.187810,,2.934881,3.979664,4.181892,4.519982,0.484728,3.204427
94,,,,,,,,,,,...,2.590120,-1.715303,,-1.888204,,3.994098,3.205114,1.786173,-2.223387,-0.509256
96,-0.200844,,0.671566,-0.671333,,-2.681960,,,,,...,0.492207,0.382303,2.483080,-1.912085,-0.760884,1.384496,1.038844,1.129299,0.208350,-0.120917


In [24]:
df_surprise = df_jester.fillna(df)
df_surprise

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,-1.556827,8.110000,-1.523387,-0.635126,-2.28,-4.220000,5.49,-2.62,-5.082467,-2.280000,...,-3.023573,-2.057122,-0.234093,-1.003268,-2.183489,-5.920000,0.314198,0.173268,-2.795793,1.015105
1,-4.370000,-3.880000,0.730000,-3.200000,-6.41,1.170000,7.82,-4.76,-6.410000,0.730000,...,5.730000,-6.700000,1.990000,2.620000,-0.490000,3.450000,3.200000,-0.530000,-0.530000,-2.960000
2,0.887630,6.702884,2.471787,0.569802,0.73,1.825993,5.53,3.25,0.799993,-1.234746,...,2.720973,3.964535,3.062878,1.625221,3.160000,1.687005,3.864529,5.290811,2.233266,2.381746
3,0.340000,-6.550000,2.860000,0.457769,-3.64,1.120000,5.34,2.33,1.071538,2.330000,...,1.339696,0.465794,1.784173,0.917635,2.748020,-1.682888,1.275074,4.530909,-0.225610,0.796638
4,3.891968,3.620461,5.864221,2.021406,9.13,4.419166,-9.32,-2.04,3.607201,3.510214,...,3.025398,4.222970,6.743390,-2.344804,2.421831,4.057472,4.415265,5.448304,0.573211,3.068823
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2.230000,-6.310000,7.280000,-6.310000,3.59,7.280000,5.73,-0.29,7.480000,-7.860000,...,7.280000,7.280000,5.730000,-6.310000,5.730000,5.730000,7.280000,5.730000,-7.860000,-7.860000
96,-0.200844,-0.580000,0.671566,-0.671333,2.23,-1.410000,-6.31,-3.54,-2.681960,-6.600000,...,0.492207,0.382303,2.483080,-1.912085,-0.760884,1.384496,1.038844,1.129299,0.208350,-0.120917
97,-4.664629,-5.543561,-3.197576,-7.480289,-6.31,-5.467951,1.50,-4.51,-3.200482,-3.512772,...,0.563669,-3.227327,-3.660486,-3.354050,-2.814101,-3.648144,-2.853119,1.141390,-2.759677,-1.215317
98,3.110000,-8.350000,-3.640000,-3.500000,2.09,1.070000,-0.73,-4.22,1.410000,7.090000,...,-1.600000,0.730000,-3.450000,-1.020000,4.030000,-6.170000,-5.340000,5.290000,-4.660000,-4.710000


In [25]:
df_item_complete

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,7.090000,8.11,,,-2.28,-4.220000,5.49,-2.62,,-2.28,...,-5.073979,-8.315830,-1.528784,0.006812,5.290000,-5.920000,1.21,-6.890000,-7.870837,-8.93
1,-4.370000,-3.88,0.73,-3.20,-6.41,1.170000,7.82,-4.76,-6.410000,0.73,...,5.730000,-6.700000,1.990000,2.620000,-0.490000,3.450000,3.20,-0.530000,-0.530000,-2.96
2,3.200000,,4.42,4.71,0.73,2.820000,5.53,3.25,4.710000,0.00,...,2.330000,1.666311,1.825257,0.717271,3.160000,2.910000,,2.265565,-1.700000,
3,0.340000,-6.55,2.86,2.91,-3.64,1.120000,5.34,2.33,1.904878,2.33,...,-3.650425,0.291744,0.394016,1.279829,0.556289,-0.509166,,-1.752455,-3.540000,-6.99
4,8.350000,,,,9.13,5.213135,-9.32,-2.04,-2.230000,8.98,...,5.490000,-2.040000,7.706577,-8.500000,2.877534,3.930000,,6.892782,-8.500000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2.230000,-6.31,7.28,-6.31,3.59,7.280000,5.73,-0.29,7.480000,-7.86,...,7.280000,7.280000,5.730000,-6.310000,5.730000,5.730000,7.28,5.730000,-7.860000,-7.86
96,-1.499698,-0.58,7.86,,2.23,-1.410000,-6.31,-3.54,-1.700000,-6.60,...,5.968785,3.885321,3.060000,-4.078016,4.733251,0.539760,,3.078828,-9.370000,
97,-3.321006,,-5.83,,-6.31,-2.942983,1.50,-4.51,-8.200000,-4.03,...,-2.910000,-4.977465,-4.030000,-4.209733,-3.879493,-7.040000,,-4.034521,-5.938663,-7.52
98,3.110000,-8.35,-3.64,-3.50,2.09,1.070000,-0.73,-4.22,1.410000,7.09,...,-1.600000,0.730000,-3.450000,-1.020000,4.030000,-6.170000,-5.34,5.290000,-4.660000,-4.71


# References

* J. Leskovec, A. Rajaraman and J. Ullman, "Mining of Massive Datasets 3e".
* C. Aggarwal, "Recommender Systems", 2016.