# Item-to-Item Collaborative Filtering

## Idea
Let $u$ be the active user and $i$ the referenced item
1. If $u$ liked items similar to $i$, he will probably like item $i$.
2. If he hated or disliked items similar to $i$, he will also hate item $i$.

The idea is therefore to look at how an active user $u$ rated items similar to $i$ to know how he would have rated item $i$

## Advantages over user-based CF

1. <b> Stability </b> : Items ratings are more stable than users ratings. New ratings on items are unlikely to significantly change the similarity between two items, particularly when the items have many ratings <a href="https://dl.acm.org/doi/10.1561/1100000009">(Michael D. Ekstrand, <i>et al.</i> 2011)</a>. 
2. <b> Scalability </b> : with stable item's ratings, it is reasonable to pre-compute similarities between items in an item-item similarity matrix (similarity between items can be computed offline). This will reduce the scalability concern of the algorithm. <a href="https://dl.acm.org/doi/10.1145/371920.372071">(Sarwar <i>et al.</i> 2001)</a>, <a href="https://dl.acm.org/doi/10.1561/1100000009">(Michael D. Ekstrand, <i>et al.</i> 2011)</a>.

## Algorithm : item-to-item collaborative filtering

The algorithm that defines item-based CF is described as follow <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.449.1171&rep=rep1&type=pdf">(B. Sarwar et al. 2001)</a><a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.554.1671&rep=rep1&type=pdf">(George Karypis 2001)</a> :

1. First identify the $k$ most similar items for each item in the catalogue and record the corresponding similarities. To compute similarity between two items we can user the <i>Adjusted Cosine Similarity</i> that has proven to be more efficient than the basic <i>Cosine similarity measure</i> used for user-based collaborative as described in <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.449.1171&rep=rep1&type=pdf">(B. Sarwar et al. 2001)</a>. The Adjusted Cosine distance between two items $i$ and $j$ is computed as follow

<center>
$
\large
 w_{i,j}= \frac{\sum_{u\in U}(r_{u,i}-\bar{r}_u)(r_{u,j}-\bar{r}_u)}{\sqrt{\sum_{u\in U} (r_{u,i}-\bar{r}_u)^2}\sqrt{\sum_{u\in U} (r_{u,j}-\bar{r}_u)^2}}
$
</center>

$w_{i,j}$ is the degree of similarity between items $i$ and $j$. This term is computed for all users $u\in U$, where $U$ is the set of users that rated both items $i$ and $j$. Let's denote by $S^{(i)}$ the set of the $k$ most similar items to item $i$.


2. To produce top-N recommendations for a given user $u$ that has already purchased a set $I_u$ of items, do the following :

    a. Find the set $C$ of candidate items by taking the union of all $S^{(i)}, \forall i\in I_u$ and removing each of the items in the set $I_u$.
<center>
$
 C = \bigcup_{i\in I_u}\{S^{(i)}\}\smallsetminus I_u
$
</center>    
    b. $\forall c\in C$, compute similarity between c and the set $I_u$ as follows:
<center>
$
 w_{c,I_u} = \sum_{i\in I_u} w_{c,i}, \forall c \in C
$
</center>
    c. Sort items in $C$ in decreasing order of $w_{c,I_u}, \forall c \in C$, and return the first $N$ items as the Top-N recommendation list.

Before returning the first $N$ items as top-N recommendation list, we can make predictions about what user $u$ would have given to each items in the top-N recommendation list, rearrange the list in descending order of predicted ratings and return the rearranged list as the final recommendation list. Rating prediction for item-based CF is given by the following formular <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.449.1171&rep=rep1&type=pdf">(B. Sarwar et al. 2001)</a>:

<center>
$
\large
 \hat{r}_{u,i}=\frac{\sum_{i\in S^{(i)}}r_{u,j}\cdot w_{i,j}}{\sum_{j\in S^{(i)}}|w_{i,j}|}
$
</center>

### Import useful requirements

In [1]:
from tools.utils import load_ratings, load_movies, download_data
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

import pandas as pd
import numpy as np
import os

Download movielen data if it doesn't exist

In [2]:
data = os.path.join('tools', 'ml-latest-small')

if os.path.exists(data):
    print('Data already exists ...')
    ratings_csv, movies_csv = os.path.join(data, 'ratings.csv'), os.path.join(data, 'movies.csv')
else:
    ratings_csv, movies_csv = download_data()

Data already exists ...


### Load ratings

In [3]:
ratings, movies = load_ratings(ratings_csv), load_movies(movies_csv)

In [4]:
itemids = sorted(ratings['itemid'].unique())

In [5]:
itemids_to_idx = {itemid:idx for (itemid,idx) in zip(itemids, range(0, len(itemids)))}

In [6]:
idx_to_itemids = {idx:itemid for (idx,itemid) in zip(range(0,len(itemids)),itemids)}

In [7]:
ratings

Unnamed: 0,userid,itemid,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0
...,...,...,...
100831,610,166534,4.0
100832,610,168248,5.0
100833,610,168250,5.0
100834,610,168252,5.0


Let's implements the item-based collaborative filtering algorithm described above

### Step 1. Find similarities for each of the items

To compute similarity between two items $i$ and $j$, we need to :

1. find all users who rated both of them,
2. Normalize their ratings on items $i$ and $j$
3. Apply the cosine metric to the normalized ratings to compute similarity between $i$ and $j$

Function ```normalize()``` process the rating dataframe to normalize ratings of all users

In [8]:
def normalize():
    """
    
    """
    # compute mean rating for each user
    mean = ratings.groupby(by='userid', as_index=False)['rating'].mean()
    norm_ratings = pd.merge(ratings, mean, suffixes=('','_mean'), on='userid')
    
    # normalize each rating by substracting the mean rating of the corresponding user
    norm_ratings['norm_rating'] = norm_ratings['rating'] - norm_ratings['rating_mean']
    
    return mean, norm_ratings

In [9]:
mean, norm_ratings = normalize()
norm_ratings.head()

Unnamed: 0,userid,itemid,rating,rating_mean,norm_rating
0,1,1,4.0,4.366379,-0.366379
1,1,3,4.0,4.366379,-0.366379
2,1,6,4.0,4.366379,-0.366379
3,1,47,5.0,4.366379,0.633621
4,1,50,5.0,4.366379,0.633621


now that each rating has been normalized, we can represent each item by a vector of its normalized ratings

In [10]:
def item_representation(norm_ratings):
    """
    
    """
    
    df = pd.crosstab(norm_ratings.itemid, norm_ratings.userid, norm_ratings.norm_rating, aggfunc=sum)
    df = df.fillna(0)

    R = csr_matrix(df.values)
    
    return R, df

In [11]:
R, df = item_representation(norm_ratings)

In [12]:
df

userid,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
itemid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.366379,0.0,0.0,0.0,0.363636,0.000000,1.269737,0.000000,0.0,0.0,...,-0.425743,0.000000,0.492047,-0.48,0.789593,-1.157399,0.213904,-0.634176,-0.27027,1.311444
2,0.000000,0.0,0.0,0.0,0.000000,0.506369,0.000000,0.425532,0.0,0.0,...,0.000000,0.607407,0.000000,1.52,0.289593,0.000000,0.000000,-1.134176,0.00000,0.000000
3,-0.366379,0.0,0.0,0.0,0.000000,1.506369,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,-1.134176,0.00000,0.000000
4,0.000000,0.0,0.0,0.0,0.000000,-0.493631,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000
5,0.000000,0.0,0.0,0.0,0.000000,1.506369,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,-0.48,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000
193583,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000
193585,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000
193587,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000


```df``` is a cross table dataframe where rows are items and columns are users. Each row represents the vector of normalized ratings attributed to the corresponding item. Missing values are filled with zeros (0). ```R``` is the numpy array corresponding to ```df``` that will be fitted to a $k$-NN model.

Let's build and fit our $k$-NN model using sklearn

In [13]:
def create_model(R, k):
    """
    :param
        R : numpy array of item representations
        k : number of nearest neighbors to return
    
    :return
        model : our knn model
    """
    
    # create the nearest neighbors model
    model = NearestNeighbors(metric='cosine', n_neighbors=k+1, algorithm='brute')
    
    # fit the model with ratings
    model.fit(R)
    
    return model

Function ```nearest_neighbors()``` compute nearest neighbors

In [14]:
def nearest_neighbors(R, model):
    """
    compute the top 30 similar items for each item.
    
    :param
        - R : items representations
        - model : nearest neighbors model
    
    :return
        - similarities
        - neighbors
    """
    
    similarities, neighbors = model.kneighbors(R)
    
    similarities = 1 - np.squeeze(similarities)
    neighbors = np.squeeze(neighbors)
    
    for idx in range(similarities.shape[0]):
        similarities[idx][0] = 0
        
    return similarities, neighbors

Let's call functions ```create_model()``` and ```nearest_neighbors()``` to respectively create the $k$-NN model (with $k=30$) and compute the nearest neighbors for all items in our database

In [15]:
model = create_model(R, k=30)

In [None]:
similarities, neighbors = nearest_neighbors(R, model)

In [None]:
print('neighbors shape : ', neighbors.shape)
print('similarities shape : ', similarities.shape)

```neighbors``` and ```similarities``` are numpy array, were each entries are list of $30$ neighbors with their corresponding similarities

### Step 2. Top N recommendation for a given user

Top-N recommendations are made for example for a user $u$ who has already rated a set of items $I_u$

#### 2.a- Finding candidate items

To find candidate items for user $u$, we need to :

1. Find the set $I_u$ of items already rated by user $u$,
2. Take the union of similar items as $C$ for all items in $I_u$
3. exclude from the set $C$ all items in $I_u$, to avoid recommend to a user items he has already purchased.

These are done in function ```candidate_items()```

In [None]:
def candidate_items(userid):
    """
    :param
        - userid : user id for which we wish to find candidate items
    
    :return
        - I_u : list of items already purchased by userid
        - candidates : list of candidate items
    """
    
    # 1. Finding the set I_u of items already rated by user userid
    I_u = ratings.loc[ratings.userid == userid].itemid.to_list()
    
    # 2. Taking the union of similar items for all items in I_u to form the set of candidate items
    C = set()
    
    for iid in I_u:        
        # get the index of item iid in the nearest neighbors set
        idx = itemids_to_idx[iid]
        
        # add the neighbors of item iid in the set of candidate items
        
        C.update([ idx_to_itemids[ix] for ix in neighbors[idx]])
        
    C = list(C)
    
    # 3. exclude from the set C all items in I_u.
    candidates = np.setdiff1d(C, I_u, assume_unique=True)
    
    return I_u, candidates

In [None]:
i_u, u_candidates = candidate_items(514)

In [None]:
print('number of items purchased by user 514 : ', len(i_u))
print('number of candidate items for user 514 : ', len(u_candidates))

In [None]:
u_candidates

#### 2.b- Find similarity between each candidate item and the set $I_u$

In [None]:
def similarity_with_Iu(c, I_u):
    """
    compute similarity between an item c and a set of items I_u. For each item i in I_u, get similarity between 
    i and c, if c exists in the set of items similar to itemid
    
    :param
        - c : itemid of a candidate item
        - I_u : set of items already purchased by a given user
    
    :return
        - w : similarity between c and I_u
    """
    w = 0    
    for i in I_u :        
        # idx of item i in nearest neighbors set
        idx = itemids_to_idx[i]
        
        # get similarity between itemid and c, if c is one of the k nearest neighbors of itemid
        if c in neighbors[idx] :
            i_similarities = similarities[idx]
            w = w + i_similarities[list(neighbors[idx]).index(c)]
    
    return w

#### 2.c- Rank candidate items according to their similarities to $I_u$

In [None]:
def rank_candidates(candidates, I_u):
    """
    rank candidate items according to their similarities with I_u
    
    :param
        - candidates : list of candidate items
        - I_u : list of items purchased by the user
    
    :return:
        - ranked_candidates : dataframe of candidate items, ranked in descending order of similarities with I_u
    """
    
    # list of candidate items mapped to their corresponding similarities to I_u 
    mapping = []
    for c in candidates:
        # compute similarity between c and I_u
        w = similarity_with_Iu(c, I_u)
        mapping.append((c,w))

    ranked_candidates = pd.DataFrame(mapping, columns=['itemid','similarity_with_Iu'])
    
    # rank candidate items according to their similarities
    ranked_candidates = ranked_candidates.sort_values(by=['similarity_with_Iu'], ascending=False)
    
    return ranked_candidates
        

## Putting all together

Now that we defined all functions necessary to build our item to item top-N recommendation, let's define function ```item2item_topN()``` that makes top-$N$ recommendations for a given user 

In [None]:
def item2item_topN(userid, N=30):
    """
    Produce top-N recommendation for a given user
    
    :param
        - userid : user for which we produce top-N recommendation
        - N : length of the top-N recommendation list
    
    :return
        - topN_list
    """
    # find candidate items
    I_u, candidates = candidate_items(userid)
    
    # rank candidate items according to their similarities with I_u
    ranked_candidates = rank_candidates(candidates, I_u)
    
    # get the first N row of ranked_candidates to build the top N recommendation list
    topN_list = ranked_candidates.iloc[:N]
    
    topN_list = pd.merge(topN_list, movies, on='itemid', how='inner')
    
    return topN_list

In [None]:
topN = item2item_topN(514)

In [None]:
topN

The ```topN``` dataframe represents the top N recommendation list a user. These items are sorted in decreasing order of similarities with $I_u$.

<b>Observation</b> : The recommended items are the most similar to the set $I_u$ of items already purchased by the user.

## Top N recommendation with predictions

Before recommending the previous list to the user, we can go further and predict the ratings the user would have given to each of these items, sort them in descending order of prediction and return the reordered list as the new top N recommendation list.

### Rating prediction

As stated earlier, the predicted rating $\hat{r}_{u,i}$ for a given user $u$ on an item $i$ is obtained by aggregating ratings given by $u$ on items similar to $i$ as follows:

<center>
$
\large
 \hat{r}_{u,i}=\frac{\sum_{j\in S^{(i)}}r_{u,j}\cdot w_{i,j}}{\sum_{j\in S^{(i)}}|w_{i,j}|}
$
</center>

In [None]:
def predict(userid, itemid):
    """
    Make rating prediction for user userid on item itemid
    
    :param
        - userid : id of the active user
        - itemid : id of the item for which we are making prediction
        
    :return
        - r_hat : predicted rating
    """
    
    # index of itemid in the nearest neighbors space
    idx = itemids_to_idx[itemid]
    
    # Get items similar to item itemid with their corresponding similarities
    i_neighbors = neighbors[idx][1:]
    i_similarities = similarities[idx][1:]
    
    # initialize denominator
    W = 0
    
    # initialize numerator
    weighted_sum = 0
    
    for iid in i_neighbors:
        # get rating of userid on iid if exists
        if ((ratings.userid==userid) & (ratings.itemid==iid)).any():
            r = ratings[(ratings.userid==userid) & (ratings.itemid==iid)].rating.values[0]
            
            # get similarity between iid and itemid
            w = i_similarities[list(i_neighbors).index(iid)]
            
            # update denominator
            W = W + abs(w)
            
            # update numerator
            weighted_sum = weighted_sum + r * w
            
    if weighted_sum == 0:
        r_hat = mean[mean.userid==userid].rating.values[0]
    else:    
        # predicted rating
        r_hat = weighted_sum / W
    
    return r_hat

Let's check if this function predicts reasonable scores ...

In [None]:
ratings.head()

In [None]:
predict(1,1)

In [None]:
predict(1,3)

In [None]:
predict(1,47)

Now let's use our ```predict()``` function to predict what ratings the user would have given to the previous top-$N$ list and return the reorganised list (in decreasing order of predictions) as the new top-$N$ list

In [None]:
def topN_with_prediction(userid):
    """
    :param
        - userid : id of the active user
    
    :return
        - topN_list : initial topN recommendations returned by the function item2item_topN
        - topN_predict : topN recommendations reordered according to rating predictions
    """
    # make top N recommendation for the active user
    topN_list = item2item_topN(userid)
    
    # get list of items of the top N list
    itemids = topN_list.itemid.to_list()
    
    predictions = []
    
    # make prediction for each item in the top N list
    for itemid in itemids:
        r = predict(userid, itemid)
        
        predictions.append((itemid,r))
    
    predictions = pd.DataFrame(predictions, columns=['itemid','prediction'])
    
    # merge the predictions to topN_list and rearrange the list according to predictions
    topN_predict = pd.merge(topN_list, predictions, on='itemid', how='inner')
    topN_predict = topN_predict.sort_values(by=['prediction'], ascending=False)
    
    return topN_list, topN_predict

Now, let's make recommendation for user 514 and compare the two list

In [None]:
topN_list, topN_predict = topN_with_prediction(userid=514)

In [None]:
topN_list

In [None]:
topN_predict

As you will have noticed, the two lists are sorted in different ways. The second list is organized according to the predictions made for the user.

<b>Note</b>: When making predictions for user $u$ on item $i$, user $u$ may not have rated any of the $k$ most similar items to i. In this case, we consider the mean rating of $u$ as the predicted value.

## References

1. George Karypis (2001)<a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.554.1671&rep=rep1&type=pdf">Evaluation of Item-Based Top-N Recommendation Algorithms</a>
2. Sarwar et al. (2001) <a href="https://dl.acm.org/doi/10.1145/371920.372071"> Item-based collaborative filtering recommendation algorithms</a> 
3. Michael D. Ekstrand, et al. (2011). <a href="https://dl.acm.org/doi/10.1561/1100000009"> Collaborative Filtering Recommender Systems</a>
4. J. Bobadilla et al. (2013)<a href="https://romisatriawahono.net/lecture/rm/survey/information%20retrieval/Bobadilla%20-%20Recommender%20Systems%20-%202013.pdf"> Recommender systems survey</a>
5. Greg Linden, Brent Smith, and Jeremy York (2003) <a href="https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf">Amazon.com Recommendations : Item-to-Item Collaborative Filtering</a>

## Author

<a href="https://www.linkedin.com/in/carmel-wenga-871876178/">Carmel WENGA</a>, Applied Machine Learning Research Engineer | <a href="https://shoppinglist.cm/fr/">ShoppingList</a>, Nzhinusoft