# Item-Item Collaborative Filtering Assignment

In this notebook I will present the solution for the *Item-Item CF Programming Assignment*. The assignment comes with a spreadsheet that contains 4 different sheets: Ratings, NormRatings, Matrix and FilterMatrix. The former one has all the ratings the users have given to the films they've watched. The second one is the same one but normalized using the mean rating for each user. The Matrix sheet is the one we should fill to carry out the similarity between each item (in our case films). Finally, the lasta one shrinks the negative distances to zero, that's why it's called filtered.

Let's first load all those sheets into different Pandas DataFrames.

In [40]:
import pandas as pd
import numpy as np

# First we load the Ratings sheet and we separate the two extra vectors added for the computation.
ratings = pd.read_excel('Assignment 5.xlsx', sheetname='Ratings', index_col=0)
l2_norm = ratings.loc['L2'][:-1]
ratings = ratings.drop('L2', 0)
mean_rating = ratings.loc[:, 'Mean']
ratings = ratings.drop('Mean', 1)

# Now we will load the normalized ratings.
norm_ratings = pd.read_excel('Assignment 5.xlsx', sheetname='NormRatings', index_col=0)
norm_l2_norm = norm_ratings.loc['L2']
norm_ratings = norm_ratings.drop('L2', 0)

Now that we have loaded all the ratings matrixes, we can start working with the task.

## Unnormalized Approach

First of all, we are going to start with the Unnormalized approach. What we need to do is to get the similarity between items using the cosine distance with the unnormalized ratings, so let's start by defining a function to do so. Basically, it has to carry out the dot product between the ratings and its transpose and then divide by the multiplication of the l2_norms. Remember that we have to shrink the negative distances to zero.

In [86]:
def get_item_similarities(ratings, l2_norm):
    similarities = ratings.transpose().fillna(0).dot(ratings.fillna(0)).divide(l2_norm, 0).divide(l2_norm, 1)
    similarities = similarities.applymap(lambda x: 0 if x < 0 else x)
    
    return similarities

get_item_similarities(ratings, l2_norm).loc['1: Toy Story (1995)'].nlargest(6)

1: Toy Story (1995)                               1.000000
260: Star Wars: Episode IV - A New Hope (1977)    0.747409
780: Independence Day (ID4) (1996)                0.690665
296: Pulp Fiction (1994)                          0.667846
318: Shawshank Redemption, The (1994)             0.667424
1265: Groundhog Day (1993)                        0.661016
Name: 1: Toy Story (1995), dtype: float64

Now we are going to use such similarities to calculate the recommendations. We are going to use an average of the user’s ratings weighted by similarity to each candidate movie. Remember that we only have to consider those weights for the movies the user has rated.

In [65]:
def get_recommendations(ratings, l2_norm, user, n=5):
    similarities = get_item_similarities(ratings, l2_norm)
    ratings_user = ratings.loc[user]
    recommendations = pd.Series(np.zeros(similarities.shape[0]), index=similarities.index)
    
    for item in ratings.columns:
        weights_item = similarities.loc[item]
        numerator = 0
        denominator = 0
        for i in range(len(ratings_user)):
            rating = ratings_user[i]
            if not np.isnan(rating):
                numerator += weights_item[i] * ratings_user[i]
                denominator += abs(weights_item[i])
        recommendations[item] = numerator/denominator
    
    return recommendations.nlargest(n)

get_recommendations(ratings, l2_norm, 5277)

527: Schindler's List (1993)                      2.973883
1259: Stand by Me (1986)                          2.928801
260: Star Wars: Episode IV - A New Hope (1977)    2.922240
593: Silence of the Lambs, The (1991)             2.883304
2396: Shakespeare in Love (1998)                  2.852131
dtype: float64

## Normalized Approach

So far we have seen how to make recommendations using an unnormalized Item-Item CF algorithm. Now we are going to normalize it to see the effect it has on the recommendations.

The first thing we need to do is to get the similarities between items. For that we can re-use the *get_similarities* that we previously defined for the unnormalized approach.

In [82]:
get_item_similarities(ratings=norm_ratings, l2_norm=norm_l2_norm).loc['1: Toy Story (1995)'].nlargest(6)

1: Toy Story (1995)                      1.000000
34: Babe (1995)                          0.554448
356: Forrest Gump (1994)                 0.355780
296: Pulp Fiction (1994)                 0.295013
318: Shawshank Redemption, The (1994)    0.215975
2028: Saving Private Ryan (1998)         0.192799
Name: 1: Toy Story (1995), dtype: float64

In [87]:
def get_norm_recommendations(ratings, l2_norm, user, n=5):
    similarities = get_item_similarities(ratings, l2_norm)
    ratings_user = ratings.loc[user]
    recommendations = pd.Series(np.zeros(similarities.shape[0]), index=similarities.index)
    
    for item in ratings.columns:
        weights_item_i = similarities.loc[item]
        numerator = 0
        denominator = 0
        for j in range(len(ratings_user)):
            rating = ratings_user[j]
            if not np.isnan(rating) and weights_item_i[j] > 0:
                weights_item_j = similarities.iloc[j]
                numerator += weights_item_i[j] * (ratings_user[j] - weights_item_j.mean())
                denominator += abs(weights_item_i[j])
        recommendations[item] = numerator/denominator + weights_item_i.mean()
    
    return recommendations.nlargest(n)

get_norm_recommendations(norm_ratings, norm_l2_norm, 5277)

260: Star Wars: Episode IV - A New Hope (1977)    1.326118
527: Schindler's List (1993)                      1.246056
1259: Stand by Me (1986)                          0.540536
593: Silence of the Lambs, The (1991)             0.404379
2396: Shakespeare in Love (1998)                  0.362240
dtype: float64