# Item-based Collaborative Filtering using Pandas

In this notebook I explore a technique to recommend items, called __item-based collaborative filtering__.

The objective of the notebook is to create a system to recommend movies based on a movie that you have seen. The recommendations are made based on a dataset of movie ratings by people.

I use the __MovieLens 100K Dataset__ (https://grouplens.org/datasets/movielens/). Copy the dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip, and unzip it into a folder.

2 files of the dataset are used:
- `ml-100k/u.data` contains the 100k ratings.
- `ml-100k/u.item` provides a mapping from a movie id to the name of the movie.

The implementation of the item-based collaborative filtering procedure is done using Pandas. Nevertheless, I based my implementation on the excellent Udemy course "__Taming Big Data with Apache Spark and Python - Hands On!__" by Frank Kane (https://www.udemy.com/course/taming-big-data-with-apache-spark-hands-on/).

In [1]:
# data manipulation
import pandas as pd

# imports to construct the similarity function
from numpy import dot
from numpy.linalg import norm

In [2]:
# paths to the files. download them from http://files.grouplens.org/datasets/movielens/ml-100k.zip if you don't have them yet
MOVIE_RATINGS_PATH = 'ml-100k/u.data'
MOVIE_NAMES_PATH = 'ml-100k/u.item'

# Reading the Ratings Data

In [3]:
def read_ratings(path_to_ratings):
    """
    Read the raw data of the movie ratings.
    
    Returns a list of tuples:
    (user id, movie id, rating)
    """

    data = []
    with open(path_to_ratings) as f:
        for line in f:
            # user id | item id | rating | timestamp
            pieces = line.split()
            user_id = int(pieces[0])
            movie_id = int(pieces[1])
            rating = float(pieces[2])
            data.append((user_id, movie_id, rating))
        
    return data

In [4]:
def read_names(path_to_names):
    """
    Read the mapping of movie id -> movie name
    
    Returns a dictionary
    {movie id -> movie name}
    """

    data = {}
    with open(path_to_names) as f:
        for line in f:
            # movie id | movie title | ...
            pieces = line.split('|')
            movie_id = int(pieces[0])
            title = pieces[1]
            data[movie_id] = title
        
    return data

In [5]:
ratings = read_ratings(MOVIE_RATINGS_PATH)
ratings = pd.DataFrame(data=ratings, columns=['user', 'movie', 'rating'])
ratings = ratings.astype(int)
ratings.head()

Unnamed: 0,user,movie,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


# Creating Pairs of Movie Ratings

The first important idea of item-based collaborative filtering is to create pairs of movie ratings. If a person gives the same rating to two different movies, it means that this person likes the 2 movies more or less to the same degree.

For example, user X gives the following ratings of 3 movies:
- user X, movie 1, 5 stars
- user X, movie 2, 1 star
- user X, movie 3, 5 stars

In this example, movie 1 and movie 3 both get 5 stars. Perhaps this person like science-fiction movies; movie 1 and movie 3 could both be science-fiction movies. That explains why the 2 movies received 5 stars by this user. On the other hand, perhaps this user hates westerns. Perhaps movie 2 was a western and therefore received a rating of only 1 star.

The method that I use here to create pairs of movie ratings by the same user is the __self join__. The ratings table is merged onto itself, using the user id as the key to merge the 2 tables. The result of the self join is a new table, but with 5 columns instead of 3:

`user id | movie 1 id | rating movie 1 | movie 2 id | rating 2`

Remember that only pairs of movie ratings are created for the same person. This is because we merge using the user id as the key.


The self-join operation can potentially lead to a table with a large number of rows. The number of created rows depends on the number of unique users in the ratings data. For example, if the table has N ratings of N different users, the self-join operation will not be able to create movie ratings pairs at all. 

In [6]:
ratings_no_dups = ratings.drop_duplicates('user', keep=False)
ratings_no_dups.merge(ratings_no_dups, on='user', suffixes=('_1', '_2')).shape

(0, 5)

In [7]:
# number of unique users
ratings.user.nunique()

943

In [8]:
ratings_joined = ratings.merge(ratings, on='user', suffixes=('_1', '_2'))
ratings_joined.head()

Unnamed: 0,user,movie_1,rating_1,movie_2,rating_2
0,196,242,3,242,3
1,196,242,3,393,4
2,196,242,3,381,4
3,196,242,3,251,3
4,196,242,3,655,5


In [9]:
ratings_joined.shape

(20200812, 5)

# Removing unhelpful movie ratings pairs

The self-join of the ratings data create some movie ratings pairs that are not helpful. For example, duplicate pairs will exist where the same 2 movies are rated, but with different indices.

Unhelpful movie ratings pairs are removed by comparing the id of the 2 movies in the pair.

In [10]:
ratings_filtered = ratings_joined[ratings_joined['movie_1'] < ratings_joined['movie_2']]
ratings_filtered.head()

Unnamed: 0,user,movie_1,rating_1,movie_2,rating_2
1,196,242,3,393,4
2,196,242,3,381,4
3,196,242,3,251,3
4,196,242,3,655,5
6,196,242,3,306,4


In [11]:
ratings_filtered.shape

(10050406, 5)

# Discard the user column

Now that we have pairs of movie ratings made by the same user, I no longer need the user id column. While it seems a small detail in the overall picture, this step is an important part of the _item_-based collaborative filtering procedure. 

The algorithm focuses on movies. Not on users.

When we make recommendations based on a movie seen by a specific user, we ignore the preferences of the user that provided the movie. Our algorithm will provide recommendations based on the more general consensus of the users in our rating data, _not_ on the specific preferences of the user. 

In [12]:
ratings_filtered = ratings_filtered.drop(['user'], axis=1)
ratings_filtered.head()

Unnamed: 0,movie_1,rating_1,movie_2,rating_2
1,242,3,393,4
2,242,3,381,4
3,242,3,251,3
4,242,3,655,5
6,242,3,306,4


# Grouping the Movie Ratings Pairs


Now that we have a table with movie pairs ratings, we want to group them together per unique movie pair. The idea is that the ratings data contained ratings of the same movie pair from many users.

The grouped data looks like this:

$movie 1| movie 2 | [(rating_{11}, rating_{21}), (rating_{12}, rating_{22}), (rating_{13}, rating_{23}), ...]$

The third column of the grouped data contains a list (a group) of ratings that all correspond to the same movie pair. Each element of the group corresponds to a specific user that rated both movie 1 and movie 2. 

If a movie pair has many elements in the corresponding group, it means that many different users have rated that specific movie pair.

In [13]:
ratings_grouped = ratings_filtered.groupby(by=['movie_1', 'movie_2'])

In [14]:
ratings_grouped.head(2)

Unnamed: 0,movie_1,rating_1,movie_2,rating_2
1,242,3,393,4
2,242,3,381,4
3,242,3,251,3
4,242,3,655,5
6,242,3,306,4
8,242,3,663,5
10,242,3,580,2
12,242,3,286,5
14,242,3,692,5
16,242,3,428,4


In [15]:
len(ratings_grouped)

983206

# Counting the co-occurrences of grouped movie pair ratings

We can now count how many ratings we found for a pair of movies, aka the co-occurrences for movie pairs.

Co-occurrences indicate how many different users have rated the same pair of movies. If many people rated the same pair of movies, it may indicate the 2 movies share some similarities.

In [16]:
cooccurrence = ratings_grouped.size().to_frame('cooccurrence')

In [17]:
cooccurrence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,cooccurrence
movie_1,movie_2,Unnamed: 2_level_1
1,2,104
1,3,78
1,4,149
1,5,57
1,6,14


In [18]:
cooccurrence.describe()

Unnamed: 0,cooccurrence
count,983206.0
mean,10.222076
std,18.824504
min,1.0
25%,1.0
50%,3.0
75%,10.0
max,480.0


# Computing movie similarity 

With a group of movie ratings, we can also compute the similarity in ratings. The idea is to quantify how similar the ratings of a pair of movies are across all the users that have rated them. If movies are rated very similarly across all the users, perhaps it means the movies are similar as well.

Here, we use cosine similarity to quantify the similarity between movie ratings. Remember that a group of movie pair ratings consists of the following data:

`[(rating 11, rating21), (rating 12, rating 22), (rating 13, rating 23), ...]`

with the first element of the tuple corresponding to the rating of the first movie of the movie pair and the second element of the tuple correspondig to the rating of the second movie of the same movie pair.

To compute the similarity, we first construct 2 vectors of ratings. The first vector holds all the ratings of the first movie. The second vector holds all the ratings of the second movie within the same movie pair.

`a = [rating 11, rating 12, rating 13, ...]`

`b = [rating 21, rating 22, rating 23, ...]`

The cosine similarity metric uses the following formula to compute the similarity:

\begin{equation*}
    cosine = \frac{a \cdot b}{||a|| \cdot ||b||}
\end{equation*}

with 
- $a \cdot b$ the element-wise (dot) product of the vector elements
- $||x||$ the L2-norm of the vector

Since all the ratings are strictly positive, the cosine similarity metric will obtain a value between `0` (not similar at all) and `1` (very similar). 

More information on cosine similarity can be found here: https://en.wikipedia.org/wiki/Cosine_similarity


In [19]:
def similarity(group_of_movie_ratings):
    """
    
    Cosine similarity measure, implemented through numpy functions.
    
    Accepts a DataFrame with 2 columns: ratings_1, ratings_2
    
    Returns a float
    """
    a = group_of_movie_ratings['rating_1'].values
    b = group_of_movie_ratings['rating_2'].values

    cos_sim = dot(a, b)/(norm(a)*norm(b))

    return cos_sim

The similarity measure must be computed for each group of movie pair ratings. In this notebook, the similarity computation is the most computationally intensive operation.

In [20]:
%%time

# calculate similarity for each group
ratings_similar = ratings_grouped.apply(similarity)

Wall time: 1min 14s


In [21]:
# convert Series with MultiIndex to a DataFrame, give column a name
ratings_similar = ratings_similar.to_frame('similarity')

In [22]:
ratings_similar.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,similarity
movie_1,movie_2,Unnamed: 2_level_1
1,2,0.948737
1,3,0.9133
1,4,0.942907
1,5,0.961364
1,6,0.955119


# Bundling the 2 similarity measures together

We now have 2 similarity measures:
- the number of co-occurrences for a movie pair
- the similarity measure of the ratings of a movie pair

We put them together into a table:

In [23]:
df = pd.concat([ratings_similar, cooccurrence], axis=1)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,similarity,cooccurrence
movie_1,movie_2,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2,0.948737,104
1,3,0.9133,78
1,4,0.942907,149
1,5,0.961364,57
1,6,0.955119,14


In [24]:
df.shape

(983206, 2)

In [25]:
df = df.reset_index().sort_values(by=['cooccurrence', 'similarity'], ascending=False)

In [26]:
df.head()

Unnamed: 0,movie_1,movie_2,similarity,cooccurrence
70757,50,181,0.985723,480
70676,50,100,0.954684,394
48,1,50,0.96722,381
70750,50,174,0.98176,380
70697,50,121,0.94773,362


In [27]:
df.to_csv('result.csv', index=False)

In [28]:
movie_names = read_names(MOVIE_NAMES_PATH)

# Recommending movies

We have a similarity and co-occurrence measure for every available pair of movies in our ratings data. We can use this data to recommend movies.

The recipe for recommendation goes as follows:
- Provide a movie that you have seen.
- Look up all the movie pairs this movie appears in.
- Sort the movie pairs of the seen movie based on the co-occurrence and similarity measures.
- Return a list of top n of movie pairs with the highest co-occurrence / similarity measures.

So, we have 2 measures that we can use to recommend movies: co-occurrence and cosine similarity. 

__Why don't we just use cosine similarity?__

Well, if the cosine similarity measure is based on a very low number of movie ratings, the similarity measure is probably not as trustworthy. So, we will only trust the similarity measure if there have been enough ratings for that movie pair. So we use the co-occurrence count as a minimum threshold. Only recommend a movie if it has appeared with a sufficiently large number of other movies.

Other _edge cases_ of the recommender system:
- each movie that we want recommendations for must appear in the original set of movie ratings. Since the MovieLens dataset is recorded in the late '90s, movies that have appeared after that date will not appear in the ratings data. So, the recommender system cannot make recommendations on those movies.
- even if a movie appears in the original MovieLens dataset, there's no guarantee recommendations can be made for that movie. For example, if the movie has never been rated by a user that also rated other movies, no recommendation can be made. It is as if this movie is an outlier.

In [29]:
def recommend(
    movie_id, 
    ratings, 
    names, 
    min_cooccurrence=50, 
    top=5):
    """
    
    Make movie recommendations for the given movie.
    
    Input:
    - ratings: the movie ratings with similarity measures
    - movie_id: the movie id for which you want recommendations
    - names: a mapping of movie id -> movie name
    - min_count: the minimum threshold for co-occurrence
    - top: the number of recommendations to make
    
    Returns:
    None
    """
    
    # filter based on given movie, minimum occurrence
    recomm = ratings.loc[
        ((ratings.movie_1 == movie_id) | \
        (ratings.movie_2 == movie_id)) & \
        (ratings.cooccurrence >= min_cooccurrence)]
    
    # sort by similarity measure
    recomm = recomm.sort_values(by=['similarity'], ascending=False)
    
    # keep top n recommendations
    recomm = recomm.head(top).reset_index(drop=True)
    
    # edge cases
    if len(recomm) == 0:
        print('No recommendations for movie id {}...'.format(movie_id))
        return
    else:
        # check if the recommended movie id corresponds to first or second movie
        recomm['recommended_movie'] = recomm.apply(lambda row: row.movie_2 if row.movie_1 == movie_id 
                                          else row.movie_1 , axis=1)
        
        recomm['recommended_movie'] = recomm['recommended_movie'].astype(int)
        
        # find movie title for the movie_id
        recomm['title'] = recomm['recommended_movie'].apply(lambda d: names[d])
                        
        return recomm[['recommended_movie', 'title', 'similarity', 'cooccurrence']]

In [30]:
movie_id = 10
min_cooccurrence=50
top=5
recommendations = recommend(movie_id, df, movie_names, min_cooccurrence, top)

In [31]:
movie_names[movie_id]

'Richard III (1995)'

In [32]:
recommendations

Unnamed: 0,recommended_movie,title,similarity,cooccurrence
0,127,"Godfather, The (1972)",0.95237,55
1,100,Fargo (1996),0.950818,74
2,7,Twelve Monkeys (1995),0.949439,62
3,9,Dead Man Walking (1995),0.947611,50
4,174,Raiders of the Lost Ark (1981),0.946769,54
