# Item-Based Collaborative Filtering

As before, we'll start by importing the MovieLens 100K data set into a pandas DataFrame:

In [46]:
# Importing the pandas library for data manipulation
import pandas as pd

# Defining the column names for the ratings data
r_cols = ['user_id', 'movie_id', 'rating']

# Reading the ratings data from a tab-separated file
# 'usecols=range(3)' ensures only the first three columns (user_id, movie_id, rating) are imported
# 'encoding' specifies the character encoding for the file
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

# Defining the column names for the movie data
m_cols = ['movie_id', 'title']

# Reading the movie data from a pipe-separated file
# Here again, we're specifying the columns to read and the file encoding
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

# Merging the 'movies' and 'ratings' DataFrames on the 'movie_id' column
# This adds the 'title' column to the 'ratings' DataFrame, allowing for more readable data
ratings = pd.merge(movies, ratings)

# Displaying the first five entries of the merged DataFrame
# This is a quick way to check the top rows of the dataset to ensure it looks as expected
ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Now we'll pivot this table to construct a nice matrix of users and the movies they rated. NaN indicates missing data, or movies that a given user did not watch:

In [47]:
# Pivot the 'ratings' dataframe to create a user-item rating matrix.
# 'index' sets 'user_id' as the row index.
# 'columns' uses movie titles for the columns.
# 'values' fills the table with user ratings for each movie.
# The result is each user's rating for each movie; NaN where a user hasn't rated a movie.
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')

# Display the first five rows of the pivot table to check its structure and contents.
# This will show what ratings (if any) the first five users gave to each movie.
userRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


Now the magic happens - pandas has a built-in corr() method that will compute a correlation score for every column pair in the matrix! This gives us a correlation score between every pair of movies (where at least one user rated both movies - otherwise NaN's will show up.) That's amazing!

In [48]:
# Assuming 'userRatings' is a DataFrame created earlier, which is a pivot table where
# rows are user IDs, columns are movie titles, and values are user ratings for the movies.

# Calculating the correlation matrix from the 'userRatings' pivot table.
# This matrix measures the pairwise correlation of all movies based on user ratings.
# It uses Pearson correlation by default.
# Each cell in the matrix represents the correlation coefficient between two movies.
corrMatrix = userRatings.corr()

# Displaying the first five rows of the correlation matrix.
# This provides a quick look at the correlation coefficients for a subset of movies.
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.0,,-1.0,-0.5,-0.5,0.522233,,-0.426401,,,...,,,,,,,,,,
1-900 (1994),,1.0,,,,,,-0.981981,,,...,,,,-0.944911,,,,,,
101 Dalmatians (1996),-1.0,,1.0,-0.04989,0.269191,0.048973,0.266928,-0.043407,,0.111111,...,,-1.0,,0.15884,0.119234,0.680414,-4.8756e-17,0.707107,,
12 Angry Men (1957),-0.5,,-0.04989,1.0,0.666667,0.256625,0.274772,0.178848,,0.457176,...,,,,0.096546,0.068944,-0.361961,0.1443376,1.0,1.0,
187 (1997),-0.5,,0.269191,0.666667,1.0,0.596644,,-0.5547,,1.0,...,,0.866025,,0.455233,-0.5,0.5,0.4753271,,,


However, we want to avoid spurious results that happened from just a handful of users that happened to rate the same pair of movies. In order to restrict our results to movies that lots of people rated together - and also give us more popular results that are more easily recongnizable - we'll use the min_periods argument to throw out results where fewer than 100 users rated a given movie pair:

In [49]:
# Assuming 'userRatings' is a DataFrame created earlier, which is a pivot table where
# rows are user IDs, columns are movie titles, and values are user ratings for the movies.

# Calculating the correlation matrix from the 'userRatings' pivot table with specific parameters:
# 'method' specifies the correlation coefficient to be calculated using Pearson's method.
# 'min_periods' sets the minimum number of observations required per pair of columns to have a valid result.
# In this case, at least 100 users must have rated both movies to include their correlation in the matrix.
corrMatrix = userRatings.corr(method='pearson', min_periods=100)

# Displaying the first five rows of the correlation matrix.
# This provides a quick look at the correlation coefficients for a subset of movies under the new parameters.
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


Now let's produce some movie recommendations for user ID 0, who I manually added to the data set as a test case. This guy really likes Star Wars and The Empire Strikes Back, but hated Gone with the Wind. I'll extract his ratings from the userRatings DataFrame, and use dropna() to get rid of missing data (leaving me only with a Series of the movies I actually rated:)

In [50]:
# Assuming 'userRatings' is a DataFrame created earlier, which is a pivot table where
# rows are user IDs, columns are movie titles, and values are user ratings for the movies.

# Selecting the ratings given by the user with user_id 0.
# The 'loc[0]' function is used to locate all ratings by this specific user.
myRatings = userRatings.loc[0]

# Dropping all NaN (Not a Number) values from this user's ratings.
# NaN values represent movies that this user has not rated.
# The 'dropna()' function removes these entries, leaving only movies that the user has rated.
myRatings = myRatings.dropna()

# Displaying the ratings given by user 0 to various movies.
# This will show a Series where the index consists of movie titles and the values are the user's ratings.
myRatings

title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64

Now, let's go through each movie I rated one at a time, and build up a list of possible recommendations based on the movies similar to the ones I rated.

So for each movie I rated, I'll retrieve the list of similar movies from our correlation matrix. I'll then scale those correlation scores by how well I rated the movie they are similar to, so movies similar to ones I liked count more than movies similar to ones I hated:

In [51]:
# Assuming 'myRatings' is a pandas Series containing movies rated by user 0.
# Also assuming 'corrMatrix' is a DataFrame containing movie-to-movie correlation coefficients.

# Initializing an empty pandas Series to store similarity candidates.
simCandidates = pd.Series()

# Looping through each movie rated by user 0.
for i in range(0, len(myRatings.index)):
    print("Round:", i)
    print("Adding sims for " + myRatings.index[i] + "...")
    print("-----------------")

    # Retrieve similar movies to this one that user 0 rated.
    # Extracting the similarity scores for the current movie from the correlation matrix.
    # Dropping any missing values from the similarity Series.
    sims = corrMatrix[myRatings.index[i]].dropna()
    print(sims)
    print("-----------------")

    # Now scale its similarity by how well user 0 rated this movie.
    # This step weights the similarity scores by how much the user liked the movie.
    sims = sims.map(lambda x: x * myRatings[i])
    print(sims)
    print("-----------------")

    # Add the score to the list of similarity candidates.
    # Using 'pd.concat' to append the new similarities to the existing Series.
    # simCandidates = simCandidates.append(sims) # old pandas
    simCandidates = pd.concat([simCandidates, sims])
    print(simCandidates)
    print("*****************")

# Sorting the similarity candidates in descending order.
print("sorting...")
simCandidates.sort_values(inplace=True, ascending=False)

# Displaying the top 10 movie recommendations.
print(simCandidates.head(10))

Round: 0
Adding sims for Empire Strikes Back, The (1980)...
-----------------
title
2001: A Space Odyssey (1968)                    0.141598
Abyss, The (1989)                               0.277867
African Queen, The (1951)                       0.231657
Air Force One (1997)                            0.165620
Aladdin (1992)                                  0.311063
                                                  ...   
When Harry Met Sally... (1989)                  0.154222
While You Were Sleeping (1995)                  0.266557
Willy Wonka and the Chocolate Factory (1971)    0.191770
Wizard of Oz, The (1939)                        0.287675
Young Frankenstein (1974)                       0.185887
Name: Empire Strikes Back, The (1980), Length: 197, dtype: float64
-----------------
title
2001: A Space Odyssey (1968)                    0.707991
Abyss, The (1989)                               1.389334
African Queen, The (1951)                       1.158286
Air Force One (1997)       

  sims = sims.map(lambda x: x * myRatings[i])
  simCandidates = pd.concat([simCandidates, sims])
  sims = sims.map(lambda x: x * myRatings[i])
  sims = sims.map(lambda x: x * myRatings[i])


This is starting to look like something useful! Note that some of the same movies came up more than once, because they were similar to more than one movie I rated. We'll use groupby() to add together the scores from movies that show up more than once, so they'll count more:

In [52]:
# Grouping the simCandidates Series by its index (movie titles) and summing up the similarity scores.
# This is necessary because the same movie might have appeared multiple times in the list with different scores,
# especially if it was similar to more than one movie that the user rated.
# Summing these scores gives a comprehensive similarity score for each movie.
simCandidates = simCandidates.groupby(simCandidates.index).sum()

In [53]:
# Sorting the 'simCandidates' Series in descending order.
# This rearranges the movies so that those with the highest total similarity scores are at the top.
simCandidates.sort_values(inplace=True, ascending=False)

# Displaying the top 10 entries from the sorted series.
# These are the top 10 movies recommended for the user, based on the user's previous ratings
# and the similarity of other movies to those ratings.
simCandidates.head(10)

Empire Strikes Back, The (1980)              8.877450
Star Wars (1977)                             8.870971
Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
dtype: float64

The last thing we have to do is filter out movies I've already rated, as recommending a movie I've already watched isn't helpful:

In [54]:
# 'simCandidates' is a pandas Series with movie titles as the index and their aggregated similarity scores as values.
# 'myRatings.index' contains the indices (movie titles) of movies that user 0 has already rated.

# Dropping the movies that user 0 has already rated from the recommendation list.
# This is done to ensure the recommendations are for movies the user hasn't seen yet.
filteredSims = simCandidates.drop(myRatings.index)

# Displaying the top 10 entries from the filtered recommendation list.
# These are the top 10 movie recommendations excluding the ones already rated by the user.
filteredSims.head(10)

Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
Wizard of Oz, The (1939)                     3.200268
Dumbo (1941)                                 2.981645
dtype: float64

There we have it!

## Exercise

Can you improve on these results? Perhaps a different method or min_periods value on the correlation computation would produce more interesting results.

Also, it looks like some movies similar to Gone with the Wind - which I hated - made it through to the final list of recommendations. Perhaps movies similar to ones the user rated poorly should actually be penalized, instead of just scaled down?

There are also probably some outliers in the user rating data set - some users may have rated a huge amount of movies and have a disporportionate effect on the results. Go back to earlier lectures to learn how to identify these outliers, and see if removing them improves things.

For an even bigger project: we're evaluating the result qualitatively here, but we could actually apply train/test and measure our ability to predict user ratings for movies they've already watched. Whether that's actually a measure of a "good" recommendation is debatable, though!