# Finding Similar Movies


We'll start by loading up the MovieLens dataset. Using Pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with movie names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)


In [71]:
# Importing the pandas library, which is a powerful tool for data manipulation and analysis
import pandas as pd

# Column names for the 'u.data' file
r_cols = ["user_id", "movie_id", "rating"]

# Reading the 'u.data' file
# - 'sep' specifies the delimiter, which is a tab character in this case ('\t')
# - 'names' assigns names to the columns in the dataframe
# - 'usecols' indicates that only the first three columns will be used
# - 'encoding' specifies the character encoding of the file
ratings = pd.read_csv(
    "ml-100k/u.data", sep="\t", names=r_cols, usecols=range(3), encoding="ISO-8859-1"
)

# Column names for the 'u.item' file
m_cols = ["movie_id", "title"]

# Reading the 'u.item' file
# - This file contains movie information
# - Similar to 'u.data', we specify the delimiter, column names, columns to use, and encoding
movies = pd.read_csv(
    "ml-100k/u.item", sep="|", names=m_cols, usecols=range(2), encoding="ISO-8859-1"
)

# Merging the 'movies' dataframe with the 'ratings' dataframe
# - This is done on the 'movie_id' column, which is common to both dataframes
# - The result is a combined dataframe where each movie rating is now associated with its movie title
ratings = pd.merge(movies, ratings)

In [72]:
ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.


In [73]:
movieRatings = ratings.pivot_table(
    index=["user_id"], columns=["title"], values="rating"
)
movieRatings.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


Let's extract a Series of users who rated Star Wars:


In [74]:
# 'movieRatings' is assumed to be a DataFrame where each column is a movie title,
# and each row represents a user's rating for that movie.

# Extracting the ratings for "Star Wars (1977)" from the 'movieRatings' DataFrame.
# This creates a pandas Series where the index is user_ids and the values are their ratings for "Star Wars (1977)".
starWarsRatings = movieRatings["Star Wars (1977)"]

# Displaying the first 5 ratings for "Star Wars (1977)".
# This helps to quickly inspect the rating values given by the first few users.
starWarsRatings.head()

user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Pandas' corrwith function makes it really easy to compute the pairwise correlation of Star Wars' vector of user rating with every other movie! After that, we'll drop any results that have no data, and construct a new DataFrame of movies and their correlation score (similarity) to Star Wars:


In [75]:
# Calculating the pairwise correlation of "Star Wars (1977)" ratings with every other movie in the 'movieRatings' DataFrame.
# The 'corrwith' method computes the correlation between the 'starWarsRatings' series and each column (movie) in 'movieRatings'.
# The resulting Series ('similarMovies') has movie titles as the index and correlation coefficients as values.
similarMovies = movieRatings.corrwith(starWarsRatings)

# Dropping any NaN values from the 'similarMovies' Series.
# NaN values occur when there is no overlapping set of users who have rated both "Star Wars (1977)" and the compared movie,
# leading to an undefined correlation coefficient.
similarMovies = similarMovies.dropna()

# Converting the 'similarMovies' Series into a DataFrame for easier manipulation and potential future analysis.
# This DataFrame ('df') has one column representing the correlation of each movie with "Star Wars (1977)".
df = pd.DataFrame(similarMovies)

# Displaying the first 50 entries of the DataFrame.
# These entries show the top 50 movies that have a defined correlation coefficient with "Star Wars (1977)",
# potentially indicating similarity in terms of how users rated them.
df.head()

  c /= stddev[:, None]
  c /= stddev[None, :]
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
'Til There Was You (1997),0.872872
1-900 (1994),-0.645497
101 Dalmatians (1996),0.211132
12 Angry Men (1957),0.184289
187 (1997),0.027398


(That warning is safe to ignore.) Let's sort the results by similarity score, and we should have the movies most similar to Star Wars! Except... we don't. These results make no sense at all! This is why it's important to know your data - clearly we missed something important.


In [76]:
similarMovies.sort_values(ascending=False)

title
Hollow Reed (1996)            1.0
Commandments (1997)           1.0
Cosi (1996)                   1.0
No Escape (1994)              1.0
Stripes (1981)                1.0
                             ... 
For Ever Mozart (1996)       -1.0
Frankie Starlight (1995)     -1.0
I Like It Like That (1994)   -1.0
American Dream (1990)        -1.0
Theodore Rex (1995)          -1.0
Length: 1410, dtype: float64

Our results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like Star Wars. So we need to get rid of movies that were only watched by a few people that are producing spurious results. Let's construct a new DataFrame that counts up how many ratings exist for each movie, and also the average rating while we're at it - that could also come in handy later.


In [77]:
# Importing the numpy library, which provides support for large, multi-dimensional
# arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
import numpy as np

# 'ratings' is assumed to be a DataFrame with at least two columns: 'title' and 'rating'.
# 'title' contains movie titles, and 'rating' contains different user ratings for these movies.

# Grouping the data in 'ratings' by the 'title' column.
# This means aggregating the data for each unique movie title.
movieStats = ratings.groupby("title").agg({"rating": [np.size, np.mean]})

# Applying aggregation functions to the 'rating' column for each group (movie).
# 'np.size' will count the number of ratings each movie received (effectively, this is the number of users who rated the movie).
# 'np.mean' will calculate the average rating for each movie.
# These two calculations will give us a sense of both the popularity (number of ratings) and the average reception (mean rating) of each movie.

# Displaying the first 50 rows of the resulting DataFrame.
# Each row represents a movie, with the number of ratings it received and its average rating.
movieStats.head(50)

  movieStats = ratings.groupby("title").agg({"rating": [np.size, np.mean]})


Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),9,2.333333
1-900 (1994),5,2.6
101 Dalmatians (1996),109,2.908257
12 Angry Men (1957),125,4.344
187 (1997),41,3.02439
2 Days in the Valley (1996),93,3.225806
"20,000 Leagues Under the Sea (1954)",72,3.5
2001: A Space Odyssey (1968),259,3.969112
3 Ninjas: High Noon At Mega Mountain (1998),5,1.0
"39 Steps, The (1935)",59,4.050847


Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left:


In [78]:
# Creating a boolean mask to identify popular movies.
# 'movieStats["rating"]["size"] >= 100' checks each movie to see if it has received 100 or more ratings.
# Movies with 100 or more ratings are considered 'popular'.
popularMovies = movieStats["rating"]["size"] >= 100

# Applying the 'popularMovies' filter to the 'movieStats' DataFrame.
# This step filters out movies that have fewer than 100 ratings, focusing only on popular movies.
# Then, it sorts these popular movies by their average rating in descending order (highest ratings first).
# The expression '[("rating", "mean")]' specifies that the sorting is to be done based on the average rating.
sortedPopularMovies = movieStats[popularMovies].sort_values(
    [("rating", "mean")], ascending=False
)

# Displaying the top 15 movies.
# This step shows the first 15 entries of the sorted DataFrame, which are the 15 most highly rated popular movies.
top15PopularMovies = sortedPopularMovies[:15]
top15PopularMovies

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.45679
"Shawshank Redemption, The (1994)",283,4.44523
Rear Window (1954),209,4.38756
"Usual Suspects, The (1995)",267,4.385768
Star Wars (1977),584,4.359589
12 Angry Men (1957),125,4.344
Citizen Kane (1941),198,4.292929


100 might still be too low, but these results look pretty good as far as "well rated movies that people have heard of." Let's join this data with our original set of similar movies to Star Wars:


In [79]:
# Flatten the multi-level column index in 'movieStats'
print(movieStats.columns)
movieStats.columns = ["_".join(col).strip() for col in movieStats.columns.values]
print(movieStats.columns)

MultiIndex([('rating', 'size'),
            ('rating', 'mean')],
           )
Index(['rating_size', 'rating_mean'], dtype='object')


In [80]:
# 'movieStats[popularMovies]' filters the 'movieStats' DataFrame to include only those movies
# which have 100 or more ratings. This was determined by the 'popularMovies' boolean mask.

# 'pd.DataFrame(similarMovies, columns=["similarity"])' converts the 'similarMovies' Series into a DataFrame.
# The series 'similarMovies' contains correlation coefficients that indicate how similarly each movie
# was rated compared to "Star Wars (1977)". By converting it into a DataFrame and naming the column "similarity",
# it becomes easier to join with the 'movieStats' DataFrame.

# The 'join' method is used to combine these two DataFrames.
# The 'movieStats' DataFrame (filtered for popular movies) is joined with the 'similarMovies' DataFrame.
# This join is performed on the index of both DataFrames, which, in this case, should be the movie titles.
# The result is a DataFrame ('df') that has the statistical data of popular movies along with their
# similarity score to "Star Wars (1977)".
df = movieStats[popularMovies].join(pd.DataFrame(similarMovies, columns=["similarity"]))

In [81]:
df.head()

Unnamed: 0_level_0,rating_size,rating_mean,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101 Dalmatians (1996),109,2.908257,0.211132
12 Angry Men (1957),125,4.344,0.184289
2001: A Space Odyssey (1968),259,3.969112,0.230884
Absolute Power (1997),127,3.370079,0.08544
"Abyss, The (1989)",151,3.589404,0.203709


And, sort these new results by similarity score. That's more like it!


In [82]:
df.sort_values(["similarity"], ascending=False)[:15]

Unnamed: 0_level_0,rating_size,rating_mean,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),584,4.359589,1.0
"Empire Strikes Back, The (1980)",368,4.206522,0.748353
Return of the Jedi (1983),507,4.00789,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,3.246154,0.377433
"Sting, The (1973)",241,4.058091,0.367538
Indiana Jones and the Last Crusade (1989),331,3.930514,0.350107
Pinocchio (1940),101,3.673267,0.347868
"Frighteners, The (1996)",115,3.234783,0.332729
L.A. Confidential (1997),297,4.161616,0.319065


Ideally we'd also filter out the movie we started from - of course Star Wars is 100% similar to itself. But otherwise these results aren't bad.


## Activity


100 was an arbitrarily chosen cutoff. Try different values - what effect does it have on the end results?
