# Movie Recommendation Engine

First of all, thanks to [kaggle/ROUNAKBANIK](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) for the dataset.\
I've chosen a qualitative approach to this project and decided to go forward with User-Based Collaborative Filtering method. It'll essentially provide recommendations on movies by computing similarity scores between users.\
\
Importing the dataset:

In [1]:
import pandas as pd

# Used ratings_small.csv instead of ratings.csv for higher performance and less memory usage
ratings = pd.read_csv("data/ratings_small.csv")
# To achieve more accurate final result, use ratings.csv

movies = pd.read_csv("data/movies_metadata.csv", low_memory=False)

In [2]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [3]:
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


Seems like we have a lot of unnecessary columns. Let's remove them:

In [4]:
ratings.drop(columns="timestamp", inplace=True)

movies = movies[["title", "id"]]

Correcting missmatch between movie ID column names and merging the data.

In [5]:
movies.rename(columns={"id": "movieId"}, inplace=True)

# Ensuring that all columns have correct data types:
movies["title"] = movies["title"].astype(str)
movies["movieId"] = movies["movieId"].astype(str)
ratings["movieId"] = ratings["movieId"].astype(str)
ratings["userId"] = ratings["userId"].astype(str)
ratings["rating"] = ratings["rating"].astype(int)

ratedMovies = pd.merge(movies, ratings, on="movieId", how="outer")

ratedMovies.head()

Unnamed: 0,title,movieId,userId,rating
0,Toy Story,862,,
1,Jumanji,8844,,
2,Grumpier Old Men,15602,,
3,Waiting to Exhale,31357,,
4,Father of the Bride Part II,11862,,


It appears that some movies were not rated by users from current user database.\
Removing movies without a rating:

In [6]:
ratedMovies = ratedMovies.dropna()

ratedMovies.head()

Unnamed: 0,title,movieId,userId,rating
5,Heat,949,23,3.0
6,Heat,949,102,4.0
7,Heat,949,232,2.0
8,Heat,949,242,5.0
9,Heat,949,263,3.0


Creating a pivot table to construct user / movie rating matrix.

In [7]:
movieRatings = ratedMovies.pivot_table(index=["userId"], columns=["title"], values="rating")
movieRatings.head()

title,!Women Art Revolution,'Gator Bait,'Twas the Night Before Christmas,...And God Created Woman,00 Schneider - Jagd auf Nihil Baxter,10 Items or Less,10 Things I Hate About You,"10,000 BC",11'09''01 - September 11,12 Angry Men,...,Zodiac,Zombie Flesh Eaters,Zombie Holocaust,Zozo,eXistenZ,xXx,¡Three Amigos!,À nos amours,Ödipussi,Şaban Oğlu Şaban
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
100,,,,,,,,,,,...,,,,,,,,,,
101,,,,,,,,,,,...,,,,,,,,,,
102,,,,,,,,,,,...,,,,,4.0,,,4.0,,


For testing purposes, "10,000 BC" movie is the one we want to get recommendations on.

In [8]:
tenKBC = movieRatings["10,000 BC"]
tenKBC.head()

userId
1     NaN
10    NaN
100   NaN
101   NaN
102   NaN
Name: 10,000 BC, dtype: float64

Computing pairwise correlation of the chosen movie vector of user rating with every other movie.

In [9]:
similarMovies = movieRatings.corrwith(tenKBC)
similarMovies = similarMovies.dropna()
df = pd.DataFrame(similarMovies)
df.head(10)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)
  c /= stddev[:, None]
  c /= stddev[None, :]


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
"10,000 BC",1.0
"20,000 Leagues Under the Sea",1.0
Aelita: Queen of Mars,1.0
Arlington Road,-1.0
Bad Boys II,-1.0
Beverly Hills Cop III,-1.0
Blood: The Last Vampire,-0.5
Bonnie and Clyde,-1.0
Broken Blossoms,-0.5
Cars,1.0


Sorting results:

In [10]:
similarMovies.sort_values(ascending=False)

title
The Talented Mr. Ripley                                               1.000000
Superstar: The Karen Carpenter Story                                  1.000000
Young and Innocent                                                    1.000000
Dogtown and Z-Boys                                                    1.000000
Meet Me in St. Louis                                                  1.000000
Madagascar                                                            1.000000
Once in a Lifetime: The Extraordinary Story of the New York Cosmos    1.000000
Pleasantville                                                         1.000000
Rocky III                                                             1.000000
Hannibal Rising                                                       1.000000
Gleaming the Cube                                                     1.000000
The Bow                                                               1.000000
Edward Scissorhands                           

Seems like results are messed up. Constructed a new DataFrame that counts up how many ratings exist for each movie and the average rating.

In [11]:
import numpy as np

movieStats = ratedMovies.groupby("title").agg({"rating": [np.size, np.mean]})
movieStats.head()

  movieStats = ratedMovies.groupby("title").agg({"rating": [np.size, np.mean]})


Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
!Women Art Revolution,2,3.0
'Gator Bait,1,0.0
'Twas the Night Before Christmas,2,3.5
...And God Created Woman,1,4.0
00 Schneider - Jagd auf Nihil Baxter,2,3.5


For better accuracy, we should get rid of movies rated by fewer than 200 people.

In [12]:
# Had to play around with the sizing condition to achieve better end results
popularMovies = movieStats["rating"]["size"] >= 200
movieStats[popularMovies].sort_values([("rating", "mean")], ascending=False)

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
The Million Dollar Hotel,311,4.405145
Sleepless in Seattle,200,4.395
Once Were Warriors,244,4.209016
Men in Black II,224,4.191964
Terminator 3: Rise of the Machines,324,4.157407
The 39 Steps,291,4.134021
Solaris,305,4.032787
5 Card Stud,200,4.03
License to Wed,202,4.029703
Dawn of the Dead,208,3.927885


Results are far better.\
Now we must join the data with the original set of similar movies.

In [13]:
mappedColumnsMoviestat=movieStats[popularMovies]
mappedColumnsMoviestat.columns=[f'{i}|{j}' if j != "" else f"{i}" for i,j in mappedColumnsMoviestat.columns]
df = mappedColumnsMoviestat.join(pd.DataFrame(similarMovies, columns=["similarity"]))

In [14]:
df.head()

Unnamed: 0_level_0,rating|size,rating|mean,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
48 Hrs.,200,3.815,
5 Card Stud,200,4.03,
Batman Returns,200,3.675,
Dawn of the Dead,208,3.927885,0.866025
License to Wed,202,4.029703,


In [15]:
df.sort_values(["similarity"], ascending=False)

Unnamed: 0_level_0,rating|size,rating|mean,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Passion of Joan of Arc,218,3.366972,1.0
Dawn of the Dead,208,3.927885,0.866025
Sleepless in Seattle,200,4.395,0.866025
Men in Black II,224,4.191964,-1.0
48 Hrs.,200,3.815,
5 Card Stud,200,4.03,
Batman Returns,200,3.675,
License to Wed,202,4.029703,
Monsoon Wedding,274,3.609489,
Once Were Warriors,244,4.209016,


For the final touch, let's divide rating|mean column by similarity column to achieve a recommendation coefficient and filter by it's value.

In [16]:
df["coefficient"] = df["rating|mean"].div(df["similarity"], axis=0)

df = df.query("coefficient > 0")
df.sort_values(["coefficient"], ascending=False)

Unnamed: 0_level_0,rating|size,rating|mean,similarity,coefficient
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sleepless in Seattle,200,4.395,0.866025,5.074909
Dawn of the Dead,208,3.927885,0.866025,4.53553
The Passion of Joan of Arc,218,3.366972,1.0,3.366972


Now we can clearly see what movies should be recommended to the user.