# Content Based Filtering

### Acquiring the data
Dataset from https://grouplens.org/datasets/movielens/

In [1]:
!wget -O moviedataset.zip https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%205/data/moviedataset.zip
print('unziping ...')
!unzip -o -j moviedataset.zip

--2025-08-12 22:30:53--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%205/data/moviedataset.zip
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160301210 (153M) [application/zip]
Saving to: ‘moviedataset.zip’


2025-08-12 22:30:57 (48.1 MB/s) - ‘moviedataset.zip’ saved [160301210/160301210]

unziping ...
Archive:  moviedataset.zip
  inflating: links.csv               
  inflating: movies.csv              
  inflating: ratings.csv             
  inflating: README.txt              
  inflating: tags.csv                


### Preprocessing

In [2]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [43]:
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Removing the year from the title column and store in a new year column

In [44]:
movies_df['year'] = movies_df['title'].str.extract(r'\((\d{4})\)', expand=False)
movies_df['title'] = movies_df['title'].str.replace(r'\(\d{4}\)', '', regex=True).str.strip()

movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [45]:
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


Since keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable since genres won't be important for our first recommendation system.

First case: won't contain the genre column

In [46]:
moviesWithGenres_df = movies_df.copy()

for index, row in movies_df.iterrows():
  for genre in row['genres']:
    moviesWithGenres_df.at[index, genre] = 1
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
moviesWithGenres_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34208 entries, 0 to 34207
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movieId             34208 non-null  int64  
 1   title               34208 non-null  object 
 2   genres              34208 non-null  object 
 3   year                34208 non-null  object 
 4   Adventure           34208 non-null  float64
 5   Animation           34208 non-null  float64
 6   Children            34208 non-null  float64
 7   Comedy              34208 non-null  float64
 8   Fantasy             34208 non-null  float64
 9   Romance             34208 non-null  float64
 10  Drama               34208 non-null  float64
 11  Action              34208 non-null  float64
 12  Crime               34208 non-null  float64
 13  Thriller            34208 non-null  float64
 14  Horror              34208 non-null  float64
 15  Mystery             34208 non-null  float64
 16  Sci-

In [48]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [49]:
ratings_df.drop('timestamp', axis=1, inplace=True)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


### Content-Based recommendation system

In [50]:
userInput = [
    {'title': 'Back to the Future', 'rating': 5},
    {'title': 'Toy Story', 'rating': 5},
    {'title': 'Jumanji', 'rating': 2},
    {'title': 'Beetlejuice', 'rating': 1},
    {'title': 'Pride & Prejudice', 'rating': 4.5},
    {'title': 'Interstellar', 'rating': 5},
    {'title': 'Matrix, The', 'rating': 4.5},
    {'title': 'Godfather, The', 'rating': 2},
    {'title': 'Forrest Gump', 'rating': 3},
    {'title': 'Dark Knight, The', 'rating': 5},
    {'title': 'Titanic', 'rating': 3.5},
    {'title': 'Gladiator', 'rating': 4},
    {'title': 'Inception', 'rating': 5},
    {'title': 'Lion King, The', 'rating': 4},
    {'title': 'Fight Club', 'rating': 4.5},
    {'title': 'Shawshank Redemption, The', 'rating': 4},
    {'title': 'Avengers, The', 'rating': 4.5},
    {'title': 'Jurassic Park', 'rating': 3.5},
    {'title': 'Silence of the Lambs, The', 'rating': 4},
    {'title': 'Saving Private Ryan', 'rating': 4.5},
    {'title': 'Departed, The', 'rating': 1}
]

inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Back to the Future,5.0
1,Toy Story,5.0
2,Jumanji,2.0
3,Beetlejuice,1.0
4,Pride & Prejudice,4.5
5,Interstellar,5.0
6,"Matrix, The",4.5
7,"Godfather, The",2.0
8,Forrest Gump,3.0
9,"Dark Knight, The",5.0


In [52]:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

In [53]:
inputId

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
315,318,"Shawshank Redemption, The","[Crime, Drama]",1994
352,356,Forrest Gump,"[Comedy, Drama, Romance, War]",1994
360,364,"Lion King, The","[Adventure, Animation, Children, Drama, Musica...",1994
476,480,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993
587,593,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991
843,858,"Godfather, The","[Crime, Drama]",1972
1242,1270,Back to the Future,"[Adventure, Comedy, Sci-Fi]",1985
1661,1721,Titanic,"[Drama, Romance]",1997


In [54]:
inputMovies = pd.merge(inputId, inputMovies, on='title')


In [55]:
inputMovies = inputMovies.drop(['genres', 'year'], axis=1)

In [56]:
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,5.0
1,2,Jumanji,2.0
2,318,"Shawshank Redemption, The",4.0
3,356,Forrest Gump,3.0
4,364,"Lion King, The",4.0
5,480,Jurassic Park,3.5
6,593,"Silence of the Lambs, The",4.0
7,858,"Godfather, The",2.0
8,1270,Back to the Future,5.0
9,1721,Titanic,3.5


#### Learning the input preferences

In [57]:
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
315,318,"Shawshank Redemption, The","[Crime, Drama]",1994,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
352,356,Forrest Gump,"[Comedy, Drama, Romance, War]",1994,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
360,364,"Lion King, The","[Adventure, Animation, Children, Drama, Musica...",1994,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
476,480,Jurassic Park,"[Action, Adventure, Sci-Fi, Thriller]",1993,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
587,593,"Silence of the Lambs, The","[Crime, Horror, Thriller]",1991,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
843,858,"Godfather, The","[Crime, Drama]",1972,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1242,1270,Back to the Future,"[Adventure, Comedy, Sci-Fi]",1985,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1661,1721,Titanic,"[Drama, Romance]",1997,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
userMovies = userMovies.reset_index(drop=True)
userGenreTable = userMovies.drop(['movieId','title','genres','year'], axis=1)
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Learning the input's preferences

Discover how much the user like each genre   
Each genre will be turned into weights -> dot product -> User Ratings and User Genre Table    
Multiply: (n_genres × n_movies) × (n_moves × 1) = (n_genres × 1)

In [59]:
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
userProfile

Unnamed: 0,0
Adventure,32.5
Animation,9.0
Children,11.0
Comedy,14.0
Fantasy,8.0
Romance,14.5
Drama,56.0
Action,51.0
Crime,25.5
Thriller,22.5


#### Recommending the movies

In [61]:
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
genreTable.drop(['movieId','title','genres','year'], axis=1, inplace=True)
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
genreTable.shape

(34208, 20)

In [63]:
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

Unnamed: 0_level_0,0
movieId,Unnamed: 1_level_1
1,0.236133
2,0.163233
3,0.090333
4,0.267829
5,0.044374


In [64]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head()

Unnamed: 0_level_0,0
movieId,Unnamed: 1_level_1
115479,0.70523
71999,0.671949
79132,0.66878
81132,0.667195
459,0.640254


In [65]:
#The final recommendation table
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
455,459,"Getaway, The","[Action, Adventure, Crime, Drama, Romance, Thr...",1994
4625,4719,Osmosis Jones,"[Action, Animation, Comedy, Crime, Drama, Roma...",2001
4861,4956,"Stunt Man, The","[Action, Adventure, Comedy, Drama, Romance, Th...",1980
4923,5018,Motorama,"[Adventure, Comedy, Crime, Drama, Fantasy, Mys...",1991
9000,26701,Patlabor: The Movie (Kidô keisatsu patorebâ: T...,"[Action, Animation, Crime, Drama, Film-Noir, M...",1989
11494,49530,Blood Diamond,"[Action, Adventure, Crime, Drama, Thriller, War]",2006
11497,49593,She,"[Action, Adventure, Drama, Fantasy, Horror, Ro...",1965
12720,59844,"Honor Among Thieves (Adieu l'ami) (Farewell, F...","[Action, Adventure, Crime, Drama, Mystery, Thr...",1968
13250,64645,The Wrecking Crew,"[Action, Adventure, Comedy, Crime, Drama, Thri...",1968
13552,67070,Army of One (Joshua Tree),"[Action, Adventure, Crime, Drama, Mystery, Thr...",1993


# Final Reflection

In this project, I learned how to build a simple content-based recommendation system. It uses user ratings and movie titles, and through a straightforward dot product, the algorithm generates recommendations. A user profile was created by assigning weights to each genre, these weights reminded me somewhat of neural network weights because of how weights influence neuron inputs. Since the course notebook uses some older pandas conventions, I had to adapt the code to work with newer versions.

Overall, the concepts were not too difficult to understand; however, I needed a bit more time to work through them thoroughly. This project serves as a good starting point to understand how these concepts work in practice.

I'm curious to explore how we can improve this by incorporating other variables, such as ratings from other users, profiles of similar users, time spent watching specific movies.