<h1 align="center"><font size="5">SHOWMAX CONTENT-BASED FILTERING</font></h1>

Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore Content-based recommendation systems and implement a simple version of one using Python and the Pandas library.

In [325]:
# For creating and manipulating structured tabular data
import pandas as pd
import html
from IPython.display import HTML
from functools import lru_cache
import requests, io
OMDB_KEY = "fde11cf3"

Let's set maximum rows to be displayed at any time to not more than 20

In [326]:
pd.set_option("display.max_rows", 20)

### Saving the raw files from github

Both files have been saved in raw .csv format in  the code cell below, but if you want to download directly from the website, click this [link](https://grouplens.org/datasets/movielens/) and <br>
Select the file name 'ml-latest-small.zip (size: 1 MB)'

In [327]:
movies_data = 'https://raw.githubusercontent.com/kay102dev/showmax-movie-recommendation-system/refs/heads/main/data/movies.csv'
ratings_data = 'https://raw.githubusercontent.com/kay102dev/showmax-movie-recommendation-system/refs/heads/main/data/ratings.csv'

### Defining additional NaN values

In [328]:
missing_values = ['na','--','?','-','None','none','non']

### Reading the data to the data frame

In [329]:
movies_df = pd.read_csv(movies_data, na_values=missing_values)
ratings_df = pd.read_csv(ratings_data, na_values=missing_values)

In [330]:
print('Movies_df Shape:',movies_df.shape)
movies_df

Movies_df Shape: (9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [331]:
print('Ratings_df Shape:',ratings_df.shape)
ratings_df.head()

Ratings_df Shape: (100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Let's first explore and prepare the movies_df

Let's remove the year from the title column and place it in its own column, using the handy [extract](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) function of pandas, alongside python regex.

In [332]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df["year"] = movies_df["title"].str.extract(r"\((\d{4})\)", expand=False)

movies_df.head(3)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995


In [333]:
#Removing the years from the 'title' column
movies_df["title"] = movies_df["title"].str.replace(r"\(\d{4}\)", "", regex=True).str.strip()
movies_df.head(3)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995


In [334]:
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


With that, let's also split the values in the Genres column into a list of Genres to simplify future use. This can be achieved by applying Python's split string function on the correct column.

In [335]:
#Every genre is separated by a | so we simply have to call the split function on |
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


Let's view summary of the data, the memory consumption and if the titles are arranged logically

In [336]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
 3   year     9729 non-null   object
dtypes: int64(1), object(3)
memory usage: 304.6+ KB


In [337]:
movies_df_original_mem = movies_df.memory_usage()
movies_df_original_mem

Index        128
movieId    77936
title      77936
genres     77936
year       77936
dtype: int64

In [338]:
# Let's convert movieId column from int64 to int8 to save memory space
movies_df.movieId = movies_df.movieId.astype('int32')

Let's check for missing values

In [339]:
movies_df.isna().sum()

movieId     0
title       0
genres      0
year       13
dtype: int64

let's fill movies_df missing year  values with 0 to indicate the year is not readily available. we have only 13 rows 

In [340]:
# Convert to numeric, force invalid strings to NaN
movies_df["year"] = pd.to_numeric(movies_df["year"], errors="coerce")
movies_df.fillna({'year': 0}, inplace=True)


In [341]:
# Let's now convert year column from int6a to int8, since it holds a max of just 4 digits of numbers. Thereby saving space.
movies_df.year = movies_df.year.astype('int16')

In [342]:
movies_df_new_mem = movies_df.memory_usage()

print(movies_df_original_mem)
print()
print(movies_df_new_mem)

Index        128
movieId    77936
title      77936
genres     77936
year       77936
dtype: int64

Index        128
movieId    38968
title      77936
genres     77936
year       19484
dtype: int64


Let's see a summary of the data types again

In [343]:
movies_df.dtypes

movieId     int32
title      object
genres     object
year        int16
dtype: object

In [344]:
movies_df.head(3)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995


Now, let's  One-Hot-Encode the list of genres. This encoding is needed for feeding categorical data. In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable, just incase we need the one without genres at some point.


In [345]:
# First let's make a copy of the movies_df
movies_with_genres = movies_df.copy(deep=True)

# Let's iterate through movies_df, then append the movie genres as columns of 1s or 0s.
# 1 if that column contains movies in the genre at the present index and 0 if not.

x = []
for index, row in movies_df.iterrows():
    x.append(index)
    for genre in row['genres']:
        movies_with_genres.at[index, genre] = 1

# Confirm that every row has been iterated and acted upon
print(len(x) == len(movies_df))

movies_with_genres.head(3)

True


Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,Mzansi,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,,...,,,,,,,,,,
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,,1.0,,1.0,,...,,,,,,,,,,
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,,,,1.0,,1.0,...,,,,,,,,,,


In [346]:
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
movies_with_genres = movies_with_genres.fillna(0)
movies_with_genres.head(3)

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,Mzansi,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's look at the ratings data set now

In [347]:
# print out the shape and first five rows of ratings data.
print('Ratings_df shape:',ratings_df.shape)          
ratings_df.head()

Ratings_df shape: (100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [348]:
# Dropping the timestamp column
ratings_df.drop('timestamp', axis=1, inplace=True)

# Confirming the drop
ratings_df.head(3)

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0


In [349]:
# Let's confirm the right data types exist per column in ratings data_set

ratings_df.dtypes

userId       int64
movieId      int64
rating     float64
dtype: object

In [350]:
# Let's check for missing values

ratings_df.isna().sum()

userId     0
movieId    0
rating     0
dtype: int64

## Content Based recommender System

Now, let's implement a Content-Based or Item-Item recommendation systems. This technique attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. 

Let's begin by creating an input user to recommend movies to. The user's name will be Lekgalwa and we would assume Lekgalwa has rated the following movies with the following ratings:-

Notice: To add more movies, simply increase the amount of elements in the userInput. We can add more in here, We just need to be sure to write it in with capital letters and if a movie starts with a "The", like "The Gods Must Be Crazy" then write it in like this: 'Gods Must Be Crazy, The' .

Step 1: Creating Lekgalwa's Profile

In [351]:
# so on a scale of 0 to 5, with 0 min and 5 max, see Lekgalwa's movie ratings below
Lekgalwa_movie_ratings = [
            {'title':'Predator', 'rating':4.9},
            {'title':'Final Destination', 'rating':4.9},
            {'title':'Mission Impossible', 'rating':4},
            {'title':"Beverly Hills Cop", 'rating':3},
            {'title':'Exorcist, The', 'rating':4.8},
            {'title':'Waiting to Exhale', 'rating':3.9},
            {'title':'Avengers, The', 'rating':4.5},
            {'title':'Omen, The', 'rating':5.0}
         ] 
Lekgalwa_movie_ratings = pd.DataFrame(Lekgalwa_movie_ratings)
Lekgalwa_movie_ratings

Unnamed: 0,title,rating
0,Invictus,4.9
1,Sarafina,4.9
2,Mission Impossible,4.0
3,Tsotsi,5.0
4,"Exorcist, The",4.8
5,Waiting to Exhale,3.9
6,"Avengers, The",4.5
7,"Omen, The",5.0


Add movieId to input user
With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space.

In [352]:
# Extracting movie Ids from movies_df and updating Lekgalwa_movie_ratings with movie Ids.

Lekgalwa_movie_Id = movies_df[movies_df['title'].isin(Lekgalwa_movie_ratings['title'])]

#Merging Lekgalwa movie Id and ratings into the Lekgalwa_movie_ratings data frame. 
#This action implicitly merges both data frames by the title column.

Lekgalwa_movie_ratings = pd.merge(
    Lekgalwa_movie_Id, 
    Lekgalwa_movie_ratings, 
    on="title",
    how="inner"
)

#Display the merged and updated data frame.
Lekgalwa_movie_ratings

Unnamed: 0,movieId,title,genres,year,rating
0,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,3.9
1,1350,"Omen, The","[Horror, Mystery, Thriller]",1976,5.0
2,1997,"Exorcist, The","[Horror, Mystery]",1973,4.8
3,2153,"Avengers, The","[Action, Adventure]",1998,4.5
4,44204,Tsotsi,"[Crime, Drama, Thriller, Mzansi]",2005,5.0
5,45662,"Omen, The","[Horror, Thriller]",2006,5.0
6,72733,Invictus,"[Drama, Mzansi]",2009,4.9
7,89745,"Avengers, The","[Action, Adventure, Sci-Fi, IMAX]",2012,4.5


Lets drop some columns that we do not need such as genres and year

In [353]:
#Dropping information we don't need such as year and genres
Lekgalwa_movie_ratings = Lekgalwa_movie_ratings.drop(['genres','year'], axis=1)
#Final input dataframe
Lekgalwa_movie_ratings

Unnamed: 0,movieId,title,rating
0,4,Waiting to Exhale,3.9
1,1350,"Omen, The",5.0
2,1997,"Exorcist, The",4.8
3,2153,"Avengers, The",4.5
4,44204,Tsotsi,5.0
5,45662,"Omen, The",5.0
6,72733,Invictus,4.9
7,89745,"Avengers, The",4.5


Step 2: Learning Lekgalwa's Profile

We're going to start by learning the input's preferences, so let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values.

In [354]:
# filter the selection by outputing movies that exist in both Lekgalwa_movie_ratings and movies_with_genres
Lekgalwa_genres_df = movies_with_genres[movies_with_genres.movieId.isin(Lekgalwa_movie_ratings.movieId)]
Lekgalwa_genres_df

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,Mzansi,(no genres listed)
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1038,1350,"Omen, The","[Horror, Mystery, Thriller]",1976,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1472,1997,"Exorcist, The","[Horror, Mystery]",1973,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1611,2153,"Avengers, The","[Action, Adventure]",1998,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6156,44204,Tsotsi,"[Crime, Drama, Thriller, Mzansi]",2005,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6216,45662,"Omen, The","[Horror, Thriller]",2006,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7206,72733,Invictus,"[Drama, Mzansi]",2009,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7693,89745,"Avengers, The","[Action, Adventure, Sci-Fi, IMAX]",2012,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns.

In [355]:
# First, let's reset index to default and drop the existing index.
Lekgalwa_genres_df.reset_index(drop=True, inplace=True)

# Next, let's drop redundant columns
Lekgalwa_genres_df = Lekgalwa_genres_df.copy()
Lekgalwa_genres_df.drop(['movieId','title','genres','year'], axis=1, inplace=True)


# Let's view changes

Lekgalwa_genres_df

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,...,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,Mzansi,(no genres listed)
0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Step 3: Building Lekgalwa's Profile<br>
To do this, we're going to turn each genre into weights, by multiplying Lekgalwa's movie ratings by Lekgalwa_genres_df table. And then summing up the resulting table by column. This operation is actually a dot product between a matrix and a vector.
First let's confirm the shapes of the data frames we have recently defined

In [356]:
# let's confirm the shapes of our data frames to guide us as we do matrix multiplication

print('Shape of Lekgalwa_movie_ratings is:',Lekgalwa_movie_ratings.shape)
print('Shape of Lekgalwa_genres_df is:',Lekgalwa_genres_df.shape)

Shape of Lekgalwa_movie_ratings is: (8, 3)
Shape of Lekgalwa_genres_df is: (8, 21)


In [357]:
# Let's find the dot product of transpose of Lekgalwa_genres_df by Lekgalwa rating column
Lekgalwa_profile = Lekgalwa_genres_df.T.dot(Lekgalwa_movie_ratings.rating)

# Let's see the result
Lekgalwa_profile

Adventure             9.0
Animation             0.0
Children              0.0
Comedy                3.9
Fantasy               0.0
                     ... 
IMAX                  4.5
Western               0.0
Film-Noir             0.0
Mzansi                9.9
(no genres listed)    0.0
Length: 21, dtype: float64

Just by Eye-balling his profile, it is clear that Lekgalwa loves 'Thriller', 'Action' and 'Horror' movies the most… apt as can be.<br>
Now, we have the weights for all his preferences. This is known as the User Profile. We can now recommend movies that satisfy Lekgalwa.<br>
Let's start by editing the original movies_with_genres data frame that contains all movies and their genres columns.

Step 4: Deploying The Content-Based Recommender System.

In [358]:
# let's set the index to the movieId
movies_with_genres = movies_with_genres.set_index(movies_with_genres.movieId)

# let's view the head
movies_with_genres.head()

Unnamed: 0_level_0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,Mzansi,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's delete irrelevant columns from the movies_with_genres data frame that contains all 9742 movies and distinctive columns of genres.

In [359]:
# Deleting four unnecessary columns.
movies_with_genres.drop(['movieId','title','genres','year'], axis=1, inplace=True)

# Viewing changes.
movies_with_genres.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,...,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,Mzansi,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With Lekgalwa's profile and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on his profile and recommend the top twenty movies that match his preference.

In [360]:
# Multiply the genres by the weights and then take the weighted average.
recommendation_table_df = (movies_with_genres.dot(Lekgalwa_profile)) / Lekgalwa_profile.sum()

# Let's view the recommendation table
recommendation_table_df.head()

movieId
1    0.125121
2    0.087294
3    0.075655
4    0.209505
5    0.037827
dtype: float64

Let's sort the recommendation table in descending order

In [361]:
# Let's sort values from great to small
recommendation_table_df.sort_values(ascending=False, inplace=True)

#Just a peek at the values
recommendation_table_df.head(20)

movieId
81132     0.778855
43932     0.648885
5433      0.605238
36509     0.602328
49530     0.598448
79132     0.597478
7235      0.596508
22        0.566440
30894     0.561591
174053    0.561591
26887     0.561591
4210      0.558681
27317     0.555771
8830      0.553831
74685     0.553831
6395      0.553831
54771     0.553831
26701     0.553831
198       0.553831
26614     0.548982
dtype: float64

Now here's the recommendation table! Complete with movie details and genres for the top 20 movies that match Lekgalwa's profile.

In [370]:
# first we make a copy of the original movies_df
copy = movies_df.copy(deep=True)

# Then we set its index to movieId
copy = copy.set_index('movieId', drop=True)

# Next we enlist the top 20 recommended movieIds we defined above
top_20_index = recommendation_table_df.index[:20].tolist()

# finally we slice these indices from the copied movies df and save in a variable
recommended_movies = copy.loc[top_20_index, :]

# Now we can display the top 20 movies in descending order of preference
recommended_movies

Unnamed: 0_level_0,title,genres,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
81132,Rubber,"[Action, Adventure, Comedy, Crime, Drama, Film...",2010
43932,Pulse,"[Action, Drama, Fantasy, Horror, Mystery, Sci-...",2006
5433,Silver Bullet (Stephen King's Silver Bullet),"[Adventure, Drama, Horror, Mystery, Thriller]",1985
36509,"Cave, The","[Action, Adventure, Horror, Mystery, Sci-Fi, T...",2005
49530,Blood Diamond,"[Action, Adventure, Crime, Drama, Thriller, Wa...",2006
79132,Inception,"[Action, Crime, Drama, Mystery, Sci-Fi, Thrill...",2010
7235,Ichi the Killer (Koroshiya 1),"[Action, Comedy, Crime, Drama, Horror, Thriller]",2001
22,Copycat,"[Crime, Drama, Horror, Mystery, Thriller]",1995
30894,White Noise,"[Drama, Horror, Mystery, Sci-Fi, Thriller]",2005
174053,Black Mirror: White Christmas,"[Drama, Horror, Mystery, Sci-Fi, Thriller]",2014


In [372]:
import re, unicodedata

_ARTICLE_TAIL = re.compile(r"\s*,\s*(the|an|a)\s*$", re.I)

def normalize_title_for_api(title: str, drop_year: bool = True) -> str:
    if title is None or (isinstance(title, float) and pd.isna(title)):
        return title
    s = unicodedata.normalize("NFKC", str(title)).strip().strip('"\'')

    # Remove parentheses that are NOT a pure (YYYY)
    s = re.sub(r"\s*\((?!\d{4}\))[^)]*\)", "", s)

    # Optionally remove a trailing (YYYY)
    if drop_year:
        s = re.sub(r"\s*\(\d{4}\)\s*$", "", s)

    # Move trailing ", The/An/A" to the front
    m = _ARTICLE_TAIL.search(s)
    if m:
        art = m.group(1).title()
        base = _ARTICLE_TAIL.sub("", s)
        s = f"{art} {base}"

    # Tidy spaces
    s = re.sub(r"\s+", " ", s).strip()
    return s

# Vectorized use on a DataFrame
recommended_movies["title_api"] = recommended_movies["title"].apply(normalize_title_for_api)
recommended_movies

Unnamed: 0_level_0,title,genres,year,title_api
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
81132,Rubber,"[Action, Adventure, Comedy, Crime, Drama, Film...",2010,Rubber
43932,Pulse,"[Action, Drama, Fantasy, Horror, Mystery, Sci-...",2006,Pulse
5433,Silver Bullet (Stephen King's Silver Bullet),"[Adventure, Drama, Horror, Mystery, Thriller]",1985,Silver Bullet
36509,"Cave, The","[Action, Adventure, Horror, Mystery, Sci-Fi, T...",2005,The Cave
49530,Blood Diamond,"[Action, Adventure, Crime, Drama, Thriller, Wa...",2006,Blood Diamond
79132,Inception,"[Action, Crime, Drama, Mystery, Sci-Fi, Thrill...",2010,Inception
7235,Ichi the Killer (Koroshiya 1),"[Action, Comedy, Crime, Drama, Horror, Thriller]",2001,Ichi the Killer
22,Copycat,"[Crime, Drama, Horror, Mystery, Thriller]",1995,Copycat
30894,White Noise,"[Drama, Horror, Mystery, Sci-Fi, Thriller]",2005,White Noise
174053,Black Mirror: White Christmas,"[Drama, Horror, Mystery, Sci-Fi, Thriller]",2014,Black Mirror: White Christmas


In [373]:
@lru_cache(maxsize=4096)
def omdb_poster_url_cached(title, year=None):
    params = {"t": title, "apikey": OMDB_KEY}
    if year is not None and pd.notna(year):
        try:
            params["y"] = int(year)
        except Exception:
            pass
    try:
        r = requests.get("https://www.omdbapi.com/", params=params, timeout=10)
        data = r.json()
        if data.get("Response") == "True" and data.get("Poster") not in (None, "N/A"):
            return data["Poster"]
    except Exception:
        pass
    return None

def render_posters_row(df, title_col="title_api", year_col="year", n=20, card_w=140, card_h=210, label="Recommended"):
    items = []
    for _, row in df.head(n).iterrows():
        title = str(row[title_col])
        year  = row.get(year_col)
        url   = omdb_poster_url_cached(title, year)
        items.append((title, year, url))

    if not items:
        display(HTML("<b>No items to display.</b>"))
        return

    cards_html = []
    for t, y, u in items:
        tt = html.escape(t)

        if y is None or (isinstance(y, float) and pd.isna(y)):
            yy = ""
        else:
            try:
                yy = str(int(y))
            except Exception:
                yy = str(y)

        q  = html.escape(f"{tt} {yy} imdb").replace(" ", "+")
        link = f"https://www.google.com/search?q={q}"

        if u:
            img_html = f'<img src="{u}" alt="{tt} poster" loading="lazy">'
        else:
            # Placeholder box instead of skipping
            img_html = f"""
            <div class="noimg" aria-label="No image available for {tt}">
              <div class="noimg-icon">🎬</div>
              <div class="noimg-text">No image</div>
            </div>
            """

        cards_html.append(f"""
        <a class="card" href="{link}" target="_blank" rel="noopener">
          <div class="imgwrap">
            {img_html}
          </div>
          <div class="meta">
            <div class="t" title="{tt}">{tt}</div>
            <div class="y">{yy}</div>
          </div>
        </a>
        """)

    html_code = f"""
    <style>
    .rail-wrap {{
        margin: 8px 0 18px 0;
        font-family: -apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Ubuntu,"Helvetica Neue",Arial,sans-serif;
        color: #f1f1f1;
    }}
    .rail-title {{
        font-weight: 700; font-size: 16px; margin: 0 0 8px 2px;
    }}
    .rail {{
        display: block;
        overflow-x: auto;
        overflow-y: hidden;
        white-space: nowrap;
        padding-bottom: 6px;
        scroll-snap-type: x mandatory;
        -webkit-overflow-scrolling: touch;
        background: linear-gradient(180deg,#0b0b0b, #0f0f10);
        border-radius: 12px;
        padding: 12px;
        box-shadow: 0 6px 20px rgba(0,0,0,.25) inset;
    }}
    .rail::-webkit-scrollbar {{ height: 8px; }}
    .rail::-webkit-scrollbar-thumb {{ background: rgba(255,255,255,.2); border-radius: 10px; }}
    .card {{
        display: inline-block;
        width: {card_w}px;
        margin-right: 12px;
        color: inherit; text-decoration: none;
        scroll-snap-align: start;
    }}
    .imgwrap {{
        width: {card_w}px; height: {card_h}px;
        border-radius: 12px; overflow: hidden;
        background:#1a1a1a;
        box-shadow: 0 4px 14px rgba(0,0,0,.35);
        transition: transform .15s ease, box-shadow .15s ease;
        display:flex; align-items:center; justify-content:center;
    }}
    .card:hover .imgwrap {{
        transform: translateY(-2px);
        box-shadow: 0 10px 24px rgba(0,0,0,.45);
    }}
    .imgwrap img {{
        width: 100%; height: 100%;
        object-fit: cover; display: block;
    }}
    .noimg {{
        width: 100%; height: 100%;
        background: repeating-linear-gradient(45deg, #222, #222 10px, #1A1A1A 10px, #1A1A1A 20px);
        display:flex; flex-direction:column;
        align-items:center; justify-content:center;
        gap: 6px; color:#cfcfcf;
    }}
    .noimg-icon {{ font-size: 22px; line-height:1; opacity:.9; }}
    .noimg-text {{ font-size: 11px; opacity:.85; letter-spacing:.2px; }}
    .meta {{
        display:flex; justify-content:space-between; align-items:baseline;
        padding: 6px 2px 0 2px; gap: 8px;
    }}
    .t {{
        font-size: 12px; font-weight: 600; line-height:1.2;
        white-space: nowrap; overflow: hidden; text-overflow: ellipsis;
        max-width: {card_w-36}px;
    }}
    .y {{ font-size: 11px; opacity: .7; }}
    </style>
    <div class="rail-wrap">
      <div class="rail-title">{html.escape(label)}</div>
      <div class="rail">
        {''.join(cards_html)}
      </div>
    </div>
    """
    display(HTML(html_code))

render_posters_row(recommended_movies, title_col="title_api", year_col="year",
                   n=20, card_w=140, card_h=210, label="Recommended Showmax Movies for you")


Kill the note book and free up space in colab

In [366]:
#import os, signal
#os.kill(os.getpid(), signal.SIGKILL)