In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# 0. Notebook description

In this notebook, we load our cleaned dataset and perform a content-based recommendation based on the `Overview` column, which contains a summary of the movie's plot.

# 1. Load dataset

In [2]:
movies_df = pd.read_csv('datasets/imdb_top_1000_cleaned.csv', low_memory=False)

movies_df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Director,Star1,Star2,Star3,Star4,No_of_Votes
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142.0,Drama,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175.0,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152.0,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202.0,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96.0,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845


# 2. Prepare tf-idf model for the `overview` column

The overview column contains the plot summary of each movie, which we used to build a content-based recommendation system. To perform machine learning, we transformed the text into numeric vector representations using the tf-idf vectorizer.

This process creates a matrix where each row represents a plot summary and each column represents a word from the corpus. The tf-idf model accounts for word frequency across all summaries to reduce the importance of common words. We configured the vectorizer to ignore words appearing in more than 80% of the summaries and words appearing in fewer than 2 summaries, reducing noise in the dataset.

The resulting tf-idf matrix was used as input for the recommendation system.

In [3]:
# print the Overview column of the first 10 movies
print(movies_df['Overview'].head(10))

0    Two imprisoned men bond over a number of years...
1    An organized crime dynasty's aging patriarch t...
2    When the menace known as the Joker wreaks havo...
3    The early life and career of Vito Corleone in ...
4    A jury holdout attempts to prevent a miscarria...
5    Gandalf and Aragorn lead the World of Men agai...
6    The lives of two mob hitmen, a boxer, a gangst...
7    In German-occupied Poland during World War II,...
8    A thief who steals corporate secrets through t...
9    An insomniac office worker and a devil-may-car...
Name: Overview, dtype: object


In [4]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=2)

movies_df.Overview = movies_df.Overview.fillna('')
tfidf_model = vectorizer.fit_transform(movies_df.Overview)
print(f'Matrix contains {tfidf_model.shape[0]} movies and {tfidf_model.shape[1]} words')

Matrix contains 999 movies and 2050 words


### Inspect the tf-idf model

Inspect the columns with popular movie terms like 'love', 'young', 'story', etc.

In [5]:
popular_terms = ['life', 'young', 'man', 'film', 'new', 'love', 'story', 'world']
columns = vectorizer.get_feature_names_out()
tfidf_model_df = pd.DataFrame.sparse.from_spmatrix(tfidf_model, columns=columns)
tfidf_model_df[popular_terms].head()

Unnamed: 0,life,young,man,film,new,love,story,world
0,0.0,0,0,0,0.0,0,0,0
1,0.0,0,0,0,0.0,0,0,0
2,0.0,0,0,0,0.0,0,0,0
3,0.175277,0,0,0,0.196541,0,0,0
4,0.0,0,0,0,0.0,0,0,0


# 3. Find similar movies

To find similar movies, we use the KNN algorithm with **cosine similarity** as a distance metric to find the nearest neighbours.

In [6]:
movies_df['Series_Title'].is_unique

True

In [7]:
def get_content_based_recommendation_overview(title, top_n=10, metric='cosine'):
    # Get the index of the movie that matches the title
    # we'll use that index to locate the row in the tf-idf matrix that corresponds to that movie
    idx = movies_df[movies_df.Series_Title.str.lower() == title.lower()].index[0]

    model = NearestNeighbors(n_neighbors=top_n+1, metric=metric)
    model.fit(tfidf_model)
    similar_movies = model.kneighbors(tfidf_model[idx], return_distance=False)[0]
    similar_movies = similar_movies[1:]  # remove the first item which is the movie itself

    # Return the top 10 most similar movies
    return movies_df.iloc[similar_movies]

In [8]:
get_content_based_recommendation_overview('The Godfather')[['Series_Title', 'Genre', 'IMDB_Rating', 'No_of_Votes', 'Overview']]

Unnamed: 0,Series_Title,Genre,IMDB_Rating,No_of_Votes,Overview
462,Knives Out,"Comedy, Crime, Drama",7.9,454203,A detective investigates the death of a patria...
973,The Godfather: Part III,"Crime, Drama",7.6,359809,"Follows Michael Corleone, now in his 60s, as h..."
738,Nebraska,"Adventure, Comedy, Drama",7.7,112298,"An aging, booze-addled father makes the trip f..."
627,The Curious Case of Benjamin Button,"Drama, Fantasy, Romance",7.8,589160,"Tells the story of Benjamin Button, a man who ..."
3,The Godfather: Part II,"Crime, Drama",9.0,1129952,The early life and career of Vito Corleone in ...
912,Die Welle,"Drama, Thriller",7.6,102742,A high school teacher's experiment to demonstr...
183,Smultronstället,"Drama, Romance",8.2,96381,"After living a life marked by coldness, an agi..."
542,The Wild Bunch,"Action, Adventure, Western",7.9,77401,An aging group of outlaws look for one last bi...
441,The Night of the Hunter,"Crime, Drama, Film-Noir",8.0,81980,A religious fanatic marries a gullible widow w...
26,La vita è bella,"Comedy, Drama, Romance",8.6,623629,When an open-minded Jewish librarian and his s...


In [9]:
get_content_based_recommendation_overview('The Dark Knight')[['Series_Title', 'Genre', 'IMDB_Rating', 'No_of_Votes', 'Overview']]

Unnamed: 0,Series_Title,Genre,IMDB_Rating,No_of_Votes,Overview
63,The Dark Knight Rises,"Action, Adventure",8.4,1516346,Eight years after the Joker's reign of anarchy...
154,Batman Begins,"Action, Adventure",8.2,1308302,"After training with his mentor, Batman begins ..."
290,La battaglia di Algeri,"Drama, War",8.1,53089,"In the 1950s, fear and violence escalate as th..."
33,Joker,"Crime, Drama, Thriller",8.5,939252,"In Gotham City, mentally troubled comedian Art..."
240,Kill Bill: Vol. 1,"Action, Crime, Drama",8.1,1000639,"After awakening from a four-year coma, a forme..."
414,Jaws,"Adventure, Thriller",8.0,543388,When a killer shark unleashes chaos on a beach...
739,Wreck-It Ralph,"Animation, Adventure, Comedy",7.7,380195,A video game villain wants to be a hero and se...
951,The Hurricane,"Biography, Drama, Sport",7.6,91557,"The story of Rubin 'Hurricane' Carter, a boxer..."
927,300,"Action, Drama",7.6,732876,King Leonidas of Sparta and a force of 300 men...
259,Trois couleurs: Rouge,"Drama, Mystery, Romance",8.1,90729,A model discovers a retired judge is keen on i...
