In [10]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# 0. Notebook description

In this notebook, we load our cleaned dataset and perform a content-based recommendation based on the `Genre` column, which contains one or more movie genres.

# 1. Load dataset

In [11]:
movies_df = pd.read_csv('datasets/imdb_top_1000_cleaned.csv', low_memory=False)

movies_df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142.0,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175.0,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152.0,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202.0,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96.0,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0


# 2. Prepare tf-idf model for the `Genre` column

The Genre column contains the genre(s) of each movie, which we used to build a content-based recommendation system. We transformed the genres into numeric vector representations using the tf-idf vectorizer.

Unlike traditional binary encoding, the tf-idf model accounts for the frequency of genres across the dataset, assigning higher importance to less common genres. This helps the system capture more nuanced relationships between movies based on their unique genre combinations. Each row in the resulting matrix represents a movie, and each column corresponds to a genre, with tf-idf scores indicating the movie's association with that genre.

This genre-based tf-idf matrix was used as input for the recommendation system, enabling it to recommend movies with similar genre profiles effectively.

In [12]:
# print the Genre column of the first 10 movies
print(movies_df['Genre'].head(10))

0                        Drama
1                 Crime, Drama
2         Action, Crime, Drama
3                 Crime, Drama
4                 Crime, Drama
5     Action, Adventure, Drama
6                 Crime, Drama
7    Biography, Drama, History
8     Action, Adventure, SciFi
9                        Drama
Name: Genre, dtype: object


In [13]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=2)

movies_df.Genre = movies_df.Genre.fillna('')
tfidf_model = vectorizer.fit_transform(movies_df.Genre)
print(f'Matrix contains {tfidf_model.shape[0]} movies and {tfidf_model.shape[1]} words')

Matrix contains 1000 movies and 22 words


### Inspect the tf-idf model

Inspect the columns with popular movie genres like 'drama', 'fantasy', 'adventure', 'comedy', 'biography', 'crime' etc.

In [14]:
columns = vectorizer.get_feature_names_out()
print(columns)

['action' 'adventure' 'animation' 'biography' 'comedy' 'crime' 'drama'
 'family' 'fantasy' 'film' 'history' 'horror' 'music' 'musical' 'mystery'
 'noir' 'romance' 'scifi' 'sport' 'thriller' 'war' 'western']


In [15]:
genres = ['drama', 'fantasy', 'adventure', 'comedy', 'biography', 'crime']
columns = vectorizer.get_feature_names_out()
tfidf_model_df = pd.DataFrame.sparse.from_spmatrix(tfidf_model, columns=columns)
tfidf_model_df[genres].head()

Unnamed: 0,drama,fantasy,adventure,comedy,biography,crime
0,1.0,0,0,0,0,0.0
1,0.458764,0,0,0,0,0.888558
2,0.337068,0,0,0,0,0.652851
3,0.458764,0,0,0,0,0.888558
4,0.458764,0,0,0,0,0.888558


# 3. Find similar movies

To find similar movies, we use the KNN algorithm with **cosine similarity** as a distance metric to find the nearest neighbours.

In [16]:
def get_content_based_recommendation_genre(title, top_n=10, metric='cosine'):
    # Get the index of the movie that matches the title
    # we'll use that index to locate the row in the tf-idf matrix that corresponds to that movie
    idx = movies_df[movies_df.Series_Title.str.lower() == title.lower()].index[0]

    model = NearestNeighbors(n_neighbors=top_n+1, metric=metric)
    model.fit(tfidf_model)
    similar_movies = model.kneighbors(tfidf_model[idx], return_distance=False)[0]
    similar_movies = similar_movies[1:]  # remove the first item which is the movie itself

    # Return the top 10 most similar movies
    return movies_df.iloc[similar_movies]

In [17]:
get_content_based_recommendation_genre('The Godfather')[['Series_Title', 'Genre', 'IMDB_Rating', 'No_of_Votes', 'Genre']]

Unnamed: 0,Series_Title,Genre,IMDB_Rating,No_of_Votes,Genre.1
3,The Godfather: Part II,"Crime, Drama",9.0,1129952,"Crime, Drama"
6,Pulp Fiction,"Crime, Drama",8.9,1826188,"Crime, Drama"
4,12 Angry Men,"Crime, Drama",9.0,689845,"Crime, Drama"
71,Once Upon a Time in America,"Crime, Drama",8.4,311365,"Crime, Drama"
974,The Godfather: Part III,"Crime, Drama",7.6,359809,"Crime, Drama"
165,Casino,"Crime, Drama",8.2,466276,"Crime, Drama"
669,Boyz n the Hood,"Crime, Drama",7.8,126082,"Crime, Drama"
299,Les quatre cents coups,"Crime, Drama",8.1,105291,"Crime, Drama"
639,Lilja 4-ever,"Crime, Drama",7.8,42673,"Crime, Drama"
708,À bout de souffle,"Crime, Drama",7.8,73251,"Crime, Drama"


In [18]:
get_content_based_recommendation_genre('The Dark Knight')[['Series_Title', 'Genre', 'IMDB_Rating', 'No_of_Votes', 'Genre']]

Unnamed: 0,Series_Title,Genre,IMDB_Rating,No_of_Votes,Genre.1
42,Léon,"Action, Crime, Drama",8.5,1035236,"Action, Crime, Drama"
918,Eastern Promises,"Action, Crime, Drama",7.6,227760,"Action, Crime, Drama"
931,Lord of War,"Action, Crime, Drama",7.6,294140,"Action, Crime, Drama"
577,Udta Punjab,"Action, Crime, Drama",7.8,27175,"Action, Crime, Drama"
605,Ajeossi,"Action, Crime, Drama",7.8,62848,"Action, Crime, Drama"
893,Sicario,"Action, Crime, Drama",7.6,371291,"Action, Crime, Drama"
602,Ang-ma-reul bo-at-da,"Action, Crime, Drama",7.8,111252,"Action, Crime, Drama"
896,Hell or High Water,"Action, Crime, Drama",7.6,204175,"Action, Crime, Drama"
901,End of Watch,"Action, Crime, Drama",7.6,228132,"Action, Crime, Drama"
888,Baby Driver,"Action, Crime, Drama",7.6,439406,"Action, Crime, Drama"
