In [56]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# 0. Notebook description

In this notebook, we load our cleaned dataset and perform a content-based recommendation based on the `Genre` column, which contains one or more movie genres.

# 1. Load dataset

In [57]:
movies_df = pd.read_csv('datasets/imdb_top_1000_cleaned.csv', low_memory=False)

movies_df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


# 2. Prepare tf-idf model for the `Genre` column

**!!!!!!!Müssen wir noch umschreiben!!!!!!!!**

The `Genre` column contains the **genre(s)** of a movie which we will use to build our content-based recommendation system.

In order to perform machine learning on the plot summaries, we need to **transform** them into vector representations such that we can apply numeric machine learning to them. This process is called **feature extraction** or in this case, simply, vectorization, and is an essential first step toward language-aware analysis. Every plot summary will be transformed from a sequence of words to a point in a **high-dimensional semantic space**. The simplest encoding of semantic space is the **bag-of-words (BOW)** model, another is the tf-idf model.

Both models are in fact tables where each row represents a plot summary and each column represents a word and are **count-based**. They count the number of times a word appears in a document and use that as a proxy for the importance of the word in that document. The difference between the two is that the BOW model simply counts the number of times a word appears in a document, while the tf-idf model also takes into account how often the word appears in all documents. The BOW model is a **sparse** representation, meaning that most of the entries in the vector are zero. The tf-idf model is more useful than the BOW model because it **downweights** words that appear frequently in a corpus and are therefore less informative than those that appear rarely.

 So, to sum up, every plot summary (or document) will be encoded as a single vector whose length is equal to the size of the vocabulary of all the plot summaries (the so-called corpus) and whose entries are some sort of counts of the words in that summary. This is because most words in the vocabulary do not appear in a given plot summary.

We'll use the tf-idf vectorizer to extract features from the plot summaries. The tf-idf vectorizer will transform the plot summaries into a matrix of tf-idf features. The tf-idf vectorizer will **ignore** words that occur in **more than 80%** of the movies and **ignore** words that occur in **less than 2 movies**. This will help us **reduce the noise** in the dataset.

In [58]:
# print the Genre column of the first 10 movies

print(movies_df['Genre'].head(10))

0                        Drama
1                 Crime, Drama
2         Action, Crime, Drama
3                 Crime, Drama
4                 Crime, Drama
5     Action, Adventure, Drama
6                 Crime, Drama
7    Biography, Drama, History
8     Action, Adventure, SciFi
9                        Drama
Name: Genre, dtype: object


In [59]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=2)

movies_df.Genre = movies_df.Genre.fillna('')
tfidf_model = vectorizer.fit_transform(movies_df.Genre)
print(f'Matrix contains {tfidf_model.shape[0]} movies and {tfidf_model.shape[1]} words')

Matrix contains 1000 movies and 22 words


### Inspect the tf-idf model

Inspect the columns with popular movie genres like 'drama', 'fantasy', 'adventure', 'comedy', 'biography', 'crime' etc.

In [60]:
columns = vectorizer.get_feature_names_out()
print(columns)

['action' 'adventure' 'animation' 'biography' 'comedy' 'crime' 'drama'
 'family' 'fantasy' 'film' 'history' 'horror' 'music' 'musical' 'mystery'
 'noir' 'romance' 'scifi' 'sport' 'thriller' 'war' 'western']


In [61]:
genres = ['drama', 'fantasy', 'adventure', 'comedy', 'biography', 'crime']
columns = vectorizer.get_feature_names_out()
tfidf_model_df = pd.DataFrame.sparse.from_spmatrix(tfidf_model, columns=columns)
tfidf_model_df[genres].head()

Unnamed: 0,drama,fantasy,adventure,comedy,biography,crime
0,1.0,0,0,0,0,0.0
1,0.458764,0,0,0,0,0.888558
2,0.337068,0,0,0,0,0.652851
3,0.458764,0,0,0,0,0.888558
4,0.458764,0,0,0,0,0.888558


# 3. Find similar movies

To find similar movies, we use the KNN algorithm with **cosine similarity** as a distance metric to find the nearest neighbours.

In [64]:
def get_content_based_recommendation_genre(title, top_n=10, metric='cosine'):
    # Get the index of the movie that matches the title
    # we'll use that index to locate the row in the tf-idf matrix that corresponds to that movie
    idx = movies_df[movies_df.Series_Title.str.lower() == title.lower()].index[0]

    model = NearestNeighbors(n_neighbors=top_n+1, metric=metric)
    model.fit(tfidf_model)
    similar_movies = model.kneighbors(tfidf_model[idx], return_distance=False)[0]
    similar_movies = similar_movies[1:]  # remove the first item which is the movie itself

    # Return the top 10 most similar movies
    return movies_df.iloc[similar_movies]

In [68]:
get_content_based_recommendation_genre('The Godfather')[['Series_Title', 'Genre', 'IMDB_Rating', 'No_of_Votes', 'Genre']]

Unnamed: 0,Series_Title,Genre,IMDB_Rating,No_of_Votes,Genre.1
108,Scarface,"Crime, Drama",8.3,740911,"Crime, Drama"
708,À bout de souffle,"Crime, Drama",7.8,73251,"Crime, Drama"
1,The Godfather,"Crime, Drama",9.2,1620367,"Crime, Drama"
22,Cidade de Deus,"Crime, Drama",8.6,699256,"Crime, Drama"
111,Taxi Driver,"Crime, Drama",8.3,724636,"Crime, Drama"
763,This Is England,"Crime, Drama",7.7,115576,"Crime, Drama"
974,The Godfather: Part III,"Crime, Drama",7.6,359809,"Crime, Drama"
71,Once Upon a Time in America,"Crime, Drama",8.4,311365,"Crime, Drama"
397,Bound by Honor,"Crime, Drama",8.0,28825,"Crime, Drama"
895,Leviafan,"Crime, Drama",7.6,49397,"Crime, Drama"


In [67]:
get_content_based_recommendation_genre('The Dark Knight')[['Series_Title', 'Genre', 'IMDB_Rating', 'No_of_Votes', 'Genre']]

Unnamed: 0,Series_Title,Genre,IMDB_Rating,No_of_Votes,Genre.1
602,Ang-ma-reul bo-at-da,"Action, Crime, Drama",7.8,111252,"Action, Crime, Drama"
605,Ajeossi,"Action, Crime, Drama",7.8,62848,"Action, Crime, Drama"
896,Hell or High Water,"Action, Crime, Drama",7.6,204175,"Action, Crime, Drama"
782,Man on Fire,"Action, Crime, Drama",7.7,329592,"Action, Crime, Drama"
893,Sicario,"Action, Crime, Drama",7.6,371291,"Action, Crime, Drama"
345,Tropa de Elite 2: O Inimigo Agora é Outro,"Action, Crime, Drama",8.0,79200,"Action, Crime, Drama"
768,Lucky Number Slevin,"Action, Crime, Drama",7.7,299524,"Action, Crime, Drama"
968,Falling Down,"Action, Crime, Drama",7.6,171640,"Action, Crime, Drama"
850,Enter the Dragon,"Action, Crime, Drama",7.7,96561,"Action, Crime, Drama"
308,White Heat,"Action, Crime, Drama",8.1,29807,"Action, Crime, Drama"
