`63070501061 S.RAKNA`

> 1.5 hrs.

# Content Based Filtering In Recommendation System

Follow the instructions in this article: <a href="https://medium.com/0xcode/content-based-filtering-in-recommendation-system-using-jupyter-colab-notebook-9d3e0520af8">Vincent T., <b>Content Based Filtering In Recommendation System Using Jupyter Colab Notebook</b></a>

> 20 points

<h2 style="font-size:18px;">
2.1 To perform <u>content based filtering</u> on the <code>movies.csv</code> dataset that can be found under <code>Lecture 8</code> of this course’s shared Google Drive folder. 
<span style="color:red">You should be able to query the movie <code>Dark Knight</code> like in the article.</span>
</h2>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
movies = pd.read_csv(
    'movies.csv',
    sep=',',
    encoding='latin-1',
)


In [3]:
movies['genres'] = movies['genres'].str.split('|')
movies['genres'] = movies['genres'].fillna("").astype('str')


In [4]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
1,2,Jumanji (1995),"['Adventure', 'Children', 'Fantasy']"
2,3,Grumpier Old Men (1995),"['Comedy', 'Romance']"
3,4,Waiting to Exhale (1995),"['Comedy', 'Drama', 'Romance']"
4,5,Father of the Bride Part II (1995),['Comedy']
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),"['Action', 'Animation', 'Comedy', 'Fantasy']"
9738,193583,No Game No Life: Zero (2017),"['Animation', 'Comedy', 'Fantasy']"
9739,193585,Flint (2017),['Drama']
9740,193587,Bungo Stray Dogs: Dead Apple (2018),"['Action', 'Animation']"


converts the genres in 2-gram words excluding the stopwords (e.g. ‘the’, ‘and’, etc.)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 2),
    min_df=0,
    stop_words='english'
)
tfidf_matrix = tf.fit_transform(movies['genres'])
tfidf_matrix.shape


(9742, 177)

Apply Cosine Similarity methods.

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim[:4, :4]

array([[1.        , 0.31379419, 0.0611029 , 0.05271111],
       [0.31379419, 1.        , 0.        , 0.        ],
       [0.0611029 , 0.        , 1.        , 0.35172407],
       [0.05271111, 0.        , 0.35172407, 1.        ]])

Build the $l$-dimensional list with movie titles

In [7]:
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])


In [8]:
def genre_recommendations(title):
    '''get movie recommendations that are based on the cosine similarity scores of the movie genres'''
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]

    return titles.iloc[movie_indices]  # type: ignore


In [9]:
genre_recommendations('Dark Knight, The (2008)').head(20)

8387                          Need for Speed (2014)
8149      Grandmaster, The (Yi dai zong shi) (2013)
123                                Apollo 13 (1995)
8026                              Life of Pi (2012)
8396                                    Noah (2014)
38                           Dead Presidents (1995)
341                              Bad Company (1995)
347             Faster Pussycat! Kill! Kill! (1965)
430                        Menace II Society (1993)
568                          Substitute, The (1996)
665                          Nothing to Lose (1994)
1645                       Untouchables, The (1987)
1696                           Monument Ave. (1998)
2563                              Death Wish (1974)
2574                        Band of the Hand (1986)
3037                              Foxy Brown (1974)
3124    Harley Davidson and the Marlboro Man (1991)
3167                                Scarface (1983)
3217                               Swordfish (2001)
3301        

> 10 points

<h2 style="font-size:18px;">
2.2
<ul>
  <li>For each user query also output the Genres of the movie you are querying.</li>
  <li>For each output in addition to the movie ID and movie name of the similar movies</li>
  <li>please also show the genres and the similarity score of each</li>
  <li>Show your results for these 2 movies: <code>Ghost (1990)</code> and <code>Terminator, The (1984)</code></li>
</ul>
</h2>

In [10]:
def genre_recommendations_extended(title):
    '''get movie recommendations that are based on the cosine similarity scores of the movie genres'''
    # SHOW genres, movieId, name, similarity score
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 20 most similar movies
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]

    # Get the movie genres
    genres = movies['genres'].iloc[movie_indices]

    # Get the movieId
    movieId = movies['movieId'].iloc[movie_indices]

    # Get the movie name
    name = movies['title'].iloc[movie_indices]

    # Get the similarity score
    similarity_score = [i[1] for i in sim_scores]

    # Create the dataframe
    result = pd.DataFrame(columns=['genres', 'movieId', 'name', 'similarity_score'])
    result['genres'] = genres
    result['movieId'] = movieId
    result['name'] = name
    result['similarity_score'] = similarity_score

    return result

In [11]:
genre_recommendations_extended("Ghost (1990)")

Unnamed: 0,genres,movieId,name,similarity_score
6905,"['Drama', 'Fantasy', 'Romance', 'Thriller']",63992,Twilight (2008),0.94052
1085,"['Comedy', 'Drama', 'Fantasy', 'Romance']",1409,Michael (1996),0.844785
1530,"['Comedy', 'Drama', 'Fantasy', 'Romance']",2065,"Purple Rose of Cairo, The (1985)",0.844785
2103,"['Comedy', 'Drama', 'Fantasy', 'Romance']",2797,Big (1988),0.844785
2350,"['Comedy', 'Drama', 'Fantasy', 'Romance']",3108,"Fisher King, The (1991)",0.844785
2510,"['Comedy', 'Drama', 'Fantasy', 'Romance']",3358,Defending Your Life (1991),0.844785
3097,"['Comedy', 'Drama', 'Fantasy', 'Romance']",4157,"Price of Milk, The (2000)",0.844785
3249,"['Comedy', 'Drama', 'Fantasy', 'Romance']",4392,Alice (1990),0.844785
4356,"['Comedy', 'Drama', 'Fantasy', 'Romance']",6373,Bruce Almighty (2003),0.844785
4744,"['Comedy', 'Drama', 'Fantasy', 'Romance']",7067,Juliet of the Spirits (Giulietta degli spiriti...,0.844785


In [12]:
genre_recommendations_extended("Terminator, The (1984)")

Unnamed: 0,genres,movieId,name,similarity_score
68,"['Action', 'Sci-Fi', 'Thriller']",76,Screamers (1995),1.0
144,"['Action', 'Sci-Fi', 'Thriller']",172,Johnny Mnemonic (1995),1.0
296,"['Action', 'Sci-Fi', 'Thriller']",338,Virtuosity (1995),1.0
336,"['Action', 'Sci-Fi', 'Thriller']",379,Timecop (1994),1.0
474,"['Action', 'Sci-Fi', 'Thriller']",541,Blade Runner (1982),1.0
567,"['Action', 'Sci-Fi', 'Thriller']",692,Solo (1996),1.0
601,"['Action', 'Sci-Fi', 'Thriller']",748,"Arrival, The (1996)",1.0
939,"['Action', 'Sci-Fi', 'Thriller']",1240,"Terminator, The (1984)",1.0
1373,"['Action', 'Sci-Fi', 'Thriller']",1882,Godzilla (1998),1.0
1939,"['Action', 'Sci-Fi', 'Thriller']",2571,"Matrix, The (1999)",1.0
