In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/movierecommenderdataset/movies.csv
/kaggle/input/movierecommenderdataset/ratings.csv


In this notebook, we will explore the statistical technique of TF-IDF (Term Frequency - Inverse Document Frequency) to recommend movies to users based on how similar the movies are. As such, we only require the first dataset which contains all the movie IDs, their titles and genres to determine the set of movies most similar to a particularly chosen movie. In another notebook, we will explore recommender systems based on collaborative filtering. 

The content of this notebook is based on a lecture from the NLP course series of Lazy Programmer:

https://www.udemy.com/course/natural-language-processing-in-python

In [2]:
df = pd.read_csv('/kaggle/input/movierecommenderdataset/movies.csv')
dfr =  pd.read_csv('/kaggle/input/movierecommenderdataset/ratings.csv')

In [3]:
#This dataset contains the movie IDs and titles and genres
df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [4]:
# This dataset has user IDs and ratings for all the movies above
dfr

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [5]:
print('The number of users is:', dfr['userId'].nunique())

The number of users is: 610


# Create TF-IDF matrix

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [7]:
#We will merge the title and genre together to form a string object for each movie
#TF-IDF will be fitted on this 
def genres_and_titles_to_string(row):
    genres = row['genres'].split('|')
    genres = ' '.join(''.join(j) for j in genres)
    
    title = row['title']
    
    return "%s %s" % (genres, title)

In [8]:
df['string'] = df.apply(genres_and_titles_to_string, axis = 1)

In [9]:
df.head(4)

Unnamed: 0,movieId,title,genres,string
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Adventure Animation Children Comedy Fantasy To...
1,2,Jumanji (1995),Adventure|Children|Fantasy,Adventure Children Fantasy Jumanji (1995)
2,3,Grumpier Old Men (1995),Comedy|Romance,Comedy Romance Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Comedy Drama Romance Waiting to Exhale (1995)


In [10]:
#create a tf-idf vectorizer object
tfidf = TfidfVectorizer(max_features = 200)

In [11]:
#create the TF-IDF matrix on df['string']
X = tfidf.fit_transform(df['string'])
#Ths shape of X is NxF (N = # of movies, F = # TF-IDF features)
X.shape

(9742, 200)

# A function for recommending 

In what follows, we write a function for recommending similar movies to a chosen movie. This function takes in as inputs a chosen movie and the number K of most similar movies to the chosen one.

Basically, the function works by choosing the movies with the highest cosine similarity scores to the chosen movie. 

In [12]:
movie2idx = pd.Series(df.index, index = df['title'])
movie2idx

title
Toy Story (1995)                                0
Jumanji (1995)                                  1
Grumpier Old Men (1995)                         2
Waiting to Exhale (1995)                        3
Father of the Bride Part II (1995)              4
                                             ... 
Black Butler: Book of the Atlantic (2017)    9737
No Game No Life: Zero (2017)                 9738
Flint (2017)                                 9739
Bungo Stray Dogs: Dead Apple (2018)          9740
Andrew Dice Clay: Dice Rules (1991)          9741
Length: 9742, dtype: int64

In [13]:

def recommend(movie, K):
    #get the row in the df for this movie
    idx = movie2idx[movie]
    if type(idx) == pd.Series:
        idx = idx.iloc[0]
        
    #calculate the pairwise similarities for this movie:
    movie_index = X[idx]
    scores = cosine_similarity(movie_index, X)
    
    #currently the array is 1xN, make it 1D array
    scores = scores.flatten()
    
    #get the indices of the highest scoring movies
    #Get the first K recommendations
    #Note that index 0 is the chosen movie itself, 
    #so start from index 1 to K+1
    recommended_idx = (-scores).argsort()[1:K+1]
    
    #return the titles of the recommendations
    return df['title'].iloc[recommended_idx]

# Check the recommendations for several chosen examples

In [14]:
recommend('Toy Story (1995)',7)

2355                        Toy Story 2 (1999)
12                                Balto (1995)
1                               Jumanji (1995)
7355                        Toy Story 3 (2010)
599     Wallace & Gromit: A Close Shave (1995)
209                               Gordy (1995)
4604      Ninja Scroll (Jûbei ninpûchô) (1995)
Name: title, dtype: object

## Comedy genre

In [15]:
df[df['genres']=='Comedy']

Unnamed: 0,movieId,title,genres,string
4,5,Father of the Bride Part II (1995),Comedy,Comedy Father of the Bride Part II (1995)
17,18,Four Rooms (1995),Comedy,Comedy Four Rooms (1995)
18,19,Ace Ventura: When Nature Calls (1995),Comedy,Comedy Ace Ventura: When Nature Calls (1995)
58,65,Bio-Dome (1996),Comedy,Comedy Bio-Dome (1996)
61,69,Friday (1995),Comedy,Comedy Friday (1995)
...,...,...,...,...
9695,184791,Fred Armisen: Standup for Drummers (2018),Comedy,Comedy Fred Armisen: Standup for Drummers (2018)
9704,185473,Blockers (2018),Comedy,Comedy Blockers (2018)
9716,188797,Tag (2018),Comedy,Comedy Tag (2018)
9726,190209,Jeff Ross Roasts the Border (2017),Comedy,Comedy Jeff Ross Roasts the Border (2017)


In [16]:
recommend('Tag (2018)', 5)

9685    Tom Segura: Disgraceful (2018)
9716                        Tag (2018)
9718                 Boundaries (2018)
9684                The Clapper (2018)
9723             BlacKkKlansman (2018)
Name: title, dtype: object

## Horror genre

In [17]:
df[df['genres'] == 'Horror']

Unnamed: 0,movieId,title,genres,string
149,177,Lord of Illusions (1995),Horror,Horror Lord of Illusions (1995)
188,220,Castle Freak (1995),Horror,Horror Castle Freak (1995)
593,735,Cemetery Man (Dellamorte Dellamore) (1994),Horror,Horror Cemetery Man (Dellamorte Dellamore) (1994)
653,841,"Eyes Without a Face (Yeux sans visage, Les) (1...",Horror,"Horror Eyes Without a Face (Yeux sans visage, ..."
842,1105,Children of the Corn IV: The Gathering (1996),Horror,Horror Children of the Corn IV: The Gathering ...
...,...,...,...,...
9447,167538,Microwave Massacre (1983),Horror,Horror Microwave Massacre (1983)
9462,168250,Get Out (2017),Horror,Horror Get Out (2017)
9480,169670,The Void (2016),Horror,Horror The Void (2016)
9582,175199,Annabelle: Creation (2017),Horror,Horror Annabelle: Creation (2017)


In [18]:
recommend('Annabelle: Creation (2017)', 5)

9641      Creep 2 (2017)
9638       Mayhem (2017)
9610    Cage Dive (2017)
9434        Split (2017)
9739        Flint (2017)
Name: title, dtype: object

In [19]:
recommend('The Void (2016)', 4)

9320    The Conjuring 2 (2016)
9390                 31 (2016)
9231         Southbound (2016)
9344            Satanic (2016)
Name: title, dtype: object