## Content based recommendation system

Content based recommender systems do not make use of data from other users to recommend a movie. Instead, they utilize a descriptive set of attributes such as keywords or the summary of a movie. The disadvantage is that these systems will recommend the same movie to the user, based on the input. However, they could be useful in recommending a movie that not many people have seen or rated.  

In the content based recommendation system it is only the user that plays a role in the recommendation. This method can also be combined with collaborative filtering methods.

In this notebook, the text content of the movies from the `merged` dataset is going to be alalyzed. The goal is to rank all the movies in the dataset based on a similarity measure with the input movie. For similarity measures, the cosine similarity will be used. Moreover, the content comes from the movies plots and possibly also the keywords. In order to remove the most common words, TF-IDF is used. Finally, the input to the TF-IDF algorithm will be the lemmatized text from each movie's content.

In [2]:
# Common libraries imports
import pandas as pd

In [3]:
# Not as common libraries imports and installation. 
# !python3 -m pip install nltk ## For linux and not environment
# !pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('corpus')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Error loading corpus: Package 'corpus' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Since it is a good idea to remove stop words from tf idf calculations, as stated also in [Chapter 1.3.1 MMDS](http://mmds.org/), a list of English stop words is created:

In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Read the `merged` dataset, or its 'cleaned' version, that has duplicates removed

In [5]:
data = pd.read_csv('../Data/merged.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,keywords,cast,crew,KeywCastDirGenre
0,0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id':16,'name':'animation'},{'id':35,'name':...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,American,johnlasseter,"Tim Allen, Tom Hanks (voices)",animated film,https://en.wikipedia.org/wiki/Toy_Story,In a world where toys are living things who pr...,"[{'id':931,'name':'jealousy'},{'id':4290,'name...","[{'cast_id':14,'character':'woody(voice)','cre...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",jealousy toy boy friendship friends rivalry bo...
1,1,False,,65000000,"[{'id':12,'name':'adventure'},{'id':14,'name':...",,8844,tt0113497,en,Jumanji,...,American,joejohnston,"Robin Williams, Bonnie Hunt, Kirsten Dunst, Br...","family, fantasy",https://en.wikipedia.org/wiki/Jumanji_(film),"In 1869, near Brantford, New Hampshire, two br...","[{'id':10090,'name':'boardgame'},{'id':10941,'...","[{'cast_id':1,'character':'alanparrish','credi...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",boardgame disappearance basedonchildren'sbook ...
2,2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id':10749,'name':'romance'},{'id':35,'name'...",,15602,tt0113228,en,Grumpier Old Men,...,American,howarddeutch,"Jack Lemmon, Walter Matthau, Ann-Margret, Soph...",comedy,https://en.wikipedia.org/wiki/Grumpier_Old_Men,The feud between Max (Walter Matthau) and John...,"[{'id':1495,'name':'fishing'},{'id':12392,'nam...","[{'cast_id':2,'character':'maxgoldman','credit...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",fishing bestfriend duringcreditsstinger oldmen...
3,3,False,,16000000,"[{'id':35,'name':'comedy'},{'id':18,'name':'dr...",,31357,tt0114885,en,Waiting to Exhale,...,American,forestwhitaker,"Whitney Houston, Angela Bassett, Loretta Devin...",drama,https://en.wikipedia.org/wiki/Waiting_to_Exhale,"""Friends are the People who let you be yoursel...","[{'id':818,'name':'basedonnovel'},{'id':10131,...","[{'cast_id':1,'character':""savannah'vannah'jac...","[{'credit_id': '52fe44779251416c91011acb', 'de...",basedonnovel interracialrelationship singlemot...
4,4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id':35,'name':'comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,...,American,charlesshyer,"Steve Martin, Diane Keaton, Martin Short, Kimb...",comedy,https://en.wikipedia.org/wiki/Father_of_the_Br...,The film begins five years after the events of...,"[{'id':1009,'name':'baby'},{'id':1599,'name':'...","[{'cast_id':1,'character':'georgebanks','credi...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",baby midlifecrisis confidence aging daughter m...


In [6]:
def print_info(index):
    '''
    Helper function used for an initial overview of the dataset
    '''
    print(f"Title:\n{data.iloc[index]['Title']}\n")
    print(f"Release Year:\n{data.iloc[index]['Release Year']}\n")
    print(f"Link:\n{data.iloc[index]['Wiki Page']}\n")
    print(f"Tagline:\n{data.iloc[index]['tagline']}\n")
    print(f"Overview:\n{data.iloc[index]['overview']}\n")
    print(f"Summary:\n{data.iloc[index]['Plot']}\n")

As an example, use `print_info` for a random movie:

In [7]:
print_info(12456)

Title:
Suicide Squad

Release Year:
2016

Link:
https://en.wikipedia.org/wiki/Suicide_Squad_(film)

Tagline:
Worst Heroes Ever

Overview:
From DC Comics comes the Suicide Squad, an antihero team of incarcerated supervillains who act as deniable assets for the United States government, undertaking high-risk black ops missions in exchange for commuted prison sentences.

Summary:
In the aftermath of Superman's death, intelligence officer Amanda Waller reaches Washington D.C for assembling Task Force X, and shows them to everyone in the White House a team of dangerous criminals imprisoned at Belle Reve Prison consisting of elite hitman Deadshot, former psychiatrist Harley Quinn, pyrokinetic ex-gangster El Diablo, opportunistic thief Captain Boomerang, genetic mutation Killer Croc, and specialized assassin Slipknot. They are placed under command of Colonel Rick Flag to be used as disposable assets in high-risk missions for the United States government. Each member has a nano bomb implanted 

Which text should we use as content? We can use `tagline` as an alternative title, `overview` which is a sentence that summarizes the movie and `Plot`, the summary of the movie. The latter is in general a longer text. We can use either the latter or for each movie create a txt document that contains the desired text.  

In the following, as a prototype, I am only using the `Plot`.

There are 2 ways to normalize text:Stemming and Lemmatization. The difference can be found [here](https://www.guru99.com/stemming-lemmatization-python-nltk.html). In the following I am using Lemmatization.  

The procedure is as follows. Lemmatize each movie's text content, get the frequency for each movie's lemmas and then use TF-IDF. To this end, I create a dataframe to store, the movie title, the relase year and the lemmas as list of words for each movie.

In [8]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import RegexpTokenizer

In [173]:
tokenizer = RegexpTokenizer(r'\w+') # Remove punctuation
wordnet_lemmatizer = WordNetLemmatizer() # Create lemmatizer


In [174]:
def tokenize(text):
    tokenization = tokenizer.tokenize(text.lower())
    tokens = [ wordnet_lemmatizer.lemmatize(token, wordnet.VERB) for token in tokenization if token not in stop_words and token.isalpha() ]
    return tokens

In [175]:
data['Tokens'] = data['Plot'].apply(tokenize)

In [12]:
data.Tokens

0        [world, toy, live, things, pretend, lifeless, ...
1        [near, brantford, new, hampshire, two, brother...
2        [feud, max, walter, matthau, john, jack, lemmo...
3        [friends, people, let, never, let, forget, wai...
4        [film, begin, five, years, events, first, one,...
                               ...                        
14701    [leave, permanent, residence, germany, famous,...
14702    [masha, krapivina, kristina, asmus, come, mosc...
14703    [biology, teacher, devki, vivacious, popular, ...
14704    [struggle, sculptor, marcel, de, lange, martin...
14705    [film, begin, miller, poach, deer, land, belon...
Name: Tokens, Length: 14706, dtype: object

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

The calculation of the cosine similarity:

In the following approach we calculate the cosine similarity in a simple way, whether there is match between words of the respective fields with one hot encoding, e.g. Plots. In the example below, the value of cosine_similarity for Rocky Balboa and Rocky ΙΙ is 0.28. However, the computational cost (2 min for each movie) for all the available data for the various combinations is very high. For this reason we selected to use the predefined function from sklearn, cosine_similarity. However, we can obtain better results, if we perform the calculation through tf-idf score, as we include the importance of each word. Through the sklearn's optimized function TfidfVectorizer we avoided the out-of-memory situation in our hardcoded approach.

In [222]:
import numpy as np
data_movie1=data.loc[data.index==1622].Plot
data_movie2=data.loc[data.index==7007].Plot
A = TfidfVectorizer().fit(data_movie1)
B = TfidfVectorizer().fit(data_movie2)

a1 = list(A.vocabulary_.keys())
a2 = list(B.vocabulary_.keys())
tot_words = list(set( a1 + a2))

a1_oh = np.array([ 1 if tot_words[i] in a1 else 0 for i in range(len(tot_words))  ]).astype(int).reshape(-1,1)
a2_oh = np.array([ 1 if tot_words[i] in a2 else 0 for i in range(len(tot_words))  ]).astype(int).reshape(-1,1)

a1_n = np.sqrt( np.dot(a1_oh.T, a1_oh))
a2_n = np.sqrt( np.dot(a2_oh.T, a2_oh))

cos_sim = np.dot(a1_oh.T, a2_oh)/(a1_n * a2_n)
cos_sim

array([[0.28329498]])

In [223]:
from tqdm.autonotebook import tqdm


  from tqdm.autonotebook import tqdm


In [232]:
def get_cos_sim(doc1, doc2):
    A = TfidfVectorizer().fit(doc1)
    B = TfidfVectorizer().fit(doc2)
    a1 = list(A.vocabulary_.keys())
    a2 = list(B.vocabulary_.keys())
    tot_words = list(set( a1 + a2))
    a1_oh = np.array([ 1 if tot_words[i] in a1 else 0 for i in range(len(tot_words))  ]).astype(int).reshape(-1,1)
    a2_oh = np.array([ 1 if tot_words[i] in a2 else 0 for i in range(len(tot_words))  ]).astype(int).reshape(-1,1)
    a1_n = np.sqrt( np.dot(a1_oh.T, a1_oh))
    a2_n = np.sqrt( np.dot(a2_oh.T, a2_oh))

    cos_sim = np.dot(a1_oh.T, a2_oh)/(a1_n * a2_n)
    return cos_sim

cos_sim_dict = {}
for movie_dx in tqdm(range(len(data))):
    cs = get_cos_sim(data.loc[data.Title=="Rocky Balboa"].Plot, data.loc[data.index==movie_dx].Plot)
    cos_sim_dict[data.loc[data.index==movie_dx].Title.item()] = cs

  0%|          | 0/14706 [00:00<?, ?it/s]

In [233]:
top_n = 10
sorted(cos_sim_dict.items(), key=lambda x: x[1], reverse=True)[1:top_n]

[('Rocky V', array([[0.28727653]])),
 ('Rocky II', array([[0.28329498]])),
 ('Fat City', array([[0.28086539]])),
 ('Girlfight', array([[0.27725272]])),
 ('Rocky', array([[0.276353]])),
 ('Girls Just Want to Have Fun', array([[0.27255818]])),
 ('Rocky III', array([[0.26744683]])),
 ('December Boys', array([[0.26743431]])),
 ('Rocky IV', array([[0.26612315]]))]

In [220]:
data2=data.copy()
print(data2['overview'].loc[data.index==8208])

8208    When young Lotus Flower sees an unconscious ma...
Name: overview, dtype: object


In [199]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

def movie_recom(titles,data2):
    title_ = titles
    data2=data2.astype(str)
    value = data2.apply(lambda x: ' '.join(x), axis=1)
    data3=pd.DataFrame({'Title':title_, 'value':value})
    tf = TfidfVectorizer().fit_transform(data3.value.dropna())
    cosine_sim = cosine_similarity(tf, tf)
    return cosine_sim

def get_recommendations(title, df, sim_measure ):
    df = df.copy()
    smd = df.reset_index()
    indices = pd.Series(smd.index, index=smd['Title'])
    idx = indices[title]
    if isinstance(idx, pd.core.series.Series):
        a = max(df.loc[df.Title==title]['Release Year'].to_list())
        idx = df.loc[ (df.Title==title) & (df['Release Year']==a) ].index[0]
    sim_scores = list(enumerate(sim_measure[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    sim_scores1 = [i[1] for i in sim_scores]
    #print(titles.iloc[movie_indices])
    return movie_indices,sim_scores1

In [23]:
from collections import defaultdict

def create_cosine_sim(combinations,data2):
    cosine_sim = defaultdict(list)
    for i in range(len(combinations[:])):
        cosine_sim_=movie_recom(data2['Title'],data2[combinations[i]])
        cosine_sim[i].append(pd.DataFrame(cosine_sim_))
    return cosine_sim


Three different models exploiting different columns from our original dataset:
'Title & Plot', 'Title, Cast & Genre','Keywords, Cast, Director & Genre'

In [228]:
def recommendation_engine(title,data2,cosine_sim):
    column=['Title & Plot', 'Title, Cast & Genre','Keywords, Cast, Director & Genre']
    recommendations_df= pd.DataFrame()
    score_df= pd.DataFrame()
    titles_df= pd.DataFrame()
    titles = data2['Title']
    for i,r in cosine_sim.items():
        recom,score = get_recommendations(title, data2, r[0])
        titles_=titles.iloc[recom]
        recom=pd.DataFrame(recom)
        score=pd.DataFrame(score)
        recommendations_df = pd.concat([ recommendations_df.reset_index(drop=True), recom.reset_index(drop=True)], axis=1, ignore_index=True).rename(columns={0: column[0], 1: column[1], 2: column[2]})
        score_df = pd.concat([ score_df.reset_index(drop=True), score.reset_index(drop=True)], axis=1, ignore_index=True).rename(columns={0: column[0], 1: column[1], 2: column[2]})
        titles_df = pd.concat([ titles_df.reset_index(drop=True), titles_.reset_index(drop=True)], axis=1, ignore_index=True).rename(columns={0: column[0], 1: column[1], 2: column[2]})
    return recommendations_df,score_df,titles_df

In [229]:
combinations=[['Title','Plot'],['Cast','Genre'],['KeywCastDirGenre']]
cosine_sim=create_cosine_sim(combinations,data2)

In [231]:
title=str("Rocky Balboa")
recommendations_df,score_df,titles_df=recommendation_engine(title,data2,cosine_sim)
titles_df

Unnamed: 0,Title & Plot,"Title, Cast & Genre","Keywords, Cast, Director & Genre"
0,Rocky II,Rocky,Rocky III
1,Rocky V,Rambo III,Rocky II
2,Rocky III,Assassins,Rocky
3,Mask,Rocky V,Fat City
4,Rocky,Lock Up,Rocky V
5,Angels with Dirty Faces,Killing Season,Rocky IV
6,Rocky IV,The Expendables 3,Body and Soul
7,Don't Breathe,Rocky II,Jason's Lyric
8,Roommates,Death Race 2000,The Set-Up
9,Red Light,Pathology,Raging Bull
