## Content based recommendation system

Content based recommender systems do not make use of data from other users to recommend a movie. Instead, they utilize a descriptive set of attributes such as keywords or the summary of a movie. The disadvantage is that these systems will recommend the same movie to the user, based on the input. However, they could be useful in recommending a movie that not many people have seen or rated.  

In the content based recommendation system it is only the user that plays a role in the recommendation. This method can also be combined with collaborative filtering methods.

In this notebook, the text content of the movies from the `merged` dataset is going to be alalyzed. The goal is to rank all the movies in the dataset based on a similarity measure with the input movie. For similarity measures, the cosine similarity will be used. Moreover, the content comes from the movies plots and possibly also the keywords. In order to remove the most common words, TF-IDF is used. Finally, the input to the TF-IDF algorithm will be the lemmatized text from each movie's content.

In [1]:
# Common libraries imports
import pandas as pd

In [25]:
# Not as common libraries imports and installation. 
# !python3 -m pip install nltk ## For linux and not environment
# !pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('corpus')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Error loading corpus: Package 'corpus' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

Since it is a good idea to remove stop words from tf idf calculations, as stated also in [Chapter 1.3.1 MMDS](http://mmds.org/), a list of English stop words is created:

In [26]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Read the `merged` dataset, or its 'cleaned' version, that has duplicates removed

In [27]:
# a = data.sort_values('Release Year', ascending=False).drop_duplicates(subset=['Title', 'Release Year'], keep='last')
# a.loc[a.Title=='The Mask']['release_date']

In [28]:
data = pd.read_csv('../Data/merged.csv')

In [29]:
def print_info(index):
    '''
    Helper function used for an initial overview of the dataset
    '''
    print(f"Title:\n{data.iloc[index]['Title']}\n")
    print(f"Release Year:\n{data.iloc[index]['Release Year']}\n")
    print(f"Link:\n{data.iloc[index]['Wiki Page']}\n")
    print(f"Tagline:\n{data.iloc[index]['tagline']}\n")
    print(f"Overview:\n{data.iloc[index]['overview']}\n")
    print(f"Summary:\n{data.iloc[index]['Plot']}\n")

As an example, use `print_info` for a random movie:

In [30]:
print_info(12456)

Title:
Suicide Squad

Release Year:
2016

Link:
https://en.wikipedia.org/wiki/Suicide_Squad_(film)

Tagline:
Worst Heroes Ever

Overview:
From DC Comics comes the Suicide Squad, an antihero team of incarcerated supervillains who act as deniable assets for the United States government, undertaking high-risk black ops missions in exchange for commuted prison sentences.

Summary:
In the aftermath of Superman's death, intelligence officer Amanda Waller reaches Washington D.C for assembling Task Force X, and shows them to everyone in the White House a team of dangerous criminals imprisoned at Belle Reve Prison consisting of elite hitman Deadshot, former psychiatrist Harley Quinn, pyrokinetic ex-gangster El Diablo, opportunistic thief Captain Boomerang, genetic mutation Killer Croc, and specialized assassin Slipknot. They are placed under command of Colonel Rick Flag to be used as disposable assets in high-risk missions for the United States government. Each member has a nano bomb implanted 

Which text should we use as content? We can use `tagline` as an alternative title, `overview` which is a sentence that summarizes the movie and `Plot`, the summary of the movie. The latter is in general a longer text. We can use either the latter or for each movie create a txt document that contains the desired text.  

In the following, as a prototype, I am only using the `Plot`.

There are 2 ways to normalize text:Stemming and Lemmatization. The difference can be found [here](https://www.guru99.com/stemming-lemmatization-python-nltk.html). In the following I am using Lemmatization.  

The procedure is as follows. Lemmatize each movie's text content, get the frequency for each movie's lemmas and then use TF-IDF. To this end, I create a dataframe to store, the movie title, the relase year and the lemmas as list of words for each movie.

In [31]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import RegexpTokenizer

In [32]:
tokenizer = RegexpTokenizer(r'\w+') # Remove punctuation
wordnet_lemmatizer = WordNetLemmatizer() # Create lemmatizer

# text = "studies studying cries cry"
# tokenization = nltk.word_tokenize(text)
# for w in tokenization:
#     print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))  

In [33]:
def create_lemmas_list(content_txt):
    lemmas = []    
    tokenization = tokenizer.tokenize(content_txt.lower()) # Lowercase the whole text, to avoid dealing with case
    for w in tokenization:
        # Do not consider single characters. Can be resolved via tf-idf,
        # but maybe there are single characters due to wrong line breaks.
        if w in stop_words:
            continue
        lemmas.append(wordnet_lemmatizer.lemmatize(w, wordnet.VERB))
    
    return lemmas

Test this:

In [34]:
txt = data.iloc[459].Plot
a = create_lemmas_list(txt)

In [35]:
data.columns

Index(['Unnamed: 0', 'adult', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'ReleaseAndTitle', 'Release Year',
       'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page',
       'Plot', 'keywords', 'cast', 'crew', 'KeywCastDirGenre'],
      dtype='object')

In [36]:
# Create a df to hold the movies and the tokenized text
movie_plots_tokens_df = data[['Title', "Plot"]]

In [37]:
def tokenize(text):
    tokenization = tokenizer.tokenize(text.lower())
    tokens = [ wordnet_lemmatizer.lemmatize(token, wordnet.VERB) for token in tokenization if token not in stop_words and token.isalpha() ]
    return tokens

In [38]:
movie_plots_tokens_df['Tokens'] = movie_plots_tokens_df['Plot'].apply(tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_plots_tokens_df['Tokens'] = movie_plots_tokens_df['Plot'].apply(tokenize)


In [39]:
movie_plots_tokens_df.Tokens

0        [world, toy, live, things, pretend, lifeless, ...
1        [near, brantford, new, hampshire, two, brother...
2        [feud, max, walter, matthau, john, jack, lemmo...
3        [friends, people, let, never, let, forget, wai...
4        [film, begin, five, years, events, first, one,...
                               ...                        
14701    [leave, permanent, residence, germany, famous,...
14702    [masha, krapivina, kristina, asmus, come, mosc...
14703    [biology, teacher, devki, vivacious, popular, ...
14704    [struggle, sculptor, marcel, de, lange, martin...
14705    [film, begin, miller, poach, deer, land, belon...
Name: Tokens, Length: 14706, dtype: object

In [180]:
data.columns


Index(['Unnamed: 0', 'adult', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'ReleaseAndTitle', 'Release Year',
       'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page',
       'Plot', 'keywords', 'cast', 'crew', 'KeywCastDirGenre', 'b', 'TTo',
       'TToP'],
      dtype='object')

In [151]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [195]:
data['TTo']=data['Title']+' '+data['Title']+' '+data['overview']+' '+data['Plot']
data['TDC']=data['Title']+' '+data['Director']
data['TDCP']=data['Title']+' '+data['Director']+' '+data['Plot']



In [196]:
tf = TfidfVectorizer().fit_transform(data.Plot)
tfov = TfidfVectorizer().fit_transform(data.overview.dropna())
tftit = TfidfVectorizer().fit_transform(data.Title)
tfTTo = TfidfVectorizer().fit_transform(data.TTo.dropna())
tfTToP = TfidfVectorizer().fit_transform(data.TToP.dropna())
tfTDC = TfidfVectorizer().fit_transform(data.TDC.dropna())
tfTDCP = TfidfVectorizer().fit_transform(data.TDCP.dropna())

In [197]:
tf[0,74013]

0.0

In [198]:
cosine_sim = linear_kernel(tf, tf)
cosine_simtit = linear_kernel(tftit, tftit)
cosine_simov = linear_kernel(tfov, tfov)
cosine_simtto = linear_kernel(tfTTo, tfTTo)
cosine_simttop = linear_kernel(tfTToP, tfTToP)
cosine_simTDC = linear_kernel(tfTDC, tfTDC)
cosine_simTDCP = linear_kernel(tfTDCP, tfTDCP)


In [199]:
data2  = data.copy()

In [200]:
data.columns

Index(['Unnamed: 0', 'adult', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'ReleaseAndTitle', 'Release Year',
       'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page',
       'Plot', 'keywords', 'cast', 'crew', 'KeywCastDirGenre', 'b', 'TTo',
       'TToP', 'TDC', 'TDCP'],
      dtype='object')

In [201]:


smd = data2.reset_index()
titles = smd['Title']
indices = pd.Series(smd.index, index=smd['Title'])



In [202]:


def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]



In [203]:
def get_recommendations(title, df, sim_measure ):
    df = df.copy()
    smd = df.reset_index()
    titles = df['Title']
    indices = pd.Series(smd.index, index=smd['Title'])
    idx = indices[title]
    if isinstance(idx, pd.core.series.Series):
        a = max(df.loc[df.Title==title]['Release Year'].to_list())
        idx = df.loc[ (df.Title==title) & (df['Release Year']==a) ].index[0]
    sim_scores = list(enumerate(sim_measure[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [204]:
print(data['overview'].loc[data.index==8208])

8208    When young Lotus Flower sees an unconscious ma...
Name: overview, dtype: object


In [228]:
title='Reservoir Dogs'

In [229]:
recom_plot = get_recommendations(title, data,cosine_sim)
recom_over = get_recommendations(title, data2,cosine_simov)
recom_title = get_recommendations(title, data,cosine_simtit)
recom_ttop = get_recommendations(title, data,cosine_simttop)
recom_tto = get_recommendations(title, data,cosine_simtto)
recom_TDC = get_recommendations(title, data,cosine_simTDC)
recom_TDCP = get_recommendations(title, data,cosine_simTDCP)


recommendations_df = pd.concat([ recom_over.reset_index(drop=True), recom_title.reset_index(drop=True), recom_plot.reset_index(drop=True), recom_tto.reset_index(drop=True), recom_ttop.reset_index(drop=True) , recom_TDC.reset_index(drop=True), recom_TDCP.reset_index(drop=True)], axis=1, ignore_index=True)
recommendations_df.head(50)

Unnamed: 0,0,1,2,3,4,5,6
0,The Broadway Melody,War Dogs,Exam,Four Sons,It's a Gift,Jackie Brown,Exam
1,The Matrix,Old Dogs,It's a Gift,Justice League,Exam,Pulp Fiction,It's a Gift
2,1920,Snow Dogs,A Beautiful Mind,What's Eating Gilbert Grape,A Beautiful Mind,Django Unchained,A Beautiful Mind
3,Logan's Run,The Dogs of War,Dreams That Money Can Buy,Westworld,Dreams That Money Can Buy,Inglourious Basterds,Dreams That Money Can Buy
4,Cattle Queen of Montana,Cats & Dogs,Snow White,Warlock,Snow White,War Dogs,Snow White
5,Trivisa,Hotel for Dogs,Dragonfly,A River Runs Through It,Dragonfly,My Best Friend's Birthday,Dragonfly
6,The Return of the Vampire,Straw Dogs,Ernest Goes to Jail,The Amazing Mr Blunden,Ernest Goes to Jail,The Dogs of War,Ernest Goes to Jail
7,Anchorman 2: The Legend Continues,Straw Dogs,Snow White and the Seven Dwarfs,Bright Star,Snow White and the Seven Dwarfs,Snow Dogs,Snow White and the Seven Dwarfs
8,Our Little Sister,Stray Dogs,DodgeBall: A True Underdog Story,Logan's Run,The Roaring Twenties,Old Dogs,The Roaring Twenties
9,He Who Gets Slapped,The Plague Dogs,The Roaring Twenties,Phas Gaye Re Obama,Super 8,Straw Dogs,Super 8


In [166]:
# def tf(document):
#     doc_tokens = create_lemmas_list(document)
#     freq_dist = nltk.FreqDist(doc_tokens)
    
#     # FreqDist return the dictionary sorted in descending order.
#     # I did not find this explicitly in the docs, so I find the max freq in the document
#     max_freq = sorted( list(freq_dist.values()), reverse=True )[0]
    
#     tfs = {token:freq/max_freq for (token, freq) in freq_dist.items() }
#     return tfs

In [97]:
# from tqdm.notebook import tqdm

# def df(data_frame):
#     df_dict = {}
#     for i in tqdm(range(len(data_frame))):
#         for token in set(data_frame.iloc[i].Tokens):
#             for j in range(len(data_frame)):
#                 document = set(data_frame.iloc[j].Tokens)
#                 if token in document:
#                     df_dict[token] = df_dict.get(token,[j]) + [j]
#     return df_dict

In [None]:
# df(movie_plots_tokens_df)