## Content based recommendation system

Content based recommender systems do not make use of data from other users to recommend a movie. Instead, they utilize a descriptive set of attributes such as keywords or the summary of a movie. The disadvantage is that these systems will recommend the same movie to the user, based on the input. However, they could be useful in recommending a movie that not many people have seen or rated.  

In the content based recommendation system it is only the user that plays a role in the recommendation. This method can also be combined with collaborative filtering methods.

In this notebook, the text content of the movies from the `merged` dataset is going to be alalyzed. The goal is to rank all the movies in the dataset based on a similarity measure with the input movie. For similarity measures, the cosine similarity will be used. Moreover, the content comes from the movies plots and possibly also the keywords. In order to remove the most common words, TF-IDF is used. Finally, the input to the TF-IDF algorithm will be the lemmatized text from each movie's content.

In [1]:
# Common libraries imports
import pandas as pd

In [2]:
# Not as common libraries imports and installation. 
# !python3 -m pip install nltk ## For linux and not environment
# !pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('corpus')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Error loading corpus: Package 'corpus' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iok\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Since it is a good idea to remove stop words from tf idf calculations, as stated also in [Chapter 1.3.1 MMDS](http://mmds.org/), a list of English stop words is created:

In [3]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Read the `merged` dataset, or its 'cleaned' version, that has duplicates removed

In [6]:
data = pd.read_csv('../Data/merged.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,keywords,cast,crew,KeywCastDirGenre
0,0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id':16,'name':'animation'},{'id':35,'name':...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,American,johnlasseter,"Tim Allen, Tom Hanks (voices)",animated film,https://en.wikipedia.org/wiki/Toy_Story,In a world where toys are living things who pr...,"[{'id':931,'name':'jealousy'},{'id':4290,'name...","[{'cast_id':14,'character':'woody(voice)','cre...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",jealousy toy boy friendship friends rivalry bo...
1,1,False,,65000000,"[{'id':12,'name':'adventure'},{'id':14,'name':...",,8844,tt0113497,en,Jumanji,...,American,joejohnston,"Robin Williams, Bonnie Hunt, Kirsten Dunst, Br...","family, fantasy",https://en.wikipedia.org/wiki/Jumanji_(film),"In 1869, near Brantford, New Hampshire, two br...","[{'id':10090,'name':'boardgame'},{'id':10941,'...","[{'cast_id':1,'character':'alanparrish','credi...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",boardgame disappearance basedonchildren'sbook ...
2,2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id':10749,'name':'romance'},{'id':35,'name'...",,15602,tt0113228,en,Grumpier Old Men,...,American,howarddeutch,"Jack Lemmon, Walter Matthau, Ann-Margret, Soph...",comedy,https://en.wikipedia.org/wiki/Grumpier_Old_Men,The feud between Max (Walter Matthau) and John...,"[{'id':1495,'name':'fishing'},{'id':12392,'nam...","[{'cast_id':2,'character':'maxgoldman','credit...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",fishing bestfriend duringcreditsstinger oldmen...
3,3,False,,16000000,"[{'id':35,'name':'comedy'},{'id':18,'name':'dr...",,31357,tt0114885,en,Waiting to Exhale,...,American,forestwhitaker,"Whitney Houston, Angela Bassett, Loretta Devin...",drama,https://en.wikipedia.org/wiki/Waiting_to_Exhale,"""Friends are the People who let you be yoursel...","[{'id':818,'name':'basedonnovel'},{'id':10131,...","[{'cast_id':1,'character':""savannah'vannah'jac...","[{'credit_id': '52fe44779251416c91011acb', 'de...",basedonnovel interracialrelationship singlemot...
4,4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id':35,'name':'comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,...,American,charlesshyer,"Steve Martin, Diane Keaton, Martin Short, Kimb...",comedy,https://en.wikipedia.org/wiki/Father_of_the_Br...,The film begins five years after the events of...,"[{'id':1009,'name':'baby'},{'id':1599,'name':'...","[{'cast_id':1,'character':'georgebanks','credi...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",baby midlifecrisis confidence aging daughter m...


In [7]:
def print_info(index):
    '''
    Helper function used for an initial overview of the dataset
    '''
    print(f"Title:\n{data.iloc[index]['Title']}\n")
    print(f"Release Year:\n{data.iloc[index]['Release Year']}\n")
    print(f"Link:\n{data.iloc[index]['Wiki Page']}\n")
    print(f"Tagline:\n{data.iloc[index]['tagline']}\n")
    print(f"Overview:\n{data.iloc[index]['overview']}\n")
    print(f"Summary:\n{data.iloc[index]['Plot']}\n")

As an example, use `print_info` for a random movie:

In [30]:
print_info(12456)

Title:
Suicide Squad

Release Year:
2016

Link:
https://en.wikipedia.org/wiki/Suicide_Squad_(film)

Tagline:
Worst Heroes Ever

Overview:
From DC Comics comes the Suicide Squad, an antihero team of incarcerated supervillains who act as deniable assets for the United States government, undertaking high-risk black ops missions in exchange for commuted prison sentences.

Summary:
In the aftermath of Superman's death, intelligence officer Amanda Waller reaches Washington D.C for assembling Task Force X, and shows them to everyone in the White House a team of dangerous criminals imprisoned at Belle Reve Prison consisting of elite hitman Deadshot, former psychiatrist Harley Quinn, pyrokinetic ex-gangster El Diablo, opportunistic thief Captain Boomerang, genetic mutation Killer Croc, and specialized assassin Slipknot. They are placed under command of Colonel Rick Flag to be used as disposable assets in high-risk missions for the United States government. Each member has a nano bomb implanted 

Which text should we use as content? We can use `tagline` as an alternative title, `overview` which is a sentence that summarizes the movie and `Plot`, the summary of the movie. The latter is in general a longer text. We can use either the latter or for each movie create a txt document that contains the desired text.  

In the following, as a prototype, I am only using the `Plot`.

There are 2 ways to normalize text:Stemming and Lemmatization. The difference can be found [here](https://www.guru99.com/stemming-lemmatization-python-nltk.html). In the following I am using Lemmatization.  

The procedure is as follows. Lemmatize each movie's text content, get the frequency for each movie's lemmas and then use TF-IDF. To this end, I create a dataframe to store, the movie title, the relase year and the lemmas as list of words for each movie.

In [8]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import RegexpTokenizer

In [9]:
tokenizer = RegexpTokenizer(r'\w+') # Remove punctuation
wordnet_lemmatizer = WordNetLemmatizer() # Create lemmatizer


In [10]:
def create_lemmas_list(content_txt):
    lemmas = []    
    tokenization = tokenizer.tokenize(content_txt.lower()) # Lowercase the whole text, to avoid dealing with case
    for w in tokenization:
        # Do not consider single characters. Can be resolved via tf-idf,
        # but maybe there are single characters due to wrong line breaks.
        if w in stop_words:
            continue
        lemmas.append(wordnet_lemmatizer.lemmatize(w, wordnet.VERB))
    
    return lemmas

Test this:

In [11]:
# Create a df to hold the movies and the tokenized text
movie_plots_tokens_df = data[['Title', "Plot"]]

In [12]:
def tokenize(text):
    tokenization = tokenizer.tokenize(text.lower())
    tokens = [ wordnet_lemmatizer.lemmatize(token, wordnet.VERB) for token in tokenization if token not in stop_words and token.isalpha() ]
    return tokens

In [13]:
movie_plots_tokens_df['Tokens'] = movie_plots_tokens_df['Plot'].apply(tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_plots_tokens_df['Tokens'] = movie_plots_tokens_df['Plot'].apply(tokenize)


In [370]:
movie_plots_tokens_df.Tokens

0        [world, toy, live, things, pretend, lifeless, ...
1        [near, brantford, new, hampshire, two, brother...
2        [feud, max, walter, matthau, john, jack, lemmo...
3        [friends, people, let, never, let, forget, wai...
4        [film, begin, five, years, events, first, one,...
                               ...                        
14701    [leave, permanent, residence, germany, famous,...
14702    [masha, krapivina, kristina, asmus, come, mosc...
14703    [biology, teacher, devki, vivacious, popular, ...
14704    [struggle, sculptor, marcel, de, lange, martin...
14705    [film, begin, miller, poach, deer, land, belon...
Name: Tokens, Length: 14706, dtype: object

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [15]:
data['TTo']=data['Title']+' '+data['Title']+' '+data['overview']+' '+data['Plot']
data['TDC']=data['Title']+' '+data['Director']
data['TDCP']=data['Title']+' '+data['Director']+' '+data['Plot']

In [None]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [17]:
tf = TfidfVectorizer().fit_transform(data.Plot)
tfov = TfidfVectorizer().fit_transform(data.overview.dropna())
tftit = TfidfVectorizer().fit_transform(data.Title)
tfTTo = TfidfVectorizer().fit_transform(data.TTo.dropna())
tfTDC = TfidfVectorizer().fit_transform(data.TDC.dropna())
tfTDCP = TfidfVectorizer().fit_transform(data.TDCP.dropna())

tfkey = TfidfVectorizer().fit_transform(data.KeywCastDirGenre)

In [18]:
tfkey = TfidfVectorizer().fit_transform(data.KeywCastDirGenre)

cosine_simkey = linear_kernel(tfkey, tfkey)


In [19]:
cosine_sim = linear_kernel(tf, tf)
cosine_simtit = linear_kernel(tftit, tftit)
cosine_simov = linear_kernel(tfov, tfov)
cosine_simtto = linear_kernel(tfTTo, tfTTo)
cosine_simTDC = linear_kernel(tfTDC, tfTDC)
cosine_simTDCP = linear_kernel(tfTDCP, tfTDCP)




In [20]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tf, tf)
cosine_simtit = cosine_similarity(tftit, tftit)
cosine_simov = cosine_similarity(tfov, tfov)
cosine_simtto = cosine_similarity(tfTTo, tfTTo)
cosine_simTDC = cosine_similarity(tfTDC, tfTDC)
cosine_simTDCP = cosine_similarity(tfTDCP, tfTDCP)

In [None]:
cosine = np.sum(A*B, axis=1)/(norm(A, axis=1)*norm(B, axis=1))

In [21]:
data2  = data.copy()

In [372]:
data.columns

Index(['Unnamed: 0', 'adult', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'ReleaseAndTitle', 'Release Year',
       'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page',
       'Plot', 'keywords', 'cast', 'crew', 'KeywCastDirGenre'],
      dtype='object')

In [22]:


smd = data2.reset_index()
titles = smd['Title']
indices = pd.Series(smd.index, index=smd['Title'])



In [23]:
def get_recommendations(title, df, sim_measure ):
    df = df.copy()
    smd = df.reset_index()
    titles = df['Title']
    indices = pd.Series(smd.index, index=smd['Title'])
    idx = indices[title]
    if isinstance(idx, pd.core.series.Series):
        a = max(df.loc[df.Title==title]['Release Year'].to_list())
        idx = df.loc[ (df.Title==title) & (df['Release Year']==a) ].index[0]
    sim_scores = list(enumerate(sim_measure[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    sim_scores1 = [i[1] for i in sim_scores]
    print(sim_scores1)
    return titles.iloc[movie_indices]

In [422]:
print(data2['overview'].loc[data.index==8208])

8208    When young Lotus Flower sees an unconscious ma...
Name: overview, dtype: object


In [24]:
title='The Shawshank Redemption'


In [28]:
recom_plot = get_recommendations(title, data,cosine_sim)
recom_over = get_recommendations(title, data2,cosine_simov)
recom_title = get_recommendations(title, data,cosine_simtit)
recom_tto = get_recommendations(title, data,cosine_simtto)
recom_TDC = get_recommendations(title, data,cosine_simTDC)
recom_TDCP = get_recommendations(title, data,cosine_simTDCP)
recom_key = get_recommendations(title, data,cosine_simkey)

recommendations_df = pd.concat([ recom_over.reset_index(drop=True), recom_title.reset_index(drop=True), recom_plot.reset_index(drop=True), recom_tto.reset_index(drop=True), recom_TDC.reset_index(drop=True), recom_TDCP.reset_index(drop=True), recom_key.reset_index(drop=True)], axis=1, ignore_index=True)
recommendations_df.head(50)

[0.5837637603141235, 0.5026779795755183, 0.4888916143448738, 0.47861650044363385, 0.4603017851538094, 0.45569408700912123, 0.4546801871365421, 0.445869670396173, 0.4356280165040722, 0.43392407787279125, 0.42936824907059823, 0.39651102962667284, 0.388496406790807, 0.37991845810853514, 0.3736742060493773, 0.3702018238367914, 0.35697882397713326, 0.35648392854921396, 0.3561766499197212, 0.35516659102263093, 0.34797267049311753, 0.3385098040990594, 0.33175022460549924, 0.3300663910430773, 0.32919236430247467, 0.32781041166457764, 0.3058039507922687, 0.29796733560975736, 0.2952589639603538, 0.29133224136873304]
[0.16911278222958165, 0.16517055671396413, 0.15371152654619724, 0.14718653853444771, 0.14553002879122287, 0.14028797427186343, 0.13051638369942797, 0.1229488770120693, 0.11947447887688742, 0.11829489318829504, 0.11639940227475869, 0.11542788671963809, 0.11489180840044513, 0.11392591676407274, 0.11125157411685556, 0.11078334644832033, 0.11039880525856458, 0.10969319547560938, 0.109421

Unnamed: 0,0,1,2,3,4,5,6
0,Shinjuku Incident,The V.I.P.s,Before the Devil Knows You're Dead,Confessions of a Nazi Spy,The Mist,Before the Devil Knows You're Dead,The Green Mile
1,Crawlspace,The D.I.,Real Time,Lawrence of Arabia,The Green Mile,Real Time,The Mist
2,Revenge of the Nerds,The Lord of the Rings: The Return of the King,Love Finds Andy Hardy,Dancing Lady,The Lord of the Rings: The Return of the King,Love Finds Andy Hardy,Berth Marks
3,Edward II,"The World, the Flesh and the Devil",The Devil Wears Prada,The Maze,The Lord of the Rings: The Fellowship of the Ring,The Devil Wears Prada,Memphis Belle
4,The Pace That Kills,The Island at the Top of the World,The One and Only,Blue Juice,The Island at the Top of the World,The One and Only,Innocent Blood
5,Aces High,The Lord of the Rings: The Fellowship of the Ring,Malice,Penelope,"The World, the Flesh and the Devil",Andy Hardy Meets Debutante,Stir Crazy
6,The Amityville Horror,The Man in the Moon,An American Werewolf in Paris,Terminal Island,"The Chronicles of Narnia: The Lion, the Witch ...",Malice,The Big House
7,Gamera vs. Barugon,The City of the Dead,Andy Hardy Meets Debutante,West of the Divide,The Light at the Edge of the World,Life Begins for Andy Hardy,Buffalo Soldiers
8,House of Dracula,The Light at the Edge of the World,Life Begins for Andy Hardy,Floating Clouds,The Dark at the Top of the Stairs,An American Werewolf in Paris,We're No Angels
9,The Wild One,The Man,The Guilt Trip,Deal,The Man in the Moon,The Guilt Trip,The Phantom


In [None]:
recom_plot = get_recommendations(title, data,cosine_sim)
recom_over = get_recommendations(title, data2,cosine_simov)
recom_title = get_recommendations(title, data,cosine_simtit)
recom_ttop = get_recommendations(title, data,cosine_simttop)
recom_tto = get_recommendations(title, data,cosine_simtto)
recom_TDC = get_recommendations(title, data,cosine_simTDC)
recom_TDCP = get_recommendations(title, data,cosine_simTDCP)


recommendations_df = pd.concat([ recom_over.reset_index(drop=True), recom_title.reset_index(drop=True), recom_plot.reset_index(drop=True), recom_tto.reset_index(drop=True), recom_ttop.reset_index(drop=True) , recom_TDC.reset_index(drop=True), recom_TDCP.reset_index(drop=True)], axis=1, ignore_index=True)
recommendations_df.head(50)

In [166]:
# def tf(document):
#     doc_tokens = create_lemmas_list(document)
#     freq_dist = nltk.FreqDist(doc_tokens)
    
#     # FreqDist return the dictionary sorted in descending order.
#     # I did not find this explicitly in the docs, so I find the max freq in the document
#     max_freq = sorted( list(freq_dist.values()), reverse=True )[0]
    
#     tfs = {token:freq/max_freq for (token, freq) in freq_dist.items() }
#     return tfs

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

def movie_recom(title,data2):
    title_ = data2['Title']
    value = data2.apply(lambda x: ' '.join(x), axis=1)
    data3=pd.DataFrame({'Title':title_, 'value':value})
    tf = TfidfVectorizer().fit_transform(data3.value.dropna())
    #cosine_sim = linear_kernel(tf, tf)
    cosine_sim = cosine_similarity(tf, tf)
    recom_TDCP = get_recommendations(title, data3, cosine_sim)
    return recom_TDCP


recommendations_df=movie_recom(title,data2[['Director','Title']])
recommendations_df=movie_recom(title,data2[['Plot','overview','Title']])
print(recommendations_df)

In [97]:
# from tqdm.notebook import tqdm

# def df(data_frame):
#     df_dict = {}
#     for i in tqdm(range(len(data_frame))):
#         for token in set(data_frame.iloc[i].Tokens):
#             for j in range(len(data_frame)):
#                 document = set(data_frame.iloc[j].Tokens)
#                 if token in document:
#                     df_dict[token] = df_dict.get(token,[j]) + [j]
#     return df_dict

In [None]:
# df(movie_plots_tokens_df)