## Content based recommendation system

Content based recommender systems do not make use of data from other users to recommend a movie. Instead, they utilize a descriptive set of attributes such as keywords or the summary of a movie. The disadvantage is that these systems will recommend the same movie to the user, based on the input. However, they could be useful in recommending a movie that not many people have seen or rated.  

In the content based recommendation system it is only the user that plays a role in the recommendation. This method can also be combined with collaborative filtering methods.

In this notebook, the text content of the movies from the `merged` dataset is going to be alalyzed. The goal is to rank all the movies in the dataset based on a similarity measure with the input movie. For similarity measures, the cosine similarity will be used. Moreover, the content comes from the movies plots and possibly also the keywords. In order to remove the most common words, TF-IDF is used. Finally, the input to the TF-IDF algorithm will be the lemmatized text from each movie's content.

In [1]:
# Common libraries imports
import pandas as pd

In [2]:
# Not as common libraries imports and installation. 
# !python3 -m pip install nltk ## For linux and not environment
# !pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

Since it is a good idea to remove stop words from tf idf calculations, as stated also in [Chapter 1.3.1 MMDS](http://mmds.org/), a list of English stop words is created:

In [3]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Read the `merged` dataset, or its 'cleaned' version, that has duplicates removed

In [4]:
# a = data.sort_values('Release Year', ascending=False).drop_duplicates(subset=['Title', 'Release Year'], keep='last')
# a.loc[a.Title=='The Mask']['release_date']

In [5]:
data = pd.read_csv('../Data/data_cleaned.csv')

In [6]:
def print_info(index):
    '''
    Helper function used for an initial overview of the dataset
    '''
    print(f"Title:\n{data.iloc[index]['Title']}\n")
    print(f"Release Year:\n{data.iloc[index]['Release Year']}\n")
    print(f"Link:\n{data.iloc[index]['Wiki Page']}\n")
    print(f"Tagline:\n{data.iloc[index]['tagline']}\n")
    print(f"Overview:\n{data.iloc[index]['overview']}\n")
    print(f"Summary:\n{data.iloc[index]['Plot']}\n")

As an example, use `print_info` for a random movie:

In [7]:
print_info(12456)

Title:
Me and the Colonel

Release Year:
1958

Link:
https://en.wikipedia.org/wiki/Me_and_the_Colonel

Tagline:
nan

Overview:
Jacobowsky, a Jewish refugee, flees from the Nazis with an aristocratic, anti-semitic Polish officer trying to get papers to England. Jurgens learns to appreciate Jacobowsky, despite their competition for the same woman, and together they outwit their pursuers

Summary:
In Paris during the World War II invasion of France by Nazi Germany, Jewish refugee S. L. Jacobowsky (Danny Kaye) seeks to leave the country before it falls. Meanwhile, Polish diplomat Dr. Szicki (Ludwig Stössel) gives antisemitic, autocratic Polish Colonel Prokoszny (Curt Jürgens) secret information that must be delivered to London by a certain date.
The resourceful Jacobowsky, who has had to flee from the Nazis several times previously, manages to "buy" an automobile from the absent Baron Rothschild's chauffeur. Prokoszny peremptorily requisitions the car, but finds he must accept an unwelcom

Which text should we use as content? We can use `tagline` as an alternative title, `overview` which is a sentence that summarizes the movie and `Plot`, the summary of the movie. The latter is in general a longer text. We can use either the latter or for each movie create a txt document that contains the desired text.  

In the following, as a prototype, I am only using the `Plot`.

There are 2 ways to normalize text:Stemming and Lemmatization. The difference can be found [here](https://www.guru99.com/stemming-lemmatization-python-nltk.html). In the following I am using Lemmatization.  

The procedure is as follows. Lemmatize each movie's text content, get the frequency for each movie's lemmas and then use TF-IDF. To this end, I create a dataframe to store, the movie title, the relase year and the lemmas as list of words for each movie.

In [8]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import RegexpTokenizer

In [9]:
tokenizer = RegexpTokenizer(r'\w+') # Remove punctuation
wordnet_lemmatizer = WordNetLemmatizer() # Create lemmatizer

# text = "studies studying cries cry"
# tokenization = nltk.word_tokenize(text)
# for w in tokenization:
#     print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))  

In [10]:
def create_lemmas_list(content_txt):
    lemmas = []    
    tokenization = tokenizer.tokenize(content_txt.lower()) # Lowercase the whole text, to avoid dealing with case
    for w in tokenization:
        # Do not consider single characters. Can be resolved via tf-idf,
        # but maybe there are single characters due to wrong line breaks.
        if w in stop_words:
            continue
        lemmas.append(wordnet_lemmatizer.lemmatize(w, wordnet.VERB))
    
    return lemmas

Test this:

In [11]:
txt = data.iloc[459].Plot
a = create_lemmas_list(txt)

In [12]:
data.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'adult', 'belongs_to_collection',
       'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'video', 'vote_average', 'vote_count', 'Release Year', 'Title',
       'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page', 'Plot'],
      dtype='object')

In [13]:
# Create a df to hold the movies and the tokenized text
movie_plots_tokens_df = data[['Title', "Plot"]]

In [14]:
def tokenize(text):
    tokenization = tokenizer.tokenize(text.lower())
    tokens = [ wordnet_lemmatizer.lemmatize(token, wordnet.VERB) for token in tokenization if token not in stop_words and token.isalpha() ]
    return tokens

In [15]:
movie_plots_tokens_df['Tokens'] = movie_plots_tokens_df['Plot'].apply(tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_plots_tokens_df['Tokens'] = movie_plots_tokens_df['Plot'].apply(tokenize)


In [16]:
movie_plots_tokens_df.Tokens

0        [four, friends, jess, scarlett, johansson, ali...
1        [bhairava, kala, bhairava, telugu, version, vi...
2        [present, day, paris, diana, receive, photogra...
3        [du, qiu, chinese, lawyer, defeat, many, legal...
4        [feral, puppy, name, toby, whisk, away, dog, p...
                               ...                        
16224    [rarebit, fiend, gorge, welsh, rarebit, restau...
16225    [scenes, introduce, use, line, poem, santa, cl...
16226    [film, open, two, bandits, break, railroad, te...
16227    [alice, follow, large, white, rabbit, rabbit, ...
16228    [earliest, know, adaptation, classic, fairytal...
Name: Tokens, Length: 16229, dtype: object

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [18]:
tf = TfidfVectorizer().fit_transform(data.Plot)

In [19]:
tfov = TfidfVectorizer().fit_transform(data.overview.dropna())
tftit = TfidfVectorizer().fit_transform(data.Title)

In [20]:
tf.shape

(16229, 96675)

In [21]:
print(tf)

  (0, 74013)	0.034568082220997325
  (0, 81454)	0.033435595582795144
  (0, 74364)	0.013308550019604754
  (0, 56348)	0.027812437746473205
  (0, 85863)	0.010609621262672965
  (0, 23956)	0.012365860880066592
  (0, 62884)	0.03167092009260365
  (0, 63510)	0.08195537174637643
  (0, 66547)	0.029613003944778026
  (0, 26362)	0.023030368265445144
  (0, 79961)	0.017174055255996797
  (0, 50625)	0.0178905632788683
  (0, 4689)	0.01605887389499494
  (0, 27796)	0.01699276967933318
  (0, 59685)	0.010781347398769838
  (0, 25522)	0.016948493465067747
  (0, 19796)	0.020830729689209574
  (0, 66410)	0.020413866647601107
  (0, 70779)	0.04823092433632978
  (0, 57092)	0.011195283635910717
  (0, 52141)	0.011381525510227022
  (0, 38311)	0.022827290520951254
  (0, 34305)	0.013993858233721575
  (0, 71337)	0.02292776048399439
  (0, 93267)	0.012912602503844883
  :	:
  (16228, 90218)	0.02484857428075115
  (16228, 53856)	0.056155437364953466
  (16228, 25207)	0.06988218907834902
  (16228, 13689)	0.032924271705393984
  (

In [22]:
tf[0,74013]

0.034568082220997325

In [23]:
cosine_sim = linear_kernel(tf, tf)

In [24]:
cosine_simtit = linear_kernel(tftit, tftit)
cosine_simov = linear_kernel(tfov, tfov)

In [25]:
data2  = data['overview'].dropna()

In [32]:
def get_recommendations(title, df, sim_measure ):
    df = df.copy()
    smd = df.reset_index()
    titles = df['Title']
    indices = pd.Series(smd.index, index=smd['Title'])
    idx = indices[title]
    sim_scores = list(enumerate(sim_measure[idx]))
    print(sim_scores)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [33]:
title='Batman'

In [34]:
recom_plot = get_recommendations(title, data,cosine_sim)
recom_over = get_recommendations(title, data,cosine_simov)
recom_title = get_recommendations(title, data,cosine_simtit)


[(0, array([0.13604804, 0.06318244, 0.11339312, ..., 0.12763941, 0.11021525,
       0.10336239])), (1, array([0.15187321, 0.07219298, 0.13165538, ..., 0.14485844, 0.11529897,
       0.09693904])), (2, array([0.11484572, 0.05908738, 0.10156317, ..., 0.12178146, 0.09699992,
       0.07861162]))]


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [35]:
type(recom_plot)

NameError: name 'recom_plot' is not defined

In [None]:
recommendations_df = pd.concat([ recom_over.reset_index(drop=True), recom_title.reset_index(drop=True), recom_plot.reset_index(drop=True) ], axis=1, ignore_index=True)

In [None]:
recommendations_df.head(50)

In [None]:
# def tf(document):
#     doc_tokens = create_lemmas_list(document)
#     freq_dist = nltk.FreqDist(doc_tokens)
    
#     # FreqDist return the dictionary sorted in descending order.
#     # I did not find this explicitly in the docs, so I find the max freq in the document
#     max_freq = sorted( list(freq_dist.values()), reverse=True )[0]
    
#     tfs = {token:freq/max_freq for (token, freq) in freq_dist.items() }
#     return tfs

In [None]:
# from tqdm.notebook import tqdm

# def df(data_frame):
#     df_dict = {}
#     for i in tqdm(range(len(data_frame))):
#         for token in set(data_frame.iloc[i].Tokens):
#             for j in range(len(data_frame)):
#                 document = set(data_frame.iloc[j].Tokens)
#                 if token in document:
#                     df_dict[token] = df_dict.get(token,[j]) + [j]
#     return df_dict

In [None]:
# df(movie_plots_tokens_df)