## Content based recommendation system

In this notebook, the text content of the movies from the `merged` dataset is going to be alalyzed. The goal is to rank all the movies in the dataset based on a similarity measure with the input movie. For similarity measures, the cosine similarity will be used. Moreover, the content comes from the movies plots and possibly also the keywords. In order to remove the most common words, TF-IDF is used. Finally, the input to the TF-IDF algorithm will be the lemmatized text from each movie's content.

In [1]:
# Common libraries imports
import pandas as pd

In [18]:
# Not as common libraries imports and installation. 
!python3 -m pip install nltk ## For linux and not environment
# !pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/user/nltk_data...


True

Read the `merged` dataset, or its 'cleaned' version, that has duplicates removed

In [3]:
# a = data.sort_values('Release Year', ascending=False).drop_duplicates(subset=['Title', 'Release Year'], keep='last')
# a.loc[a.Title=='The Mask']['release_date']

In [4]:
data = pd.read_csv('../Data/data_cleaned.csv')

In [5]:
def print_info(index):
    '''
    Helper function used for an initial overview of the dataset
    '''
    print(f"Title:\n{data.iloc[index]['Title']}\n")
    print(f"Release Year:\n{data.iloc[index]['Release Year']}\n")
    print(f"Link:\n{data.iloc[index]['Wiki Page']}\n")
    print(f"Tagline:\n{data.iloc[index]['tagline']}\n")
    print(f"Overview:\n{data.iloc[index]['overview']}\n")
    print(f"Summary:\n{data.iloc[index]['Plot']}\n")

As an example, use `print_info` for a random movie:

In [8]:
print_info(12456)

Title:
Me and the Colonel

Release Year:
1958

Link:
https://en.wikipedia.org/wiki/Me_and_the_Colonel

Tagline:
nan

Overview:
Jacobowsky, a Jewish refugee, flees from the Nazis with an aristocratic, anti-semitic Polish officer trying to get papers to England. Jurgens learns to appreciate Jacobowsky, despite their competition for the same woman, and together they outwit their pursuers

Summary:
In Paris during the World War II invasion of France by Nazi Germany, Jewish refugee S. L. Jacobowsky (Danny Kaye) seeks to leave the country before it falls. Meanwhile, Polish diplomat Dr. Szicki (Ludwig Stössel) gives antisemitic, autocratic Polish Colonel Prokoszny (Curt Jürgens) secret information that must be delivered to London by a certain date.
The resourceful Jacobowsky, who has had to flee from the Nazis several times previously, manages to "buy" an automobile from the absent Baron Rothschild's chauffeur. Prokoszny peremptorily requisitions the car, but finds he must accept an unwelcom

Which text should we use as content? We can use `tagline` as an alternative title, `overview` which is a sentence that summarizes the movie and `Plot`, the summary of the movie. The latter is in general a longer text. We can use either the latter or for each movie create a txt document that contains the desired text.  

In the following, as a prototype, I am only using the `Plot`.

There are 2 ways to normalize text:Stemming and Lemmatization. The difference can be found [here](https://www.guru99.com/stemming-lemmatization-python-nltk.html). In the following I am using Lemmatization.  

The procedure is as follows. Lemmatize each movie's text content, get the frequency for each movie's lemmas and then use TF-IDF. To this end, I create a dataframe to store, the movie title, the relase year and the lemmas as list of words for each movie.

In [26]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

In [40]:
tokenizer = RegexpTokenizer(r'\w+') # Remove punctuation
wordnet_lemmatizer = WordNetLemmatizer() # Create lemmatizer

# text = "studies studying cries cry"
# tokenization = nltk.word_tokenize(text)
# for w in tokenization:
#     print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))  

In [61]:
def create_lemmas_list(content_txt):
    lemmas = []    
    tokenization = tokenizer.tokenize(content_txt.lower()) # Lowercase the whole text, to avoid dealing with case
    for w in tokenization:
        # Do not consider single characters. Can be resolved via tf-idf,
        # but maybe there are single characters due to wrong line breaks.
        if len(w)<2:
            continue
        lemmas.append(wordnet_lemmatizer.lemmatize(w))
    
    # Since the text is lowercase, we can also return only the unique tokens
    return set(lemmas)

Test this:

In [62]:
txt = data.iloc[459].Plot
a = create_lemmas_list(txt)
for el in a:
    print(el)

gain
grudgingly
revenge
becomes
occultist
favor
finger
beginning
fall
road
grueling
guardian
soldier
rent
agrees
kitchen
dark
go
punishing
special
about
cut
her
on
they
spell
fail
instruction
beatific
drive
serious
his
life
seven
down
hearing
have
quiet
kidnapped
some
explains
grant
both
joseph
ha
allowed
able
summon
brilliant
ceremony
discloses
out
another
disturbed
culminated
month
honest
seeing
awakens
he
angel
head
cpr
perimeter
order
including
crossed
asks
everything
solomon
peril
grave
insist
dishonesty
is
convince
complains
armor
a
because
unreadable
knife
rural
side
something
isolated
people
tormenting
impaling
sorry
their
begin
presence
infected
year
but
this
smile
after
the
wound
get
accuses
smash
one
complies
old
howl
sophia
tell
over
by
burial
increasingly
lengthy
afterward
fine
too
kind
retreat
work
car
ritual
medical
unpure
meager
from
bitter
rite
empty
forgive
do
then
sign
won
eventually
grieving
spend
start
basement
light
whoever
fill
awaiting
house
lead
tempered
must
d