***This notebook is based on the "Feature Engineering for NLP in Python for ML" course at DataCamp***

The purpose of this code is to indicate 10 films based only on the analysis of the plot made available in the dataset "wikipedia-movie-plots"

Steps for implementation:
1. Import the dataset;

2. Check for plot information on all movies;

3. Apply nlp () to the Plot column - Create tokens from each word;

5. Apply pre-processing for each token;

6. Calculates the importance of each word according to the whole document with TFIDF;

7. Calculate the cosine_similarity between each plot;

8. Create function to list the films with cosine_similarity closest to the calculated value for the chosen film.


Por questão de processamento foi necessário diminuir pela metade o tamanho do dataset escolhido

**Import libraries**

In [1]:
import numpy as np
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.metrics.pairwise import cosine_similarity 

**Import the dataset**

In [2]:
md = pd.read_csv('../input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')

**Check what is displayed in that database**

In [3]:
md.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [4]:
md.describe()

Unnamed: 0,Release Year
count,34886.0
mean,1981.314252
std,27.815174
min,1901.0
25%,1957.0
50%,1988.0
75%,2007.0
max,2017.0


Select the 'Plot' column

In [5]:
md_plot = md['Plot']

In [6]:
md_plot.head()

0    A bartender is working at a saloon, serving dr...
1    The moon, painted with a smiling face hangs ov...
2    The film, just over a minute long, is composed...
3    Lasting just 61 seconds and consisting of two ...
4    The earliest known adaptation of the classic f...
Name: Plot, dtype: object

Verify if there is any missing value in 'Plot' column

In [7]:
md_nan = md_plot.isna()
md_nan.sum()

0

**Pre-processing of the text**

Pre-processing using SPACY library

In [8]:
nlp = spacy.load('en_core_web_sm') 

In [9]:
doc = nlp(md_plot[0]) 
print(doc) 

A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]


In [10]:
lemmas = [token.lemma_ for token in doc] 
print(lemmas)

['a', 'bartender', 'be', 'work', 'at', 'a', 'saloon', ',', 'serve', 'drink', 'to', 'customer', '.', 'after', '-PRON-', 'fill', 'a', 'stereotypically', 'irish', 'man', "'s", 'bucket', 'with', 'beer', ',', 'Carrie', 'Nation', 'and', '-PRON-', 'follower', 'burst', 'inside', '.', '-PRON-', 'assault', 'the', 'irish', 'man', ',', 'pull', '-PRON-', 'hat', 'over', '-PRON-', 'eye', 'and', 'then', 'dump', 'the', 'beer', 'over', '-PRON-', 'head', '.', 'the', 'group', 'then', 'begin', 'wreck', 'the', 'bar', ',', 'smash', 'the', 'fixture', ',', 'mirror', ',', 'and', 'break', 'the', 'cash', 'register', '.', 'the', 'bartender', 'then', 'spray', 'seltzer', 'water', 'in', 'Nation', "'s", 'face', 'before', 'a', 'group', 'of', 'policeman', 'appear', 'and', 'order', 'everybody', 'to', 'leave.[1', ']']


In [11]:
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() or lemma not in STOP_WORDS] 

print(a_lemmas)

['a', 'bartender', 'be', 'work', 'at', 'a', 'saloon', ',', 'serve', 'drink', 'to', 'customer', '.', 'after', '-PRON-', 'fill', 'a', 'stereotypically', 'irish', 'man', 'bucket', 'with', 'beer', ',', 'Carrie', 'Nation', 'and', '-PRON-', 'follower', 'burst', 'inside', '.', '-PRON-', 'assault', 'the', 'irish', 'man', ',', 'pull', '-PRON-', 'hat', 'over', '-PRON-', 'eye', 'and', 'then', 'dump', 'the', 'beer', 'over', '-PRON-', 'head', '.', 'the', 'group', 'then', 'begin', 'wreck', 'the', 'bar', ',', 'smash', 'the', 'fixture', ',', 'mirror', ',', 'and', 'break', 'the', 'cash', 'register', '.', 'the', 'bartender', 'then', 'spray', 'seltzer', 'water', 'in', 'Nation', 'face', 'before', 'a', 'group', 'of', 'policeman', 'appear', 'and', 'order', 'everybody', 'to', 'leave.[1', ']']


In [12]:
print(' '.join(a_lemmas))

a bartender be work at a saloon , serve drink to customer . after -PRON- fill a stereotypically irish man bucket with beer , Carrie Nation and -PRON- follower burst inside . -PRON- assault the irish man , pull -PRON- hat over -PRON- eye and then dump the beer over -PRON- head . the group then begin wreck the bar , smash the fixture , mirror , and break the cash register . the bartender then spray seltzer water in Nation face before a group of policeman appear and order everybody to leave.[1 ]


Create a function for this pre-processing and apply to all other Plots in the database

In [13]:
def preprocess(text):
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in STOP_WORDS]
    
    return ' '.join(a_lemmas)

In [14]:
preprocess(md_plot[0]) # verificar o resultado da função
print(md_plot[0])

A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]


Test the function for simple cases

In [15]:
md_plot_test = [[md_plot[0]], [md_plot[1]], [md_plot[2]]] #selecionar apenas algumas linhas
md_plot_test = pd.DataFrame(md_plot_test, columns = ['Plot']) 
    
md_plot_test['test'] = md_plot_test['Plot'].apply(lambda x: preprocess(x))

md_plot_test #cria uma nova coluna

Unnamed: 0,Plot,test
0,"A bartender is working at a saloon, serving dr...",bartender work saloon serve drink customer fil...
1,"The moon, painted with a smiling face hangs ov...",moon paint smile face hang park night young co...
2,"The film, just over a minute long, is composed...",film minute long compose shot girl sit base al...


Apply function to the dataset
**here it was necessary to decrease the size of the dataset

In [16]:
md_half = md[:len(md)//2] 

In [17]:
md_half['Plot_lemma'] = md_half['Plot'].apply(lambda x: preprocess(x)) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [18]:
md_half #verificar a nova coluna

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,Plot_lemma
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...",bartender work saloon serve drink customer fil...
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...",moon paint smile face hang park night young co...
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...",film minute long compose shot girl sit base al...
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,second consist shot shot set wood winter actor...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,earliest know adaptation classic fairytale fil...
...,...,...,...,...,...,...,...,...,...
17438,1976,End Play,Australian,Unknown,"George Mallaby, John Waters",unknown,https://en.wikipedia.org/wiki/End_Play,Hitchhiker Janine Talbort is picked up and mur...,Hitchhiker Janine Talbort pick murder unseen a...
17439,1976,Fantasm,Australian,Unknown,,unknown,https://en.wikipedia.org/wiki/Fantasm,German psychiatrist Professor Jurgen Notafreud...,german psychiatrist Professor Jurgen Notafreud...
17440,1976,The Fourth Wish,Australian,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Fourth_Wish,Casey learns that his 12-year-old son Sean has...,Casey learn old son Sean leukaemia die month C...
17441,1976,Illuminations,Australian,Unknown,,unknown,https://en.wikipedia.org/wiki/Illuminations_(f...,A couple living together have a tense relation...,couple live tense relationship woman father di...


In [19]:
md_half.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,Plot_lemma
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...",bartender work saloon serve drink customer fil...
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...",moon paint smile face hang park night young co...
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...",film minute long compose shot girl sit base al...
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,second consist shot shot set wood winter actor...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,earliest know adaptation classic fairytale fil...


Salvar resultado

In [20]:
np.savez_compressed('md_half')
md_half.to_csv('csv_to_submit.csv', index = False)

**Apply TFIDF**

In [21]:
vectorizer = TfidfVectorizer()

Just some visualization of the dataset

In [22]:
md_half_plot_lemma = md_half['Plot_lemma'] 
md_half_plot_lemma.head()
md_half_plot_lemma.shape

(17443,)

Verify the TFIDF matrix on the test dataset

In [23]:
tfidf_matrix_teste = vectorizer.fit_transform(md_plot_test['test'])
print(tfidf_matrix_teste) 

  (0, 30)	0.1333442091361333
  (0, 66)	0.1333442091361333
  (0, 2)	0.1333442091361333
  (0, 71)	0.1333442091361333
  (0, 32)	0.07875523797761909
  (0, 97)	0.1333442091361333
  (0, 82)	0.1333442091361333
  (0, 90)	0.1333442091361333
  (0, 77)	0.1333442091361333
  (0, 20)	0.1333442091361333
  (0, 15)	0.1333442091361333
  (0, 62)	0.1333442091361333
  (0, 37)	0.1333442091361333
  (0, 88)	0.1333442091361333
  (0, 6)	0.1333442091361333
  (0, 101)	0.1333442091361333
  (0, 10)	0.1333442091361333
  (0, 43)	0.2666884182722666
  (0, 46)	0.1333442091361333
  (0, 28)	0.1333442091361333
  (0, 31)	0.1333442091361333
  (0, 45)	0.10141170805545374
  (0, 75)	0.1333442091361333
  (0, 5)	0.1333442091361333
  (0, 48)	0.1333442091361333
  :	:
  (2, 60)	0.14989686209292205
  (2, 98)	0.14989686209292205
  (2, 41)	0.14989686209292205
  (2, 50)	0.14989686209292205
  (2, 56)	0.14989686209292205
  (2, 0)	0.14989686209292205
  (2, 74)	0.14989686209292205
  (2, 73)	0.14989686209292205
  (2, 26)	0.14989686209292205


Apply TFIDF 

In [24]:
tfidf_matrix_half = vectorizer.fit_transform(md_half['Plot_lemma']) #criar matriz de TFIDF

In [25]:
md_half_plot_lemma.shape

(17443,)

In [26]:
print(tfidf_matrix_half) 

  (0, 21867)	0.14631007738501167
  (0, 48658)	0.06627689408092892
  (0, 2631)	0.07766582627173153
  (0, 52148)	0.12445248875955761
  (0, 22335)	0.08159781528464904
  (0, 72899)	0.09722082119046897
  (0, 59918)	0.22562849671011898
  (0, 63342)	0.15705060793633624
  (0, 55546)	0.16029227812703717
  (0, 10598)	0.1198935709165304
  (0, 8240)	0.06500397858088504
  (0, 44250)	0.1354111152013188
  (0, 23665)	0.2148200504316633
  (0, 62083)	0.13291154595650542
  (0, 4703)	0.09803099082800035
  (0, 74528)	0.1384718967309977
  (0, 5585)	0.058382827880235355
  (0, 27946)	0.15690816649247308
  (0, 29533)	0.0681083113735678
  (0, 19640)	0.12356875340321877
  (0, 22258)	0.0999194023703423
  (0, 29327)	0.13961081167075853
  (0, 53654)	0.08804282183838569
  (0, 3437)	0.1149746058439725
  (0, 33009)	0.08693507613923547
  :	:
  (17441, 21232)	0.4163045293748532
  (17441, 66791)	0.37027766453022537
  (17441, 5154)	0.3724073041760828
  (17441, 55719)	0.18762762187904594
  (17441, 19387)	0.2785697818464236

**Apply o cosine simularity**

Visualization of some tests

In [27]:
cosine_sim_test = cosine_similarity(tfidf_matrix_teste, tfidf_matrix_teste)

Now apply on the correct dataset

In [28]:
cosine_sim_half = cosine_similarity(tfidf_matrix_half, tfidf_matrix_half)

In [29]:
print(cosine_sim_half)

[[1.         0.03002355 0.00784485 ... 0.         0.         0.00976852]
 [0.03002355 1.         0.03582965 ... 0.00908275 0.02688147 0.00827609]
 [0.00784485 0.03582965 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.00908275 0.         ... 1.         0.01548907 0.01721219]
 [0.         0.02688147 0.         ... 0.01548907 1.         0.        ]
 [0.00976852 0.00827609 0.         ... 0.01721219 0.         1.        ]]


In [30]:
cosine_sim_half.shape

(17443, 17443)

**Make the recommendations**

In [31]:
indices_half = pd.Series(md_half.index, index=md_half['Title']).drop_duplicates() #pegar os nomes de cada filme
indices_half

Title
Kansas Saloon Smashers                  0
Love by the Light of the Moon           1
The Martyred Presidents                 2
Terrible Teddy, the Grizzly King        3
Jack and the Beanstalk                  4
                                    ...  
End Play                            17438
Fantasm                             17439
The Fourth Wish                     17440
Illuminations                       17441
Let the Balloon Go                  17442
Length: 17443, dtype: int64

In [32]:

def get_recommendations(title, cosine_sim, indices):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:6]
    movie_indices = [i[0] for i in sim_scores]
    return md_half['Title'].iloc[movie_indices]

In [33]:
print(get_recommendations('The Godfather', cosine_sim_half, indices_half))

11175           Family Business
7464      A Cold Wind in August
13512          Mickey Blue Eyes
9042      The Godfather Part II
11411    The Godfather Part III
Name: Title, dtype: object
