## Tf–idf: Term Frequency – Inverse Document Frequency)

This method is a statistical process that evaluates how relevant a word is to a document in a collection of documents.

It's mainly used on automated text analysis and for scoring words in machine learning algorithms for Natural Language Processing.

This method searches each word across each document that is provided, and it counts the number of times that word appears on each document. Words that are very common across all documents observed will be off-set since they don't transmit any major insight for a specific document (or, it, is, the, a), although words that appear very frequently on a document but are not very common across all documents will have an higher score.
 - For the example of *An Interview with the Vampire* **Lestat** is a word that appears many times but won't probably be that common on other movie plots) 
 
.
Wrapping-up, if a word appears many times in a document, but won't appear many times on other documents then probably it will be relevant, and the opposite will indicate that the word is probably not very relevant

.
*For more info:* https://monkeylearn.com/blog/what-is-tf-idf/

In [1]:
# Import nltk
import nltk

# import json to read the data file
import json

# Import scikitlearn
import sklearn

# The usual suspects
import pandas as pd
import numpy as np

In [2]:
# download both punkt and stopwords so that our script understants both punctuation and common stop-words
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\guilh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\guilh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Data import and Preparation 
The base dataset corresponds to the movies.json file, which contain information regarding 7113 movies, containing several information about each movie

In [3]:
# creating the movies variable that only retrieves data when called
with open("movies.json", 'r') as f:
    movies = json.load(f)

In [4]:
# Create a list of lists of the available plots
plots = [movie.get("plot") for movie in movies]

### Training the model based on the English Stopwords
 - Create a vectorizer that takes int consideration the English Stopwords
 - Train the model with the plots data
 - Transform the data into a matrix like structure

In [5]:
# Creating the vectorizer based on the English Stopwords
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(stop_words = nltk.corpus.stopwords.words("English"))

# Train the model with the plots information and based on the defined stopwords
trained_vectorizer = vectorizer.fit(plots)

# Transform to a document-term matrix
tfidf = trained_vectorizer.transform(plots)

In [6]:
tfidf.shape

(7113, 65108)

### Using Cosine Similarity

Using the Cosine Similarity it's possible to compare how two matrixes are related with one another, by evaluating to what two vectors are point to, and verify how simmilar they are.

 - Using a new plot, that is a string that has the information regarding part of the plot of a movie we can then create a tool that helps predict which are the main films of our database that are more similar to it

In [7]:
# Using the initial part of the "Interview with the Vampire" for test
new_plot = ["In modern-day San Francisco, reporter Daniel Molloy interviews Louis de Pointe du Lac, who claims to be a vampire. Louis describes his human life as a wealthy plantation owner in 1791 Louisiana"]

In [8]:
tfidf_new = trained_vectorizer.transform(new_plot) #vamos transformar os novos dados no modelo que acabámos de treinar
tfidf_new.shape

(1, 65108)

In [9]:
cosine_similarity = sklearn.metrics.pairwise.cosine_similarity(tfidf_new, tfidf).flatten()
indices = np.argsort(-cosine_similarity)[:10]
for idx in indices:
    print(movies[idx].get("title"))

Interview with the Vampire
The 9th Life of Louis Drax
Route Irish
The Man in the Iron Mask
Kangaroo Jack
The Extra Man
Daughter of Darkness
Foxcatcher
There Will Be Blood
The Believer


In [10]:
tfidf_new.shape
tfidf.shape

(7113, 65108)

### Wrapping-up for a more user-friendly perspective

 - Were the movies correctly identified?

In [11]:
## Example 1, trying to reach on the Lord of The Rings Saga

# Place the intended plot in here
new_plot = ["A grupo of people tries to return a ring. Mordor does not approve"]

tfidf_new = trained_vectorizer.transform(new_plot) #vamos transformar os novos dados no modelo que acabámos de treinar

cosine_similarity = sklearn.metrics.pairwise.cosine_similarity(tfidf_new, tfidf).flatten()
indices = np.argsort(-cosine_similarity)[:10]
for idx, movie_title in enumerate(indices):
    print(idx+1,".", movies[movie_title].get("title"))

1 . The Lord of the Rings: The Two Towers
2 . The Lord of the Rings: The Fellowship of the Ring
3 . The Lord of the Rings: The Return of the King
4 . What's the Worst That Could Happen?
5 . St. Trinian's II: The Legend of Fritton's Gold
6 . O Brother, Where Art Thou?
7 . Without Evidence
8 . Blonde Fist
9 . All the Real Girls
10 . Twin Peaks: Fire Walk with Me


In [12]:
## Example 2, trying to reach to the Harry Potter Saga

# Place the intended plot in here
new_plot = ["Children go to magic school. Also, there's Hermione"]

tfidf_new = trained_vectorizer.transform(new_plot) #vamos transformar os novos dados no modelo que acabámos de treinar

cosine_similarity = sklearn.metrics.pairwise.cosine_similarity(tfidf_new, tfidf).flatten()
indices = np.argsort(-cosine_similarity)[:10]
for idx, movie_title in enumerate(indices):
    print(idx+1,".", movies[movie_title].get("title"))

1 . Harry Potter and the Deathly Hallows: Part I
2 . Harry Potter and the Deathly Hallows: Part 1
3 . Harry Potter and the Sorcerer's Stone
4 . Harry Potter and the Philosopher's Stone
5 . Harry Potter and the Prisoner of Azkaban
6 . Harry Potter and the Order of the Phoenix
7 . Harry Potter and the Chamber of Secrets
8 . Village of the Damned
9 . Magic Island (film)
10 . Harry Potter and the Deathly Hallows: Part 2


In [13]:
## Example 3, trying to reach to Titanic

# Place the intended plot in here
new_plot = ["A boat sinks on its first voyage. Romance between a man and a woman"]

tfidf_new = trained_vectorizer.transform(new_plot) #vamos transformar os novos dados no modelo que acabámos de treinar

cosine_similarity = sklearn.metrics.pairwise.cosine_similarity(tfidf_new, tfidf).flatten()
indices = np.argsort(-cosine_similarity)[:10]
for idx, movie_title in enumerate(indices):
    print(idx+1,".", movies[movie_title].get("title"))

1 . All Is Lost
2 . Titanic
3 . Under the Skin
4 . Deep Rising
5 . Trust
6 . Gunmen
7 . 47 Meters Down
8 . The Road
9 . Only for You
10 . Paper Heart


### Final comments:
The used method is far from perfect and it is only identifying how many times a word or token appears out of all words or tokens based on their relative frequency on the database that was given. The more frequent or more specific a word is used the more probable is that that word will correctly identify a film. 

For example, searching by "ring" will not be as effective as using "Mordor" on identifying the LOTR saga.

The same happens with the word "magic" or "wand" that by itself will not identify the Harry Potter movies saga, but by placing the word "Hermione" it will immediatly identify it correctly.

On the last example, using a more general description of the film it's identified correctly the target movie on the Top 10 movies, although since it's not specific enough it won't be identifying the movie immediatly, but other key-expressions could be better used on this topic.