# IMDb Movies similarity from plot summaries

### 1. Import and observe dataset

In [None]:
# Import modules
import numpy as np
import pandas as pd
import nltk

# Set seed for reproducibility
np.random.seed(5)

movies_df = pd.read_csv('../dataset/movie_info.csv')

# show the top 10 rows
movies_df.head(10)

### 2. Tokenization and Stemming
Tokenization is the process by which we break down articles into individual sentences or words, as needed. Besides the tokenization method provided by NLTK, we might have to perform additional filtration to remove tokens which are entirely numeric values or punctuation.

Stemming is the process by which we bring down a word from its different forms to the root word. This helps us establish meaning to different forms of the same words without having to deal with each form separately.

First, let us perform tokenization on a small extract from Mean Girls 2.

In [None]:
# Tokenize a paragraph into sentences and store in sent_tokenized
sent_tokenized = [sent for sent in nltk.sent_tokenize("""
                        The Plastics are back in the long-awaited follow-up to the smash hit Mean Girls - and now the clique is more fashionable, funny, and ferocious than ever.
                        """)]

# Word Tokenize first sentence from sent_tokenized, save as words_tokenized
words_tokenized = [word for word in nltk.word_tokenize(sent_tokenized[0])]

# Remove tokens that do not contain any letters from words_tokenized
import re

filtered = [word for word in words_tokenized if re.search('[a-zA-Z]', word)]

# Display filtered words to observe words after tokenization
filtered

In [None]:
# Import the SnowballStemmer to perform stemming
from nltk.stem.snowball import SnowballStemmer

# Create an English language SnowballStemmer object
stemmer = SnowballStemmer("english")

# Print filtered to observe words without stemming
print("Without stemming: ", filtered)

# Stem the words from filtered and store in stemmed_words
stemmed_words = [stemmer.stem(word) for word in filtered]

# Print the stemmed_words to observe words after stemming
print("After stemming:   ", stemmed_words)

We are now able to tokenize and stem sentences. But we may have to use the two functions repeatedly one after the other to handle a large amount of data, hence we can think of wrapping them in a function and passing the text to be tokenized and stemmed as the function argument. Then we can pass the new wrapping function, which shall perform both tokenizing and stemming instead of just tokenizing, as the tokenizer argument while creating the TF-IDF vector of the text.

First, let us define a function to perform both stemming and tokenization

In [None]:
def tokenize_and_stem(text):
    
    # Tokenize by sentence, then by word
    tokens = nltk.word_tokenize(text)
    
    
    # Filter out raw tokens to remove noise
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    
    # Stem the filtered_tokens
    stems = [stemmer.stem(word) for word in filtered_tokens]
    
    return stems

Apply the function on the plot of Mean Girls 2 for example:


In [None]:
words_stemmed = tokenize_and_stem("The Plastics are back in the long-awaited follow-up to the smash hit Mean Girls - and now the clique is more fashionable, funny, and ferocious than ever.")
print(words_stemmed)

In [None]:
It gave us the same result, which means that the function is good to go.



### 3. Create TfidfVectorizer
Computers do not understand text. These are machines only capable of understanding numbers and performing numerical computation. Hence, we must convert our textual plot summaries to numbers for the computer to be able to extract meaning from them. One simple method of doing this would be to count all the occurrences of each word in the entire vocabulary and return the counts in a vector.

Term Frequency-Inverse Document Frequency (TF-IDF) is one method which overcomes the shortcomings of CountVectorizer. The Term Frequency of a word is the measure of how often it appears in a document, while the Inverse Document Frequency is the parameter which reduces the importance of a word if it frequently appears in several documents.

In simplest terms, TF-IDF recognizes words which are unique and important to any given document. Let's create one for our purposes.

In [None]:
# Import TfidfVectorizer to create TF-IDF vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer object with stopwords and tokenizer
# parameters for efficient processing of text
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=200000,
                                 min_df=2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem,
                                 ngram_range=(1,3))

### 4. Fit transform TfidfVectorizer
Once we create a TF-IDF Vectorizer, the next step is to fit the text to it and then transform the text to produce the corresponding numeric form of the data which the computer will be able to understand and derive meaning from. To do this, we use the fit_transform() method of the TfidfVectorizer object.

If we observe the TfidfVectorizer object we created, we come across a parameter stopwords. 'stopwords' are those words in a given text which do not contribute considerably towards the meaning of the sentence and are generally grammatical filler words.

On setting the stopwords to 'english', we direct the vectorizer to drop all stopwords from a pre-defined list of English language stopwords present in the nltk module. Another parameter, ngram_range, defines the length of the ngrams to be formed while vectorizing the text.

In [None]:
# Fit and transform the tfidf_vectorizer with the "plot" of each movie
# to create a vector representation of the plot summaries
tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in movies_df["plot"]])
print(tfidf_matrix.shape)

### 5. K-means clustering
To determine how closely one movie is related to the other by the help of unsupervised learning, we can use clustering techniques. Clustering is the method of grouping together a number of items such that they exhibit similar properties. According to the measure of similarity desired, a given sample of items can have one or more clusters.

   *  look for the elbow to determine the optimal number of clusters
  
   *  check the number of samples per group to confirm we have balanced samples accross k-means groups


#### 1. look for the elbow to determine the optimal number of clusters

Mini Batch K-means algorithm‘s main idea is to use small random batches of data of a fixed size, so they can be stored in memory. Each iteration of a new random sample from the dataset is obtained and used to update the clusters and this is repeated until convergence. Each mini batch updates the clusters using a convex combination of the values of the prototypes and the data, applying a learning rate that decreases with the number of iterations. This learning rate is the inverse of the number of data assigned to a cluster during the process. As the number of iterations increases, the effect of new data is reduced, so convergence can be detected when no changes in the clusters occur in several consecutive iterations.

Probably the most well known method, the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.

In [None]:
from sklearn.cluster import MiniBatchKMeans

# init potential n_clusters
n_clusters_list = list(range(1, 60, 1))
# init scores
scores = []
# init models for cache
kms = {}
for n_clusters in n_clusters_list:
    km = MiniBatchKMeans(
        n_clusters=n_clusters,
        random_state=99
    ).fit(tfidf_matrix)
    # save models
    kms.update({n_clusters: km})
    # save score
    scores.append(-1 * km.score(tfidf_matrix))

In [None]:
# look for elbow to determine optimal number of clusters
pd.DataFrame({'scores': scores}, index=n_clusters_list).plot(
    figsize=(16, 8),
    title='K-Means Objective Score vs. Number of Clusters',
    grid=True
)

In [None]:
# pick optimal K-means Model
n = 12
km = kms[n]
clusters = km.labels_.tolist()

# Fit the k-means object with tfidf_matrix
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

# create DataFrame df_clusters for clustering analysis
data_clusters = {
    'title': movies_df.title.values,
    'plot': movies_df['plot'].values,
    'run time': movies_df['run time /min'].values,
    'votes': movies_df['number of votes'].values,
    'rating':movies_df.rating.values,
    'cluster': clusters
}
df_clusters = pd.DataFrame(
    data_clusters,
    index=[clusters],
    columns=['title','plot','run time', 'votes','rating', 'cluster']
)

df_clusters.head(5)

#### 2. check the number of samples per group to confirm we have balanced samples accross k-means groups

In [None]:
print("Number of movies included in each cluster:")
df_clusters['cluster'].value_counts().to_frame()

### 6. Calculate Movie Simialrity
#### Use cosine similarity to calculate similarity of movie plots

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# get similarity matrix using tfidf matrix
sim_matrix = cosine_similarity(tfidf_matrix)

In [None]:
def movie_recommender_cosine_similarity(movie_title, K):
    if (len(movies_df[movies_df['title']==movie_title])==0):
        print("Sorry, we don't have this movie in our database. But we will take it into consideration in the future, thank you!")
    else:
        movie_idx = movies_df[movies_df['title'] == movie_title].index[0]
        return movies_df.loc[np.argsort(sim_matrix[movie_idx])[::-1][:K]]

#### Again, let's use 'Mean Girls 2' as an example:

In [None]:
movie_title = str(input("which movie you want to search? "))
K = int(input("How many most similarity movies you want to display? "))

movie_recommender_cosine_similarity(movie_title, K)

In [None]:
What if we input some movie that doesn't exist in the dataset? say input "I am not a movie"


In [None]:
movie_title = str(input("which movie you want to search? "))
K = int(input("How many most similarity movies you want to display? "))

movie_recommender_cosine_similarity(movie_title, K)