In [1]:
# Import required libraries
import pandas as pd
import string
import math
import numpy as np

## 0. Introduction

Typically, comparing movie overviews is challenging due to the complexity and variability of language. To effectively measure similarity between movie overviews, we first need to transform them into a structured numerical format. This is where Term Frequency-Inverse Document Frequency ([TF-IDF](https://link.springer.com/referenceworkentry/10.1007/978-0-387-30164-8_832)) algorithm comes into play.

In this notebook, we delve into the implementation and step-by-step explanation of the TF-IDF algorithm.

*TF-IDF, standing for Term Frequency-Inverse Document Frequency, is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is calculated by multiplying two metrics: how many times a word appears in a document (term frequency) and the inverse document frequency of the word across a set of documents. This approach diminishes the weight of commonly used words and amplifies that of unique terms, providing a more meaningful representation of text content.*

Through this notebook, we will compute TF-IDF vectors for movie overviews (documents). This process will yield a matrix where each column represents a unique word in our corpus (all the words appearing in at least one movie overview), and each row represents an individual movie. This structured representation is essential for our recommendation system.

## 1. Movie overview exploration
First, let us inspect the plots of a few movies.

In [2]:
# Read the movies metadata CSV file
movies_metadata = pd.read_csv('the-movies-dataset/movies_metadata.csv', low_memory=False, encoding='utf-8').dropna()

# Print plot overviews of the first 5 movies
movies_metadata['overview'].head(5)

9      James Bond must unmask the mysterious head of ...
68     Craig and Smokey are two guys in Los Angeles h...
69     Seth Gecko and his younger brother Richard are...
153    Auggie runs a small tobacco shop in Brooklyn, ...
178    Power up with six incredible teens who out-man...
Name: overview, dtype: object

## 2. Preprocessing and Cleaning Overviews
 
To enhance the quality of our TF-IDF analysis, it is crucial to preprocess and clean the movie overviews. This involves removing "stop words", which are commonly used words that do not contribute significantly to the overall meaning of the text (e.g., 'and', 'the', 'is'), and punctuation, as it is not informative for our analysis and can interfere with word comparison.

In [3]:
# Define the set of stop words to be excluded from the analysis
stop_words = set([
    # List of common stop words (ensure no duplicates and all necessary words are included)
    'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at',
    'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by',
    'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't",
    'down', 'during', 'each', 'few', 'for', 'from', 'further', 'arent', 
    'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how',
    'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', '', 'andy', 'such', 'just',
    'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself',
    'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own',
    're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such',
    't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too',
    'under', 'until', 'up',
    've', 'very',
    'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't",
    'wouldn', "wouldn't",
    'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves',   'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
    "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him',
    'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its',
    'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who',
    'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were',
    'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
    'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
    'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during',
    'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
    'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
    'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
    'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
    's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd',
    'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
    "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't",
    'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
    'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won',
    "won't", 'wouldn', "wouldn't"
])

# Create a translation table to remove punctuation
translator = str.maketrans('', '', string.punctuation)

# Initialize a set to hold all the unique cleaned words
cleaned_unique_words = set()

# Iterate over each overview to populate cleaned_unique_words
for overview in movies_metadata['overview']:
    # Clean the overview of punctuation and split into words
    cleaned_words = [word.lower().translate(translator) for word in overview.split()]
    # Update the set with words from this overview, excluding stopwords
    cleaned_unique_words.update(word for word in cleaned_words if word not in stop_words)

# cleaned_unique_words now contains all the unique words after removing stopwords and punctuation.

## 3. Word frequency count 

After cleaning and preprocessing the movie overviews, our next step is to count the frequency of each word in these overviews. This is a crucial step for TF-IDF as it helps us understand the term frequency component. We create a list of dictionaries, where each dictionary corresponds to a movie overview, and the keys are the unique cleaned words with their frequency counts.

In [4]:
word_dicts = []

# Count the word frequency for each overview
for overview in movies_metadata['overview']:
   # Split the overview into words, clean them, and filter out stopwords
    words = [word.lower().translate(translator) for word in overview.split() if word.lower().translate(translator) not in stop_words]
    # Initialize the dictionary for this overview with all cleaned unique words
    word_dict = dict.fromkeys(cleaned_unique_words, 0)
    # Count the words in the current line
    for word in words:
        word_dict[word] += 1
    # Add the word count dictionary to the list of dictionaries
    word_dicts.append(word_dict)

Let us inspect the `word_dicts` list. We inspect the frequency of words corresponding to the first and second movie overview, i.e. `word_dicts[0]` and `word_dicts[1]`. The output we are seeing is truncated because it is too long, but if we look closely, we notice that it is showing word counts (mostly zeros since most words do not appear in every single overview).

Let us inspect the frequency of words corresponding to the first and second movie overview, i.e. `word_dicts[0]` and `word_dicts[1]`. 

In [5]:
first_dict = word_dicts[0]  
second_dict = word_dicts[1]  

# Function to get the top N words with the highest frequency
def get_top_n_words(word_dict, n=10):
    return sorted(word_dict.items(), key=lambda item: item[1], reverse=True)[:n]

print("Top words in the first overview:", get_top_n_words(first_dict))
print("Top words in the second overview:", get_top_n_words(second_dict))

print("Length of the first dictionary:", len(first_dict))
print("Length of the second dictionary:", len(second_dict))

Top words in the first overview: [('goldeneye', 1), ('leader', 1), ('weapons', 1), ('utilizing', 1), ('syndicate', 1), ('system', 1), ('revenge', 1), ('bond', 1), ('britain', 1), ('janus', 1)]
Top words in the second overview: [('something', 1), ('smoking', 1), ('craig', 1), ('afternoon', 1), ('hanging', 1), ('smokey', 1), ('angeles', 1), ('friday', 1), ('drinking', 1), ('guys', 1)]
Length of the first dictionary: 8025
Length of the second dictionary: 8025


Let us analyze the word frequency distribution across all movie overviews. We will calculate the total word count for each overview and then sum these counts to understand the overall word usage in our movies dataset.

In [6]:
# Calculate the total word count for each movie overview
total_counts_per_overview = [sum(word_dict.values()) for word_dict in word_dicts]
print("Total word count per overview:", total_counts_per_overview)

# Calculate the total number of words across all overviews
total_word_count = sum(total_counts_per_overview)
print("Total word count across all overviews:", total_word_count)

# Number of unique words across all overviews
num_unique_words = len(cleaned_unique_words)
print("Number of unique words across all overviews:", num_unique_words)

# Calculate the average frequency of each word across all overviews
average_word_frequency = total_word_count / num_unique_words
print(f"Average frequency of each word across all overviews: {average_word_frequency:.2f}")

Total word count per overview: [18, 14, 30, 31, 24, 23, 33, 33, 14, 36, 23, 39, 34, 26, 33, 27, 14, 14, 37, 32, 37, 31, 23, 21, 11, 33, 30, 40, 20, 25, 51, 40, 38, 34, 18, 29, 30, 18, 48, 59, 29, 22, 20, 42, 34, 38, 35, 12, 20, 29, 15, 32, 22, 40, 9, 14, 29, 30, 18, 29, 16, 39, 25, 16, 31, 15, 27, 45, 34, 21, 21, 18, 25, 20, 35, 37, 9, 25, 48, 31, 33, 36, 35, 24, 29, 22, 34, 23, 36, 40, 21, 23, 22, 35, 25, 12, 19, 32, 80, 36, 33, 37, 34, 34, 31, 40, 34, 33, 25, 13, 59, 41, 35, 59, 30, 27, 18, 9, 28, 25, 20, 32, 35, 30, 47, 26, 35, 24, 37, 29, 22, 45, 25, 22, 35, 19, 30, 34, 26, 40, 35, 36, 21, 48, 33, 21, 23, 21, 40, 32, 13, 27, 12, 22, 27, 29, 23, 38, 35, 64, 27, 43, 48, 45, 38, 45, 52, 43, 27, 32, 14, 33, 12, 31, 26, 34, 16, 30, 38, 70, 28, 25, 40, 15, 40, 21, 31, 27, 25, 37, 46, 15, 20, 37, 74, 15, 22, 37, 23, 26, 33, 39, 8, 30, 29, 16, 26, 29, 17, 32, 37, 48, 27, 11, 27, 31, 33, 22, 29, 30, 21, 12, 33, 55, 24, 19, 36, 20, 28, 47, 14, 19, 16, 13, 43, 19, 25, 38, 35, 39, 14, 15, 41, 

Our analysis of the word frequency in the movie overviews reveals a comprehensive overview of word usage across the dataset. Each movie overview has a varying word count, indicating a diverse range of content lengths. The total word count across all overviews is $2,693$. This large count, combined with a unique word count of $8,026$, suggests a rich and varied vocabulary within the dataset.

The average frequency of each word across all overviews is approximately $2.83$ times. This higher average indicates that some words are used more frequently than others, which is a crucial factor in the TF-IDF analysis. This variation in word frequency highlights the varied nature of the movie overviews, with certain terms likely holding more significance in specific overviews.

In [7]:
df = pd.DataFrame(word_dicts)

print(f"Number of movie overviews in the DataFrame: {len(df)}")

print("Preview of the DataFrame:")
print(df.head())

# Extract and print all unique frequency counts in the DataFrame
unique_freq_counts = pd.Series(df.values.ravel()).unique()
print("Unique word frequency counts across all movie overviews:")
print(unique_freq_counts)

Number of movie overviews in the DataFrame: 693
Preview of the DataFrame:
   alesia  enthusiasts  everlasting  outfitted  periodically  almost  support  \
0       0            0            0          0             0       0        0   
1       0            0            0          0             0       0        0   
2       0            0            0          0             0       0        0   
3       0            0            0          0             0       0        0   
4       0            0            0          0             0       0        0   

   investigations  ordinary  j  ...  allies  klebb  famed  realised  edgy  \
0               0         0  0  ...       0      0      0         0     0   
1               0         0  0  ...       0      0      0         0     0   
2               0         0  0  ...       0      0      0         0     0   
3               0         0  0  ...       0      0      0         0     0   
4               0         0  0  ...       0      0    

After converting our word frequency data into a structured format, we now have a DataFrame (`df`) where each row represents a movie overview and each column corresponds to a unique word.

This analysis phase is essential in preparing us for the next steps, where we will calculate the TF-IDF scores.

## 4. Compute TF

In the next step of our TF-IDF analysis, we focus on calculating the Term Frequencies (TF). Term frequency measures how frequently a term occurs in a movie overview. Since every movie overview is different in length, it is possible that a term would appear much more times in long movie overviews than shorter ones. Thus, the term frequency is often divided by the movie overview length (the total number of terms in the movie overview as a way of normalization:

$$TF(t) = \frac{\text{Number of times term t appears in a movie overview}}{\text{Total number of terms in the document}}$$

In [8]:
def computeTF(df):
    # Ensure the DataFrame is in the correct data type for floating-point division
    df = df.astype(float)
    
    # Initialize an empty DataFrame to hold the term frequencies
    tf_df = pd.DataFrame()
    
    # Loop through each row in the DataFrame
    for index, row in df.iterrows():
        # Calculate the total number of words in the document (row)
        total_words = row.sum()
        # print("Total words in document {}: {}".format(index, total_words))
        
        # Copy the row to avoid modifying the original DataFrame
        tf_row = row.copy()
        
        # Avoid division by zero by checking if total_words is not zero
        if total_words != 0:
            # Calculate the term frequency for each word
            tf_row = row / total_words
        else:
            # If there are no words, set the term frequencies to zero
            tf_row[:] = 0
            # print("All term frequencies set to 0 for document {} because total_words is 0.".format(index))
        
        # Append the row of term frequencies to the term frequency DataFrame
        tf_df = tf_df.append(tf_row, ignore_index=True)
        # print("tf_df after appending document {}:\n{}".format(index, tf_row))
        
    return tf_df

After defining the function, we apply it to our existing DataFrame (`df`) to compute the term frequencies. The resulting DataFrame, Term_Frequency_Data_Frame, is printed to give us an overview of the term frequencies across our dataset.

This step is crucial as it lays the groundwork for the next phase of our analysis, where these term frequencies will be used to calculate the Inverse Document Frequency (IDF) and subsequently, the TF-IDF scores.

In [9]:
# Compute term frequencies
# Term_Frequency_Data_Frame = computeTF(df)

Before proceeding to the calculation of the Inverse Document Frequency (IDF), it's insightful to examine the term frequency data further. Specifically, we want to identify the non-zero term frequencies within our DataFrame.

## 5. Compute IDF

Our next step in the TF-IDF analysis is to calculate the Inverse Document Frequency (IDF). The IDF is a measure of how important a word is within a corpus. The goal of the IDF is to diminish the weight of terms that appear very frequently in the document set and increase the weight of terms that appear rarely.

The IDF for a term is calculated as follows:

$$IDF(t) = \log\left(\frac{N}{\texttt{df}(t)}\right) + 1$$

where:

$N$ is the total number of movie overviews and $\texttt{df}(t)$ is the number of movie overviews with term $t$ in it.

We add $1$ to the $\log$ term to smooth the IDF, preventing division by zero and ensuring that terms with zero frequency get a finite weight.

In [10]:
def compute_idf(tf_df):
    # The number of documents
    N = len(tf_df)
    
    # Counting the number of documents that contain each word
    # Convert to binary to indicate presence or absence of a term
    binary_tf = tf_df.gt(0).astype(int)
    df = binary_tf.sum(axis=0)

    # Apply the IDF formula with smoothing. Using log base 10
    idf = np.log10((N + 1) / (df + 1)) + 1  # Added 1 to N and df to avoid division by zero
    
    # Converting to a DataFrame for easier handling
    idf_df = pd.DataFrame(idf, index=tf_df.columns).rename(columns={0: 'IDF'})
    
    return idf_df

In [11]:
# Compute the IDF
# Inverse_Document_Frequency_Data_Frame = compute_idf(Term_Frequency_Data_Frame)

## 5. Compute TF-IDF scores

Having computed both the Term Frequencies (TF) and the Inverse Document Frequencies (IDF), we now arrive at the final step of our TF-IDF analysis: calculating the TF-IDF scores. It is the product of TF and IDF, providing a weight to each word that signifies its relevance in the context of a specific movie overview as well as within the entire dataset.

The TF-IDF score for a term in a document is calculated as follows:

$$TF-IDF = TF(t, d) × IDF(t)$$

where:

$TF(t, d)$ is the term frequency of term $t$ in the movie overview $d$
$IDF(t)$ is the inverse document frequency of term $t$.

In [12]:
def compute_TFIDF(tf_df, idf_df):
    # Ensure the column names match between the two DataFrames
    tf_df.columns = idf_df.index
    
    # Multiply each TF row by the IDF vector
    tfidf_df = tf_df * idf_df['IDF'].values
    
    return tfidf_df

# Now call the function with your data
# tfidf_df = compute_TFIDF(Term_Frequency_Data_Frame, Inverse_Document_Frequency_Data_Frame)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=699b63e8-cbf7-4458-8201-adc8db264c30' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>