# Introduction

In this analysis, we will address a RAG problem: How can I get a movie recommendation based on the scripts of the movies?

We see many different potential unique value propositions for this in terms of validating movie content based on the literal content rather than the content contained in reviews. While reviews are valuable, there are systems that analyze reviews and give you recommendations for movies based on reviews. Sometimes, you may not have the same tastes as reviewers or you may not agree with human intervention.

For this, we will apply deep learning models to receive human text explanations of desired movie descriptions to see and recommend movies where their scripts match the themes outlined in the query.

Specifically, this model will employ **Transfer Learning** using a pretrained model on the English language and then supplied with movie scripts in 2 specific genres of movies: Horror and Drama. In the future, this project could grow to include more movie scripts, but for the scope of this assignment, we focus our analysis.

Our Basic Strategy:

1. Employ a pre-trained English language model: -----. This model is able to understand sentiments and such to classify English text.
2. Retrain the model based on our specific style of data, movie script analysis. It will learn the specific nuances of movie script presentations to know where to look. Our dataset consists of: -------
3. Create a front-end application on top to provide an interactive experience for the users of the application and script fanatics!


In [6]:
import os
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments


In [7]:
# Path to the directory containing the script files
directory_path = "dropbox-archive/movie_scripts/"

# Function to read scripts from files and extract titles
def load_scripts_from_directory(directory_path):
    scripts = []
    titles = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".txt"):
            # Construct full file path
            file_path = os.path.join(directory_path, filename)
            # Read the content of the file
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                scripts.append(content)
                # Extract title from the filename (remove "Script_" and ".txt")
                title = filename.replace("Script_", "").replace(".txt", "").replace("_", " ")
                titles.append(title)
    return titles, scripts

# Load scripts and titles
titles, scripts = load_scripts_from_directory(directory_path)

# Create dummy labels (e.g., binary classification, 0 or 1 for each script)
labels = [0 if i % 2 == 0 else 1 for i in range(len(scripts))]

In [8]:
titles[:5], scripts[:5], labels[:5]

(['Lost Horizon',
  'Spider-Man',
  'Zero Dark Thirty',
  'Star Wars  The Phantom Menace',
  'Pandorum'],
  '                                  ZERO DARK THIRTY                                    Written by                                    Mark Boal                                                     October 3rd, 2011                                                           FROM BLACK, VOICES EMERGE--          We hear the actual recorded emergency calls made by World          Trade Center office workers to police and fire departments          after the planes struck on 9/11, just before the buildings          collapsed.          TITLE OVER: SEPTEMBER 11, 2001          We listen to fragments from a number of these calls...starting          with pleas for help, building to a panic, ending with the          caller\'s grim acceptance that help will not arrive, that the          situation is hopeless, that they are about to die.                         CUT TO:          TITLE OVER: TWO YEA

In [9]:
# Load the pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# Define a custom dataset class
class MovieScriptDataset(Dataset):
    # Initialize the dataset
    def __init__(self, scripts, labels, tokenizer, max_length):
        self.scripts = scripts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    # Return the number of items in the dataset
    def __len__(self):
        return len(self.scripts)

    # Get the item at the specified index and return the input text, attention mask, and label
    def __getitem__(self, index):
        script = str(self.scripts[index])
        label = self.labels[index]
        # Encode the text using the tokenizer
        encoding = self.tokenizer.encode_plus(
            script,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            #'title': title,
            'script_text': script,
            # Input IDs returns a tensor of shape (max_length,)
            'input_ids': encoding['input_ids'].flatten(),
            # Attention mask returns a tensor of shape (max_length,)
            'attention_mask': encoding['attention_mask'].flatten(),
            # Labels returns a tensor of shape (1,)
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [11]:
# Create dataset and dataloaders
max_length = 512
batch_size = 4

dataset = MovieScriptDataset(scripts, labels, tokenizer, max_length)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

In [12]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)


In [20]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

  0%|          | 0/657 [00:00<?, ?it/s]

{'loss': 0.746, 'grad_norm': 7.989304065704346, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.05}
{'loss': 0.7009, 'grad_norm': 22.240758895874023, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.09}
{'loss': 0.7338, 'grad_norm': 7.654321670532227, 'learning_rate': 3e-06, 'epoch': 0.14}
{'loss': 0.7151, 'grad_norm': 24.089853286743164, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.18}
{'loss': 0.6902, 'grad_norm': 10.29809284210205, 'learning_rate': 5e-06, 'epoch': 0.23}
{'loss': 0.7433, 'grad_norm': 18.20090675354004, 'learning_rate': 6e-06, 'epoch': 0.27}
{'loss': 0.7144, 'grad_norm': 6.6054558753967285, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.32}
{'loss': 0.7373, 'grad_norm': 10.335643768310547, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.37}
{'loss': 0.7089, 'grad_norm': 8.588286399841309, 'learning_rate': 9e-06, 'epoch': 0.41}
{'loss': 0.6832, 'grad_norm': 6.805417060852051, 'learning_rate': 1e-05, 'epoch': 0.46}
{'loss': 0.7423, 'grad_norm': 13.81

TrainOutput(global_step=657, training_loss=0.7118017698894715, metrics={'train_runtime': 22112.4558, 'train_samples_per_second': 0.118, 'train_steps_per_second': 0.03, 'total_flos': 689087853987840.0, 'train_loss': 0.7118017698894715, 'epoch': 3.0})

In [13]:
# Save the model
model.save_pretrained("./movie_script_model")
tokenizer.save_pretrained("./movie_script_tokenizer")

('./movie_script_tokenizer/tokenizer_config.json',
 './movie_script_tokenizer/special_tokens_map.json',
 './movie_script_tokenizer/vocab.txt',
 './movie_script_tokenizer/added_tokens.json')

In [14]:
# read in model 
model = BertForSequenceClassification.from_pretrained("./movie_script_model")
tokenizer = BertTokenizer.from_pretrained("./movie_script_tokenizer")

In [15]:
# Inference function
def recommend_movie(description, tokenizer, model, max_length):
    encoding = tokenizer.encode_plus(
        description,
        add_special_tokens=True,
        max_length=max_length,
        return_token_type_ids=False,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']
    
    outputs = model(input_ids, attention_mask=attention_mask)
    _, prediction = torch.max(outputs.logits, dim=1)
    # get the text of the title out of the prediction
    prediction = titles[prediction]
    return prediction

In [18]:
# Example usage
description = "Creative girl."
prediction = recommend_movie(description, tokenizer, model, max_length)
print(f"Recommended Movie: {prediction}")

Recommended Movie: Lost Horizon
