My first attempt at preprocessing resulted in screenplays that were still much too large for neural networks.  Here I'll try to aggressively cut down the size of the screenplays as much as possible, while hopefully still retaining relevant info.

# 0. Import Data

In [1]:
import pandas as pd
import numpy as np

root_path = r'C:\Users\bened\DataScience\ANLP\AT2\36118_NLP_Spring\CSVs'

df_aus = pd.read_csv(f'{root_path}\\df_aus.csv', index_col=0)
df_aus.head()

Unnamed: 0,imdbid,title,year,opening weekend,budget,age restrict,genres,screenplay,age restrict aus
0,120770,A Night at the Roxbury,1998,United States:,"$17,000,000 (estimated)","Argentina:13, Australia:M, Brazil:14, Canada:P...","Comedy, Music, Romance",\n\n\t\t\t A NIGHT AT THE ROXBURY \n\n\n\t\...,M
1,132512,At First Sight,1999,United States:,"$60,000,000 (estimated)","Argentina:13, Australia:M, Canada:PG::(Alberta...","Drama, Romance",AT FIRST SIGHT\n\nEXT. VALLEY - DUSK \nGold li...,M
2,118661,The Avengers,1998,"United States: $10,305,957, 16 Aug 1998","$60,000,000 (estimated)","Argentina:13, Australia:PG, Brazil:10, Canada:...","Action, Adventure, Sci-Fi, Thriller",\n\n\t\t\t\t\tTHE AVENGERS\n\n\t\t\t\tScreenpl...,PG
4,118715,The Big Lebowski,1998,"United States: $5,533,844, 08 Mar 1998","$15,000,000 (estimated)","Argentina:16, Argentina:18::(cable rating), Au...","Comedy, Crime, Sport",\n\n\t\t\tTHE BIG LEBOWSKI\n\nWe are floating ...,MA
5,112571,Boys on the Side,1995,,,"Argentina:13, Australia:MA, Canada:14A::(Manit...","Comedy, Drama",Boys on the Side\n\nSCENE 1\n\nJANE\nThank you...,MA


In [4]:
df_aus.columns

Index(['imdbid', 'title', 'year', 'opening weekend', 'budget', 'age restrict',
       'genres', 'screenplay', 'age restrict aus'],
      dtype='object')

# 1. Text Analysis

## Sentence Tokenizer

Before taking any additional steps, we're going to try to train a sentence tokenizer on the whole corpus, for later use.

In [3]:
! pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp311-cp311-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 8.9 MB/s eta 0:00:00
Downloading regex-2024.9.11-cp311-cp311-win_amd64.whl (274 kB)
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk
Successfully installed click-8.1.7 nltk-3.9.1 regex-2024.9.11 tqdm-4.66.5


In [None]:
# build corpus

corpus = ("\n"*10).join(df_aus['screenplay'].values)
corpus

In [39]:
print(corpus[:1000])
print("-"*50)
print(corpus[(len(corpus)-1000):len(corpus)])



			    A NIGHT AT THE ROXBURY 


					written by 

					Steve Koren 

					Will Ferrell 

						& 

					Chris Kattan 






							  June 2, 1997 




   FADE IN: 


   EXT. PANORAMIC VIEW OF LOS ANGELES - SUNSET 


   As we hear "What is Love" by HADDAWAY -- night falls and 
   partytime begins. 


   SUPERIMPOSE:  SUNSET BLVD., 11:03 PM 


							   CUT TO: 


   EXT. DANCE CLUBS - NIGHT 


   Coconut Teaser, The Palace, The Roxbury, Tatou, etc. 


							   CUT TO: 


   INT. DANCE CLUBS- QUICK SHOTS - NIGHT 


   Of random dancers -- gyrating, flirting, making out, drinking. 


							   CUT TO: 


   INT. PALACE - NIGHT 


   The CAMERA MOVES THROUGH a crowded dance floor -- and 
   SETTLES ON the rhythmically swaying backs of... 


   STEVE & DOUG BUTABI 


   Our heroes.  In their minds, Steve is tall, dark and handsome and DOUG is a little genius. Neither is correct 
   -- except for the tall and little part. 


   They simultaneously turn and scope the room.  In unison, 
  

In [None]:
screenplay_tokenizer = PunktSentenceTokenizer(train_text=corpus)

In [None]:
import pickle

with open('sentence_tokenizer.pkl', 'wb') as f:
  pickle.dump(screenplay_tokenizer, f)

Let's take a look at the first screenplay in the data, 'A Night at the Roxbury', to get a sense of how to approach this.

In [6]:
roxbury_raw = df_aus.at[0, 'screenplay']
roxbury_raw[:1000]

'\n\n\t\t\t    A NIGHT AT THE ROXBURY \n\n\n\t\t\t\t\twritten by \n\n\t\t\t\t\tSteve Koren \n\n\t\t\t\t\tWill Ferrell \n\n\t\t\t\t\t\t& \n\n\t\t\t\t\tChris Kattan \n\n\n\n\n\n\n\t\t\t\t\t\t\t  June 2, 1997 \n\n\n\n\n   FADE IN: \n\n\n   EXT. PANORAMIC VIEW OF LOS ANGELES - SUNSET \n\n\n   As we hear "What is Love" by HADDAWAY -- night falls and \n   partytime begins. \n\n\n   SUPERIMPOSE:  SUNSET BLVD., 11:03 PM \n\n\n\t\t\t\t\t\t\t   CUT TO: \n\n\n   EXT. DANCE CLUBS - NIGHT \n\n\n   Coconut Teaser, The Palace, The Roxbury, Tatou, etc. \n\n\n\t\t\t\t\t\t\t   CUT TO: \n\n\n   INT. DANCE CLUBS- QUICK SHOTS - NIGHT \n\n\n   Of random dancers -- gyrating, flirting, making out, drinking. \n\n\n\t\t\t\t\t\t\t   CUT TO: \n\n\n   INT. PALACE - NIGHT \n\n\n   The CAMERA MOVES THROUGH a crowded dance floor -- and \n   SETTLES ON the rhythmically swaying backs of... \n\n\n   STEVE & DOUG BUTABI \n\n\n   Our heroes.  In their minds, Steve is tall, dark and handsome and DOUG is a little genius. Ne

- What's the first sequence here that might actually be relevant?  I would say perhaps the lines "night falls and partytime begins.
- Formatting:  The first several lines are basically metadata.  This is signified by subsequent strings of \n\t.  
- The string "FADE IN" or "EXT." is roughly where the screenplay proper begins.

Let's look at a random sampling to determine a pattern.

In [9]:
import random

screenplay_texts = list(df_aus['screenplay'].values)

screenplay_sample = random.sample(screenplay_texts, 10)

# print first 1000 chars of each screenplay

for s in screenplay_sample:
  print(s[:1000])
  print("-"*50)

The script to
 Tim Burton's The Nightmare Before Christmas

NARRATOR
'Twas a long time ago, longer now than it seems, in a place 
that perhaps you've seen in your dreams.  For the story that you are 
about to be told, took place in the holiday worlds of old.  Now, you've 
probably wondered where holidays come from.  If you haven't, I'd say it's 
time you begun.

	This Is Halloween

SHADOW
Boys and girls of every age
Wouldn't you like to see something strange?

SIAMESE SHADOW
Come with us and you will see
This, our town of Halloween

PUMPKIN PATCH CHORUS
This is Halloween, this is Halloween
Pumpkins scream in the dead of night

GHOSTS
This is Halloween, everybody make a scene
Trick or treat till the neighbors gonna die of fright
It's our town, everybody scream
In this town of Halloween

CREATURE UNDER BED
I am the one hiding under your bed
Teeth ground sharp and eyes glowing red

MAN UNDER THE STAIRS
I am the one hiding under your stairs
Fingers like snakes and spiders in my hair

CORPS

- For actual screenplays, the action seems to begin with "EXT" or "INT".  Some of the data here are scripts, which don't follow this format.  So ideally, we truncate everything before EXT|INT, unless there's no match for either, in which case we truncate nothing.

## truncate opening metadata

In [15]:
import re

pat = re.compile(r'EXT|INT')

def truncate_metadata(screenplay):
  match = re.search(pat, screenplay)
  if match:
    cutoff = match.end() + 1
    # return the string from start
    return screenplay[cutoff:]
  # else return the whole screenplay
  else:
    return screenplay

# beta test
roxbury_truncated = truncate_metadata(roxbury_raw)
print(roxbury_truncated[:1000])

 PANORAMIC VIEW OF LOS ANGELES - SUNSET 


   As we hear "What is Love" by HADDAWAY -- night falls and 
   partytime begins. 


   SUPERIMPOSE:  SUNSET BLVD., 11:03 PM 


							   CUT TO: 


   EXT. DANCE CLUBS - NIGHT 


   Coconut Teaser, The Palace, The Roxbury, Tatou, etc. 


							   CUT TO: 


   INT. DANCE CLUBS- QUICK SHOTS - NIGHT 


   Of random dancers -- gyrating, flirting, making out, drinking. 


							   CUT TO: 


   INT. PALACE - NIGHT 


   The CAMERA MOVES THROUGH a crowded dance floor -- and 
   SETTLES ON the rhythmically swaying backs of... 


   STEVE & DOUG BUTABI 


   Our heroes.  In their minds, Steve is tall, dark and handsome and DOUG is a little genius. Neither is correct 
   -- except for the tall and little part. 


   They simultaneously turn and scope the room.  In unison, 
   their heads bop to the MUSIC.  Doug steps out from the 
   bar. 


					  DOUG 
				(to O.S. female) 
		   Hey!  You want to dance?  No? 
		   Yes?  Alright, don't worry about

We'll also keep updating stopwords bank as we go along.

In [18]:
print(stop_words)

{'as', 'such', 'these', 'we', "you'd", 'him', 'to', 'out', 'nor', 've', 'hers', 'is', 'yours', 'being', 'were', 'mustn', 'which', "shan't", 'if', 'ain', 'has', "couldn't", 'they', 'until', 'further', 'how', 'because', "hadn't", 'under', 'y', "wouldn't", 'isn', 'd', 'won', 'in', 'with', 'it', 'from', 'shan', 'needn', 'or', 'after', 'down', 'your', 't', 'her', 're', 'he', 'myself', 'this', 'hasn', 'mightn', 'be', 'into', 'above', "mustn't", "you're", 'what', 'off', "aren't", 'now', 'll', 'itself', "haven't", 'ourselves', 'whom', 'about', 'shouldn', 'all', 'against', 'yourselves', 'having', 'why', 'those', 'ma', 'hadn', 'than', 'me', "weren't", 'no', "that'll", "hasn't", 'do', "it's", 'our', 'too', 'between', 'should', 'will', 'couldn', 'once', "isn't", 'am', "she's", 'only', 'through', 'for', 'where', 'himself', 'weren', "you'll", 'are', 'had', 'doesn', 'theirs', 'over', 'yourself', 'don', 'same', 'did', 'here', 'not', 'just', 'o', 'was', 'very', 'wasn', 'can', 'any', "won't", 'haven', '

In [16]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [13]:
# beta test for a script
df_aus[df_aus['title'] == 'The Nightmare Before Christmas']
nightmare = df_aus.at[787, 'screenplay']
nightmare_truncated = truncate_metadata(nightmare)
print(nightmare_truncated[:1000])

The script to
 Tim Burton's The Nightmare Before Christmas

NARRATOR
'Twas a long time ago, longer now than it seems, in a place 
that perhaps you've seen in your dreams.  For the story that you are 
about to be told, took place in the holiday worlds of old.  Now, you've 
probably wondered where holidays come from.  If you haven't, I'd say it's 
time you begun.

	This Is Halloween

SHADOW
Boys and girls of every age
Wouldn't you like to see something strange?

SIAMESE SHADOW
Come with us and you will see
This, our town of Halloween

PUMPKIN PATCH CHORUS
This is Halloween, this is Halloween
Pumpkins scream in the dead of night

GHOSTS
This is Halloween, everybody make a scene
Trick or treat till the neighbors gonna die of fright
It's our town, everybody scream
In this town of Halloween

CREATURE UNDER BED
I am the one hiding under your bed
Teeth ground sharp and eyes glowing red

MAN UNDER THE STAIRS
I am the one hiding under your stairs
Fingers like snakes and spiders in my hair

CORPS

In [19]:
# seems to be working okay, so let's apply to all screenplays
truncated_screenplays = df_aus['screenplay'].apply(truncate_metadata)
truncated_screenplays

Unnamed: 0,screenplay
0,PANORAMIC VIEW OF LOS ANGELES - SUNSET \n\n\n...
1,VALLEY - DUSK \nGold light dappling across a ...
2,"- SALTFLATS - DAY\n\nA flat horizon, stretchin..."
4,"RIOR RALPH'S\n\nIt is late, the supermarket ..."
5,Boys on the Side\n\nSCENE 1\n\nJANE\nThank you...
...,...
2847,LIVING ROOM - HOLIDAY MORNING - 1990\nVIDEO F...
2849,QUEERS ~ LATE FALL ~ 1974\n\nA subway car rat...
2850,HIGHWAY 789 -- CENTRAL WYOMING -- DAWN.\n\nYe...
2851,RNATIONAL REVISED 4/11/97\nREVISED 5/15/97\nL...


let's now see if there are any patterns to the end of screenplays

In [20]:
screenplay_sample = random.sample(screenplay_texts, 10)

# print first 1000 chars of each screenplay

for s in screenplay_sample:
  print(s[(len(s)-1000):len(s)])
  print("-"*50)

es gently, leans in close
          and whispers the answer in his ear. Jonathan gazes at her a
          moment - whatever her name is, it must be perfect.

          JONATHAN
          So what now?
          She smiles up at him.

          S
          Well... are you free tonight?
          On Jonathan: treasuring these moments. He is truly free --
          perhaps for the first time in his life.

          JONATHAN
          Yes. I'm free.
          They gaze at each other.
          We leave Jonathan and "S" to their privacy, pulling back as
          they continue talking, continuing away until we're

          THROUGH A WINDOW AND INSIDE THE HOTEL ITSELF
          the window itself now a frame, letter-boxing the scene,
          people passing like fish in their tank... except for Jonathan
          and S, seated center-frame, two small figures side by side,
          face to face, talking intimately as we.

          FADE OUT.


                                      THE END




we wouldn't really be removing enough words here for it to be worthwhile.

## remove allcaps

In general, it seems that words in all capital letters are either character names or provide photographic direction.  Neither are really releavnt to us. We will first tokenize into words, and then remove words that are all-caps.

In [21]:
# first I want to see how nltk.word_tokenize will behave
from nltk.tokenize import word_tokenize

roxbury_tokens = word_tokenize(roxbury_truncated, preserve_line=True)
print(roxbury_tokens[:1000])

['PANORAMIC', 'VIEW', 'OF', 'LOS', 'ANGELES', '-', 'SUNSET', 'As', 'we', 'hear', '``', 'What', 'is', 'Love', "''", 'by', 'HADDAWAY', '--', 'night', 'falls', 'and', 'partytime', 'begins.', 'SUPERIMPOSE', ':', 'SUNSET', 'BLVD.', ',', '11:03', 'PM', 'CUT', 'TO', ':', 'EXT.', 'DANCE', 'CLUBS', '-', 'NIGHT', 'Coconut', 'Teaser', ',', 'The', 'Palace', ',', 'The', 'Roxbury', ',', 'Tatou', ',', 'etc.', 'CUT', 'TO', ':', 'INT.', 'DANCE', 'CLUBS-', 'QUICK', 'SHOTS', '-', 'NIGHT', 'Of', 'random', 'dancers', '--', 'gyrating', ',', 'flirting', ',', 'making', 'out', ',', 'drinking.', 'CUT', 'TO', ':', 'INT.', 'PALACE', '-', 'NIGHT', 'The', 'CAMERA', 'MOVES', 'THROUGH', 'a', 'crowded', 'dance', 'floor', '--', 'and', 'SETTLES', 'ON', 'the', 'rhythmically', 'swaying', 'backs', 'of', '...', 'STEVE', '&', 'DOUG', 'BUTABI', 'Our', 'heroes.', 'In', 'their', 'minds', ',', 'Steve', 'is', 'tall', ',', 'dark', 'and', 'handsome', 'and', 'DOUG', 'is', 'a', 'little', 'genius.', 'Neither', 'is', 'correct', '--', '

In [23]:
# filter out capital letters
def filter_allcaps(tokens):
  filtered_tokens = []
  for t in tokens:
    if t.isupper() == False:
      filtered_tokens.append(t)
    else:
      continue
  return filtered_tokens

roxbury_lower = filter_allcaps(roxbury_tokens)
print(roxbury_lower[:1000])

['-', 'As', 'we', 'hear', '``', 'What', 'is', 'Love', "''", 'by', '--', 'night', 'falls', 'and', 'partytime', 'begins.', ':', ',', '11:03', ':', '-', 'Coconut', 'Teaser', ',', 'The', 'Palace', ',', 'The', 'Roxbury', ',', 'Tatou', ',', 'etc.', ':', '-', 'Of', 'random', 'dancers', '--', 'gyrating', ',', 'flirting', ',', 'making', 'out', ',', 'drinking.', ':', '-', 'The', 'a', 'crowded', 'dance', 'floor', '--', 'and', 'the', 'rhythmically', 'swaying', 'backs', 'of', '...', '&', 'Our', 'heroes.', 'In', 'their', 'minds', ',', 'Steve', 'is', 'tall', ',', 'dark', 'and', 'handsome', 'and', 'is', 'a', 'little', 'genius.', 'Neither', 'is', 'correct', '--', 'except', 'for', 'the', 'tall', 'and', 'little', 'part.', 'They', 'simultaneously', 'turn', 'and', 'scope', 'the', 'room.', 'In', 'unison', ',', 'their', 'heads', 'bop', 'to', 'the', 'Doug', 'steps', 'out', 'from', 'the', 'bar.', '(', 'to', 'female', ')', 'Hey', '!', 'You', 'want', 'to', 'dance', '?', 'No', '?', 'Yes', '?', 'Alright', ',', 'do

In [28]:
# now filter out punctuations
import string

puncts = list(string.punctuation)
puncts.extend([r'``', r'--', r'...', r"''"])
# print(puncts)

def filter_puncts(tokens):
  filtered_tokens = []
  for t in tokens:
    if t not in puncts:
      filtered_tokens.append(t)
  return filtered_tokens

roxbury_unpunctuated = filter_puncts(roxbury_lower)
print(roxbury_unpunctuated[:1000])

['As', 'we', 'hear', 'What', 'is', 'Love', 'by', 'night', 'falls', 'and', 'partytime', 'begins.', '11:03', 'Coconut', 'Teaser', 'The', 'Palace', 'The', 'Roxbury', 'Tatou', 'etc.', 'Of', 'random', 'dancers', 'gyrating', 'flirting', 'making', 'out', 'drinking.', 'The', 'a', 'crowded', 'dance', 'floor', 'and', 'the', 'rhythmically', 'swaying', 'backs', 'of', 'Our', 'heroes.', 'In', 'their', 'minds', 'Steve', 'is', 'tall', 'dark', 'and', 'handsome', 'and', 'is', 'a', 'little', 'genius.', 'Neither', 'is', 'correct', 'except', 'for', 'the', 'tall', 'and', 'little', 'part.', 'They', 'simultaneously', 'turn', 'and', 'scope', 'the', 'room.', 'In', 'unison', 'their', 'heads', 'bop', 'to', 'the', 'Doug', 'steps', 'out', 'from', 'the', 'bar.', 'to', 'female', 'Hey', 'You', 'want', 'to', 'dance', 'No', 'Yes', 'Alright', 'do', "n't", 'worry', 'about', 'it.', 'Doug', 'rejected', 'steps', 'back', 'as', 'Steve', 'steps', 'out.', 'to', 'female', 'Do', 'you', 'want', 'to', 'dance', 'You', 'do', 'You', 'd

In [29]:
# filter out stopwords
def remove_stopwords(tokens):
  tokens_nonstop = []
  for t in tokens:
    if t not in stop_words:
      tokens_nonstop.append(t)
  return tokens_nonstop

roxbury_nonstop = remove_stopwords(roxbury_unpunctuated)
print(roxbury_nonstop[:1000])

['As', 'hear', 'What', 'Love', 'night', 'falls', 'partytime', 'begins.', '11:03', 'Coconut', 'Teaser', 'The', 'Palace', 'The', 'Roxbury', 'Tatou', 'etc.', 'Of', 'random', 'dancers', 'gyrating', 'flirting', 'making', 'drinking.', 'The', 'crowded', 'dance', 'floor', 'rhythmically', 'swaying', 'backs', 'Our', 'heroes.', 'In', 'minds', 'Steve', 'tall', 'dark', 'handsome', 'little', 'genius.', 'Neither', 'correct', 'except', 'tall', 'little', 'part.', 'They', 'simultaneously', 'turn', 'scope', 'room.', 'In', 'unison', 'heads', 'bop', 'Doug', 'steps', 'bar.', 'female', 'Hey', 'You', 'want', 'dance', 'No', 'Yes', 'Alright', "n't", 'worry', 'it.', 'Doug', 'rejected', 'steps', 'back', 'Steve', 'steps', 'out.', 'female', 'Do', 'want', 'dance', 'You', 'You', "n't", 'Not', 'problem.', 'They', 'strangers', 'rejection', 'neither', 'fazed.', 'Doug', 'enthusiastically', 'steps', 'towards', 'two', 'attractive', 'girls.', 'Hey', 'wan', 'na', 'Two', 'attractive', 'girls', 'turn', 'backs', 'Doug.', 'remai

more will be removed after converting to lowercase, but I'm not sure I want to do this yet because want to preserve sentence boundaries.

First though we can remove all numbers.

In [30]:
def remove_numbers(tokens):
  filtered_tokens = []
  for t in tokens:
    if t.isalpha():
      filtered_tokens.append(t)
  return filtered_tokens

roxbury_alpha = remove_numbers(roxbury_nonstop)
print(roxbury_alpha[:1000])

['As', 'hear', 'What', 'Love', 'night', 'falls', 'partytime', 'Coconut', 'Teaser', 'The', 'Palace', 'The', 'Roxbury', 'Tatou', 'Of', 'random', 'dancers', 'gyrating', 'flirting', 'making', 'The', 'crowded', 'dance', 'floor', 'rhythmically', 'swaying', 'backs', 'Our', 'In', 'minds', 'Steve', 'tall', 'dark', 'handsome', 'little', 'Neither', 'correct', 'except', 'tall', 'little', 'They', 'simultaneously', 'turn', 'scope', 'In', 'unison', 'heads', 'bop', 'Doug', 'steps', 'female', 'Hey', 'You', 'want', 'dance', 'No', 'Yes', 'Alright', 'worry', 'Doug', 'rejected', 'steps', 'back', 'Steve', 'steps', 'female', 'Do', 'want', 'dance', 'You', 'You', 'Not', 'They', 'strangers', 'rejection', 'neither', 'Doug', 'enthusiastically', 'steps', 'towards', 'two', 'attractive', 'Hey', 'wan', 'na', 'Two', 'attractive', 'girls', 'turn', 'backs', 'remaining', 'positive', 'No', 'Maybe', 'see', 'Doug', 'steps', 'Steve', 'spots', 'end', 'dances', 'Hey', 'want', 'dance', 'cheerfully', 'Alright', 'know', 'Steve', 

# Now we'll rejoin the text so we can sentence tokenize

In [36]:
from nltk.tokenize.punkt import PunktSentenceTokenizer

sentence_tokenizer = PunktSentenceTokenizer()
roxbury_sentences = sentence_tokenizer.sentences_from_tokens(tokens=roxbury_alpha)
print(roxbury_sentences[:100])

TypeError: 'generator' object is not subscriptable