Here I'll take a look at the BERT annotated data and see if we can do anything with it. 

# Australian Film Classification Guidelines

Helpful resources for how film classifications are determined can be found here: 

https://www.classification.gov.au/classification-ratings/how-rating-decided

Some relevant excerpts below: 

Approved classification tools use logic rules and algorithms to classify content.

The guidelines include 6 classifiable elements. These are:

themes
violence
sex
language
drug use
nudity.

Impact
The impact depends on the frequency, intensity and the overall effect of the content. The purpose, tone and style can affect impact.

Impact may increase where depictions are:

detailed
prolonged
realistic
interactive.
Impact may be lower where content is:

implied rather than depicted
not detailed
short in duration
verbal and not visual
incidental and not direct.
The level of impact allowed in each classification category (rating):

Rating	Impact level
G	Very mild
PG	Mild
M	Moderate
MA 15+	Strong
R 18+	High
RC	Very high

<i> So we have six relevant categories, and six impact levels corresponding to the 6 ratings.  The goal will be then to build a model that basically scores the impact level of the screenplay on these six measures, then uses those six scores to make the prediction/classsification.  So e.g., we might except a neural network architecture to look like:

embedding layer(vector for each {sentence, word} in screenplay)
- > 6 (?) neurons, one for each classifiable element
( -> context layer? )
- > 6 (?) neurons, an impact score for each element 
- > 1 output neuron for classification
OR 
- > an output layer per country classification desired? </i> 

Context
Context determines whether the storyline justifies content. Content that falls into a particular rating in one context may fall outside it in a different context.

For example, the way the content deals with social issues may require a mature or adult perspective. Historical context may also justify certain depictions.

<i> Should there be a separate context layer then? </i>

Consumer advice
Under section 20 of the Act, a classification decision must include consumer advice.

Consumer advice gives information about the content. It usually describes the classifiable elements with the greatest impact.

For example, a film classified PG may have consumer advice of 'Mild violence and coarse language'. This means that the film is PG because the violence and coarse language are mild in impact.

<i> Ideally, our model would also therefore output this Consumer Advice </i> 





# 0. Import Data and Assemble DataFrame

In [19]:
# replace paths here
root_path = r'C:\\Users\bened\DataScience\ANLP\AT2'

import os 
import re

folder_path = f'{root_path}\\BERT_annotations'
screenplays_annot = {}

hex_pat = re.compile(r'[\x00-\x1F\x7F-\x9F]')

# list all files in folder and iterate over them 
for file_name in os.listdir(folder_path):
    # get file_path by joining folder path with file_name
    file_path = os.path.join(folder_path, file_name)
    # ensure path points to an actual file
    if os.path.isfile(file_path):
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
            # remove non-printable non-ASCII chars
            cleaned_content = re.sub(hex_pat, '', content)
            # assign content to its file_name
            screenplays_annot[file_name] = content

# ensure files were imported correctly by printing a sample of the first ten files 
i = 0
for file_name, content in screenplays_annot.items():
    if i == 10:
        break
    else:
        print(f"Example of {file_name}:\n")
        print(content[:100])
        print("-"*50)
        i += 1

Example of 10 Cloverfield Lane_1179933_anno.txt:

dialog: The Cellar
dialog: by
dialog: Josh Campbell & Matt Stuecken
speaker_heading: DARKNESS
dialog
--------------------------------------------------
Example of 10 Things I Hate About You_0147800_anno.txt:

dialog: 
text: TEN THINGS I HATE ABOUT YOU
dialog: 
dialog: written by Karen McCullah Lutz &amp; Kir
--------------------------------------------------
Example of 12 Angry Men_0118528_anno.txt:

scene_heading: PLEASE COPY AND RETURN |
dialog: ———_————_
dialog: 
scene_heading: TWELVE ANGRY MEN
d
--------------------------------------------------
Example of 12 Monkeys_0114746_anno.txt:

dialog: 
speaker_heading: TWELVE MONKEYS
dialog: 
dialog: An original screenplay by
dialog: David Pe
--------------------------------------------------
Example of 12 Years a Slave_2024544_anno.txt:

dialog: 
speaker_heading: 12 YEARS A SLAVE
dialog: Written by
dialog: John Ridley
speaker_heading: C
--------------------------------------------------
Ex

In [20]:
# search values of screenplays_annot for data corruption pattern 
import re 

corruptions = set()
for val in screenplays_annot.values():
    corrupted = re.findall(hex_pat, repr(val))
    for c in corrupted:
        corruptions.add(c)

print(corruptions)

set()


In [21]:
print(list(screenplays_annot.keys())[:10])

['10 Cloverfield Lane_1179933_anno.txt', '10 Things I Hate About You_0147800_anno.txt', '12 Angry Men_0118528_anno.txt', '12 Monkeys_0114746_anno.txt', '12 Years a Slave_2024544_anno.txt', '127 Hours_1542344_anno.txt', '13 13 13_2991516_anno.txt', '1408_0450385_anno.txt', '1492 Conquest of Paradise_0103594_anno.txt', '15 Minutes_0179626_anno.txt']


In [22]:
# find names of screenpay with hexademical chars 
counter = 0
for key, val in screenplays_annot.items():
    if re.search(hex_pat, repr(val)):
        print(key)
        counter += 1

print(f"{counter} files contain hexadecimal chars")

42_0453562_anno.txt
Ace Ventura Pet Detective_0109040_anno.txt
Airplane II The Sequel_0083530_anno.txt
Airplane_0080339_anno.txt
Alone in the Dark_0369226_anno.txt
Amadeus_0086879_anno.txt
American Psycho_0144084_anno.txt
American Shaolin_0101327_anno.txt
American Splendor_0305206_anno.txt
An Act of Murder_0040072_anno.txt
Annie Hall_0075686_anno.txt
April Fool s Day_0090655_anno.txt
Argo_1024648_anno.txt
At First Sight_0132512_anno.txt
Bad Lieutenant_0103759_anno.txt
Barton Fink_0101410_anno.txt
Batman Robin_0118688_anno.txt
Batman Year One_1672723_anno.txt
Bedlam_0038343_anno.txt
Being John Malkovich_0120601_anno.txt
Blade Trinity_0359013_anno.txt
Blade_0120611_anno.txt
Braveheart_0112573_anno.txt
Bridesmaids_1478338_anno.txt
Burn After Reading_0887883_anno.txt
Burning Annie_0307879_anno.txt
Candle to Water_2387411_anno.txt
Carrie_0074285_anno.txt
Cat People_0034587_anno.txt
Chasing Amy_0118842_anno.txt
Chinatown_0071315_anno.txt
Cinema Paradiso_0095765_anno.txt
Coco_2380307_anno.txt

## Global Functions

In [23]:
import numpy as np

In [24]:
def print_first_lines(dict_list, n):
    for idx, d in enumerate(dict_list):
        if idx == n:
            break
        else:
            print(d)

In [25]:
def find_avg_length(series):
    avg_length = np.mean([len(d) for d in series])
    print("Average length:", avg_length)

## 0.1 Join with metadata

In [27]:
! pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp311-cp311-win_amd64.whl (11.6 MB)
Using cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2024.2 tzdata-2024.2


In [28]:
import pandas as pd

meta_df = pd.read_csv(f'{root_path}\\movie_meta_data.csv')
meta_df.head()

Unnamed: 0,imdbid,title,akas,year,metascore,imdb user rating,number of imdb user votes,awards,opening weekend,producers,...,casting directors,cast,countries,age restrict,plot,plot outline,keywords,genres,taglines,synopsis
0,120770,A Night at the Roxbury,"Une nuit au Roxbury (France), Movida en el Rox...",1998,26,6,56537,,United States:,"Marie Cantin, Erin Fraser, Amy Heckerling, Ste...",...,Jeff Greenberg,"Will Ferrell, Chris Kattan, Raquel Gardner, Vi...",United States,"Argentina:13, Australia:M, Brazil:14, Canada:P...",Two dim-witted brothers dream of owning their ...,"The Roxbury Guys, Steve and Doug Butabi, want ...","woman-on-top, nightclub, car-accident, 1990s, ...","Comedy, Music, Romance",Score!,
1,132512,At First Sight,"Sight Unseen (United States), Premier regard (...",1999,40,6,12922,,United States:,"Rob Cowan, Roger Paradiso, Irwin Winkler",...,"Kerry Barden, Billy Hopkins, Suzanne Smith","Val Kilmer, Mira Sorvino, Kelly McGillis, Stev...",United States,"Argentina:13, Australia:M, Canada:PG::(Alberta...",A blind man has an operation to regain his sig...,First Sight is true to the title from start to...,"visual-agnosia, brother-sister-relationship, r...","Drama, Romance","Only Love Can Bring You To Your Senses., Scien...",
2,118661,The Avengers,"Chapeau melon et bottes de cuir (France), Mit ...",1998,12,3,40784,"FMCJ Award 1998, Golden Reel Award 1999, Razzi...","United States: $10,305,957, 16 Aug 1998","Susan Ekins, Jerry Weintraub",...,Susie Figgis,"Ralph Fiennes, Uma Thurman, Sean Connery, Patr...",United States,"Argentina:13, Australia:PG, Brazil:10, Canada:...",Two British Agents team up to stop Sir August ...,"British Ministry Agent John Steed, under direc...","good-versus-evil, heroine, evil-man, villain, ...","Action, Adventure, Sci-Fi, Thriller","Mrs. Peel, we're needed., Extraordinary crimes...",
3,215545,Bamboozled,"The Very Black Show (France), It's Showtime (G...",2000,54,6,10373,"Golden Berlin Bear 2001, Black Reel 2001, Imag...",United States:,"Jon Kilik, Spike Lee, Kisha Imani Cameron",...,Aisha Coley,"Damon Wayans, Savion Glover, Jada Pinkett Smit...",United States,"Australia:MA, Finland:K-15, France:Tous public...",A frustrated African-American TV writer propos...,"Dark, biting satire of the television industry...","television-industry, african-american, referen...","Comedy, Drama, Music",Starring the great negroe actors,"In a New York City residence, Pierre Delacroix..."
4,118715,The Big Lebowski,"El gran Lebowski (Spain), O Grande Lebowski (P...",1998,71,8,724388,"Honorable Mention 1998, ACCA 1998, Golden Berl...","United States: $5,533,844, 08 Mar 1998","Tim Bevan, John Cameron, Ethan Coen, Eric Fell...",...,John S. Lyons,"Jeff Bridges, John Goodman, Julianne Moore, St...","United States, United Kingdom","Argentina:16, Argentina:18::(cable rating), Au...","Jeff ""The Dude"" Lebowski, mistaken for a milli...","When ""the dude"" Lebowski is mistaken for a mil...","rug, nihilism, pornographer, bowling-alley, de...","Comedy, Crime, Sport",Hay quienes tratan de ganarse la vida sin move...,A tumbleweed rolls up a hillside just outside ...


In [29]:
# take a look at filename format

filenames = list(screenplays_annot.keys())
print(filenames[:10])

['10 Cloverfield Lane_1179933_anno.txt', '10 Things I Hate About You_0147800_anno.txt', '12 Angry Men_0118528_anno.txt', '12 Monkeys_0114746_anno.txt', '12 Years a Slave_2024544_anno.txt', '127 Hours_1542344_anno.txt', '13 13 13_2991516_anno.txt', '1408_0450385_anno.txt', '1492 Conquest of Paradise_0103594_anno.txt', '15 Minutes_0179626_anno.txt']


In [30]:
# filenames are formatted as movietitle_IMDBid 
import re

filenames = list(screenplays_annot.keys())
movie_titles = []
ids = []
for f in filenames:
    # split at first _ to separate title from rest of filename
    split = f.split(sep="_")
    movie_title = split[0]
    id = split[1]
    movie_titles.append(movie_title)
    ids.append(id)
i = 0
for title, id in zip(movie_titles, ids):
    if i == 10:
        break
    else:
        print("Title:", title, " ID:", id)
        i += 1

Title: 10 Cloverfield Lane  ID: 1179933
Title: 10 Things I Hate About You  ID: 0147800
Title: 12 Angry Men  ID: 0118528
Title: 12 Monkeys  ID: 0114746
Title: 12 Years a Slave  ID: 2024544
Title: 127 Hours  ID: 1542344
Title: 13 13 13  ID: 2991516
Title: 1408  ID: 0450385
Title: 1492 Conquest of Paradise  ID: 0103594
Title: 15 Minutes  ID: 0179626


In [31]:
screenplays_df = pd.DataFrame({
    'imdbid': ids,
    'annot_screenplay': screenplays_annot.values()
})
screenplays_df.head()

Unnamed: 0,imdbid,annot_screenplay
0,1179933,dialog: The Cellar\ndialog: by\ndialog: Josh C...
1,147800,dialog: \ntext: TEN THINGS I HATE ABOUT YOU\nd...
2,118528,scene_heading: PLEASE COPY AND RETURN |\ndialo...
3,114746,dialog: \nspeaker_heading: TWELVE MONKEYS\ndia...
4,2024544,dialog: \nspeaker_heading: 12 YEARS A SLAVE\nd...


In [32]:
print(screenplays_df.info())
print(meta_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1998 entries, 0 to 1997
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   imdbid            1998 non-null   object
 1   annot_screenplay  1998 non-null   object
dtypes: object(2)
memory usage: 31.3+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2858 entries, 0 to 2857
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   imdbid                     2858 non-null   int64 
 1   title                      2858 non-null   object
 2   akas                       2652 non-null   object
 3   year                       2858 non-null   int64 
 4   metascore                  2858 non-null   int64 
 5   imdb user rating           2858 non-null   int64 
 6   number of imdb user votes  2858 non-null   int64 
 7   awards                     2243 non-null   object
 8   op

In [33]:
# convert screenplays imdbid to int
screenplays_df['imdbid'] = screenplays_df['imdbid'].astype(int)
df = meta_df.merge(screenplays_df, on='imdbid')
df.head()

Unnamed: 0,imdbid,title,akas,year,metascore,imdb user rating,number of imdb user votes,awards,opening weekend,producers,...,cast,countries,age restrict,plot,plot outline,keywords,genres,taglines,synopsis,annot_screenplay
0,120770,A Night at the Roxbury,"Une nuit au Roxbury (France), Movida en el Rox...",1998,26,6,56537,,United States:,"Marie Cantin, Erin Fraser, Amy Heckerling, Ste...",...,"Will Ferrell, Chris Kattan, Raquel Gardner, Vi...",United States,"Argentina:13, Australia:M, Brazil:14, Canada:P...",Two dim-witted brothers dream of owning their ...,"The Roxbury Guys, Steve and Doug Butabi, want ...","woman-on-top, nightclub, car-accident, 1990s, ...","Comedy, Music, Romance",Score!,,dialog: \ntext: A NIGHT AT THE ROXBURY\ndialog...
1,132512,At First Sight,"Sight Unseen (United States), Premier regard (...",1999,40,6,12922,,United States:,"Rob Cowan, Roger Paradiso, Irwin Winkler",...,"Val Kilmer, Mira Sorvino, Kelly McGillis, Stev...",United States,"Argentina:13, Australia:M, Canada:PG::(Alberta...",A blind man has an operation to regain his sig...,First Sight is true to the title from start to...,"visual-agnosia, brother-sister-relationship, r...","Drama, Romance","Only Love Can Bring You To Your Senses., Scien...",,scene_heading: AT FIRST SIGHT\nscene_heading: ...
2,118661,The Avengers,"Chapeau melon et bottes de cuir (France), Mit ...",1998,12,3,40784,"FMCJ Award 1998, Golden Reel Award 1999, Razzi...","United States: $10,305,957, 16 Aug 1998","Susan Ekins, Jerry Weintraub",...,"Ralph Fiennes, Uma Thurman, Sean Connery, Patr...",United States,"Argentina:13, Australia:PG, Brazil:10, Canada:...",Two British Agents team up to stop Sir August ...,"British Ministry Agent John Steed, under direc...","good-versus-evil, heroine, evil-man, villain, ...","Action, Adventure, Sci-Fi, Thriller","Mrs. Peel, we're needed., Extraordinary crimes...",,dialog: \nspeaker_heading: THE AVENGERS\ndialo...
3,215545,Bamboozled,"The Very Black Show (France), It's Showtime (G...",2000,54,6,10373,"Golden Berlin Bear 2001, Black Reel 2001, Imag...",United States:,"Jon Kilik, Spike Lee, Kisha Imani Cameron",...,"Damon Wayans, Savion Glover, Jada Pinkett Smit...",United States,"Australia:MA, Finland:K-15, France:Tous public...",A frustrated African-American TV writer propos...,"Dark, biting satire of the television industry...","television-industry, african-american, referen...","Comedy, Drama, Music",Starring the great negroe actors,"In a New York City residence, Pierre Delacroix...",dialog: Bamboozled\ndialog: by\ndialog: Spike ...
4,118715,The Big Lebowski,"El gran Lebowski (Spain), O Grande Lebowski (P...",1998,71,8,724388,"Honorable Mention 1998, ACCA 1998, Golden Berl...","United States: $5,533,844, 08 Mar 1998","Tim Bevan, John Cameron, Ethan Coen, Eric Fell...",...,"Jeff Bridges, John Goodman, Julianne Moore, St...","United States, United Kingdom","Argentina:16, Argentina:18::(cable rating), Au...","Jeff ""The Dude"" Lebowski, mistaken for a milli...","When ""the dude"" Lebowski is mistaken for a mil...","rug, nihilism, pornographer, bowling-alley, de...","Comedy, Crime, Sport",Hay quienes tratan de ganarse la vida sin move...,A tumbleweed rolls up a hillside just outside ...,dialog: \nscene_heading: THE BIG LEBOWSKI\ntex...


# 1. Data Annotations

In [34]:
roxbury_annot = df.at[0, 'annot_screenplay']
print(roxbury_annot[:1000])

dialog: 
text: A NIGHT AT THE ROXBURY
dialog: written by
dialog: Steve Koren
dialog: Will Ferrell
dialog: &
dialog: Chris Kattan
dialog: June 2, 1997
speaker_heading: FADE IN:
scene_heading: EXT. PANORAMIC VIEW OF LOS ANGELES - SUNSET
text: As we hear "What is Love" by HADDAWAY -- night falls and
text: partytime begins.
scene_heading: SUPERIMPOSE: SUNSET BLVD., 11:03 PM
speaker_heading: CUT TO:
scene_heading: EXT. DANCE CLUBS - NIGHT
dialog: Coconut Teaser, The Palace, The Roxbury, Tatou, etc.
speaker_heading: CUT TO:
scene_heading: INT. DANCE CLUBS- QUICK SHOTS - NIGHT
text: Of random dancers -- gyrating, flirting, making out, drinking.
speaker_heading: CUT TO:
scene_heading: INT. PALACE - NIGHT
text: The CAMERA MOVES THROUGH a crowded dance floor -- and
text: SETTLES ON the rhythmically swaying backs of...
speaker_heading: STEVE & DOUG BUTABI
text: Our heroes. In their minds, Steve is tall, dark and handsome and DOUG is a little genius. Neither is correct
dialog: -- except for the ta

Each \n introduces a new label: data pairing
Try to turn this into a json

## 1.1 Format Text Data as JSONs

In [35]:
# define a function to format screenplay as json
## output should look like e.g.:
## {"label": "data", "label":"data" etc.}

import re

def format_as_json(screenplay):
    # store results as list of key-value pairs
    screenplay_data = []
    # split screenplays by \n
    lines = screenplay.split("\n")
    # iterate through lines 
    for line in lines: 
        # take part of string up to : as label
        match = re.search(r':', line)
        if match:
            # take end of match as cutoff
            cutoff = match.end()
            label = line[:cutoff-1]
            # after cutoff is data 
            data = line[cutoff+1:]
            # store as dict
            line_info = {label:data}
            # append to list
            screenplay_data.append(line_info)
    # return list
    return screenplay_data

# beta test on roxbury 
roxbury_data = format_as_json(roxbury_annot)
print(roxbury_data[:100])

[{'dialog': ''}, {'text': 'A NIGHT AT THE ROXBURY'}, {'dialog': 'written by'}, {'dialog': 'Steve Koren'}, {'dialog': 'Will Ferrell'}, {'dialog': '&'}, {'dialog': 'Chris Kattan'}, {'dialog': 'June 2, 1997'}, {'speaker_heading': 'FADE IN:'}, {'scene_heading': 'EXT. PANORAMIC VIEW OF LOS ANGELES - SUNSET'}, {'text': 'As we hear "What is Love" by HADDAWAY -- night falls and'}, {'text': 'partytime begins.'}, {'scene_heading': 'SUPERIMPOSE: SUNSET BLVD., 11:03 PM'}, {'speaker_heading': 'CUT TO:'}, {'scene_heading': 'EXT. DANCE CLUBS - NIGHT'}, {'dialog': 'Coconut Teaser, The Palace, The Roxbury, Tatou, etc.'}, {'speaker_heading': 'CUT TO:'}, {'scene_heading': 'INT. DANCE CLUBS- QUICK SHOTS - NIGHT'}, {'text': 'Of random dancers -- gyrating, flirting, making out, drinking.'}, {'speaker_heading': 'CUT TO:'}, {'scene_heading': 'INT. PALACE - NIGHT'}, {'text': 'The CAMERA MOVES THROUGH a crowded dance floor -- and'}, {'text': 'SETTLES ON the rhythmically swaying backs of...'}, {'speaker_heading'

In [36]:
# now we'll apply this logic to the whole corpus to see if labels are the same

screenplay_jsons = df['annot_screenplay'].apply(format_as_json)

In [37]:
print(screenplay_jsons[10][:100])

[{'dialog': ''}, {'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by'}, {'dialog': 'Michael Mann &amp; Christopher Crowe'}, {'speaker_heading': 'FADE IN'}, {'text': 'The screen is a microcosm of leaf, crystal drops of precipitation, a stone,'}, {'text': "emerald green moss. It's a landscape in miniature. We HEAR the forest. Some"}, {'text': 'distant birds. Their sound seems to reverberate as if in a cavern. A piece of'}, {'text': 'sunlight refracts within the drops of water, paints a patch of moss yellow. The'}, {'text': 'whisper of wind is joined by another sound that mixes with it. A distant'}, {'text': "rustling. It gets closer and louder. It's shallow breathing. It gets ominous."}, {'text': "We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'dialog': ''}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running 

## 1.2 Analyze labels

In [38]:
# assess the unique keys (labels)
## iterate through list and return set of unique labels
unique_labels = {key for d in roxbury_data for key in d.keys()}
print(unique_labels)

{'text', 'speaker_heading', 'scene_heading', 'dialog'}


In [39]:
def find_unique_labels(json):
    unique_labels = {key for d in json for key in d.keys()}
    return unique_labels

unique_labels_series = screenplay_jsons.apply(find_unique_labels)


## 1.3 Find and drop rows where data is empty

In [40]:
unequal_length = []
for series in unique_labels_series:
    if len(series) != 4:
        unequal_length.append(series)

print(unequal_length)
    

[{'speaker_heading', 'text', 'dialog'}, {'text', 'scene_heading', 'dialog'}, set(), set(), set()]


In [41]:
# some annotations have only three labels, which is fine, but others appear to be empty, which we should investigate
empty_series = []
for idx, series in enumerate(unique_labels_series):
    if len(series) == 0:
        empty_series.append(idx)

In [42]:
missing_imdbid = df.loc[empty_series, 'imdbid']
df.loc[empty_series]

Unnamed: 0,imdbid,title,akas,year,metascore,imdb user rating,number of imdb user votes,awards,opening weekend,producers,...,cast,countries,age restrict,plot,plot outline,keywords,genres,taglines,synopsis,annot_screenplay
1034,1837703,The Fifth Estate,"The 5ifth Estate (United States), The Man Who ...",2013,49,6,38595,"Britannia Award 2013, COFCA Award 2014, Audien...",United States:,"Leifur B. Dagfinnsson, Hilde De Laere, Steve G...",...,"Peter Capaldi, David Thewlis, Anatole Taubman,...","United States, India, Belgium","Argentina:13, Australia:M, Canada:PG::(British...",A dramatic thriller based on real events that ...,The story begins as WikiLeaks founder Julian A...,"internet, pantyhose, red-pantyhose, female-sto...","Biography, Crime, Drama, Thriller",You can't expose the world's secrets without e...,,
1477,99892,Joe Versus the Volcano,"Joe contre le volcan (France), Joe gegen den V...",1990,45,5,34277,Felix 2011,United States:,"Kathleen Kennedy, Frank Marshall, Roxanne Roge...",...,"Tom Hanks, Meg Ryan, Lloyd Bridges, Robert Sta...",United States,"Argentina:13, Australia:PG, Canada:PG, Canada:...","When a hypochondriac learns that he is dying, ...",Joe versus the Volcano is a fable which opens ...,"tom-hanks, surrealism, terminal-illness, suici...","Comedy, Romance","An Average Joe. An Adventurous Comedy., A stor...",,
1639,4364194,The Peanut Butter Falcon,"Le Cri du faucon (France), La familia que tú e...",2019,70,7,61007,"AFCA Award 2020, Audience Award 2019, Special ...",United States:,"Albert Berger, Carmella Casinelli, Manu Gargi,...",...,"Zack Gottsagen, Ann Owens, Dakota Johnson, Bru...",United States,"Australia:M, Austria:10, Belgium:KT/EA, Canada...",Zak runs away from his care home to make his d...,The Peanut Butter Falcon is an adventure story...,"down-syndrome, wrestling, friendship, bare-che...","Adventure, Comedy, Drama",,,


In [43]:
screenplays_df[screenplays_df['imdbid'].isin(missing_imdbid)]

Unnamed: 0,imdbid,annot_screenplay
769,99892,
1513,1837703,
1702,4364194,


If you look at the source data, you'll find the .txt files for these screenplays are simply empty.  We'll drop them.

In [44]:
df_clean = df.drop(empty_series)
df_clean.head()

Unnamed: 0,imdbid,title,akas,year,metascore,imdb user rating,number of imdb user votes,awards,opening weekend,producers,...,cast,countries,age restrict,plot,plot outline,keywords,genres,taglines,synopsis,annot_screenplay
0,120770,A Night at the Roxbury,"Une nuit au Roxbury (France), Movida en el Rox...",1998,26,6,56537,,United States:,"Marie Cantin, Erin Fraser, Amy Heckerling, Ste...",...,"Will Ferrell, Chris Kattan, Raquel Gardner, Vi...",United States,"Argentina:13, Australia:M, Brazil:14, Canada:P...",Two dim-witted brothers dream of owning their ...,"The Roxbury Guys, Steve and Doug Butabi, want ...","woman-on-top, nightclub, car-accident, 1990s, ...","Comedy, Music, Romance",Score!,,dialog: \ntext: A NIGHT AT THE ROXBURY\ndialog...
1,132512,At First Sight,"Sight Unseen (United States), Premier regard (...",1999,40,6,12922,,United States:,"Rob Cowan, Roger Paradiso, Irwin Winkler",...,"Val Kilmer, Mira Sorvino, Kelly McGillis, Stev...",United States,"Argentina:13, Australia:M, Canada:PG::(Alberta...",A blind man has an operation to regain his sig...,First Sight is true to the title from start to...,"visual-agnosia, brother-sister-relationship, r...","Drama, Romance","Only Love Can Bring You To Your Senses., Scien...",,scene_heading: AT FIRST SIGHT\nscene_heading: ...
2,118661,The Avengers,"Chapeau melon et bottes de cuir (France), Mit ...",1998,12,3,40784,"FMCJ Award 1998, Golden Reel Award 1999, Razzi...","United States: $10,305,957, 16 Aug 1998","Susan Ekins, Jerry Weintraub",...,"Ralph Fiennes, Uma Thurman, Sean Connery, Patr...",United States,"Argentina:13, Australia:PG, Brazil:10, Canada:...",Two British Agents team up to stop Sir August ...,"British Ministry Agent John Steed, under direc...","good-versus-evil, heroine, evil-man, villain, ...","Action, Adventure, Sci-Fi, Thriller","Mrs. Peel, we're needed., Extraordinary crimes...",,dialog: \nspeaker_heading: THE AVENGERS\ndialo...
3,215545,Bamboozled,"The Very Black Show (France), It's Showtime (G...",2000,54,6,10373,"Golden Berlin Bear 2001, Black Reel 2001, Imag...",United States:,"Jon Kilik, Spike Lee, Kisha Imani Cameron",...,"Damon Wayans, Savion Glover, Jada Pinkett Smit...",United States,"Australia:MA, Finland:K-15, France:Tous public...",A frustrated African-American TV writer propos...,"Dark, biting satire of the television industry...","television-industry, african-american, referen...","Comedy, Drama, Music",Starring the great negroe actors,"In a New York City residence, Pierre Delacroix...",dialog: Bamboozled\ndialog: by\ndialog: Spike ...
4,118715,The Big Lebowski,"El gran Lebowski (Spain), O Grande Lebowski (P...",1998,71,8,724388,"Honorable Mention 1998, ACCA 1998, Golden Berl...","United States: $5,533,844, 08 Mar 1998","Tim Bevan, John Cameron, Ethan Coen, Eric Fell...",...,"Jeff Bridges, John Goodman, Julianne Moore, St...","United States, United Kingdom","Argentina:16, Argentina:18::(cable rating), Au...","Jeff ""The Dude"" Lebowski, mistaken for a milli...","When ""the dude"" Lebowski is mistaken for a mil...","rug, nihilism, pornographer, bowling-alley, de...","Comedy, Crime, Sport",Hay quienes tratan de ganarse la vida sin move...,A tumbleweed rolls up a hillside just outside ...,dialog: \nscene_heading: THE BIG LEBOWSKI\ntex...


In [45]:
screenplay_jsons.drop(empty_series, inplace=True)

So we have 'scene_heading', 'speaker_heading', 'text' and 'dialog' labels.  Let's look at an example screenplay to see what might be worth removing. 

## 'text'

- 'text' is likely to be relevant, e.g. {'text': 'Of random dancers -- gyrating, flirting, making out, drinking.'}
- 'scene_heading' is conceivably relevant, e.g. {'scene_heading': 'INT. DANCE CLUBS- QUICK SHOTS - NIGHT'} -- a scene set in a nightclub might predict a higher classification. 
Let's look at a representative sample of 'text'. 

In [46]:
# iterate through jsons and return data for 'text' label 
roxbury_texts = [d.get('text') for d in roxbury_data if 'text' in d]
print(roxbury_texts[:100])

['A NIGHT AT THE ROXBURY', 'As we hear "What is Love" by HADDAWAY -- night falls and', 'partytime begins.', 'Of random dancers -- gyrating, flirting, making out, drinking.', 'The CAMERA MOVES THROUGH a crowded dance floor -- and', 'SETTLES ON the rhythmically swaying backs of...', 'Our heroes. In their minds, Steve is tall, dark and handsome and DOUG is a little genius. Neither is correct', 'They simultaneously turn and scope the room. In unison,', 'their heads bop to the MUSIC. Doug steps out from the', 'Doug, rejected, steps back as Steve steps out.', 'They are no strangers to rejection, so neither is fazed.', 'Doug enthusiastically steps towards two attractive girls.', 'Two attractive girls turn their backs to Doug.', 'Doug steps back. Steve spots GIRL AT end of BAR and', 'dances over to her.', 'Steve steps back. Doug sees a pretty woman on a balcony,', 'waving to someone.', 'Pretty woman waves them off, frustrated, and dissapears.', 'They turn around to the bar, bartender is standi

In [47]:
print(roxbury_texts[len(roxbury_texts)-100:len(roxbury_texts)])

['I got it. I walk down the aisle.', 'skirmishes.', 'heavy, I step in. Like a', 'FATHER WILLIAMS, a grey-haired priest, Phil Donahue-type,', 'walks over.', "He's in regular black priest garb. He exits, confused.", 'Steve walks up to Mr. Butabi, who is waiting with the', 'procession. WEDDING MARCH BEGINS.', 'Richard Grieco, in tux, walks down the aisle with a', 'frumpy BRIDESMAID.', 'Craig, the best man, begins walking down with the maid of', 'Craig takes pulse, looks at his watch. Grandma and', 'Grandma Butabi walk down the aisle. As they approach', 'Walk down the aisle. Steve is wearing a CD walkman --', 'Mr. Butabi notices and yanks it off. They pass a pretty', 'girl. Steve veers off course.', 'Mr. Butabi pulls Steve back on course.', 'and her parents walk down the aisle...', 'Emily goes back to walking gracefully. MARCH ENDS.', 'Nobody speaks. Beat.', 'Steve steps forward, takes out a piece of paper, reads.', "Steve steps back. Priest looks to see if Steve's done.", 'Mr. Butabi hand

'texts' are most likely relevant. 

## Flattening Data

In some cases we see the same sentences spread over different values, while key is the same.  The function below will find these contiguous values and flatten them into one value.  This will make sentence tokenization more meaningful later on. 

In [48]:
# we'll define a more general function this time that takes a key input
def flatten_data(dict_list, key):
    flattened_data = []
    temp = ''
    for d in dict_list:
        if key in d:
            temp += ' ' + d[key] if temp else d[key]
        else:
            # if a key other than input is encountered and temp is not empty
            if temp:
                # append the concatenated string to text list 
                flattened_data.append({key:temp})
                # and reset temp 
                temp = ''
            # append non text dict to list 
            flattened_data.append(d)
    # after loop ends, concatenate what's left in temp if anything
    if temp:
        flattened_data.append({key:temp})
    # and return concatenated list
    return flattened_data

In [49]:
# test function on the film 'Anonymous'
anon = screenplay_jsons[100]
anon_text_flattened = flatten_data(anon, 'text')
print(anon_text_flattened[:100])

[{'dialog': ''}, {'speaker_heading': 'ANONYMOUS'}, {'dialog': 'Written by'}, {'dialog': 'John Orloff'}, {'scene_heading': '1 BLACK SCREEN 1'}, {'text': 'TITLES BEGIN over the SOUNDS of city traffic.'}, {'speaker_heading': 'FADE UP:'}, {'scene_heading': '2 EXT. THEATER DISTRICT OF BROADWAY - DUSK 2'}, {'text': 'The sidewalks are filled with theater-goers heading for their shows. Cabs line the streets.'}, {'speaker_heading': 'SIDE ALLEY'}, {'text': 'A cab quickly turns into the alley, coming to a screeching halt. A Man in a Grey Suit jumps out and rushes to the side entrance of a theater. In the background we see that the title of the play, "Anonymous", is written on the theatre\'s marquee...'}, {'scene_heading': '3 INT. BROADWAY THEATER - BACKSTAGE - DUSK 3'}, {'text': 'We follow the Man in the Grey Suit as he rushes through narrow backstage hallways, passing several ACTORS dressing in Elizabethan costumes, applying their make-'}, {'dialog': 'up, etc...'}, {'scene_heading': 'TITLES CONT

In [50]:
# and now for dialog
anon_dialog_flattened = flatten_data(anon_text_flattened, 'dialog')
for i, d in enumerate(anon_dialog_flattened):
    if i == 50:
        break
    else:
        print(d)

{'speaker_heading': 'ANONYMOUS'}
{'dialog': 'Written by John Orloff'}
{'scene_heading': '1 BLACK SCREEN 1'}
{'text': 'TITLES BEGIN over the SOUNDS of city traffic.'}
{'speaker_heading': 'FADE UP:'}
{'scene_heading': '2 EXT. THEATER DISTRICT OF BROADWAY - DUSK 2'}
{'text': 'The sidewalks are filled with theater-goers heading for their shows. Cabs line the streets.'}
{'speaker_heading': 'SIDE ALLEY'}
{'text': 'A cab quickly turns into the alley, coming to a screeching halt. A Man in a Grey Suit jumps out and rushes to the side entrance of a theater. In the background we see that the title of the play, "Anonymous", is written on the theatre\'s marquee...'}
{'scene_heading': '3 INT. BROADWAY THEATER - BACKSTAGE - DUSK 3'}
{'text': 'We follow the Man in the Grey Suit as he rushes through narrow backstage hallways, passing several ACTORS dressing in Elizabethan costumes, applying their make-'}
{'dialog': 'up, etc...'}
{'scene_heading': 'TITLES CONTINUE.'}
{'scene_heading': '4 INT. BROADWAY T

appears to have worked, so we'll apply all for both 'text' and 'dialog' keys

In [51]:
screenplays_flat_txt = screenplay_jsons.apply(flatten_data, key='text')
screenplays_flat = screenplays_flat_txt.apply(flatten_data, key='dialog')

In [52]:
print(screenplays_flat[10][:100])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'speaker_heading': 'FADE IN'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathi

In [53]:
del screenplays_flat_txt

## speaker_heading

'speaker_heading' is likely irrelevant, but let's take a look

In [54]:
speaker_headings = [d.get('speaker_heading') for d in roxbury_data if 'speaker_heading' in d]

'speaker_heading' almost certainly irrelevant. 

In [55]:
# drop speaker_heading data 
def decapitate_speakers(json_list):
    decapitated = [d for d in json_list if not 'speaker_heading' in d]
    return decapitated 

# # test on roxbury 
# roxbury_decapitated = decapitate_speakers(roxbury_data)
# print(roxbury_decapitated[:100])

In [57]:
# free up RAM 
del df, filenames, content, meta_df, ids, movie_titles, roxbury_annot, roxbury_data, screenplays_annot, screenplays_df, speaker_headings, unique_labels_series

In [58]:
# apply to all texts 
decapitated_screenplays = screenplays_flat.apply(decapitate_speakers)

In [59]:
print(decapitated_screenplays[10][:100])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [60]:
del screenplays_flat

note: we can likely remove any values:
- that are empty 
- containing ':' -- these seem to be camera directions

### Remove Empty Strings

In [62]:
# remove empty strings

import string 
puncts = set(string.punctuation)

def remove_nulls(json_list):
    non_nulls = []
    for dict in json_list:
        valid = True
        for val in dict.values():
            if val == '' or all(char in puncts for char in val):
                valid = False
                break
        if valid:
            non_nulls.append(dict)
    return non_nulls

# # beta test on roxbury 
# roxbury_nonna = remove_nulls(roxbury_decapitated)
# print(roxbury_nonna[:100])

In [63]:
print(decapitated_screenplays[10][:100])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [64]:
# apply to all data 
screenplays_nonna = decapitated_screenplays.apply(remove_nulls)
print(screenplays_nonna[10][:100])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [65]:
i = 0
for s in screenplays_nonna[10]:
    if i == 10:
        break
    print(s)
    i += 1

{'scene_heading': 'THE LAST OF THE MOHICANS'}
{'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}
{'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}
{'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}
{'text': 'rockets through the frame scaring us and ...'}
{'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}
{'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}
{'dialog': 'thi

It's possible that values in allcaps are basically irrelevant. Let's return these and take a look at them. 

In [66]:
pattern = re.compile(r'^[^a-z]+$')

def no_lower(dict_list):
    no_lowers = []
    for d in dict_list:
        for val in d.values():
            if re.match(pattern, val):
                no_lowers.append(d)
    return no_lowers

# test on mohicans
mohicans = screenplays_nonna[10]
mohicans_no_lowers = no_lower(mohicans)
for i, j in enumerate(mohicans_no_lowers):
    if j == 20:
        break
    print(j)

{'scene_heading': 'THE LAST OF THE MOHICANS'}
{'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}
{'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}
{'dialog': 'CUT TO ...'}
{'scene_heading': 'ANOTHER PART OF THE FOREST - MASSIVE WAR CLUB - DAY'}
{'scene_heading': 'WIDE ANGLE: CHINGACHGOOK'}
{'dialog': 'CUT TO ...'}
{'scene_heading': 'ANOTHER PART OF THE FOREST - LONG BLACK HAIR - DAY'}
{'scene_heading': "HAWKEYE'S POV: A PIECE OF TAN"}
{'scene_heading': 'SUDDENLY HE STOPS'}
{'scene_heading': "HAWKEYE'S POV: RACK FOCUS THROUGH THE GUN SIGHT"}
{'scene_heading': 'EXTERIOR - INTERIOR CAMERON CABIN - JOHN CAMERON - NIGHT'}
{'scene_heading': 'EXTERIOR CAMERON CABIN, DOORWAY - CAMERON - NIGHT'}
{'scene_heading': 'FENCE: CHINGACHGOOK'}
{'dialog': 'CUT TO ...'}
{'scene_heading': 'INTERIOR CABIN - CHINGACHGOOK - EVENING (LATER)'}
{'dialog': 'CUT TO ...'}
{'dialog': 'CUT TO ...'}
{'scene_heading': 'EXTERIOR BRITISH ENCAMPMENT, PARADE GROUND - SIX HUNDRED'}
{'scene_heading': 'REGIMENTAL SG

potentially relevant info in here, e.g. 'GUN', 'MASSIVE WAR CLUB'.  However we can delete all strings that match 'CUT TO ...' 

In [67]:
print(screenplays_nonna[0][:10])

[{'text': 'A NIGHT AT THE ROXBURY'}, {'dialog': 'written by Steve Koren Will Ferrell & Chris Kattan June 2, 1997'}, {'scene_heading': 'EXT. PANORAMIC VIEW OF LOS ANGELES - SUNSET'}, {'text': 'As we hear "What is Love" by HADDAWAY -- night falls and partytime begins.'}, {'scene_heading': 'SUPERIMPOSE: SUNSET BLVD., 11:03 PM'}, {'scene_heading': 'EXT. DANCE CLUBS - NIGHT'}, {'dialog': 'Coconut Teaser, The Palace, The Roxbury, Tatou, etc.'}, {'scene_heading': 'INT. DANCE CLUBS- QUICK SHOTS - NIGHT'}, {'text': 'Of random dancers -- gyrating, flirting, making out, drinking.'}, {'scene_heading': 'INT. PALACE - NIGHT'}]


In [68]:
def delete_cuts(dict_list):
    # empty list for filtered dicts
    dicts_uncut = []
    for d in dict_list:
        # if none of the values in the dict match 'CUT'
        if all(not re.search(r'CUT', str(val)) for val in d.values()):
            # then append to list
            dicts_uncut.append(d)
    return dicts_uncut 

# test on mohicans_no_lowers
mohicans_uncut = delete_cuts(mohicans_no_lowers)
for i, j in enumerate(mohicans_uncut):
    if j == 20:
        break
    print(j)

{'scene_heading': 'THE LAST OF THE MOHICANS'}
{'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}
{'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}
{'scene_heading': 'ANOTHER PART OF THE FOREST - MASSIVE WAR CLUB - DAY'}
{'scene_heading': 'WIDE ANGLE: CHINGACHGOOK'}
{'scene_heading': 'ANOTHER PART OF THE FOREST - LONG BLACK HAIR - DAY'}
{'scene_heading': "HAWKEYE'S POV: A PIECE OF TAN"}
{'scene_heading': 'SUDDENLY HE STOPS'}
{'scene_heading': "HAWKEYE'S POV: RACK FOCUS THROUGH THE GUN SIGHT"}
{'scene_heading': 'EXTERIOR - INTERIOR CAMERON CABIN - JOHN CAMERON - NIGHT'}
{'scene_heading': 'EXTERIOR CAMERON CABIN, DOORWAY - CAMERON - NIGHT'}
{'scene_heading': 'FENCE: CHINGACHGOOK'}
{'scene_heading': 'INTERIOR CABIN - CHINGACHGOOK - EVENING (LATER)'}
{'scene_heading': 'EXTERIOR BRITISH ENCAMPMENT, PARADE GROUND - SIX HUNDRED'}
{'scene_heading': 'REGIMENTAL SGT. MAJOR'}
{'scene_heading': 'EXTERIOR ROAD - HORSES GALLOP - DAY'}
{'scene_heading': 'INTERIOR COACH - MAJOR DUNCAN HEYWARD 

In [69]:
print(mohicans[:10])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [70]:
mohicans_uncut = delete_cuts(mohicans)

In [71]:
print(mohicans_uncut[:10])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [72]:
print(screenplays_nonna[10][:10])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [73]:
# apply all 
import numpy as np

# average length before filtering
avg_length_before = np.mean([len(s) for s in screenplays_nonna])
print("avg length before:", avg_length_before)

screenplays_uncut = screenplays_nonna.apply(delete_cuts)

# average length after filtering
avg_length_after = np.mean([len(s) for s in screenplays_uncut])
print("avg_length_after:", avg_length_after)

avg length before: 1852.1598997493734
avg_length_after: 1846.6005012531327


In [74]:
print(screenplays_uncut[10][:10])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [75]:
del screenplays_nonna

In [76]:
df_clean.loc[200]

imdbid                                                                 1488555
title                                                            The Change-Up
akas                         Échange standard (France), Wie ausgewechselt (...
year                                                                      2011
metascore                                                                   39
imdb user rating                                                             6
number of imdb user votes                                               165797
awards                                                 Young Artist Award 2012
opening weekend                        United States: $13,531,115, 07 Aug 2011
producers                    Joseph M. Caracciolo Jr., David Dobkin, Jeff K...
budget                                                 $52,000,000 (estimated)
script department                                  Gail Hunter, Debbie Walters
production companies         Universal Pictures, Rel

In [77]:
# print another random sample to see where we're at 
change_up = screenplays_uncut[200]
for i, s in enumerate(change_up):
    if i == 10:
        break
    print(s)

{'dialog': 'Written by Jon Lucas &amp; Scott Moore July 31, 2009 '}
{'scene_heading': 'OPEN ON: PEACEFUL BLACK STILLNESS'}
{'text': 'Then we hear a baby SCREAM BLOODY MURDER. Then a second baby joins in, even more shrill than the first. Finally, we hear'}
{'dialog': 'the worst two words a parent can ever hear:'}
{'dialog': 'Your turn.'}
{'dialog': 'Fuck.'}
{'scene_heading': 'INT. SUBURBAN HOUSE -- NIGHT'}
{'text': 'DAVE LOCKWOOD, 30, bleary-eyed father of three, shuffles through his well-appointed suburban home, passing a grandfather clock reading 3:45. He stumbles over a TOY GIRAFFE -- it SQUEAKS, and Dave sleepily mumbles:'}
{'dialog': 'Sorry Hank.'}
{'scene_heading': 'INT. NURSERY-- NIGHT'}


# Sentence Tokenization

We'll sentence tokenize the values first before removing punctuation marks etc. 

In [78]:
# ! pip install nltk

In [79]:
print(mohicans_uncut[:10])

[{'scene_heading': 'THE LAST OF THE MOHICANS'}, {'dialog': 'Written by Michael Mann &amp; Christopher Crowe'}, {'text': "The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss. It's a landscape in miniature. We HEAR the forest. Some distant birds. Their sound seems to reverberate as if in a cavern. A piece of sunlight refracts within the drops of water, paints a patch of moss yellow. The whisper of wind is joined by another sound that mixes with it. A distant rustling. It gets closer and louder. It's shallow breathing. It gets ominous. We're interlopers on the floor of the forest and something is coming."}, {'scene_heading': 'SUDDENLY: A MOCCASINED FOOT'}, {'text': 'rockets through the frame scaring us and ...'}, {'scene_heading': 'EXTREMELY CLOSE: PART OF AN INDIAN FACE'}, {'text': "running hard. His head shaved bald except for a scalp-lock. Tattoos. He's twenty-five. He seems tall and muscled. Heavy, even breathing. We'll learn later"}, {'dialo

In [81]:
! pip install nltk

Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2024.9.11-cp311-cp311-win_amd64.whl.metadata (41 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Using cached regex-2024.9.11-cp311-cp311-win_amd64.whl (274 kB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Installing collected packages: regex, joblib, nltk
Successfully installed joblib-1.4.2 nltk-3.9.1 regex-2024.9.11


In [83]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')

# try out first on mohicans
mohicans_sents = []

for d in mohicans_uncut:
    # empty dict for storing result 
    sents_dict = {}
    for key, value in d.items():
        # # if the value is a list, unpack the list first (needs to be debugged)
        # if isinstance(value, list):
        #     value = str(value)
        #     sents_dict[key] = sent_tokenize(value)
        #     mohicans_sents.append(sents_dict)
        # else:
        sents_dict[key] = sent_tokenize(value)
        mohicans_sents.append(sents_dict)

for i, j in enumerate(mohicans_sents):
    if i == 10:
        break
    print(j)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\bened\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


{'scene_heading': ['THE LAST OF THE MOHICANS']}
{'dialog': ['Written by Michael Mann &amp; Christopher Crowe']}
{'text': ['The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss.', "It's a landscape in miniature.", 'We HEAR the forest.', 'Some distant birds.', 'Their sound seems to reverberate as if in a cavern.', 'A piece of sunlight refracts within the drops of water, paints a patch of moss yellow.', 'The whisper of wind is joined by another sound that mixes with it.', 'A distant rustling.', 'It gets closer and louder.', "It's shallow breathing.", 'It gets ominous.', "We're interlopers on the floor of the forest and something is coming."]}
{'scene_heading': ['SUDDENLY: A MOCCASINED FOOT']}
{'text': ['rockets through the frame scaring us and ...']}
{'scene_heading': ['EXTREMELY CLOSE: PART OF AN INDIAN FACE']}
{'text': ['running hard.', 'His head shaved bald except for a scalp-lock.', 'Tattoos.', "He's twenty-five.", 'He seems tall and muscled.'

Seems to work okay, although now we have to deal with a list of dicts of lists :/ including lists with one sentence 

If you run this again, make it a dict of dicts? With a structure like
{"screenplay":
    {"label":"data"},
    {"label":"data"},
    etc}

In [84]:
# delete unneeded variables before sentence tokenization 
del anon, anon_dialog_flattened

In [86]:
del anon_text_flattened, change_up, decapitated_screenplays, mohicans, mohicans_no_lowers, mohicans_sents, mohicans_uncut, roxbury_texts, screenplay_jsons

NameError: name 'anon_text_flattened' is not defined

In [87]:
# define as a general function
def sent_tokenize_dicts(dict_list):

    sentence_dicts = []

    for d in dict_list:
        # empty dict for storing result 
        sents_dict = {}
        for key, value in d.items():
            sents_dict[key] = sent_tokenize(value)
            sentence_dicts.append(sents_dict)
    
    return sentence_dicts

In [88]:
# apply all 
screenplay_sents = screenplays_uncut.apply(sent_tokenize_dicts)

In [89]:
print_first_lines(screenplay_sents[50], 10)

{'text': ['TEN THINGS I HATE ABOUT YOU']}
{'dialog': ['written by Karen McCullah Lutz &amp; Kirsten Smith  based on \'Taming of the Shrew" by William Shakespeare  Revision November 12, 1997']}
{'scene_heading': ['PADUA HIGH SCHOOL - DAY']}
{'dialog': ['Welcome to Padua High School,, your typical urban-suburban high school in Portland, Oregon.', 'Smarties, Skids, Preppies,']}
{'text': ['Granolas.', 'Loners, Lovers, the In and the Out Crowd rub sleep out of their eyes and head for the main building.']}
{'scene_heading': ['PADUA HIGH PARKING LOT - DAY']}
{'text': ["KAT STRATFORD, eighteen, pretty -- but trying hard not to be -- in a baggy granny dress and glasses, balances a cup of coffee and a backpack as she climbs out of her battered, baby blue '75 Dodge Dart."]}
{'text': ['A stray SKATEBOARD clips her, causing her to stumble and spill her coffee, as well as the contents of her backpack.']}
{'text': ['The young RIDER dashes over to help, trembling when he sees who his board has hit.']}

looks okay

In [90]:
del screenplays_uncut

## Label Encoding

At this point we're going to encode our labels just to save on memory. 

In [91]:
label_map = {
    'scene_heading': 0,
    'text': 1,
    'dialog': 2
}

# check how it will work on ten things I hate about you 
ten_things_sents = screenplay_sents[50]
# iterate through dict list
ten_things_encoded = []
for d in ten_things_sents:
    encoded_dict = {np.int8(label_map[key]): value for key, value in d.items()}
    ten_things_encoded.append(encoded_dict)

print(ten_things_encoded[:10])


[{np.int8(1): ['TEN THINGS I HATE ABOUT YOU']}, {np.int8(2): ['written by Karen McCullah Lutz &amp; Kirsten Smith  based on \'Taming of the Shrew" by William Shakespeare  Revision November 12, 1997']}, {np.int8(0): ['PADUA HIGH SCHOOL - DAY']}, {np.int8(2): ['Welcome to Padua High School,, your typical urban-suburban high school in Portland, Oregon.', 'Smarties, Skids, Preppies,']}, {np.int8(1): ['Granolas.', 'Loners, Lovers, the In and the Out Crowd rub sleep out of their eyes and head for the main building.']}, {np.int8(0): ['PADUA HIGH PARKING LOT - DAY']}, {np.int8(1): ["KAT STRATFORD, eighteen, pretty -- but trying hard not to be -- in a baggy granny dress and glasses, balances a cup of coffee and a backpack as she climbs out of her battered, baby blue '75 Dodge Dart."]}, {np.int8(1): ['A stray SKATEBOARD clips her, causing her to stumble and spill her coffee, as well as the contents of her backpack.']}, {np.int8(1): ['The young RIDER dashes over to help, trembling when he sees wh

In [92]:
# define as function and apply all 

def encode_labels(dict_list):
    encoded_list = []
    for d in dict_list:
        encoded_dict = {np.int8(label_map[key]): value for key, value in d.items()}
        encoded_list.append(encoded_dict)
    return encoded_list

screenplays_encoded = screenplay_sents.apply(encode_labels)
print(screenplays_encoded[0][:10])

[{np.int8(1): ['A NIGHT AT THE ROXBURY']}, {np.int8(2): ['written by Steve Koren Will Ferrell & Chris Kattan June 2, 1997']}, {np.int8(0): ['EXT.', 'PANORAMIC VIEW OF LOS ANGELES - SUNSET']}, {np.int8(1): ['As we hear "What is Love" by HADDAWAY -- night falls and partytime begins.']}, {np.int8(0): ['SUPERIMPOSE: SUNSET BLVD., 11:03 PM']}, {np.int8(0): ['EXT.', 'DANCE CLUBS - NIGHT']}, {np.int8(2): ['Coconut Teaser, The Palace, The Roxbury, Tatou, etc.']}, {np.int8(0): ['INT.', 'DANCE CLUBS- QUICK SHOTS - NIGHT']}, {np.int8(1): ['Of random dancers -- gyrating, flirting, making out, drinking.']}, {np.int8(0): ['INT.', 'PALACE - NIGHT']}]


## remove 'EXT' and 'INT' 

We can remove all sentences which contain only EXT/INT

In [93]:
def remove_location(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [item for item in value if item not in ['EXT.', 'INT.']]
    return dict_list

In [94]:
# test on ten things 
roxbury_sents = screenplays_encoded[0]
roxbury_unlocated = remove_location(roxbury_sents)
print_first_lines(roxbury_unlocated, 10)

{np.int8(1): ['A NIGHT AT THE ROXBURY']}
{np.int8(2): ['written by Steve Koren Will Ferrell & Chris Kattan June 2, 1997']}
{np.int8(0): ['PANORAMIC VIEW OF LOS ANGELES - SUNSET']}
{np.int8(1): ['As we hear "What is Love" by HADDAWAY -- night falls and partytime begins.']}
{np.int8(0): ['SUPERIMPOSE: SUNSET BLVD., 11:03 PM']}
{np.int8(0): ['DANCE CLUBS - NIGHT']}
{np.int8(2): ['Coconut Teaser, The Palace, The Roxbury, Tatou, etc.']}
{np.int8(0): ['DANCE CLUBS- QUICK SHOTS - NIGHT']}
{np.int8(1): ['Of random dancers -- gyrating, flirting, making out, drinking.']}
{np.int8(0): ['PALACE - NIGHT']}


In [95]:
del screenplay_sents

In [96]:
# apply all
screenplays_unlocated = screenplays_encoded.apply(remove_location)
print(screenplays_unlocated[10][:10])

[{np.int8(0): ['THE LAST OF THE MOHICANS']}, {np.int8(2): ['Written by Michael Mann &amp; Christopher Crowe']}, {np.int8(1): ['The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss.', "It's a landscape in miniature.", 'We HEAR the forest.', 'Some distant birds.', 'Their sound seems to reverberate as if in a cavern.', 'A piece of sunlight refracts within the drops of water, paints a patch of moss yellow.', 'The whisper of wind is joined by another sound that mixes with it.', 'A distant rustling.', 'It gets closer and louder.', "It's shallow breathing.", 'It gets ominous.', "We're interlopers on the floor of the forest and something is coming."]}, {np.int8(0): ['SUDDENLY: A MOCCASINED FOOT']}, {np.int8(1): ['rockets through the frame scaring us and ...']}, {np.int8(0): ['EXTREMELY CLOSE: PART OF AN INDIAN FACE']}, {np.int8(1): ['running hard.', 'His head shaved bald except for a scalp-lock.', 'Tattoos.', "He's twenty-five.", 'He seems tall and mus

In [97]:
print_first_lines(screenplays_unlocated[200], 10)

{np.int8(2): ['Written by Jon Lucas &amp; Scott Moore July 31, 2009']}
{np.int8(0): ['OPEN ON: PEACEFUL BLACK STILLNESS']}
{np.int8(1): ['Then we hear a baby SCREAM BLOODY MURDER.', 'Then a second baby joins in, even more shrill than the first.', 'Finally, we hear']}
{np.int8(2): ['the worst two words a parent can ever hear:']}
{np.int8(2): ['Your turn.']}
{np.int8(2): ['Fuck.']}
{np.int8(0): ['SUBURBAN HOUSE -- NIGHT']}
{np.int8(1): ['DAVE LOCKWOOD, 30, bleary-eyed father of three, shuffles through his well-appointed suburban home, passing a grandfather clock reading 3:45.', 'He stumbles over a TOY GIRAFFE -- it SQUEAKS, and Dave sleepily mumbles:']}
{np.int8(2): ['Sorry Hank.']}
{np.int8(0): ['NURSERY-- NIGHT']}


## Word Tokenization

In [98]:
print(screenplays_unlocated[10][:10])

[{np.int8(0): ['THE LAST OF THE MOHICANS']}, {np.int8(2): ['Written by Michael Mann &amp; Christopher Crowe']}, {np.int8(1): ['The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss.', "It's a landscape in miniature.", 'We HEAR the forest.', 'Some distant birds.', 'Their sound seems to reverberate as if in a cavern.', 'A piece of sunlight refracts within the drops of water, paints a patch of moss yellow.', 'The whisper of wind is joined by another sound that mixes with it.', 'A distant rustling.', 'It gets closer and louder.', "It's shallow breathing.", 'It gets ominous.', "We're interlopers on the floor of the forest and something is coming."]}, {np.int8(0): ['SUDDENLY: A MOCCASINED FOOT']}, {np.int8(1): ['rockets through the frame scaring us and ...']}, {np.int8(0): ['EXTREMELY CLOSE: PART OF AN INDIAN FACE']}, {np.int8(1): ['running hard.', 'His head shaved bald except for a scalp-lock.', 'Tattoos.', "He's twenty-five.", 'He seems tall and mus

In [99]:
from nltk.tokenize import word_tokenize
import copy

# try out on mohicans 
mohicans_sents = copy.deepcopy(screenplays_unlocated[10])

def word_tokenize_dicts(dict_list):
    # iterate through dict list
    for d in dict_list:
        # iterate through keys and values 
        for key, value in d.items():
            d[key] = [word_tokenize(sent) for sent in value]
    return dict_list

mohicans_tokenized = word_tokenize_dicts(mohicans_sents)
print_first_lines(mohicans_tokenized, 10)


{np.int8(0): [['THE', 'LAST', 'OF', 'THE', 'MOHICANS']]}
{np.int8(2): [['Written', 'by', 'Michael', 'Mann', '&', 'amp', ';', 'Christopher', 'Crowe']]}
{np.int8(1): [['The', 'screen', 'is', 'a', 'microcosm', 'of', 'leaf', ',', 'crystal', 'drops', 'of', 'precipitation', ',', 'a', 'stone', ',', 'emerald', 'green', 'moss', '.'], ['It', "'s", 'a', 'landscape', 'in', 'miniature', '.'], ['We', 'HEAR', 'the', 'forest', '.'], ['Some', 'distant', 'birds', '.'], ['Their', 'sound', 'seems', 'to', 'reverberate', 'as', 'if', 'in', 'a', 'cavern', '.'], ['A', 'piece', 'of', 'sunlight', 'refracts', 'within', 'the', 'drops', 'of', 'water', ',', 'paints', 'a', 'patch', 'of', 'moss', 'yellow', '.'], ['The', 'whisper', 'of', 'wind', 'is', 'joined', 'by', 'another', 'sound', 'that', 'mixes', 'with', 'it', '.'], ['A', 'distant', 'rustling', '.'], ['It', 'gets', 'closer', 'and', 'louder', '.'], ['It', "'s", 'shallow', 'breathing', '.'], ['It', 'gets', 'ominous', '.'], ['We', "'re", 'interlopers', 'on', 'the',

In [100]:
print(screenplays_unlocated[10][:10])

[{np.int8(0): ['THE LAST OF THE MOHICANS']}, {np.int8(2): ['Written by Michael Mann &amp; Christopher Crowe']}, {np.int8(1): ['The screen is a microcosm of leaf, crystal drops of precipitation, a stone, emerald green moss.', "It's a landscape in miniature.", 'We HEAR the forest.', 'Some distant birds.', 'Their sound seems to reverberate as if in a cavern.', 'A piece of sunlight refracts within the drops of water, paints a patch of moss yellow.', 'The whisper of wind is joined by another sound that mixes with it.', 'A distant rustling.', 'It gets closer and louder.', "It's shallow breathing.", 'It gets ominous.', "We're interlopers on the floor of the forest and something is coming."]}, {np.int8(0): ['SUDDENLY: A MOCCASINED FOOT']}, {np.int8(1): ['rockets through the frame scaring us and ...']}, {np.int8(0): ['EXTREMELY CLOSE: PART OF AN INDIAN FACE']}, {np.int8(1): ['running hard.', 'His head shaved bald except for a scalp-lock.', 'Tattoos.', "He's twenty-five.", 'He seems tall and mus

unfortunate that we're now dealing with lists of dicts of lists of lists :/  but not sure how to remedy that without losing sentence boundaries

In [101]:
# apply all 
screenplays_tokenized = screenplays_unlocated.apply(word_tokenize_dicts)
print_first_lines(screenplays_tokenized[0], 10)

{np.int8(1): [['A', 'NIGHT', 'AT', 'THE', 'ROXBURY']]}
{np.int8(2): [['written', 'by', 'Steve', 'Koren', 'Will', 'Ferrell', '&', 'Chris', 'Kattan', 'June', '2', ',', '1997']]}
{np.int8(0): [['PANORAMIC', 'VIEW', 'OF', 'LOS', 'ANGELES', '-', 'SUNSET']]}
{np.int8(1): [['As', 'we', 'hear', '``', 'What', 'is', 'Love', "''", 'by', 'HADDAWAY', '--', 'night', 'falls', 'and', 'partytime', 'begins', '.']]}
{np.int8(0): [['SUPERIMPOSE', ':', 'SUNSET', 'BLVD.', ',', '11:03', 'PM']]}
{np.int8(0): [['DANCE', 'CLUBS', '-', 'NIGHT']]}
{np.int8(2): [['Coconut', 'Teaser', ',', 'The', 'Palace', ',', 'The', 'Roxbury', ',', 'Tatou', ',', 'etc', '.']]}
{np.int8(0): [['DANCE', 'CLUBS-', 'QUICK', 'SHOTS', '-', 'NIGHT']]}
{np.int8(1): [['Of', 'random', 'dancers', '--', 'gyrating', ',', 'flirting', ',', 'making', 'out', ',', 'drinking', '.']]}
{np.int8(0): [['PALACE', '-', 'NIGHT']]}


## remove strings that contain no letters 

In [102]:
mohicans_tokenized = copy.deepcopy(screenplays_tokenized[10])
print_first_lines(mohicans_tokenized, 10)

{np.int8(0): [['THE', 'LAST', 'OF', 'THE', 'MOHICANS']]}
{np.int8(2): [['Written', 'by', 'Michael', 'Mann', '&', 'amp', ';', 'Christopher', 'Crowe']]}
{np.int8(1): [['The', 'screen', 'is', 'a', 'microcosm', 'of', 'leaf', ',', 'crystal', 'drops', 'of', 'precipitation', ',', 'a', 'stone', ',', 'emerald', 'green', 'moss', '.'], ['It', "'s", 'a', 'landscape', 'in', 'miniature', '.'], ['We', 'HEAR', 'the', 'forest', '.'], ['Some', 'distant', 'birds', '.'], ['Their', 'sound', 'seems', 'to', 'reverberate', 'as', 'if', 'in', 'a', 'cavern', '.'], ['A', 'piece', 'of', 'sunlight', 'refracts', 'within', 'the', 'drops', 'of', 'water', ',', 'paints', 'a', 'patch', 'of', 'moss', 'yellow', '.'], ['The', 'whisper', 'of', 'wind', 'is', 'joined', 'by', 'another', 'sound', 'that', 'mixes', 'with', 'it', '.'], ['A', 'distant', 'rustling', '.'], ['It', 'gets', 'closer', 'and', 'louder', '.'], ['It', "'s", 'shallow', 'breathing', '.'], ['It', 'gets', 'ominous', '.'], ['We', "'re", 'interlopers', 'on', 'the',

In [103]:
import string
puncts = list(string.punctuation)
print(puncts)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [104]:
import re 

def contains_letters(token):
    return bool(re.search(r'[a-zA-Z]', token))

def remove_non_letters(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [t for t in sentence if contains_letters(t)]
                for sentence in value
            ]
    return dict_list 

# test on mohicans_tokenized
mohicans_alpha = remove_non_letters(mohicans_tokenized)
print_first_lines(mohicans_alpha, 10)

{np.int8(0): [['THE', 'LAST', 'OF', 'THE', 'MOHICANS']]}
{np.int8(2): [['Written', 'by', 'Michael', 'Mann', 'amp', 'Christopher', 'Crowe']]}
{np.int8(1): [['The', 'screen', 'is', 'a', 'microcosm', 'of', 'leaf', 'crystal', 'drops', 'of', 'precipitation', 'a', 'stone', 'emerald', 'green', 'moss'], ['It', "'s", 'a', 'landscape', 'in', 'miniature'], ['We', 'HEAR', 'the', 'forest'], ['Some', 'distant', 'birds'], ['Their', 'sound', 'seems', 'to', 'reverberate', 'as', 'if', 'in', 'a', 'cavern'], ['A', 'piece', 'of', 'sunlight', 'refracts', 'within', 'the', 'drops', 'of', 'water', 'paints', 'a', 'patch', 'of', 'moss', 'yellow'], ['The', 'whisper', 'of', 'wind', 'is', 'joined', 'by', 'another', 'sound', 'that', 'mixes', 'with', 'it'], ['A', 'distant', 'rustling'], ['It', 'gets', 'closer', 'and', 'louder'], ['It', "'s", 'shallow', 'breathing'], ['It', 'gets', 'ominous'], ['We', "'re", 'interlopers', 'on', 'the', 'floor', 'of', 'the', 'forest', 'and', 'something', 'is', 'coming']]}
{np.int8(0): [

In [105]:
del mohicans_alpha, mohicans_sents, mohicans_tokenized, roxbury_sents, roxbury_unlocated, screenplays_encoded, screenplays_unlocated, ten_things_encoded, ten_things_sents

In [106]:
# seems to work so apply all 
screenplays_alpha = screenplays_tokenized.apply(remove_non_letters)
print_first_lines(screenplays_alpha[0], 10)

{np.int8(1): [['A', 'NIGHT', 'AT', 'THE', 'ROXBURY']]}
{np.int8(2): [['written', 'by', 'Steve', 'Koren', 'Will', 'Ferrell', 'Chris', 'Kattan', 'June']]}
{np.int8(0): [['PANORAMIC', 'VIEW', 'OF', 'LOS', 'ANGELES', 'SUNSET']]}
{np.int8(1): [['As', 'we', 'hear', 'What', 'is', 'Love', 'by', 'HADDAWAY', 'night', 'falls', 'and', 'partytime', 'begins']]}
{np.int8(0): [['SUPERIMPOSE', 'SUNSET', 'BLVD.', 'PM']]}
{np.int8(0): [['DANCE', 'CLUBS', 'NIGHT']]}
{np.int8(2): [['Coconut', 'Teaser', 'The', 'Palace', 'The', 'Roxbury', 'Tatou', 'etc']]}
{np.int8(0): [['DANCE', 'CLUBS-', 'QUICK', 'SHOTS', 'NIGHT']]}
{np.int8(1): [['Of', 'random', 'dancers', 'gyrating', 'flirting', 'making', 'out', 'drinking']]}
{np.int8(0): [['PALACE', 'NIGHT']]}


## to lower

Since sentence boundaries are already marked, we can convert all chars to lowercase 

In [107]:
def convert_to_lower(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [w.lower() for w in sentence]
                for sentence in value
            ]
    return dict_list

In [108]:
mo = copy.deepcopy(screenplays_alpha[10])
print_first_lines(mo, 10)

{np.int8(0): [['THE', 'LAST', 'OF', 'THE', 'MOHICANS']]}
{np.int8(2): [['Written', 'by', 'Michael', 'Mann', 'amp', 'Christopher', 'Crowe']]}
{np.int8(1): [['The', 'screen', 'is', 'a', 'microcosm', 'of', 'leaf', 'crystal', 'drops', 'of', 'precipitation', 'a', 'stone', 'emerald', 'green', 'moss'], ['It', "'s", 'a', 'landscape', 'in', 'miniature'], ['We', 'HEAR', 'the', 'forest'], ['Some', 'distant', 'birds'], ['Their', 'sound', 'seems', 'to', 'reverberate', 'as', 'if', 'in', 'a', 'cavern'], ['A', 'piece', 'of', 'sunlight', 'refracts', 'within', 'the', 'drops', 'of', 'water', 'paints', 'a', 'patch', 'of', 'moss', 'yellow'], ['The', 'whisper', 'of', 'wind', 'is', 'joined', 'by', 'another', 'sound', 'that', 'mixes', 'with', 'it'], ['A', 'distant', 'rustling'], ['It', 'gets', 'closer', 'and', 'louder'], ['It', "'s", 'shallow', 'breathing'], ['It', 'gets', 'ominous'], ['We', "'re", 'interlopers', 'on', 'the', 'floor', 'of', 'the', 'forest', 'and', 'something', 'is', 'coming']]}
{np.int8(0): [

In [109]:
# test on mohicans 
mo_lower = convert_to_lower(mo)
print_first_lines(mo_lower, 10)

{np.int8(0): [['the', 'last', 'of', 'the', 'mohicans']]}
{np.int8(2): [['written', 'by', 'michael', 'mann', 'amp', 'christopher', 'crowe']]}
{np.int8(1): [['the', 'screen', 'is', 'a', 'microcosm', 'of', 'leaf', 'crystal', 'drops', 'of', 'precipitation', 'a', 'stone', 'emerald', 'green', 'moss'], ['it', "'s", 'a', 'landscape', 'in', 'miniature'], ['we', 'hear', 'the', 'forest'], ['some', 'distant', 'birds'], ['their', 'sound', 'seems', 'to', 'reverberate', 'as', 'if', 'in', 'a', 'cavern'], ['a', 'piece', 'of', 'sunlight', 'refracts', 'within', 'the', 'drops', 'of', 'water', 'paints', 'a', 'patch', 'of', 'moss', 'yellow'], ['the', 'whisper', 'of', 'wind', 'is', 'joined', 'by', 'another', 'sound', 'that', 'mixes', 'with', 'it'], ['a', 'distant', 'rustling'], ['it', 'gets', 'closer', 'and', 'louder'], ['it', "'s", 'shallow', 'breathing'], ['it', 'gets', 'ominous'], ['we', "'re", 'interlopers', 'on', 'the', 'floor', 'of', 'the', 'forest', 'and', 'something', 'is', 'coming']]}
{np.int8(0): [

In [110]:
consideration = copy.deepcopy(screenplays_alpha[25])
print_first_lines(consideration, 10)

{np.int8(2): [['C', 'O', 'N', 'S', 'I', 'D', 'E', 'R', 'A', 'T', 'I', 'O', 'N']]}
{np.int8(0): [['NICOLE', 'HOLOFCENER', 'AND', 'JEFF', 'WHIT', 'T', 'Y']]}
{np.int8(1): [['BASED', 'ON', 'THE', 'BOOK', 'BY']]}
{np.int8(0): [['NICOLE', 'HOLOFCENER', 'AND', 'JEFF', 'WHIT', 'T', 'Y']]}
{np.int8(2): [['CAN', 'YOU', 'EVER', 'FORGIVE', 'ME'], ['Screenplay', 'by', 'Nicole', 'Holofcener', 'and', 'Jeff', 'Whitty', 'Based', 'on', 'the', 'book', 'CAN', 'YOU', 'EVER', 'FORGIVE', 'ME'], ['By', 'Lee', 'Israel', 'Final', 'Shooting', 'Script', 'March']]}
{np.int8(0): [['FOX', 'SEARCHLIGHT', 'PICTURES', 'INC']]}
{np.int8(2): [['Los', 'Angeles', 'CA']]}
{np.int8(0): [['ALL', 'RIGHTS', 'RESERVED'], ['COPYRIGHT', 'WILLOW', 'AND', 'OAK', 'INC.', 'NO']]}
{np.int8(0): [['PORTION', 'OF', 'THIS', 'SCRIPT', 'MAY', 'BE', 'PERFORMED', 'PUBLISHED', 'REPRODUCED']]}
{np.int8(1): [['SOLD', 'OR', 'DISTRIBUTED', 'BY', 'ANY', 'MEANS', 'OR', 'QUOTED', 'OR', 'PUBLISHED', 'IN', 'ANY']]}


In [111]:
# apply all
screenplays_lower = screenplays_alpha.apply(convert_to_lower)
print_first_lines(screenplays_lower[25], 10)

{np.int8(2): [['c', 'o', 'n', 's', 'i', 'd', 'e', 'r', 'a', 't', 'i', 'o', 'n']]}
{np.int8(0): [['nicole', 'holofcener', 'and', 'jeff', 'whit', 't', 'y']]}
{np.int8(1): [['based', 'on', 'the', 'book', 'by']]}
{np.int8(0): [['nicole', 'holofcener', 'and', 'jeff', 'whit', 't', 'y']]}
{np.int8(2): [['can', 'you', 'ever', 'forgive', 'me'], ['screenplay', 'by', 'nicole', 'holofcener', 'and', 'jeff', 'whitty', 'based', 'on', 'the', 'book', 'can', 'you', 'ever', 'forgive', 'me'], ['by', 'lee', 'israel', 'final', 'shooting', 'script', 'march']]}
{np.int8(0): [['fox', 'searchlight', 'pictures', 'inc']]}
{np.int8(2): [['los', 'angeles', 'ca']]}
{np.int8(0): [['all', 'rights', 'reserved'], ['copyright', 'willow', 'and', 'oak', 'inc.', 'no']]}
{np.int8(0): [['portion', 'of', 'this', 'script', 'may', 'be', 'performed', 'published', 'reproduced']]}
{np.int8(1): [['sold', 'or', 'distributed', 'by', 'any', 'means', 'or', 'quoted', 'or', 'published', 'in', 'any']]}


In [112]:
del mo, mo_lower, screenplay_jsons, screenplays_alpha, screenplays_tokenized, unique_labels_series

NameError: name 'unique_labels_series' is not defined

## remove tokens of length 1

In [113]:
def cut_single_chars(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [w for w in sentence if len(w) > 1]
                for sentence in value]
    return dict_list

In [114]:
# test on consideration 
consideration = copy.deepcopy(screenplays_lower[25])
consideration_poly = cut_single_chars(consideration)
print_first_lines(consideration_poly, 10)

{np.int8(2): [[]]}
{np.int8(0): [['nicole', 'holofcener', 'and', 'jeff', 'whit']]}
{np.int8(1): [['based', 'on', 'the', 'book', 'by']]}
{np.int8(0): [['nicole', 'holofcener', 'and', 'jeff', 'whit']]}
{np.int8(2): [['can', 'you', 'ever', 'forgive', 'me'], ['screenplay', 'by', 'nicole', 'holofcener', 'and', 'jeff', 'whitty', 'based', 'on', 'the', 'book', 'can', 'you', 'ever', 'forgive', 'me'], ['by', 'lee', 'israel', 'final', 'shooting', 'script', 'march']]}
{np.int8(0): [['fox', 'searchlight', 'pictures', 'inc']]}
{np.int8(2): [['los', 'angeles', 'ca']]}
{np.int8(0): [['all', 'rights', 'reserved'], ['copyright', 'willow', 'and', 'oak', 'inc.', 'no']]}
{np.int8(0): [['portion', 'of', 'this', 'script', 'may', 'be', 'performed', 'published', 'reproduced']]}
{np.int8(1): [['sold', 'or', 'distributed', 'by', 'any', 'means', 'or', 'quoted', 'or', 'published', 'in', 'any']]}


we'll remove empty values after also removing stopwords

In [115]:
# apply all 
screenplays_poly = screenplays_lower.apply(cut_single_chars)
print_first_lines(screenplays_poly[250], 10)

{np.int8(2): [['written', 'by', 'rhett', 'reese', 'amp', 'paul', 'wernick', 'final', 'shooting', 'script', 'november']]}
{np.int8(1): [['over', 'black'], ['low', 'volume', 'through', 'tinny', 'speaker', 'juice', 'newton', "'s", 'angel', 'of', 'the', 'morning']]}
{np.int8(0): [['ext./int'], ['taxi', 'cab', 'morning']]}
{np.int8(1): [['deadpool', 'in', 'full', 'dress', 'reds', 'and', 'mask', 'quietly', 'fidgets', 'in', 'the', 'back', 'seat', 'of', 'taxi', 'cab', 'as', 'it', 'proceeds', 'along', 'city', 'freeway'], ['deadpool', 'adjusts', 'the', 'two', 'katanas', 'strapped', 'to', 'his', 'back'], ['rolls', 'the', 'windows', 'up', 'down', 'up'], ['tries', 'futilely', 'to', 'untwist', 'the', 'seatbelt', 'then', 'lunges', 'forward', 'locking', 'it', 'up'], ['rifles', 'through', 'tourist', 'booklet', 'and', 'tears', 'out', 'haunted', 'segway', 'tour', 'coupon'], ['the', 'cabbie', 'young', 'thin', 'brown', 'glances', 'back', 'and', 'forth', 'from', 'the', 'rear', 'view', 'to', 'the', 'road', '

In [116]:
del screenplays_lower

## stopwords

In [2]:
from nltk.corpus import stopwords

stops = stopwords.words('english')
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [3]:
extra_stops = [
'fox', 'searchlight', 'pictures', 'inc', 'los', 'angeles', 'ca',
'all', 'rights', 'reserved', 'copyright', 'willow', 'and', 'oak', 'inc.', 'no',
'portion', 'of', 'this', 'script', 'may', 'be', 'performed', 'published', 'reproduced',
'sold', 'or', 'distributed', 'by', 'any', 'means', 'or', 'quoted', 'or', 'published', 'in', 'any',
r'ext./int', 'amp', "'ll", 'ext', 'int'
]

for s in extra_stops:
    if s not in stops:
        stops.append(s)

In [5]:
stops_path = r"C:\\Users\bened\DataScience\ANLP\AT2\\36118_NLP_Spring\\front_end_pipeline\\stops.txt"
with open(stops_path, 'w', encoding='utf-8') as f:
    for s in stops:
        f.write(s + '\n')

In [119]:
del consideration, consideration_poly

In [120]:
def remove_stops(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [w for w in sentence if w not in stops]
                for sentence in value]
    return dict_list

In [121]:
# test on deadpool 
deadpool = copy.deepcopy(screenplays_poly[250])
deadpool_nonstop = remove_stops(deadpool)
print_first_lines(deadpool_nonstop, 10)

{np.int8(2): [['written', 'rhett', 'reese', 'paul', 'wernick', 'final', 'shooting', 'november']]}
{np.int8(1): [['black'], ['low', 'volume', 'tinny', 'speaker', 'juice', 'newton', "'s", 'angel', 'morning']]}
{np.int8(0): [[], ['taxi', 'cab', 'morning']]}
{np.int8(1): [['deadpool', 'full', 'dress', 'reds', 'mask', 'quietly', 'fidgets', 'back', 'seat', 'taxi', 'cab', 'proceeds', 'along', 'city', 'freeway'], ['deadpool', 'adjusts', 'two', 'katanas', 'strapped', 'back'], ['rolls', 'windows'], ['tries', 'futilely', 'untwist', 'seatbelt', 'lunges', 'forward', 'locking'], ['rifles', 'tourist', 'booklet', 'tears', 'haunted', 'segway', 'tour', 'coupon'], ['cabbie', 'young', 'thin', 'brown', 'glances', 'back', 'forth', 'rear', 'view', 'road', 'rear', 'view']]}
{np.int8(2): [['kinda', 'lonesome', 'back']]}
{np.int8(2): [['little', 'help']]}
{np.int8(1): [['cabbie', 'grabs', 'deadpool', "'s", 'hand', 'pulls', 'front'], ['deadpool', "'s", 'head', 'rests', 'upside', 'bench', 'seat', 'maneuvers', 'le

In [122]:
# apply all
screenplays_nonstop = screenplays_poly.apply(remove_stops)
print_first_lines(screenplays_nonstop[0], 10)

{np.int8(1): [['night', 'roxbury']]}
{np.int8(2): [['written', 'steve', 'koren', 'ferrell', 'chris', 'kattan', 'june']]}
{np.int8(0): [['panoramic', 'view', 'sunset']]}
{np.int8(1): [['hear', 'love', 'haddaway', 'night', 'falls', 'partytime', 'begins']]}
{np.int8(0): [['superimpose', 'sunset', 'blvd.', 'pm']]}
{np.int8(0): [['dance', 'clubs', 'night']]}
{np.int8(2): [['coconut', 'teaser', 'palace', 'roxbury', 'tatou', 'etc']]}
{np.int8(0): [['dance', 'clubs-', 'quick', 'shots', 'night']]}
{np.int8(1): [['random', 'dancers', 'gyrating', 'flirting', 'making', 'drinking']]}
{np.int8(0): [['palace', 'night']]}


In [123]:
print_first_lines(screenplays_nonstop[10], 10)

{np.int8(0): [['last', 'mohicans']]}
{np.int8(2): [['written', 'michael', 'mann', 'christopher', 'crowe']]}
{np.int8(1): [['screen', 'microcosm', 'leaf', 'crystal', 'drops', 'precipitation', 'stone', 'emerald', 'green', 'moss'], ["'s", 'landscape', 'miniature'], ['hear', 'forest'], ['distant', 'birds'], ['sound', 'seems', 'reverberate', 'cavern'], ['piece', 'sunlight', 'refracts', 'within', 'drops', 'water', 'paints', 'patch', 'moss', 'yellow'], ['whisper', 'wind', 'joined', 'another', 'sound', 'mixes'], ['distant', 'rustling'], ['gets', 'closer', 'louder'], ["'s", 'shallow', 'breathing'], ['gets', 'ominous'], ["'re", 'interlopers', 'floor', 'forest', 'something', 'coming']]}
{np.int8(0): [['suddenly', 'moccasined', 'foot']]}
{np.int8(1): [['rockets', 'frame', 'scaring', 'us']]}
{np.int8(0): [['extremely', 'close', 'part', 'indian', 'face']]}
{np.int8(1): [['running', 'hard'], ['head', 'shaved', 'bald', 'except', 'scalp-lock'], ['tattoos'], ["'s", 'twenty-five'], ['seems', 'tall', 'mus

In [124]:
del screenplays_poly

## remove empty values 

In [125]:
def remove_empties(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [sent for sent in value if sent]
    return dict_list

# test on deadpool 
deadpool_cleaned = remove_empties(deadpool_nonstop)
print_first_lines(deadpool_cleaned, 10)

{np.int8(2): [['written', 'rhett', 'reese', 'paul', 'wernick', 'final', 'shooting', 'november']]}
{np.int8(1): [['black'], ['low', 'volume', 'tinny', 'speaker', 'juice', 'newton', "'s", 'angel', 'morning']]}
{np.int8(0): [['taxi', 'cab', 'morning']]}
{np.int8(1): [['deadpool', 'full', 'dress', 'reds', 'mask', 'quietly', 'fidgets', 'back', 'seat', 'taxi', 'cab', 'proceeds', 'along', 'city', 'freeway'], ['deadpool', 'adjusts', 'two', 'katanas', 'strapped', 'back'], ['rolls', 'windows'], ['tries', 'futilely', 'untwist', 'seatbelt', 'lunges', 'forward', 'locking'], ['rifles', 'tourist', 'booklet', 'tears', 'haunted', 'segway', 'tour', 'coupon'], ['cabbie', 'young', 'thin', 'brown', 'glances', 'back', 'forth', 'rear', 'view', 'road', 'rear', 'view']]}
{np.int8(2): [['kinda', 'lonesome', 'back']]}
{np.int8(2): [['little', 'help']]}
{np.int8(1): [['cabbie', 'grabs', 'deadpool', "'s", 'hand', 'pulls', 'front'], ['deadpool', "'s", 'head', 'rests', 'upside', 'bench', 'seat', 'maneuvers', 'legs']

In [126]:
# apply all
cleaned_screenplays = screenplays_nonstop.apply(remove_empties)

In [127]:
print_first_lines(cleaned_screenplays[12], 10)

{np.int8(2): [['fourth', 'draft', 'screenplay', 'james', 'baldwin', 'arnold', 'perl', 'spike', 'lee', 'based', 'autobiography', 'malcolm', 'told', 'alex', 'haley']]}
{np.int8(0): [['roxbury', 'street', 'war', 'years', 'day']]}
{np.int8(1): [['bright', 'sunny', 'day', 'crowded', 'street', 'black', 'side', 'boston'], ['people', 'kids', 'busy', 'things'], ['shorty', 'bops', 'way', 'street'], ['runty', 'dark', 'young', 'man', 'mission', 'smile', 'face'], ['wears', 'flamboyant', 'style', 'time', 'whole', 'zoot-suit', 'pegged', 'legs', 'wide', 'brim', 'hat', 'white', 'feather', 'stuck', 'hat', 'band']]}
{np.int8(0): [['street', 'day']]}
{np.int8(1): [['follow', 'shot'], ['shorty', 'dodges', 'crowd', 'packages'], ['smile', 'one', 'anticipation'], ['nods', 'pal', 'without', 'stopping', 'eyes', 'couple', 'chicks', 'dancing', 'street', 'dissuaded']]}
{np.int8(0): [['barber', 'shop', 'day']]}
{np.int8(1): [['shorty', 'jacket', 'hat', 'sleeves', 'rolled'], ['like', 'surgeon', 'preparing', 'operati

In [128]:
# remove stops again 
cleaned_screenplays = cleaned_screenplays.apply(remove_stops)
print_first_lines(cleaned_screenplays[12], 10)

{np.int8(2): [['fourth', 'draft', 'screenplay', 'james', 'baldwin', 'arnold', 'perl', 'spike', 'lee', 'based', 'autobiography', 'malcolm', 'told', 'alex', 'haley']]}
{np.int8(0): [['roxbury', 'street', 'war', 'years', 'day']]}
{np.int8(1): [['bright', 'sunny', 'day', 'crowded', 'street', 'black', 'side', 'boston'], ['people', 'kids', 'busy', 'things'], ['shorty', 'bops', 'way', 'street'], ['runty', 'dark', 'young', 'man', 'mission', 'smile', 'face'], ['wears', 'flamboyant', 'style', 'time', 'whole', 'zoot-suit', 'pegged', 'legs', 'wide', 'brim', 'hat', 'white', 'feather', 'stuck', 'hat', 'band']]}
{np.int8(0): [['street', 'day']]}
{np.int8(1): [['follow', 'shot'], ['shorty', 'dodges', 'crowd', 'packages'], ['smile', 'one', 'anticipation'], ['nods', 'pal', 'without', 'stopping', 'eyes', 'couple', 'chicks', 'dancing', 'street', 'dissuaded']]}
{np.int8(0): [['barber', 'shop', 'day']]}
{np.int8(1): [['shorty', 'jacket', 'hat', 'sleeves', 'rolled'], ['like', 'surgeon', 'preparing', 'operati

In [129]:
cleaned_screenplays = cleaned_screenplays.apply(remove_empties)

# Save to JSON

In [130]:
# convert series to a json
cleaned_screenplays.to_json(
    f'{root_path}\\cleaned_screenplays.json', 
    force_ascii=False, 
    indent=2,
    compression='gzip')

TODO: 
- expand stops list 
- cut useless metadata if possible 
- lemmatize if possible 
- stem if not 
- reach a reasonable avg length target 
- apply phrases model
- try word vectorization. The eventual output vectors should look like:
{label code: sentence{ {v1}, {v2} etc.}}
- Build a BERT annotator.  Use these annotations as supervision. 
- Run through NN pipeline.  Truncate aggressively.  Use samples only for training. 