Here I'll take a look at the BERT annotated data and see if we can do anything with it. 

# Australian Film Classification Guidelines

Helpful resources for how film classifications are determined can be found here: 

https://www.classification.gov.au/classification-ratings/how-rating-decided

Some relevant excerpts below: 

Approved classification tools use logic rules and algorithms to classify content.

The guidelines include 6 classifiable elements. These are:

themes
violence
sex
language
drug use
nudity.

Impact
The impact depends on the frequency, intensity and the overall effect of the content. The purpose, tone and style can affect impact.

Impact may increase where depictions are:

detailed
prolonged
realistic
interactive.
Impact may be lower where content is:

implied rather than depicted
not detailed
short in duration
verbal and not visual
incidental and not direct.
The level of impact allowed in each classification category (rating):

Rating	Impact level
G	Very mild
PG	Mild
M	Moderate
MA 15+	Strong
R 18+	High
RC	Very high

<i> So we have six relevant categories, and six impact levels corresponding to the 6 ratings.  The goal will be then to build a model that basically scores the impact level of the screenplay on these six measures, then uses those six scores to make the prediction/classsification.  So e.g., we might except a neural network architecture to look like:

embedding layer(vector for each {sentence, word} in screenplay)
- > 6 (?) neurons, one for each classifiable element
( -> context layer? )
- > 6 (?) neurons, an impact score for each element 
- > 1 output neuron for classification
OR 
- > an output layer per country classification desired? </i> 

Context
Context determines whether the storyline justifies content. Content that falls into a particular rating in one context may fall outside it in a different context.

For example, the way the content deals with social issues may require a mature or adult perspective. Historical context may also justify certain depictions.

<i> Should there be a separate context layer then? </i>

Consumer advice
Under section 20 of the Act, a classification decision must include consumer advice.

Consumer advice gives information about the content. It usually describes the classifiable elements with the greatest impact.

For example, a film classified PG may have consumer advice of 'Mild violence and coarse language'. This means that the film is PG because the violence and coarse language are mild in impact.

<i> Ideally, our model would also therefore output this Consumer Advice </i> 





# 0. Import Data and Assemble DataFrame

In [None]:
# replace paths here
root_path = r'C:\Users\bened\DataScience\ANLP\AT2'

import os 

folder_path = f'{root_path}\\BERT_annotations'
screenplays_annot = {}

# list all files in folder and iterate over them 
for file_name in os.listdir(folder_path):
    # get file_path by joining folder path with file_name
    file_path = os.path.join(folder_path, file_name)
    # ensure path points to an actual file
    if os.path.isfile(file_path):
        with open(file_path, 'r', encoding='latin-1') as f:
            content = f.read()
            screenplays_annot[file_name] = content

# ensure files were imported correctly by printing a sample of the first ten files 
i = 0
for file_name, content in screenplays_annot.items():
    if i == 10:
        break
    else:
        print(f"Example of {file_name}:\n")
        print(content[:100])
        print("-"*50)
        i += 1

## Global Functions

In [None]:
import numpy as np

In [2]:
def print_first_lines(dict_list, n):
    for idx, d in enumerate(dict_list):
        if idx == n:
            break
        else:
            print(d)

In [None]:
def find_avg_length(series):
    avg_length = np.mean([len(d) for d in series])
    print("Average length:", avg_length)

## 0.1 Join with metadata

In [None]:
import pandas as pd

meta_df = pd.read_csv(f'{root_path}\\movie_meta_data.csv')
meta_df.head()

In [None]:
# take a look at filename format

filenames = list(screenplays_annot.keys())
print(filenames[:10])

In [None]:
# filenames are formatted as movietitle_IMDBid 
import re

filenames = list(screenplays_annot.keys())
movie_titles = []
ids = []
for f in filenames:
    # split at first _ to separate title from rest of filename
    split = f.split(sep="_")
    movie_title = split[0]
    id = split[1]
    movie_titles.append(movie_title)
    ids.append(id)
i = 0
for title, id in zip(movie_titles, ids):
    if i == 10:
        break
    else:
        print("Title:", title, " ID:", id)
        i += 1

In [None]:
screenplays_df = pd.DataFrame({
    'imdbid': ids,
    'annot_screenplay': screenplays_annot.values()
})
screenplays_df.head()

In [None]:
print(screenplays_df.info())
print(meta_df.info())

In [None]:
# convert screenplays imdbid to int
screenplays_df['imdbid'] = screenplays_df['imdbid'].astype(int)
df = meta_df.merge(screenplays_df, on='imdbid')
df.head()

# 1. Data Annotations

In [None]:
roxbury_annot = df.at[0, 'annot_screenplay']
print(roxbury_annot[:1000])

Each \n introduces a new label: data pairing
Try to turn this into a json

## 1.1 Format Text Data as JSONs

In [None]:
# define a function to format screenplay as json
## output should look like e.g.:
## {"label": "data", "label":"data" etc.}

import re

def format_as_json(screenplay):
    # store results as list of key-value pairs
    screenplay_data = []
    # split screenplays by \n
    lines = screenplay.split("\n")
    # iterate through lines 
    for line in lines: 
        # take part of string up to : as label
        match = re.search(r':', line)
        if match:
            # take end of match as cutoff
            cutoff = match.end()
            label = line[:cutoff-1]
            # after cutoff is data 
            data = line[cutoff+1:]
            # store as dict
            line_info = {label:data}
            # append to list
            screenplay_data.append(line_info)
    # return list
    return screenplay_data

# beta test on roxbury 
roxbury_data = format_as_json(roxbury_annot)
print(roxbury_data[:100])

In [88]:
# now we'll apply this logic to the whole corpus to see if labels are the same

screenplay_jsons = df['annot_screenplay'].apply(format_as_json)

In [None]:
print(screenplay_jsons[10][:100])

## 1.2 Analyze labels

In [None]:
# assess the unique keys (labels)
## iterate through list and return set of unique labels
unique_labels = {key for d in roxbury_data for key in d.keys()}
print(unique_labels)

In [90]:
def find_unique_labels(json):
    unique_labels = {key for d in json for key in d.keys()}
    return unique_labels

unique_labels_series = screenplay_jsons.apply(find_unique_labels)


## 1.3 Find and drop rows where data is empty

In [None]:
unequal_length = []
for series in unique_labels_series:
    if len(series) != 4:
        unequal_length.append(series)

print(unequal_length)
    

In [92]:
# some annotations have only three labels, which is fine, but others appear to be empty, which we should investigate
empty_series = []
for idx, series in enumerate(unique_labels_series):
    if len(series) == 0:
        empty_series.append(idx)

In [None]:
missing_imdbid = df.loc[empty_series, 'imdbid']
df.loc[empty_series]

In [None]:
screenplays_df[screenplays_df['imdbid'].isin(missing_imdbid)]

If you look at the source data, you'll find the .txt files for these screenplays are simply empty.  We'll drop them.

In [None]:
df_clean = df.drop(empty_series)
df_clean.head()

In [20]:
screenplay_jsons.drop(empty_series, inplace=True)

So we have 'scene_heading', 'speaker_heading', 'text' and 'dialog' labels.  Let's look at an example screenplay to see what might be worth removing. 

## 'text'

- 'text' is likely to be relevant, e.g. {'text': 'Of random dancers -- gyrating, flirting, making out, drinking.'}
- 'scene_heading' is conceivably relevant, e.g. {'scene_heading': 'INT. DANCE CLUBS- QUICK SHOTS - NIGHT'} -- a scene set in a nightclub might predict a higher classification. 
Let's look at a representative sample of 'text'. 

In [None]:
# iterate through jsons and return data for 'text' label 
roxbury_texts = [d.get('text') for d in roxbury_data if 'text' in d]
print(roxbury_texts[:100])

In [None]:
print(roxbury_texts[len(roxbury_texts)-100:len(roxbury_texts)])

'texts' are most likely relevant. 

## Flattening Data

In some cases we see the same sentences spread over different values, while key is the same.  The function below will find these contiguous values and flatten them into one value.  This will make sentence tokenization more meaningful later on. 

In [95]:
# we'll define a more general function this time that takes a key input
def flatten_data(dict_list, key):
    flattened_data = []
    temp = ''
    for d in dict_list:
        if key in d:
            temp += ' ' + d[key] if temp else d[key]
        else:
            # if a key other than input is encountered and temp is not empty
            if temp:
                # append the concatenated string to text list 
                flattened_data.append({key:temp})
                # and reset temp 
                temp = ''
            # append non text dict to list 
            flattened_data.append(d)
    # after loop ends, concatenate what's left in temp if anything
    if temp:
        flattened_data.append({key:temp})
    # and return concatenated list
    return flattened_data

In [None]:
# test function on the film 'Anonymous'
anon = screenplay_jsons[100]
anon_text_flattened = flatten_data(anon, 'text')
print(anon_text_flattened[:100])

In [None]:
# and now for dialog
anon_dialog_flattened = flatten_data(anon_text_flattened, 'dialog')
for i, d in enumerate(anon_dialog_flattened):
    if i == 50:
        break
    else:
        print(d)

appears to have worked, so we'll apply all for both 'text' and 'dialog' keys

In [98]:
screenplays_flat_txt = screenplay_jsons.apply(flatten_data, key='text')
screenplays_flat = screenplays_flat_txt.apply(flatten_data, key='dialog')

In [None]:
print(screenplays_flat[10][:100])

In [100]:
del screenplays_flat_txt

## speaker_heading

'speaker_heading' is likely irrelevant, but let's take a look

In [None]:
speaker_headings = [d.get('speaker_heading') for d in roxbury_data if 'speaker_heading' in d]

'speaker_heading' almost certainly irrelevant. 

In [102]:
# drop speaker_heading data 
def decapitate_speakers(json_list):
    decapitated = [d for d in json_list if not 'speaker_heading' in d]
    return decapitated 

# # test on roxbury 
# roxbury_decapitated = decapitate_speakers(roxbury_data)
# print(roxbury_decapitated[:100])

In [None]:
# check unique keys in roxbury_decapitated 
roxbury_decapitated_labels = find_unique_labels(roxbury_decapitated)
print(roxbury_decapitated_labels)

In [32]:
# free up RAM 
del df, filenames, content, meta_df, ids, movie_titles, roxbury_annot, roxbury_data, screenplays_annot, screenplays_df, speaker_headings, unique_labels_series

In [104]:
# apply to all texts 
decapitated_screenplays = screenplays_flat.apply(decapitate_speakers)

In [None]:
print(decapitated_screenplays[10][:100])

In [106]:
del screenplays_flat

note: we can likely remove any values:
- that are empty 
- containing ':' -- these seem to be camera directions

In [None]:
print(roxbury_decapitated[:100])

### Remove Empty Strings

In [109]:
# remove empty strings

import string 
puncts = set(string.punctuation)

def remove_nulls(json_list):
    non_nulls = []
    for dict in json_list:
        valid = True
        for val in dict.values():
            if val == '' or all(char in puncts for char in val):
                valid = False
                break
        if valid:
            non_nulls.append(dict)
    return non_nulls

# # beta test on roxbury 
# roxbury_nonna = remove_nulls(roxbury_decapitated)
# print(roxbury_nonna[:100])

In [None]:
print(decapitated_screenplays[10][:100])

In [None]:
# apply to all data 
screenplays_nonna = decapitated_screenplays.apply(remove_nulls)
print(screenplays_nonna[10][:100])

In [None]:
i = 0
for s in screenplays_nonna[10]:
    if i == 10:
        break
    print(s)
    i += 1

It's possible that values in allcaps are basically irrelevant. Let's return these and take a look at them. 

In [None]:
pattern = re.compile(r'^[^a-z]+$')

def no_lower(dict_list):
    no_lowers = []
    for d in dict_list:
        for val in d.values():
            if re.match(pattern, val):
                no_lowers.append(d)
    return no_lowers

# test on mohicans
mohicans = screenplays_nonna[10]
mohicans_no_lowers = no_lower(mohicans)
for i, j in enumerate(mohicans_no_lowers):
    if j == 20:
        break
    print(j)

potentially relevant info in here, e.g. 'GUN', 'MASSIVE WAR CLUB'.  However we can delete all strings that match 'CUT TO ...' 

In [None]:
print(screenplays_nonna[0][:10])

In [None]:
def delete_cuts(dict_list):
    # empty list for filtered dicts
    dicts_uncut = []
    for d in dict_list:
        # if none of the values in the dict match 'CUT'
        if all(not re.search(r'CUT', str(val)) for val in d.values()):
            # then append to list
            dicts_uncut.append(d)
    return dicts_uncut 

# test on mohicans_no_lowers
mohicans_uncut = delete_cuts(mohicans_no_lowers)
for i, j in enumerate(mohicans_uncut):
    if j == 20:
        break
    print(j)

In [None]:
print(mohicans[:10])

In [117]:
mohicans_uncut = delete_cuts(mohicans)

In [None]:
print(mohicans_uncut[:10])

In [None]:
print(screenplays_nonna[10][:10])

In [None]:
# apply all 
import numpy as np

# average length before filtering
avg_length_before = np.mean([len(s) for s in screenplays_nonna])
print("avg length before:", avg_length_before)

screenplays_uncut = screenplays_nonna.apply(delete_cuts)

# average length after filtering
avg_length_after = np.mean([len(s) for s in screenplays_uncut])
print("avg_length_after:", avg_length_after)

In [None]:
print(screenplays_uncut[10][:10])

In [122]:
del screenplays_nonna

In [None]:
df_clean.loc[200]

In [None]:
# print another random sample to see where we're at 
change_up = screenplays_uncut[200]
for i, s in enumerate(change_up):
    if i == 10:
        break
    print(s)

# Sentence Tokenization

We'll sentence tokenize the values first before removing punctuation marks etc. 

In [125]:
# ! pip install nltk

In [None]:
print(mohicans_uncut[:10])

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

# try out first on mohicans
mohicans_sents = []

for d in mohicans_uncut:
    # empty dict for storing result 
    sents_dict = {}
    for key, value in d.items():
        # # if the value is a list, unpack the list first (needs to be debugged)
        # if isinstance(value, list):
        #     value = str(value)
        #     sents_dict[key] = sent_tokenize(value)
        #     mohicans_sents.append(sents_dict)
        # else:
        sents_dict[key] = sent_tokenize(value)
        mohicans_sents.append(sents_dict)

for i, j in enumerate(mohicans_sents):
    if i == 10:
        break
    print(j)

Seems to work okay, although now we have to deal with a list of dicts of lists :/ including lists with one sentence 

If you run this again, make it a dict of dicts? With a structure like
{"screenplay":
    {"label":"data"},
    {"label":"data"},
    etc}

In [128]:
# delete unneeded variables before sentence tokenization 
del anon, anon_dialog_flattened

In [None]:
del anon_text_flattened, change_up, decapitated_screenplays, mohicans, mohicans_no_lowers, mohicans_sents, mohicans_uncut, roxbury_decapitated, roxbury_nonna, roxbury_texts, screenplay_jsons

In [130]:
# define as a general function
def sent_tokenize_dicts(dict_list):

    sentence_dicts = []

    for d in dict_list:
        # empty dict for storing result 
        sents_dict = {}
        for key, value in d.items():
            sents_dict[key] = sent_tokenize(value)
            sentence_dicts.append(sents_dict)
    
    return sentence_dicts

In [131]:
# apply all 
screenplay_sents = screenplays_uncut.apply(sent_tokenize_dicts)

In [None]:
print_first_lines(screenplay_sents[50], 10)

looks okay

In [133]:
del screenplays_uncut

## Label Encoding

At this point we're going to encode our labels just to save on memory. 

In [None]:
label_map = {
    'scene_heading': 0,
    'text': 1,
    'dialog': 2
}

# check how it will work on ten things I hate about you 
ten_things_sents = screenplay_sents[50]
# iterate through dict list
ten_things_encoded = []
for d in ten_things_sents:
    encoded_dict = {np.int8(label_map[key]): value for key, value in d.items()}
    ten_things_encoded.append(encoded_dict)

print(ten_things_encoded[:10])


In [None]:
# define as function and apply all 

def encode_labels(dict_list):
    encoded_list = []
    for d in dict_list:
        encoded_dict = {np.int8(label_map[key]): value for key, value in d.items()}
        encoded_list.append(encoded_dict)
    return encoded_list

screenplays_encoded = screenplay_sents.apply(encode_labels)
print(screenplays_encoded[0][:10])

## remove 'EXT' and 'INT' 

We can remove all sentences which contain only EXT/INT

In [136]:
def remove_location(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [item for item in value if item not in ['EXT.', 'INT.']]
    return dict_list

In [None]:
# test on ten things 
roxbury_sents = screenplays_encoded[0]
roxbury_unlocated = remove_location(roxbury_sents)
print_first_lines(roxbury_unlocated, 10)

In [138]:
del screenplay_sents

In [None]:
# apply all
screenplays_unlocated = screenplays_encoded.apply(remove_location)
print(screenplays_unlocated[10][:10])

In [None]:
print_first_lines(screenplays_unlocated[200], 10)

## Word Tokenization

In [None]:
print(screenplays_unlocated[10][:10])

In [None]:
from nltk.tokenize import word_tokenize
import copy

# try out on mohicans 
mohicans_sents = copy.deepcopy(screenplays_unlocated[10])

def word_tokenize_dicts(dict_list):
    # iterate through dict list
    for d in dict_list:
        # iterate through keys and values 
        for key, value in d.items():
            d[key] = [word_tokenize(sent) for sent in value]
    return dict_list

mohicans_tokenized = word_tokenize_dicts(mohicans_sents)
print_first_lines(mohicans_tokenized, 10)


In [None]:
print(screenplays_unlocated[10][:10])

unfortunate that we're now dealing with lists of dicts of lists of lists :/  but not sure how to remedy that without losing sentence boundaries

In [None]:
# apply all 
screenplays_tokenized = screenplays_unlocated.apply(word_tokenize_dicts)
print_first_lines(screenplays_tokenized[0], 10)

## remove strings that contain no letters 

In [None]:
mohicans_tokenized = copy.deepcopy(screenplays_tokenized[10])
print_first_lines(mohicans_tokenized, 10)

In [None]:
import string
puncts = list(string.punctuation)
print(puncts)

In [None]:
import re 

def contains_letters(token):
    return bool(re.search(r'[a-zA-Z]', token))

def remove_non_letters(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [t for t in sentence if contains_letters(t)]
                for sentence in value
            ]
    return dict_list 

# test on mohicans_tokenized
mohicans_alpha = remove_non_letters(mohicans_tokenized)
print_first_lines(mohicans_alpha, 10)

In [148]:
del mohicans_alpha, mohicans_sents, mohicans_tokenized, roxbury_sents, roxbury_unlocated, screenplays_encoded, screenplays_unlocated, ten_things_encoded, ten_things_sents

In [None]:
# seems to work so apply all 
screenplays_alpha = screenplays_tokenized.apply(remove_non_letters)
print_first_lines(screenplays_alpha[0], 10)

## to lower

Since sentence boundaries are already marked, we can convert all chars to lowercase 

In [155]:
def convert_to_lower(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [w.lower() for w in sentence]
                for sentence in value
            ]
    return dict_list

In [None]:
mo = copy.deepcopy(screenplays_alpha[10])
print_first_lines(mo, 10)

In [None]:
# test on mohicans 
mo_lower = convert_to_lower(mo)
print_first_lines(mo_lower, 10)

In [None]:
consideration = copy.deepcopy(screenplays_alpha[25])
print_first_lines(consideration, 10)

In [None]:
# apply all
screenplays_lower = screenplays_alpha.apply(convert_to_lower)
print_first_lines(screenplays_lower[25], 10)

In [159]:
del mo, mo_lower, screenplay_jsons, screenplays_alpha, screenplays_tokenized, unique_labels_series

## remove tokens of length 1

In [162]:
def cut_single_chars(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [w for w in sentence if len(w) > 1]
                for sentence in value]
    return dict_list

In [163]:
# test on consideration 
consideration = copy.deepcopy(screenplays_lower[25])
consideration_poly = cut_single_chars(consideration)
print_first_lines(consideration_poly, 10)

{2: [[]]}
{0: [['nicole', 'holofcener', 'and', 'jeff', 'whit']]}
{1: [['based', 'on', 'the', 'book', 'by']]}
{0: [['nicole', 'holofcener', 'and', 'jeff', 'whit']]}
{2: [['can', 'you', 'ever', 'forgive', 'me'], ['screenplay', 'by', 'nicole', 'holofcener', 'and', 'jeff', 'whitty', 'based', 'on', 'the', 'book', 'can', 'you', 'ever', 'forgive', 'me'], ['by', 'lee', 'israel', 'final', 'shooting', 'script', 'march']]}
{0: [['fox', 'searchlight', 'pictures', 'inc']]}
{2: [['los', 'angeles', 'ca']]}
{0: [['all', 'rights', 'reserved'], ['copyright', 'willow', 'and', 'oak', 'inc.', 'no']]}
{0: [['portion', 'of', 'this', 'script', 'may', 'be', 'performed', 'published', 'reproduced']]}
{1: [['sold', 'or', 'distributed', 'by', 'any', 'means', 'or', 'quoted', 'or', 'published', 'in', 'any']]}


we'll remove empty values after also removing stopwords

In [164]:
# apply all 
screenplays_poly = screenplays_lower.apply(cut_single_chars)
print_first_lines(screenplays_poly[250], 10)

{2: [['written', 'by', 'rhett', 'reese', 'amp', 'paul', 'wernick', 'final', 'shooting', 'script', 'november']]}
{1: [['over', 'black'], ['low', 'volume', 'through', 'tinny', 'speaker', 'juice', 'newton', "'s", 'angel', 'of', 'the', 'morning']]}
{0: [['ext./int'], ['taxi', 'cab', 'morning']]}
{1: [['deadpool', 'in', 'full', 'dress', 'reds', 'and', 'mask', 'quietly', 'fidgets', 'in', 'the', 'back', 'seat', 'of', 'taxi', 'cab', 'as', 'it', 'proceeds', 'along', 'city', 'freeway'], ['deadpool', 'adjusts', 'the', 'two', 'katanas', 'strapped', 'to', 'his', 'back'], ['rolls', 'the', 'windows', 'up', 'down', 'up'], ['tries', 'futilely', 'to', 'untwist', 'the', 'seatbelt', 'then', 'lunges', 'forward', 'locking', 'it', 'up'], ['rifles', 'through', 'tourist', 'booklet', 'and', 'tears', 'out', 'haunted', 'segway', 'tour', 'coupon'], ['the', 'cabbie', 'young', 'thin', 'brown', 'glances', 'back', 'and', 'forth', 'from', 'the', 'rear', 'view', 'to', 'the', 'road', 'to', 'the', 'rear', 'view']]}
{2: [[

In [165]:
del screenplays_lower

## stopwords

In [168]:
from nltk.corpus import stopwords

stops = stopwords.words('english')
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [180]:
extra_stops = [
'fox', 'searchlight', 'pictures', 'inc', 'los', 'angeles', 'ca',
'all', 'rights', 'reserved', 'copyright', 'willow', 'and', 'oak', 'inc.', 'no',
'portion', 'of', 'this', 'script', 'may', 'be', 'performed', 'published', 'reproduced',
'sold', 'or', 'distributed', 'by', 'any', 'means', 'or', 'quoted', 'or', 'published', 'in', 'any',
r'ext./int', 'amp', "'ll", 'ext', 'int'
]

for s in extra_stops:
    if s not in stops:
        stops.append(s)

In [169]:
del consideration, consideration_poly

NameError: name 'mo' is not defined

In [171]:
def remove_stops(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [
                [w for w in sentence if w not in stops]
                for sentence in value]
    return dict_list

In [172]:
# test on deadpool 
deadpool = copy.deepcopy(screenplays_poly[250])
deadpool_nonstop = remove_stops(deadpool)
print_first_lines(deadpool_nonstop, 10)

{2: [['written', 'rhett', 'reese', 'paul', 'wernick', 'final', 'shooting', 'november']]}
{1: [['black'], ['low', 'volume', 'tinny', 'speaker', 'juice', 'newton', "'s", 'angel', 'morning']]}
{0: [[], ['taxi', 'cab', 'morning']]}
{1: [['deadpool', 'full', 'dress', 'reds', 'mask', 'quietly', 'fidgets', 'back', 'seat', 'taxi', 'cab', 'proceeds', 'along', 'city', 'freeway'], ['deadpool', 'adjusts', 'two', 'katanas', 'strapped', 'back'], ['rolls', 'windows'], ['tries', 'futilely', 'untwist', 'seatbelt', 'lunges', 'forward', 'locking'], ['rifles', 'tourist', 'booklet', 'tears', 'haunted', 'segway', 'tour', 'coupon'], ['cabbie', 'young', 'thin', 'brown', 'glances', 'back', 'forth', 'rear', 'view', 'road', 'rear', 'view']]}
{2: [['kinda', 'lonesome', 'back']]}
{2: [['little', 'help']]}
{1: [['cabbie', 'grabs', 'deadpool', "'s", 'hand', 'pulls', 'front'], ['deadpool', "'s", 'head', 'rests', 'upside', 'bench', 'seat', 'maneuvers', 'legs'], ['cabbie', 'turns', 'helping', 'hand', 'handshake', 'turn

In [175]:
# apply all
screenplays_nonstop = screenplays_poly.apply(remove_stops)
print_first_lines(screenplays_nonstop[0], 10)

{1: [['night', 'roxbury']]}
{2: [['written', 'steve', 'koren', 'ferrell', 'chris', 'kattan', 'june']]}
{0: [['panoramic', 'view', 'sunset']]}
{1: [['hear', 'love', 'haddaway', 'night', 'falls', 'partytime', 'begins']]}
{0: [['superimpose', 'sunset', 'blvd.', 'pm']]}
{0: [['dance', 'clubs', 'night']]}
{2: [['coconut', 'teaser', 'palace', 'roxbury', 'tatou', 'etc']]}
{0: [['dance', 'clubs-', 'quick', 'shots', 'night']]}
{1: [['random', 'dancers', 'gyrating', 'flirting', 'making', 'drinking']]}
{0: [['palace', 'night']]}


In [176]:
print_first_lines(screenplays_nonstop[10], 10)

{0: [['last', 'mohicans']]}
{2: [['written', 'michael', 'mann', 'christopher', 'crowe']]}
{1: [['screen', 'microcosm', 'leaf', 'crystal', 'drops', 'precipitation', 'stone', 'emerald', 'green', 'moss'], ["'s", 'landscape', 'miniature'], ['hear', 'forest'], ['distant', 'birds'], ['sound', 'seems', 'reverberate', 'cavern'], ['piece', 'sunlight', 'refracts', 'within', 'drops', 'water', 'paints', 'patch', 'moss', 'yellow'], ['whisper', 'wind', 'joined', 'another', 'sound', 'mixes'], ['distant', 'rustling'], ['gets', 'closer', 'louder'], ["'s", 'shallow', 'breathing'], ['gets', 'ominous'], ["'re", 'interlopers', 'floor', 'forest', 'something', 'coming']]}
{0: [['suddenly', 'moccasined', 'foot']]}
{1: [['rockets', 'frame', 'scaring', 'us']]}
{0: [['extremely', 'close', 'part', 'indian', 'face']]}
{1: [['running', 'hard'], ['head', 'shaved', 'bald', 'except', 'scalp-lock'], ['tattoos'], ["'s", 'twenty-five'], ['seems', 'tall', 'muscled'], ['heavy', 'even', 'breathing'], ["'ll", 'learn', 'later

In [177]:
del screenplays_poly

## remove empty values 

In [174]:
def remove_empties(dict_list):
    for d in dict_list:
        for key, value in d.items():
            d[key] = [sent for sent in value if sent]
    return dict_list

# test on deadpool 
deadpool_cleaned = remove_empties(deadpool_nonstop)
print_first_lines(deadpool_cleaned, 10)

{2: [['written', 'rhett', 'reese', 'paul', 'wernick', 'final', 'shooting', 'november']]}
{1: [['black'], ['low', 'volume', 'tinny', 'speaker', 'juice', 'newton', "'s", 'angel', 'morning']]}
{0: [['taxi', 'cab', 'morning']]}
{1: [['deadpool', 'full', 'dress', 'reds', 'mask', 'quietly', 'fidgets', 'back', 'seat', 'taxi', 'cab', 'proceeds', 'along', 'city', 'freeway'], ['deadpool', 'adjusts', 'two', 'katanas', 'strapped', 'back'], ['rolls', 'windows'], ['tries', 'futilely', 'untwist', 'seatbelt', 'lunges', 'forward', 'locking'], ['rifles', 'tourist', 'booklet', 'tears', 'haunted', 'segway', 'tour', 'coupon'], ['cabbie', 'young', 'thin', 'brown', 'glances', 'back', 'forth', 'rear', 'view', 'road', 'rear', 'view']]}
{2: [['kinda', 'lonesome', 'back']]}
{2: [['little', 'help']]}
{1: [['cabbie', 'grabs', 'deadpool', "'s", 'hand', 'pulls', 'front'], ['deadpool', "'s", 'head', 'rests', 'upside', 'bench', 'seat', 'maneuvers', 'legs'], ['cabbie', 'turns', 'helping', 'hand', 'handshake', 'turns', 

In [178]:
# apply all
cleaned_screenplays = screenplays_nonstop.apply(remove_empties)

In [179]:
print_first_lines(cleaned_screenplays[12], 10)

{2: [['fourth', 'draft', 'screenplay', 'james', 'baldwin', 'arnold', 'perl', 'spike', 'lee', 'based', 'autobiography', 'malcolm', 'told', 'alex', 'haley']]}
{0: [['ext'], ['roxbury', 'street', 'war', 'years', 'day']]}
{1: [['bright', 'sunny', 'day', 'crowded', 'street', 'black', 'side', 'boston'], ['people', 'kids', 'busy', 'things'], ['shorty', 'bops', 'way', 'street'], ['runty', 'dark', 'young', 'man', 'mission', 'smile', 'face'], ['wears', 'flamboyant', 'style', 'time', 'whole', 'zoot-suit', 'pegged', 'legs', 'wide', 'brim', 'hat', 'white', 'feather', 'stuck', 'hat', 'band']]}
{0: [['ext'], ['street', 'day']]}
{1: [['follow', 'shot'], ['shorty', 'dodges', 'crowd', 'packages'], ['smile', 'one', 'anticipation'], ['nods', 'pal', 'without', 'stopping', 'eyes', 'couple', 'chicks', 'dancing', 'street', 'dissuaded']]}
{0: [['int'], ['barber', 'shop', 'day']]}
{1: [['shorty', 'jacket', 'hat', 'sleeves', 'rolled'], ['like', 'surgeon', 'preparing', 'operation'], ['equipment', 'spread', 'table

In [181]:
# remove stops again 
cleaned_screenplays = cleaned_screenplays.apply(remove_stops)
print_first_lines(cleaned_screenplays[12], 10)

{2: [['fourth', 'draft', 'screenplay', 'james', 'baldwin', 'arnold', 'perl', 'spike', 'lee', 'based', 'autobiography', 'malcolm', 'told', 'alex', 'haley']]}
{0: [[], ['roxbury', 'street', 'war', 'years', 'day']]}
{1: [['bright', 'sunny', 'day', 'crowded', 'street', 'black', 'side', 'boston'], ['people', 'kids', 'busy', 'things'], ['shorty', 'bops', 'way', 'street'], ['runty', 'dark', 'young', 'man', 'mission', 'smile', 'face'], ['wears', 'flamboyant', 'style', 'time', 'whole', 'zoot-suit', 'pegged', 'legs', 'wide', 'brim', 'hat', 'white', 'feather', 'stuck', 'hat', 'band']]}
{0: [[], ['street', 'day']]}
{1: [['follow', 'shot'], ['shorty', 'dodges', 'crowd', 'packages'], ['smile', 'one', 'anticipation'], ['nods', 'pal', 'without', 'stopping', 'eyes', 'couple', 'chicks', 'dancing', 'street', 'dissuaded']]}
{0: [[], ['barber', 'shop', 'day']]}
{1: [['shorty', 'jacket', 'hat', 'sleeves', 'rolled'], ['like', 'surgeon', 'preparing', 'operation'], ['equipment', 'spread', 'table', 'lye', 'larg

In [183]:
cleaned_screenplays = cleaned_screenplays.apply(remove_empties)

In [185]:
# convert series to a json
cleaned_screenplays.to_json(f'{root_path}\\cleaned_screenplays.json')

TODO: 
- expand stops list 
- cut useless metadata if possible 
- lemmatize if possible 
- stem if not 
- reach a reasonable avg length target 
- apply phrases model
- try word vectorization. The eventual output vectors should look like:
{label code: sentence{ {v1}, {v2} etc.}}
- Build a BERT annotator.  Use these annotations as supervision. 
- Run through NN pipeline.  Truncate aggressively.  Use samples only for training. 