# Entry NLP4: Frequencies and Comparison

In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it's time to do math and analysis!

In [2]:
import pandas as pd
import os
from IPython.display import display

import string
import re
import itertools
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/julie.fisher/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Grab and store the data
def read_script(file_path):
    corpus = ''
    with open(file_path, 'r', encoding='latin-1') as l:
        for line in l:
            if (re.match('[^\d+]', line)
               ) and (re.match('^(?!\s*$).+', line)
                      ) and not (re.match('(.*www.*)|(.*http:*)', line)
                                ) and not (re.match('Sync and correct*', line)):
                line = re.sub('</?i>|</?font.*>', '', line)
                corpus = corpus + ' ' + line
    return corpus

def load_files_to_dict(file_path, return_dict):    
    for thing in os.scandir(file_path):
        if thing.is_dir():
            new_path = os.path.join(file_path, thing.name)
            new_dict = return_dict[thing.name] = {}
            load_files_to_dict(new_path, new_dict)
        elif thing.is_file:
            return_dict[thing.name] = read_script(f'{file_path}/{thing.name}')
    return return_dict

In [4]:
def convert_dict_df(script_dict):
    return pd.DataFrame.from_dict(script_dict, orient='index').reset_index().rename(columns={'index':'script_name', 0:'corpus'})

# Clean the text and create ngrams
def punct_tokens(df, text_col):
    newline_list = '\t\r\n'
    remove_newline = str.maketrans(' ', ' ', newline_list)
    punct_list = string.punctuation + '-‘_”'
    nopunct = str.maketrans('', '', punct_list)
    df['no_punct_tokens'] = df[text_col].fillna("").str.lower().str.translate(remove_newline).str.translate(nopunct).str.split()
    return df

def create_ngrams(df):
    stop = nltk.corpus.stopwords.words('english')
    df['unigrams'] = df['no_punct_tokens'].apply(lambda x: [item for item in x if item not in stop])
    df['bigrams'] = df['unigrams'].apply(lambda x:(list(nltk.bigrams(x))))
    df['trigrams'] = df['unigrams'].apply(lambda x:(list(nltk.trigrams(x))))
    return df

def create_ngram_df(script_dict, text_col):
    df = convert_dict_df(script_dict)
    df = punct_tokens(df, text_col)
    df = create_ngrams(df)
    return df

# Frequencies

Counting words is a common sample problem and can probably be considered the 'hello world' of NLP. When putting it into a dictionary data structure, the concept isn't difficult:

- For each word (or in our case, n-gram) in the corpus
- Insert the word if it's not there (the dictionary key)
- Add 1 to the count (the dictionary value)

```
frequency_dictionary = {}
for ngram in ngram_list:
    if ngram not in frequency_dictionary:
        frequency_dictionary[ngram] = 0
    frequency_dictionary[ngram] +=1
```

The question is, how to apply this general concept to my specific use case.

The n-grams have already been created, so I don't have to worry about longer n-grams (the bigrams, and I threw in trigrams because why not?) spilling from one scrip to another. Which means I can concatenate all the n-grams of a specific category together (i.e. I don't want to combine unigrams with bigrams, just all the unigrams with each other).

In [6]:
auth_file_path = os.path.join(os.getcwd(), 'data', '1960s')
raw_auth_dict = load_files_to_dict(auth_file_path, {})

auth_ngram_df = create_ngram_df(raw_auth_dict, 'corpus')
auth_ngram_df.head()

Unnamed: 0,script_name,corpus,no_punct_tokens,unigrams,bigrams,trigrams
0,The Twilight Zone - 3x17 - One More Pallbearer...,You're traveling\n through another dimension-...,"[youre, traveling, through, another, dimension...","[youre, traveling, another, dimension, dimensi...","[(youre, traveling), (traveling, another), (an...","[(youre, traveling, another), (traveling, anot..."
1,The Twilight Zone - 3x05 - A Game of Pool.srt,You're traveling\n through another dimension-...,"[youre, traveling, through, another, dimension...","[youre, traveling, another, dimension, dimensi...","[(youre, traveling), (traveling, another), (an...","[(youre, traveling, another), (traveling, anot..."
2,The Twilight Zone - 2x03 - Nervous Man in a Fo...,You're traveling\n through another dimension-...,"[youre, traveling, through, another, dimension...","[youre, traveling, another, dimension, dimensi...","[(youre, traveling), (traveling, another), (an...","[(youre, traveling, another), (traveling, anot..."
3,The Twilight Zone - 4x05 - Mute.srt,You unlock this door\n with the key of imagin...,"[you, unlock, this, door, with, the, key, of, ...","[unlock, door, key, imagination, beyond, anoth...","[(unlock, door), (door, key), (key, imaginatio...","[(unlock, door, key), (door, key, imagination)..."
4,The Twilight Zone - 3x04 - The Passersby.srt,You're traveling\n through another dimension-...,"[youre, traveling, through, another, dimension...","[youre, traveling, another, dimension, dimensi...","[(youre, traveling), (traveling, another), (an...","[(youre, traveling, another), (traveling, anot..."


I already know I want to use the n-grams as my unique identifier, which means I'll need to create a separate dataframe for each set of frequencies - mixing unigrams with bigrams wouldn't let me do the analysis I want. This both simplifies and complicates the process, since I won't be able to just add on to the same dataframe anymore.

The `frequency_ct` and `dict_to_df` functions that I created in the previous solution to the homework still work. The only new aspect is that I need to put all the n-gram lists from the different scripts together. My initial thought was to use `list.expand`, but that would require looping through every row of the dataframe, which isn't the fastest or memory optimized solution.

Fortunately, there is an easy alternative: it's easily accomplished by using the `sum` method on the column as specified in this [StackOverflow answer](https://stackoverflow.com/a/42909969).

In [7]:
auth_ngram_df['unigrams'].sum()[:10]

['youre',
 'traveling',
 'another',
 'dimension',
 'dimension',
 'sight',
 'sound',
 'mind',
 'journey',
 'wondrous']

Now that all of the ngrams are in a single list, it's a simple matter of creating a function to process them.

In [8]:
def frequency_ct(ngram_list):
    freq_dict = {}
    for ngram in ngram_list:
        if ngram not in freq_dict:
            freq_dict[ngram] = 0
        freq_dict[ngram] +=1
    return freq_dict

In [9]:
test_freq = frequency_ct(auth_ngram_df['unigrams'].sum())
test_freq

{'youre': 1410,
 'traveling': 71,
 'another': 358,
 'dimension': 353,
 'sight': 131,
 'sound': 205,
 'mind': 422,
 'journey': 76,
 'wondrous': 72,
 'land': 180,
 'whose': 100,
 'boundaries': 60,
 'imagination': 138,
 'next': 390,
 'stop': 260,
 'twilight': 499,
 'zone': 506,
 'shes': 220,
 'set': 95,
 'mr': 1604,
 'radin': 27,
 'system': 36,
 'check': 118,
 'ready': 94,
 'go': 987,
 'dont': 2199,
 'know': 1777,
 'got': 1132,
 'effects': 8,
 'youd': 192,
 'swear': 29,
 'bomb': 52,
 'exploding': 3,
 'mean': 502,
 'big': 203,
 'thats': 1367,
 'precisely': 42,
 'way': 550,
 'supposed': 72,
 'quite': 171,
 'setup': 1,
 'part': 108,
 'illusion': 27,
 'room': 201,
 'venture': 3,
 'guess': 111,
 'best': 152,
 'designed': 23,
 'shelter': 22,
 'face': 121,
 'earth': 194,
 'knows': 94,
 'hydrogen': 5,
 'tonight': 160,
 'gags': 2,
 'huh': 256,
 'something': 662,
 'sort': 67,
 'practical': 9,
 'joke': 24,
 'lets': 338,
 'say': 657,
 'start': 115,
 'stuff': 71,
 'screen': 12,
 'world': 244,
 'gettin

Of course now that I have my counts, I want to sort the n-grams from most frequent to least frequent. My favorite method to do this? DataFrames.

Unlike the previous `convert_dict_df` function, this one will need to be more flexible. It needs to be able to handle both the authentic 1960s corpus, all four of the modern corpora, and which ever n-grams I happen to be running. The addition of a couple of variables to handle column naming and a `sort_values` method takes care of it.

The `corpus_name` variable in particular is important later in the analysis. I'll need to compare the authentic corpus which was written in the 1960s about the 1960s to each of the corpora written in the 21st century about the 1960s. With the flow I've established, I'll need to merge dataframes to complete the analysis. This is most easily accomplished when the non-join-on columns have different names.

Example: If I join two dataframes with column names = `['unigram', 'frequency']` I'll end up with a single dataframe with the column names = `['unigram', 'x-frequency', 'y-frequency']`. I find these `x` and `y` prefixes less than informative and prefer to name my columns explicitly.

In [10]:
def dict_to_df(freq_dict, gram_name, corpus_name):
    if (type(gram_name)==str) and (type(corpus_name)==str):
        pass
    else:
        print('gram_name and corpus_name variables must be strings')
    freq_colname = corpus_name+'_frequency'
    df = pd.DataFrame.from_dict(freq_dict, orient='index'
                               ).reset_index().rename(columns={'index':gram_name, 0:freq_colname}
                                                     ).sort_values(freq_colname, ascending=False)
    return df

But why stop my function at just the frequency? I also need normalized frequencies. Normalized frequencies level the playing field of straight counts when comparing corpora. With simple counts, a larger corpus will have n-grams with larger counts simply because there are more words overall than a smaller corpus. It doesn't necessarily reflect any relevant comparison.

Also, the homework problem requires getting ratios of the normalized frequencies later in the analysis.

In [11]:
def normalized_freq(freq_df, corpus_name):
    freq_col_name = corpus_name + '_frequency'
    norm_col_name = corpus_name + '_norm_freq'
    total_ct = freq_df[freq_col_name].sum()
    freq_df[norm_col_name] = freq_df[freq_col_name]/total_ct
    return freq_df

def create_frequencies(ngram_list, gram_name, corpus_name):
    freq_dict = frequency_ct(ngram_list)
    freq_df = dict_to_df(freq_dict, gram_name, corpus_name)
    freq_df = normalized_freq(freq_df, corpus_name)
    return freq_df

In [12]:
auth_freq_df = create_frequencies(auth_ngram_df['unigrams'].sum(), 'unigram', 'authentic')
auth_freq_df.head()

Unnamed: 0,unigram,authentic_frequency,authentic_norm_freq
206,well,2272,0.012132
25,dont,2199,0.011742
175,im,1988,0.010616
26,know,1777,0.009489
19,mr,1604,0.008565


In [23]:
test_file_path = os.path.join(os.getcwd(), 'data', '21st-century')
raw_test_dict = load_files_to_dict(test_file_path, {})

test_ngram_dict = {}
for script_group in list(raw_test_dict.keys()):
    test_ngram_dict[script_group] = create_ngram_df(raw_test_dict[script_group], 'corpus')

test_freq_dict = {}
for script_group in list(test_ngram_dict.keys()):
    test_freq_dict[script_group] = create_frequencies(test_ngram_dict[script_group]['unigrams'].sum(), 'unigram', script_group)

test_freq_dict['Pan_Am'].head()

Unnamed: 0,unigram,Pan_Am_frequency,Pan_Am_norm_freq
67,im,489,0.015189
114,oh,407,0.012642
11,dont,379,0.011772
50,well,373,0.011586
119,know,323,0.010033


# Compare corpora



The last piece of this homework challenge is to compare the authentic corpus (wrtten regarding the 1960s and penned in the 1960s) to the four test corpora (written regarding the 1960s but not penned until the 21st century).

To compare anything to anything, first I need to combine the different dataframes holding my test corpora with the authentic corpus. I decided to do this by merging the values for the authentic data into each of the dataframes holding the values for the test data.

In [24]:
compare_dict = {}
for script_group in list(test_freq_dict.keys()):
    compare_dict[script_group] = test_freq_dict[script_group].merge(auth_freq_df, on='unigram', how='outer').fillna(0)

In [26]:
compare_dict['Pan_Am'].head()

Unnamed: 0,unigram,Pan_Am_frequency,Pan_Am_norm_freq,authentic_frequency,authentic_norm_freq
0,im,489.0,0.015189,1988.0,0.010616
1,oh,407.0,0.012642,1580.0,0.008437
2,dont,379.0,0.011772,2199.0,0.011742
3,well,373.0,0.011586,2272.0,0.012132
4,know,323.0,0.010033,1777.0,0.009489


The equation I implemented in the previous solution to this homework was:

```
df['norm_freq_ratio'] = df.loc[(df['imitation_norm_freq'] != 0
                               ) & (df['authentic_norm_freq'] != 0), 'imitation_norm_freq'
                              ]/df.loc[(df['imitation_norm_freq'] != 0
                                       ) & (df['authentic_norm_freq'] != 0), 'authentic_norm_freq']
```

In order to implement this in the various dataframes, I'll need a way to identify the appropriate columns, regardless of which dataframe I'm working with. This can be done by looking for 'norm_freq' in the column names - which will pull out the normalized frequency for both the authentic and test data.

In [27]:
[compare_dict['Pan_Am'].columns[compare_dict['Pan_Am'].columns.str.contains('norm_freq')]]

[Index(['Pan_Am_norm_freq', 'authentic_norm_freq'], dtype='object')]

Referencing the dataframe by the dictionary and script group name is getting rather tedious, so I can just set the dictionary/script name as the dataframe I'm working with. This has a much cleaner appearance and, more importantly, is easier to read. Regardless of how good (or not) code is, it's much more common to have to read code in order to improve, maintain, update, or repair it than write it. My philosophy is to make code as easy to read as possible, so that my future self can decipher what I was thinking when I wrote it the first time around.

In [28]:
test = compare_dict['Pan_Am']
test_cols = test.columns[test.columns.str.contains('norm_freq')]
test_cols

Index(['Pan_Am_norm_freq', 'authentic_norm_freq'], dtype='object')

Now I can update my code to the more readable version. Since I use the test dataframe as the left object and the authentic dataframe as the right object in the join, I can count on the fact that the test:authentic columns will always be in the same order.

As an added bonus, I only have to write to the dictionary once instead of the initial write, then the update with the new columns.

In [30]:
compare_dict = {}
for script_group in list(test_freq_dict.keys()):
    df = test_freq_dict[script_group].merge(auth_freq_df, on='unigram', how='outer').fillna(0)
    freq_cols = df.columns[df.columns.str.contains('norm_freq')]
    df['norm_freq_ratio'] = df.loc[(df[freq_cols[0]]!=0) & (df[freq_cols[1]]!=0), freq_cols[0]] / df.loc[(df[freq_cols[0]]!=0) & (df[freq_cols[1]]!=0), freq_cols[1]]
    compare_dict[script_group] = df

In [32]:
compare_dict['Pan_Am'].head()

Unnamed: 0,unigram,Pan_Am_frequency,Pan_Am_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
0,im,489.0,0.015189,1988.0,0.010616,1.430801
1,oh,407.0,0.012642,1580.0,0.008437,1.498387
2,dont,379.0,0.011772,2199.0,0.011742,1.002538
3,well,373.0,0.011586,2272.0,0.012132,0.954965
4,know,323.0,0.010033,1777.0,0.009489,1.057309


## High Ratios

High ratios for the normalized frequency show unigrams that were used commonly in the 21st-century scripts, but were extremely rare (but present) in 1960s scripts.

In [33]:
for script_group in compare_dict.keys():
    print(script_group)
    display(compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50))
    print('\n')

Pan_Am


Unnamed: 0,unigram,Pan_Am_frequency,Pan_Am_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
51,dean,87.0,0.002702,1.0,5e-06,506.064637
18,pan,160.0,0.00497,3.0,1.6e-05,310.231195
162,amanda,32.0,0.000994,1.0,5e-06,186.138717
89,stewardess,54.0,0.001677,2.0,1.1e-05,157.054543
197,teddy,27.0,0.000839,1.0,5e-06,157.054543
281,stewardesses,19.0,0.00059,1.0,5e-06,110.519863
364,ryan,15.0,0.000466,1.0,5e-06,87.252524
456,cia,13.0,0.000404,1.0,5e-06,75.618854
452,ich,13.0,0.000404,1.0,5e-06,75.618854
491,monte,12.0,0.000373,1.0,5e-06,69.802019




Mad_Men


Unnamed: 0,unigram,Mad_Men_frequency,Mad_Men_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
138,sterling,170.0,0.001166,2.0,1.1e-05,109.15141
172,sally,143.0,0.000981,2.0,1.1e-05,91.815598
54,draper,365.0,0.002503,6.0,3.2e-05,78.118166
238,jesus,108.0,0.000741,2.0,1.1e-05,69.343249
553,francis,42.0,0.000288,1.0,5e-06,53.933638
317,clients,74.0,0.000507,2.0,1.1e-05,47.512967
187,joan,134.0,0.000919,4.0,2.1e-05,43.018497
195,betty,128.0,0.000878,4.0,2.1e-05,41.092295
435,jimmy,55.0,0.000377,2.0,1.1e-05,35.313691
457,ken,52.0,0.000357,2.0,1.1e-05,33.38749




X-Men_First_Class


Unnamed: 0,unigram,X-Men_First_Class_frequency,X-Men_First_Class_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
79,cia,10.0,0.002281,1.0,5e-06,427.076397
171,commands,5.0,0.00114,1.0,5e-06,213.538198
103,cuba,9.0,0.002052,2.0,1.1e-05,192.184379
210,sebastian,4.0,0.000912,1.0,5e-06,170.830559
211,shaws,4.0,0.000912,1.0,5e-06,170.830559
364,x,3.0,0.000684,1.0,5e-06,128.122919
326,presentation,3.0,0.000684,1.0,5e-06,128.122919
284,moscow,3.0,0.000684,1.0,5e-06,128.122919
264,threat,3.0,0.000684,1.0,5e-06,128.122919
370,homo,3.0,0.000684,1.0,5e-06,128.122919




The_Kennedys


Unnamed: 0,unigram,The_Kennedys_frequency,The_Kennedys_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
16,bobby,112.0,0.006192,3.0,1.6e-05,386.54975
86,khrushchev,30.0,0.001659,1.0,5e-06,310.620335
103,sighs,25.0,0.001382,1.0,5e-06,258.850279
165,rosemary,18.0,0.000995,1.0,5e-06,186.372201
12,kennedy,128.0,0.007077,9.0,4.8e-05,147.257048
101,cuba,25.0,0.001382,2.0,1.1e-05,129.42514
37,ii,60.0,0.003317,5.0,2.7e-05,124.248134
298,election,11.0,0.000608,1.0,5e-06,113.894123
163,ethel,18.0,0.000995,2.0,1.1e-05,93.186101
399,elected,8.0,0.000442,1.0,5e-06,82.832089






In [34]:
high_score_results = pd.DataFrame(columns = ['script', 'score'])
for script_group in compare_dict.keys():
    high_score_results = high_score_results.append(
        {'script':script_group,
         'score':compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50)['norm_freq_ratio'].sum()
        }, ignore_index=True)
display(high_score_results.sort_values('score'))
print('Best performing corpus (lowest score) {}'.format(high_score_results.iloc[high_score_results['score'].idxmin(), 0]))
print('Worst performing corpus (highest score) {}'.format(high_score_results.iloc[high_score_results['score'].idxmax(), 0]))

Unnamed: 0,script,score
1,Mad_Men,1456.975643
0,Pan_Am,3336.81119
3,The_Kennedys,3980.829683
2,X-Men_First_Class,4282.152672


Best performing corpus (lowest score) Mad_Men
Worst performing corpus (highest score) X-Men_First_Class


## Low Ratios

Low ratios for the normalized frequency show unigrams that were used commonly in 1960, but were rare in the 21st-century scripts.

In [35]:
for script_group in compare_dict.keys():
    print(script_group)
    display(compare_dict[script_group].sort_values('norm_freq_ratio').head(50))
    print('\n')

Pan_Am


Unnamed: 0,unigram,Pan_Am_frequency,Pan_Am_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
5337,honey,1.0,3.1e-05,152.0,0.000812,0.038269
3797,imagination,1.0,3.1e-05,138.0,0.000737,0.042151
3761,ship,1.0,3.1e-05,137.0,0.000732,0.042459
3260,human,1.0,3.1e-05,101.0,0.000539,0.057592
3247,major,1.0,3.1e-05,76.0,0.000406,0.076537
4218,machine,1.0,3.1e-05,75.0,0.0004,0.077558
3352,radio,1.0,3.1e-05,74.0,0.000395,0.078606
3118,jerry,1.0,3.1e-05,69.0,0.000368,0.084302
4146,shadow,1.0,3.1e-05,68.0,0.000363,0.085542
3089,martin,1.0,3.1e-05,66.0,0.000352,0.088134




Mad_Men


Unnamed: 0,unigram,Mad_Men_frequency,Mad_Men_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
3985,twilight,3.0,2.1e-05,499.0,0.002665,0.00772
3764,zone,4.0,2.7e-05,506.0,0.002702,0.010151
12149,doc,1.0,7e-06,57.0,0.000304,0.022529
3302,captain,4.0,2.7e-05,208.0,0.001111,0.024695
10348,commander,1.0,7e-06,52.0,0.000278,0.024695
9954,emma,1.0,7e-06,48.0,0.000256,0.026753
10925,ace,1.0,7e-06,47.0,0.000251,0.027322
10578,schmidt,1.0,7e-06,46.0,0.000246,0.027916
7970,base,1.0,7e-06,45.0,0.00024,0.028536
4642,sight,3.0,2.1e-05,131.0,0.0007,0.029408




X-Men_First_Class


Unnamed: 0,unigram,X-Men_First_Class_frequency,X-Men_First_Class_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
212,mr,4.0,0.000912,1604.0,0.008565,0.106503
1516,boy,1.0,0.000228,311.0,0.001661,0.137324
1532,away,1.0,0.000228,305.0,0.001629,0.140025
874,hear,1.0,0.000228,297.0,0.001586,0.143797
782,long,1.0,0.000228,293.0,0.001565,0.14576
405,old,2.0,0.000456,526.0,0.002809,0.162386
1132,minute,1.0,0.000228,217.0,0.001159,0.196809
986,captain,1.0,0.000228,208.0,0.001111,0.205325
1374,room,1.0,0.000228,201.0,0.001073,0.212476
479,night,2.0,0.000456,394.0,0.002104,0.21679




The_Kennedys


Unnamed: 0,unigram,The_Kennedys_frequency,The_Kennedys_norm_freq,authentic_frequency,authentic_norm_freq,norm_freq_ratio
1774,zone,2.0,0.000111,506.0,0.002702,0.040925
1874,earth,1.0,5.5e-05,194.0,0.001036,0.053371
2646,game,1.0,5.5e-05,123.0,0.000657,0.084179
3250,guess,1.0,5.5e-05,111.0,0.000593,0.093279
3314,kill,1.0,5.5e-05,106.0,0.000566,0.097679
1328,sound,2.0,0.000111,205.0,0.001095,0.101015
2802,hot,1.0,5.5e-05,101.0,0.000539,0.102515
2239,key,1.0,5.5e-05,97.0,0.000518,0.106742
2959,space,1.0,5.5e-05,90.0,0.000481,0.115045
2352,ought,1.0,5.5e-05,88.0,0.00047,0.117659






In [36]:
low_score_results = pd.DataFrame(columns = ['script', 'score'])
for script_group in compare_dict.keys():
    low_score_results = low_score_results.append(
        {'script':script_group,
         'score':compare_dict[script_group].sort_values('norm_freq_ratio').head(50)['norm_freq_ratio'].sum()
        }, ignore_index=True)
display(low_score_results.sort_values('score', ascending=False))
print('Best performing corpus (highest score) {}'.format(low_score_results.iloc[low_score_results['score'].idxmax(), 0]))
print('Worst performing corpus (lowest score) {}'.format(low_score_results.iloc[low_score_results['score'].idxmin(), 0]))

Unnamed: 0,script,score
2,X-Men_First_Class,13.255571
3,The_Kennedys,7.791826
0,Pan_Am,6.533376
1,Mad_Men,2.309133


Best performing corpus (highest score) X-Men_First_Class
Worst performing corpus (lowest score) Mad_Men


# Ranking

The scores returned both as top and bottom normalized frequency ratios are bad things:

- The 50 highest ratios are words that were used frequently in the 21st century scripts, but were rare in the 1960s
- the 50 lowest ratios are words that were used frequently in the 1960s, but showed up rarely in the 21st century scripts

In the high ratios set, the higher the ratio, the further the script is from the authentic corpus. In the low ratios set, the higher the ratio, the closer the script is to the authentic corpus. So to get my ranking, I'm going to subtract the low ratio from the high ratio. The script corpora will then be sorted from lowest (best) to highest (worst) score.

In [37]:
results = pd.DataFrame(columns = ['script', 'high_ratio', 'low_ratio'])
for script_group in compare_dict.keys():
    results = results.append(
        {'script':script_group,
         'high_ratio':compare_dict[script_group].sort_values('norm_freq_ratio', ascending=False).head(50)['norm_freq_ratio'].sum(),
         'low_ratio':compare_dict[script_group].sort_values('norm_freq_ratio').head(50)['norm_freq_ratio'].sum()
        }, ignore_index=True)
    results['combined_score'] = results['high_ratio'] - results['low_ratio']
    results = results.sort_values('combined_score')
    results['rank'] = range(1, 1+len(results))
display(results)

Unnamed: 0,script,high_ratio,low_ratio,combined_score,rank
0,Mad_Men,1456.975643,2.309133,1454.66651,1
1,Pan_Am,3336.81119,6.533376,3330.277814,2
3,The_Kennedys,3980.829683,7.791826,3973.037857,3
2,X-Men_First_Class,4282.152672,13.255571,4268.897101,4


The analysis for the unigrams is now complete. To see the clean code (including improvements to functions) and the results for unigrams, bigrams, and trigrams, see the accompanying notebook.

# Caveats

There are several problems with this exercise and the solution.

## Corpus data processing

The biggest initial problem for me was the fact that punctuation wasn't removed, the n-grams were case sensitive, and stopwords weren't removed. The first two mean that words aren't counted appropriately, especially when they're prone to different capitalizations and uses with punctuation. For example, in the initial solution I noticed 'daddy' written several ways. Here are several ways 'daddy' could be included in a script

- Daddy.
- daddy.
- Daddy
- daddy
- Daddy!
- daddy!

This is six iterations for a single word which should all be counded together.

The last point, stopwords weren't removed, means that there's a lot of meaningless noise; Words like 'the', 'a', 'an', 'of', 'for', etc remain in the analysis.

## Pronouns

Related to proper counting and stopwords are proper nouns. In a script or novel, the names of the characters of the story will show up a disporportionate amount of the time. With a large enough corpus this becomes moot because names common to the era will naturally show up more than modern names. However, these corpora aren't large enough for this averaging of character names. The same is true for place names. The location the script is set has a higher likelihood of being mentioned.

## Ratio impact

As can be seen in the final results dataframe, the high ratios have a much larger impact on my ranking than the lower numbers. This means that including words that were rare in the 1960s has a much bigger impact on the ranking than excluding words that were common.

### Repetition

The authentic 1960s corpus includes many, many The Twilight Zone episodes. Most, if not all, of The Twilight Zone episodes start with the same introduction. This means that words like 'traveling', 'another', 'dimension', 'sight', 'sound', 'mind', and 'journey' are disproportionately represented. An improvement to the analysis would be to account for and remove this repetition so that it's only represented once in the frequencies.