# Gender Classification

In an attempt to analyze the way in which movie characters speak about male characters versus female characters, we try to classify the gender of the character being spoken about in a particular movie line sentence.

Here is an outline of our approach:
1. Process the `data/movie_lines.txt` file from the Cornell Movie Dialogs Corpus.
    * Perform sentence segmentation on each movie line to determine where the sentence boundaries are for each line
    * Separate each sentence of a movie line onto its own row in a Pandas dataframe.
    * Tokenize each movie line sentence to obtain a list of words for each sentence.
    * For each movie line sentence, determine whether there is a 'he' or 'she' pronoun in it.
2. Write all 23990 sentences with 'he' (and not both 'he' and 'she') to `data_processed/movie_line_sentences_tokenized_he.txt`.
3. Write all 10735 sentences with 'she' (and not both 'he' and 'she') to `data_processed/movie_line_sentences_tokenized_she.txt`.
4. To achieve gender parity, use all 10735 of the 'she' sentences, and randomly choose 10735 of the 'he' sentences to be in the dataset used for this classification task. Write those 10735 'he' sentences to `data_processed/movie_line_sentences_tokenized_he_selected.txt`.
5. Build a vocabulary of all the words in those 10735 'he' sentences and 10735 'she' sentences.
    * Write this vocabulary to `data_processed/movie_line_sentences_vocab.txt`, where each vocab word should appear on its own line.
6. Limit the vocab size to the top 10000 vocab words after removing the 100 most common stop words.
    * 'he' and 'she' are included in the 100 most common stop words, so this also helps us avoid using 'he' or 'she' as features.
    * 'he' is generally used to describe men and 'she' is generally used to describe women, so we didn't want to include these two pronouns as features in our model, since it would not tell us much about the words used to describe male characters versus female characters if 'he' and 'she' were the most important features in our model.
7. For each movie line sentence, obtain a feature vector. Use the counts of each unigram in the sentence as bag-of-words features. Populate a scikit-learn sparse feature matrix with the feature vectors for all 10735 'he' sentences and 10735 'she' sentences. The first 10735 feature vectors of the feature matrix are for the 'he' sentences, and the last 10735 feature vectors are for the 'she' sentences.
8. Create a ground truth list of 0 and 1 labels, where the first 10735 labels are 0's, and the last 10735 labels are 1's.
    * 0 is for 'he'
    * 1 is for 'she'
9. Create a scikit-learn Multinomial Bayes Naive Classifier.
10. Perform 10-fold cross validation using scikit-learn's `cross_val_predict` function, passing in the feature matrix, the ground truth label list, and 10 as the number of folds. Write binary predictions (0 or 1) and prediction probabilities for all 10735 'he' sentences and 10735 'she' sentences to `data_processed/movie_line_sentences_predictions.txt` file.
11. Calculate accuracy, precision, recall, and F1 by comparing the binary predictions in `data_processed/movie_line_sentences_predictions.txt` to the ground truth labels in our ground truth label list, matching movie line sentences by their index (which should be the same for the same sentence in the prediction file and the ground truth label list).

## Step 1: Process the `data/movie_lines.txt` file from the Cornell Movie Dialogs Corpus

* Perform sentence segmentation on each movie line to determine where the sentence boundaries are for each line
* Separate each sentence of a movie line onto its own row in a Pandas dataframe.
* Tokenize each movie line sentence to obtain a list of words for each sentence.
* For each movie line sentence, determine whether there is a 'he' or 'she' pronoun in it.

In [1]:
import pandas as pd

In [2]:
# So we can see the entire movie line in the dataframe
# Important for writing movie line sentences to files later in this step!
pd.set_option('display.max_colwidth', None)

In [3]:
# Import movie_lines.txt data as pandas dataframe
movie_lines_features = ["LineID", "Character", "Movie", "Name", "Line"]

In [4]:
movie_lines = pd.read_csv("data/movie_lines.txt", sep = "\+\+\+\$\+\+\+", engine = "python", encoding='ISO-8859-1', index_col = False, names = movie_lines_features)

In [5]:
movie_lines.head()

Unnamed: 0,LineID,Character,Movie,Name,Line
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


In [6]:
# Strip the space from "LineID" for further usage
movie_lines["LineID"] = movie_lines["LineID"].apply(str.strip)

# Change the datatype of "Line" to string and lowercase "Line"
movie_lines["Line"] = movie_lines["Line"].apply(str)
movie_lines["Line"] = movie_lines["Line"].apply(str.lower)

In [7]:
movie_lines.head()

Unnamed: 0,LineID,Character,Movie,Name,Line
0,L1045,u0,m0,BIANCA,they do not!
1,L1044,u2,m0,CAMERON,they do to!
2,L985,u0,m0,BIANCA,i hope so.
3,L984,u2,m0,CAMERON,she okay?
4,L925,u0,m0,BIANCA,let's go.


In [8]:
# nltk will be used for sentence segmentation and word tokenization
import nltk
from nltk import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/michellelum/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
# Perform sentence segmentation on each movie line
# Segmented_Line column will contain a list of the sentences in the movie line
movie_lines["Segmented_Line"] = movie_lines["Line"].apply(sent_tokenize)

In [10]:
# The next few cells modify movie_lines dataframe so each movie line sentence is on its own row
df_temp = pd.DataFrame(columns=movie_lines.columns)

In [11]:
df_temp.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line


In [12]:
for row in movie_lines.iterrows():
    for sentence in row[1]["Segmented_Line"]:
        line_id = row[1]["LineID"]
        character = row[1]["Character"]
        movie = row[1]["Movie"]
        name = row[1]["Name"]
        line = row[1]["Line"]
        segmented = sentence
        new_row = {"LineID":line_id, "Character": character,
                   "Movie":movie,"Name":name,"Line":line,
                   "Segmented_Line":segmented}
        df_temp = df_temp.append(new_row, ignore_index=True)

In [13]:
movie_lines = df_temp

In [15]:
movie_lines.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line
0,L1045,u0,m0,BIANCA,they do not!,they do not!
1,L1044,u2,m0,CAMERON,they do to!,they do to!
2,L985,u0,m0,BIANCA,i hope so.,i hope so.
3,L984,u2,m0,CAMERON,she okay?,she okay?
4,L925,u0,m0,BIANCA,let's go.,let's go.


In [16]:
movie_lines.shape

(510511, 6)

In [17]:
# Tokenize each movie line sentence
# Tokenized_Line column will contain a list of the words in the movie line sentence in that row
movie_lines["Tokenized_Line"] = movie_lines["Segmented_Line"].apply(word_tokenize)

In [18]:
movie_lines.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line,Tokenized_Line
0,L1045,u0,m0,BIANCA,they do not!,they do not!,"[they, do, not, !]"
1,L1044,u2,m0,CAMERON,they do to!,they do to!,"[they, do, to, !]"
2,L985,u0,m0,BIANCA,i hope so.,i hope so.,"[i, hope, so, .]"
3,L984,u2,m0,CAMERON,she okay?,she okay?,"[she, okay, ?]"
4,L925,u0,m0,BIANCA,let's go.,let's go.,"[let, 's, go, .]"


In [19]:
movie_lines.shape

(510511, 7)

In [20]:
# Function that takes in a tokenized_line as a list of words
# Returns 'he & she' if both 'he' and 'she' are in tokenized_line
# Returns 'he' if only 'he' is in tokenized_line
# Returns 'she' if only 'she' is in tokenized_line
# Returns 'none' if neither 'he' nor 'she' is in tokenized_line
def get_pronoun(tokenized_line):
    if "he" in tokenized_line and "she" in tokenized_line:
        return "he & she"
    elif "he" in tokenized_line:
        return "he"
    elif "she" in tokenized_line:
        return "she"
    else:
        return "none"

In [21]:
# Find 'he' or 'she' pronoun(s) in each movie line sentence
# Pronoun column will contain
# 'he & she' if both 'he' and 'she' are in tokenized_line
# 'he' if only 'he' is in tokenized_line
# 'she' if only 'she' is in tokenized_line
# 'none' if neither 'he' nor 'she' is in tokenized_line
movie_lines["Pronoun"] = movie_lines["Tokenized_Line"].apply(get_pronoun)

In [22]:
movie_lines.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line,Tokenized_Line,Pronoun
0,L1045,u0,m0,BIANCA,they do not!,they do not!,"[they, do, not, !]",none
1,L1044,u2,m0,CAMERON,they do to!,they do to!,"[they, do, to, !]",none
2,L985,u0,m0,BIANCA,i hope so.,i hope so.,"[i, hope, so, .]",none
3,L984,u2,m0,CAMERON,she okay?,she okay?,"[she, okay, ?]",she
4,L925,u0,m0,BIANCA,let's go.,let's go.,"[let, 's, go, .]",none


In [23]:
movie_lines.shape

(510511, 8)

In [26]:
# note from michelle: this doesn't seem to work?

# Write the whole movie_lines dataframe to a txt file
# with open('data_processed/movie_lines_df.txt', 'w') as f:
#    movie_lines_string = movie_lines.to_string(header=False, index=False)
#    f.write(movie_lines_string)

In [24]:
# Write each movie line sentence to its own line in movie_line_sentences.txt
with open('data_processed/movie_line_sentences.txt', 'w') as f:
    movie_lines_string = movie_lines['Segmented_Line'].to_string(header=False, index=False)
    f.write(movie_lines_string)

In [25]:
# Write each tokenized movie line sentence to its own line in movie_line_sentences_tokenized.txt
with open('data_processed/movie_line_sentences_tokenized.txt', 'w') as f:
    movie_lines_string = movie_lines['Tokenized_Line'].to_string(header=False, index=False)
    f.write(movie_lines_string)

In [27]:
# Drop all rows where the movie line sentence does not contain either 'he' or 'she'
movie_lines = movie_lines.loc[movie_lines["Pronoun"] != "none"]

In [28]:
movie_lines.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line,Tokenized_Line,Pronoun
3,L984,u2,m0,CAMERON,she okay?,she okay?,"[she, okay, ?]",she
38,L407,u0,m0,BIANCA,who knows? all i've ever heard her say is that she'd dip before dating a guy that smokes.,all i've ever heard her say is that she'd dip before dating a guy that smokes.,"[all, i, 've, ever, heard, her, say, is, that, she, 'd, dip, before, dating, a, guy, that, smokes, .]",she
39,L406,u2,m0,CAMERON,so that's the kind of guy she likes? pretty ones?,so that's the kind of guy she likes?,"[so, that, 's, the, kind, of, guy, she, likes, ?]",she
43,L405,u0,m0,BIANCA,"lesbian? no. i found a picture of jared leto in one of her drawers, so i'm pretty sure she's not harboring same-sex tendencies.","i found a picture of jared leto in one of her drawers, so i'm pretty sure she's not harboring same-sex tendencies.","[i, found, a, picture, of, jared, leto, in, one, of, her, drawers, ,, so, i, 'm, pretty, sure, she, 's, not, harboring, same-sex, tendencies, .]",she
44,L404,u2,m0,CAMERON,she's not a...,she's not a...,"[she, 's, not, a, ...]",she


In [29]:
movie_lines.shape

(34941, 8)

In [31]:
# Write each movie line sentence with 'he' or 'she' or both to its own line in movie_line_sentences_he_she_both.txt
with open('data_processed/movie_line_sentences_he_she_both.txt', 'w') as f:
    movie_lines_string = movie_lines['Segmented_Line'].to_string(header=False, index=False)
    f.write(movie_lines_string)

In [30]:
# Write each tokenized movie line sentence with 'he' or 'she' or both to its own line in movie_line_sentences_tokenized_he_she_both.txt
with open('data_processed/movie_line_sentences_tokenized_he_she_both.txt', 'w') as f:
    movie_lines_string = movie_lines['Tokenized_Line'].to_string(header=False, index=False)
    f.write(movie_lines_string)

In [32]:
# Drop all rows where the movie line sentence contains both 'he' and 'she'
movie_lines = movie_lines.loc[movie_lines["Pronoun"] != "he & she"]

In [33]:
movie_lines.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line,Tokenized_Line,Pronoun
3,L984,u2,m0,CAMERON,she okay?,she okay?,"[she, okay, ?]",she
38,L407,u0,m0,BIANCA,who knows? all i've ever heard her say is that she'd dip before dating a guy that smokes.,all i've ever heard her say is that she'd dip before dating a guy that smokes.,"[all, i, 've, ever, heard, her, say, is, that, she, 'd, dip, before, dating, a, guy, that, smokes, .]",she
39,L406,u2,m0,CAMERON,so that's the kind of guy she likes? pretty ones?,so that's the kind of guy she likes?,"[so, that, 's, the, kind, of, guy, she, likes, ?]",she
43,L405,u0,m0,BIANCA,"lesbian? no. i found a picture of jared leto in one of her drawers, so i'm pretty sure she's not harboring same-sex tendencies.","i found a picture of jared leto in one of her drawers, so i'm pretty sure she's not harboring same-sex tendencies.","[i, found, a, picture, of, jared, leto, in, one, of, her, drawers, ,, so, i, 'm, pretty, sure, she, 's, not, harboring, same-sex, tendencies, .]",she
44,L404,u2,m0,CAMERON,she's not a...,she's not a...,"[she, 's, not, a, ...]",she


In [34]:
movie_lines.shape

(34725, 8)

In [36]:
# Write each movie line sentence with 'he' or 'she' (but not both) to its own line in movie_line_sentences_he_she.txt
with open('data_processed/movie_line_sentences_he_she.txt', 'w') as f:
    movie_lines_string = movie_lines['Segmented_Line'].to_string(header=False, index=False)
    f.write(movie_lines_string)

In [35]:
# Write each tokenized movie line sentence with 'he' or 'she' (but not both) to its own line in movie_line_sentences_tokenized_he_she.txt
with open('data_processed/movie_line_sentences_tokenized_he_she.txt', 'w') as f:
    movie_lines_string = movie_lines['Tokenized_Line'].to_string(header=False, index=False)
    f.write(movie_lines_string)

## Step 2: Write all 23990 sentences with 'he' (and not both 'he' and 'she') to `data_processed/movie_line_sentences_tokenized_he.txt`

In [37]:
# Create new movie_lines_he dataframe containing only movie line sentences with the pronoun 'he' (not both 'he' and 'she')
movie_lines_he = movie_lines.loc[movie_lines["Pronoun"] == "he"]

In [39]:
movie_lines_he.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line,Tokenized_Line,Pronoun
110,L597,u0,m0,BIANCA,combination. i don't know -- i thought he'd be different. more of a gentleman...,i don't know -- i thought he'd be different.,"[i, do, n't, know, --, i, thought, he, 'd, be, different, .]",he
112,L596,u3,m0,CHASTITY,is he oily or dry?,is he oily or dry?,"[is, he, oily, or, dry, ?]",he
113,L595,u0,m0,BIANCA,"he practically proposed when he found out we had the same dermatologist. i mean. dr. bonchowski is great an all, but he's not exactly relevant party conversation.",he practically proposed when he found out we had the same dermatologist.,"[he, practically, proposed, when, he, found, out, we, had, the, same, dermatologist, .]",he
115,L595,u0,m0,BIANCA,"he practically proposed when he found out we had the same dermatologist. i mean. dr. bonchowski is great an all, but he's not exactly relevant party conversation.","dr. bonchowski is great an all, but he's not exactly relevant party conversation.","[dr., bonchowski, is, great, an, all, ,, but, he, 's, not, exactly, relevant, party, conversation, .]",he
120,L571,u0,m0,BIANCA,where did he go? he was just here.,where did he go?,"[where, did, he, go, ?]",he


In [38]:
movie_lines_he.shape

(23990, 8)

In [41]:
# Write each movie line sentence with 'he' (not both 'he' and 'she') to its own line in movie_line_sentences_he.txt
with open('data_processed/movie_line_sentences_he.txt', 'w') as f:
    movie_lines_he_string = movie_lines_he['Segmented_Line'].to_string(header=False, index=False)
    f.write(movie_lines_he_string)

In [40]:
# Write each tokenized movie line sentence with 'he' (not both 'he' and 'she') to its own line in movie_line_sentences_tokenized_he.txt
with open('data_processed/movie_line_sentences_tokenized_he.txt', 'w') as f:
    movie_lines_he_string = movie_lines_he['Tokenized_Line'].to_string(header=False, index=False)
    f.write(movie_lines_he_string)

## Step 3: Write all 10735 sentences with 'she' (and not both 'he' and 'she') to `data_processed/movie_line_sentences_tokenized_she.txt`

In [42]:
# Create new movie_lines_she dataframe containing only movie line sentences with the pronoun 'she' (not both 'he' and 'she')
movie_lines_she = movie_lines.loc[movie_lines["Pronoun"] == "she"]

In [43]:
movie_lines_she.head()

Unnamed: 0,LineID,Character,Movie,Name,Line,Segmented_Line,Tokenized_Line,Pronoun
3,L984,u2,m0,CAMERON,she okay?,she okay?,"[she, okay, ?]",she
38,L407,u0,m0,BIANCA,who knows? all i've ever heard her say is that she'd dip before dating a guy that smokes.,all i've ever heard her say is that she'd dip before dating a guy that smokes.,"[all, i, 've, ever, heard, her, say, is, that, she, 'd, dip, before, dating, a, guy, that, smokes, .]",she
39,L406,u2,m0,CAMERON,so that's the kind of guy she likes? pretty ones?,so that's the kind of guy she likes?,"[so, that, 's, the, kind, of, guy, she, likes, ?]",she
43,L405,u0,m0,BIANCA,"lesbian? no. i found a picture of jared leto in one of her drawers, so i'm pretty sure she's not harboring same-sex tendencies.","i found a picture of jared leto in one of her drawers, so i'm pretty sure she's not harboring same-sex tendencies.","[i, found, a, picture, of, jared, leto, in, one, of, her, drawers, ,, so, i, 'm, pretty, sure, she, 's, not, harboring, same-sex, tendencies, .]",she
44,L404,u2,m0,CAMERON,she's not a...,she's not a...,"[she, 's, not, a, ...]",she


In [44]:
movie_lines_she.shape

(10735, 8)

In [45]:
# Write each movie line sentence with 'she' (not both 'he' and 'she') to its own line in movie_line_sentences_she.txt
with open('data_processed/movie_line_sentences_she.txt', 'w') as f:
    movie_lines_she_string = movie_lines_she['Segmented_Line'].to_string(header=False, index=False)
    f.write(movie_lines_she_string)

In [46]:
# Write each tokenized movie line sentence with 'she' (not both 'he' and 'she') to its own line in movie_line_sentences_tokenized_she.txt
with open('data_processed/movie_line_sentences_tokenized_she.txt', 'w') as f:
    movie_lines_she_string = movie_lines_she['Tokenized_Line'].to_string(header=False, index=False)
    f.write(movie_lines_she_string)

## Step 4: To achieve gender parity, use all 10735 of the 'she' sentences, and randomly choose 10735 of the 'he' sentences to be in the dataset for this classification task.

Write those 10735 'he' sentences to `data_processed/movie_line_sentences_tokenized_he_selected.txt`.

In [8]:
# Check number of movie line sentences with 'he'
with open('data_processed/movie_line_sentences_he.txt') as f:
    count = 0
    for line in f:
        count += 1
        if count < 1:
            print(line)
        
print(count, "lines with 'he'")

23990 lines with 'he'


In [7]:
with open('data_processed/movie_line_sentences_she.txt') as f:
    count = 0
    for line in f:
        count += 1
        if count < 1:
            print(line)
        
print(count, "lines with 'she'")

10735 lines with 'she'


In [40]:
# Take random sample of 10,735 'he' sentences to ensure gender parity
# In the future, always work with this same smaller dataset of 'he' sentences,
# which we write to data_processed/movie_line_sentences_tokenized_he_selected.txt
import random

# Generate a set of 10735 random indices between 0 and 23989, inclusive (no repeats)
# since 23990 is the total number of 'he' sentences, while we only have 10735 'she' sentences.
random_indices = set()
while len(random_indices) < 10735:
    random_indices.add(random.randint(0, 23989))

print(len(random_indices))

# Write the 'he' sentences at these 10,735 randomly selected indices to a new txt file
with open('data_processed/movie_line_sentences_tokenized_he.txt') as f1:
    with open('data_processed/movie_line_sentences_tokenized_he_selected.txt', 'w') as f2:
        count = 0
        for line in f1:
            count += 1
            if count in random_indices:
                f2.write(line)

10735


## Step 5: Build a vocabulary of all the words in the 10735 'he' sentences and 10735 'she' sentences.

* Write this vocabulary to `data_processed/movie_line_sentences_vocab.txt`, where each vocab word should appear on its own line.

In [21]:
# Build vocabulary from all the 'he' and 'she' sentences
from collections import Counter

vocab = Counter()

with open('data_processed/movie_line_sentences_tokenized_he_selected.txt', 'r') as f:
    he_lines = f.readlines()

with open('data_processed/movie_line_sentences_tokenized_she.txt', 'r') as f:
    she_lines = f.readlines()

for line in he_lines:
    line = line.strip()
    line = line.strip('[]')     # remove brackets (present due to the python word tokens list data structure)
    for word in line.split():
        if word[-1] == ',':
            word = word[:-1]    # remove comma after the word (the comma acted as a delimiter for the list)
        vocab[word] += 1

for line in she_lines:
    line = line.strip()
    line = line.strip('[]')     # remove brackets (present due to the python word tokens list data structure)
    for word in line.split():
        if word[-1] == ',':
            word = word[:-1]    # remove comma after the word (the comma acted as a delimiter for the list)
        vocab[word] += 1

print(len(vocab))

# Write vocabulary to txt file, with the most common vocab words appearing first
with open('data_processed/movie_line_sentences_vocab.txt', 'w') as f:
    for word, freq in vocab.most_common():
        f.write(word + "\n")

12901


In [22]:
print(vocab.most_common(105))

[('.', 15409), ('he', 12136), ('she', 11956), (',', 9696), ("'s", 7701), ('the', 5426), ('to', 4873), ('a', 4704), ('i', 4180), ('you', 4022), ('and', 3831), ('?', 3676), ("n't", 3280), ('...', 3015), ('was', 2999), ('in', 2374), ('it', 2324), ('me', 2240), ('is', 2168), ('of', 2086), ('that', 2080), ('her', 1912), ('--', 1660), ('did', 1501), ('what', 1453), ('do', 1347), ('but', 1342), ('for', 1301), ('be', 1254), ('with', 1252), ('know', 1215), ('if', 1121), ('his', 1098), ('on', 1088), ('him', 1054), ('!', 1032), ('my', 991), ("'ll", 974), ('out', 908), ('not', 900), ('got', 890), ('like', 887), ('does', 886), ('just', 870), ('have', 864), ('this', 806), ('all', 803), ('so', 792), ('when', 789), ('up', 752), ('said', 745), ('at', 730), ('think', 729), ('had', 709), ('about', 678), ('has', 657), ('here', 656), ("'d", 653), ('we', 651), ("'", 649), ('your', 632), ('there', 629), ('as', 615), ('one', 600), ('no', 568), ('would', 567), ('how', 566), ('where', 542), ('well', 540), ('can

## Step 6: Limit the vocab size to the top 10000 vocab words after removing the 100 most common stop words.

* 'he' and 'she' are included in the 100 most common stop words, so this also helps us avoid using 'he' or 'she' as features.
* 'he' is generally used to describe men and 'she' is generally used to describe women, so we didn't want to include these two pronouns as features in our model, since it would not tell us much about the words used to describe male characters versus female characters if 'he' and 'she' were the most important features in our model.

In [23]:
# Limit the vocab size to the top 10000 vocab words after removing the 100 most common words as stop words
# 'he' and 'she' are included in the 100 most common stop words, so we also don't use 'he' or 'she' as features
from itertools import islice

vocab_file = 'data_processed/movie_line_sentences_vocab.txt'
vocab_size = 10000
num_stop_words = 100

start_index = num_stop_words
end_index = start_index + vocab_size

with open(vocab_file) as f:
    vocab_words = [w.strip() for w in islice(f, start_index, end_index)]
    vocab_words_to_indices = dict([(w, i) for (i, w) in enumerate(vocab_words)])
        
print(len(vocab_words))
print(vocab_words[:5]) # see first 5 most common vocab words (these should be the last 5 words printed out in the last cell)
print(len(vocab_words_to_indices))

10000
['``', 'will', 'says', 'time', "''"]
10000


## Step 7: For each movie line sentence, obtain a feature vector.

* Use the counts of each unigram in the sentence as bag-of-words features.
* Populate a scikit-learn sparse feature matrix with the feature vectors for all 10735 'he' sentences and 10735 'she' sentences.
* The first 10735 feature vectors of the feature matrix are for the 'he' sentences, and the last 10735 feature vectors are for the 'she' sentences.

In [40]:
# Create a feature matrix consisting of a feature vector for each sentence
# Use counts of unigrams as bag-of-words features
from collections import Counter
from scipy import sparse
from scipy.sparse import vstack

class_size = 10735    # 10735 'he' sentences and 10735 'she' sentences
num_features = 10000  # vocab size of 10000

def get_features(movie_line_sentence, vocab_dict):
    """
    takes in a movie line sentences
    and also a dictionary mapping vocab words (which are the features) to their feature indices
    returns a feature vector of bag-of-words features (counts of unigrams) for the sentence
    """
    # create counter of the words in the movie line sentence
    word_counter = Counter()
    movie_line_sentence = movie_line_sentence.strip()
    movie_line_sentence = movie_line_sentence.strip('[]')   # remove brackets (present due to the python word tokens list data structure)
    for word in movie_line_sentence.split():
        if word[-1] == ',':
            word = word[:-1]     # remove comma after the word (the comma acted as a delimiter for the list)
        word_counter[word] += 1
    
    # here, the features are the counts of unigrams (single words) in the sentence
    # we're populating a sparse matrix, so only add non-zero counts to our features list
    features = []
    for word in word_counter:
        # only include counts for words in our vocab
        if word in vocab_dict:
            word_feature_index = vocab_dict[word]
            word_count = word_counter[word]
            features.append((word_feature_index, word_count))   # need vocab word index to add features to feature index at the index where they belong
    return features

def get_feature_matrix(movie_lines_file, vocab_dict):
    """
    takes in a txt file of movie line sentences
    and also a dictionary mapping vocab words (which are the features) to their feature indices
    returns a sparse lil matrix as the feature matrix for all the sentences in the movie line sentence file
    """
    X = sparse.lil_matrix((class_size, num_features), dtype='uint8')
    with open(movie_lines_file, 'r') as f:
        lines = f.readlines()
        line_index = 0
        for line in lines:
            for feature_index, value in get_features(line, vocab_dict):
                # add features to feature vector at the index where they belong, according to their the vocab word index
                X[line_index,feature_index] = value
            line_index += 1
    return X

feature_matrix_he = get_feature_matrix('data_processed/movie_line_sentences_tokenized_he_selected.txt', vocab_words_to_indices)
feature_matrix_she = get_feature_matrix('data_processed/movie_line_sentences_tokenized_she.txt', vocab_words_to_indices)

# Concatenate the two matrices.
# Put the 'he' sentence feature vectors first (in the first half of the feature matrix),
# and then the 'she' sentence feature vectors after (in the second half of the feature matrix).
feature_matrix = vstack([feature_matrix_he, feature_matrix_she]).toarray()

print(feature_matrix.shape)

(21470, 10000)


## Step 8: Create a ground truth list of 0 and 1 labels, where the first 10735 labels are 0's, and the last 10735 labels are 1's.
* 0 is for 'he'
* 1 is for 'she'

In [27]:
# Mapping for binary labels — 0 is for 'he', and 1 is for 'she'
labels = {0: 'he', 1: 'she'}

class_size = 10735 # 10735 'he' sentences and 10735 'she' sentences

# A list of ground truth binary labels for the two classes ('he' and 'she') 
# We'll put the 'he' sentences first, and then the 'she' sentences after,
# since that's the same as the order in which our feature vectors are stored in the feature matrix.
data_labels = [0] * class_size + [1] * class_size
# print(data_labels)

## Step 9: Create a scikit-learn Multinomial Bayes Naive Classifier.

## Step 10: Perform 10-fold cross validation.
* Use scikit-learn's `cross_val_predict` function, passing in the feature matrix, the ground truth label list, and 10 as the number of folds.
* Write binary predictions (0 or 1) and prediction probabilities for all 10735 'he' sentences and 10735 'she' sentences to `data_processed/movie_line_sentences_predictions.txt` file.


In [35]:
from sklearn.model_selection import cross_val_predict
from sklearn.naive_bayes import MultinomialNB

# Create Multinomial Naive Bayes classifier using scikit learn
classifier = MultinomialNB()

# Perform 10-fold cross validation on the training data,
# getting predictions (and probabilities) for each instance in the training set
num_folds = 10
test_predictions = cross_val_predict(classifier, feature_matrix, data_labels, cv=num_folds, method='predict')
test_probabilities = cross_val_predict(classifier, feature_matrix, data_labels, cv=num_folds, method='predict_proba')

# Write predictions and prediction probabilities to txt file
# Each movie line sentence prediction is on its own line, with a space separating the predicted label from the prediction probability
with open('data_processed/movie_line_sentences_predictions.txt', 'w') as f:
    for i in range(len(test_predictions)):
        f.write(str(test_predictions[i]) + " " + str(max(test_probabilities[i])) + "\n")

## Step 11: Calculate accuracy, precision, recall, and F1
* By comparing the binary predictions in `data_processed/movie_line_sentences_predictions.txt` to the ground truth labels in our ground truth label list (`data_labels`), matching movie line sentences by their index (which should be the same for the same sentence in the prediction file and the ground truth label list).

In [39]:
from collections import Counter

ground_truth = data_labels
with open('data_processed/movie_line_sentences_predictions.txt') as f:
    c = Counter()
    sentence_num = 0
    for line in f: 
        values = line.rstrip('\n').split()
        prediction = int(values[0])  # the 0's and 1's will be read in as strings, so need to convert to ints
        
        c[(prediction, ground_truth[sentence_num])] += 1
        sentence_num += 1

    if sum(c.values()) < len(ground_truth):
        warnings.warn("Missing {} predictions".format(len(ground_truth) - sum(c.values())), UserWarning)

    # treat 'she' (1) as the positive class, and 'he' (0) as the negative class
    tp = c[(1, 1)] # predicted 'she', and sentence was actually referring to 'she' 
    tn = c[(0, 0)] # predicted 'he', and sentence was actually referring to 'he'
    fp = c[(1, 0)] # predicted 'she', and sentence was actually referring to 'he' 
    fn = c[(0, 1)] # predicted 'he', and sentence was actually referring to 'she' 
    
    accuracy  = (tp + tn) / sum(c.values())
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    results = {"truePositives": tp, "trueNegatives": tn, "falsePositives": fp, "falseNegatives": fn,
               "accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}
    print(results)

{'truePositives': 7073, 'trueNegatives': 4664, 'falsePositives': 6071, 'falseNegatives': 3662, 'accuracy': 0.5466697717745692, 'precision': 0.5381162507608034, 'recall': 0.6588728458313926, 'f1': 0.5924033669751665}


## Next steps?
* The dataset has a lot of very short sentences, which make for very sparse feature vectors (vectors with lots of 0's), which is why the classifier might have difficult labeling sentences as either about a 'he' or about a 'she'. Let's try restricting the dataset even further to sentences that have at least some number of words.

* Find the most important features used by our classifier and analyze them:
    - https://stackoverflow.com/questions/50526898/how-to-get-feature-importance-in-naive-bayes