# Word2Vec on DocuSouth Corpus

This script create the word embedding model on the full DocSouth corpus, as well as the 40 randomly created models, for use in constructing the confidence intervals, used in the paper "Leveraging the Alignment between Machine Learning and Intersectionality: Using Word Embeddings to Measure Intersectional Experiences of the Nineteenth Century U.S. South."

## 0. Prep

In [11]:
#import the necessary libraries

#Data Wrangling
import pandas
import numpy as np
import string
import os
from nltk.tokenize import word_tokenize, sent_tokenize
from random import choices

import gensim #library needed for word2vec


## Define some functions

In [4]:
def get_path(pathname):
    allFiles = os.listdir(pathname)
    allFiles = [pathname+file for file in allFiles]
    return(allFiles)

In [9]:
def fast_tokenize(text):
    
    # Get a list of punctuation marks
    punct = string.punctuation + '“' + '”' + '‘' + "’"
    
    lower_case = text.lower()
    lower_case = lower_case.replace('—', ' ').replace('\n', ' ')
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in lower_case if char not in punct])
    
    # Split text over whitespace into list of words
    tokens = no_punct.split()
    
    return tokens

## 1. Import and Pre-Processing

### Corpus Description

The corpus description can be found [here](https://docsouth.unc.edu/docsouthdata/).

### Import Data

Read in all of the .txt files in two folders, do some pre-processing on it, and concat them all into a Pandas dataframe

In [13]:
meta_fpn = pandas.read_csv('../data/first-person-narratives-american-south/data/toc.csv', encoding = 'utf-8')
meta_neh = pandas.read_csv('../data/na-slave-narratives/data/toc.csv', encoding = 'utf-8')
meta = pandas.concat([meta_fpn, meta_neh]).reset_index()

#Dropping multiple editions from the same autobiography
#Keeping the autobiography with the latest date

#349 and 270 = Frederick Douglass
#363 = William Wells Brown
meta.drop([349, 270, 363], inplace=True)

meta.drop_duplicates(subset = 'Filename', inplace=True)

#read in all the data, with some cleaning
path_fpn = get_path('../data/first-person-narratives-american-south/data/texts/') # indicate the local path where files are stored
path_neh = get_path('../data/na-slave-narratives/data/texts/')
path_all = path_fpn + path_neh

#remove duplicate files and multiple editions of same narrative
keep = meta['Filename'].tolist()
keep = [name.replace('.xml', '.txt') for name in keep]
filenames = []
path = []

for p in path_all:
    if (p.split('/')[-1] not in filenames) and (p.split('/')[-1] in keep):
        filenames.append(p.split('/')[-1])
        path.append(p)
    else:
        pass

data = []

for file in path:
    with open(file, encoding='utf-8') as myfile:
        data.append(myfile.read())

### Pre-Processing

Word2Vec learns about the relationships among words by observing them in context. This means that we want to split our texts into word-units. In this text there is no punctuation, and thus nothing resembling a sentence. In other text we  want to maintain sentence boundaries as well, since the last word of the previous sentence might skew the meaning of the next sentence.

You can split your text in sentences using ` nltk.tokenize.sent_tokenize()`

In [10]:
sentences = [sentence for text in data for sentence in sent_tokenize(text)]
words_by_sentence = [fast_tokenize(sentence) for sentence in sentences]
words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]
words_by_sentence[0]

['title', 'page', 'image', 'introduction']

## 2. Word2Vec

### Word Embedding
Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts.  This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

### Word2Vec Features
<ul>
<li>Size: Number of dimensions for word embedding model</li>
<li>Window: Number of context words to observe in each direction</li>
<li>min_count: Minimum frequency for words included in model</li>
<li>sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram</li>
<li>Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning</li>
<li>Iterations: Number of passes through dataset</li>
<li>Batch Size: Number of words to sample from data during each pass</li>
<li>Worker: Set the 'worker' option to ensure reproducibility</li>
</ul>

Note: Script uses default value for each argument

### Training, or fitting

In [19]:
model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5,
                               min_count=10, sg=1, alpha=0.025, iter=5, batch_words=10000, workers=1)

# Save model for later use
model.wv.save_word2vec_format('../data/word2vec_all_clean.txt')

In [None]:
#create 40 random models for constructing confidence intervals

def gen_model(words_by_sentence, num):
    """
    Takes a list of words by senence as input and a number (for naming the file)
    Saves a word2vec model in the word2vec_robust folder
    """

    model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5,
                                   min_count=10, sg=1, alpha=0.025, iter=5, batch_words=10000, workers=1)
    
    
    model.wv.save_word2vec_format('../data/word2vec_robust/model%d.txt' % num)
    
#Number of sentences, for use in creating random sentences
num_sent = len(words_by_sentence)

for num in range(0,40):
    print(num)
    
    #extract random sample of sentences with replacement, 
    #equal to total number of sentences in the full corpus
    gen_model(choices(words_by_sentence, k = num_sent), num)