# NLP on SEC Forms Using Doc2vec with Gensim
## Introduction
Throughout this notebook, we reference <a href="https://arxiv.org/pdf/1405.4053.pdf">Le and Mikolov 2014</a>. 

### Bag-of-words Model
Traditional state-of-the-art document representations are based on the <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">bag-of-words model</a>, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents  
(1) `John likes to watch movies. Mary likes movies too.`  
(2) `John also likes to watch football games.`  
are used to construct a length 10 list of words  
`["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"]`  
so then we can represent the two documents as fixed length vectors whose elements are the frequencies of the corresponding words in our list  
(1) `[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]`  
(2) `[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]`  
Bag-of-words models are surprisingly effective but lose information about word order. Bag of <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> models consider word phrases of length n to capture local word order but suffer from data sparsity and high dimensionality.

### Word2vec Model
Word2vec uses a shallow neural network to embed words in a high-dimensional vector space. In the resulting vector space, close word vectors have similar contextual meanings and distant word vectors have different contextual meanings. For example, `strong` and `powerful` would be closer together than `strong` and `Paris`. Word2vec models can be trained using two prediction tasks which represent the skip-gram and continuous-bag-of-words models.


#### Word2vec - Skip-gram Model
The Skip-gram <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">Word2vec</a> Model takes in pairs (word1, word2) generated by moving a window across text data and trains a 1-hidden-layer neural network based on the fake task giving us a predicted probability distribution of nearby words to a given input word. The hidden-to-output weights in the neural network become the word embeddings. So if the hidden layer has 300 neurons, this network woulld give us 300-dimensional word embeddings. We use <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> encoding for the words.

#### Word2vec - Continuous-bag-of-words Model
The Continuous-bag-of-words Word2vec Model is also a 1-hidden-layer neural network. This time, the fake task is to predict the center word based on context words in a window around the center word. Again, the hidden-to-output weights become the word embeddings and we use one-hot encoding.

### Paragraph Vector
Le and Mikolov 2014 introduces the <i>Paragraph Vector</i>, which outperforms representing a documents by averaging or concatenating the word vectors of a document. We determine the embedding of the paragraph in its vector space in the same way as words - by training a shallow neural network on a fake task. Paragraph Vectors consider local word order but also give us a dense vector representations.

#### Paragraph Vector - Distributed Memory (PV-DM)
This is the Paragraph Vector Model analogous to the Continuous-bag-of-words Word2vec Model. Paragraph Vectors are obtained by training a neural network on the fake task of inferring a center word based on context words and a context paragraph. A paragraph is a context for all words in the paragraph.

#### Paragraph Vector - Distributed Bag of Words (PV-DBOW)
This is the Paragraph Vector Model analogous to the Skip-gram Word2vec Model. Paragraph Vectors are obtained by training a neural network on the fake task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

### Requirements
The following python modules are dependencies for this notebook:
* spacy  
* smart_open
* testfixtures
* sklearn
* gensim
* python -m spacy download en

In [1]:
from multiprocessing import Pool
import smart_open
import os.path
import spacy
import time
import glob

nlp = spacy.load('en')
dirname = 'data'

In [2]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
from contextlib import contextmanager
from collections import defaultdict
from collections import OrderedDict
from collections import namedtuple
from gensim.models import Doc2Vec
from IPython.display import HTML
from timeit import default_timer
import gensim.models.doc2vec
import statsmodels.api as sm
from random import shuffle
from random import sample
import multiprocessing
from os import remove
import pandas as pd
import numpy as np
import itertools
import datetime
import locale
import gensim
import random
import sys
import re
#import sklearn
from sklearn import linear_model

Using TensorFlow backend.


### PrepData - OOV Removal, No Lemmatization

In [None]:
print("Process started...")

# Redirect output to a text file
orig_stdout = sys.stdout
out_f = open('out_prepdata.txt', 'w')
sys.stdout = out_f

def processOne(txt):
    with smart_open.smart_open(txt, "rb") as t:
        doc = nlp.make_doc(t.read().decode("utf-8"))
        # Approximately top 500 words in a SEC Form are header
        removed_stop_words = list(map(lambda x: x.lower_, filter(lambda token: token.is_alpha and not token.is_stop and not token.is_oov, doc)))[500:]
        return " ".join(removed_stop_words)
def prepData():
    folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg'] 
    print("Preparing dataset...")
    pool = Pool()
    tick_counter = 0
    for fol in folders:
        temp = u''
        txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
        batch_size = 200
        print("Processing {0} files, {1} at a time in {2}".format(len(txt_files), batch_size, fol))
        for i in range(0, len(txt_files), 200):
            if (tick_counter != 0 and tick_counter % 10==0):
                start_time = time.time()
                sys.stdout = orig_stdout
                print("Finished with {0} files".format(tick_counter*200))
                sys.stdout = out_f
            if (i%1000==0):
                print("Finished processing {0} files".format(i))
            if (i+200 > len(txt_files)):
                end = len(txt_files)
            else:
                end = i+200
            results = pool.map(processOne, txt_files[i:end])
            temp += '\n'.join(results)
            temp += '\n'
            if (tick_counter != 0 and tick_counter % 10==0):
                end_time = time.time()
                sys.stdout = orig_stdout
                print("Last file size {0} batch processing running time was {1}".format(batch_size, end_time-start_time))
                sys.stdout = out_f
            tick_counter += 1
            if (i != 0 and (i % 5000==0 or i==len(txt_files))):
                output = 'aggregated-{0}-{1}.txt'.format(fol.replace('/', '-'), i)
                with smart_open.smart_open(os.path.join(dirname, output), "wb") as f:
                    for idx, line in enumerate(temp.split('\n')):
                        #num_line = u"_*{0} {1}\n".format(idx, line)
                        num_line = u"{0}\n".format(line)
                        f.write(num_line.encode("UTF-8"))
                temp = u''
                print("{} aggregated".format(os.path.join(dirname, output)))
        
start = time.time()
prepData()
end = time.time()
print ("Total running time: ".format(end-start))

sys.stdout = orig_stdout
out_f.close()

print("Process completed")
print("Total running time: {0}".format(end-start))

Process started...
Finished with 2000 files
Last file size 200 batch processing running time was 5.919139385223389
Finished with 4000 files
Last file size 200 batch processing running time was 4.711833715438843
Finished with 6000 files
Last file size 200 batch processing running time was 4.077444314956665
Finished with 8000 files
Last file size 200 batch processing running time was 4.98043155670166
Finished with 10000 files
Last file size 200 batch processing running time was 4.967067241668701
Finished with 12000 files
Last file size 200 batch processing running time was 4.60893988609314
Finished with 14000 files
Last file size 200 batch processing running time was 4.950732231140137
Finished with 16000 files
Last file size 200 batch processing running time was 4.94968843460083
Finished with 18000 files
Last file size 200 batch processing running time was 5.979551315307617
Finished with 20000 files
Last file size 200 batch processing running time was 5.809570550918579
Finished with 2200

In [17]:
print("Process started...")

def aggregate_data(name, out):
    txt_files = glob.glob(os.path.join(dirname, name))
    open('alldata-id.txt', 'w').close() # Clear alldata-id.txt
    print(len(txt_files))
    with smart_open.smart_open(os.path.join(dirname, out), 'ab') as f:
        for txt in txt_files:
            for idx, line in enumerate(open(txt, 'rb')):
                #num_line = u"_*{0} {1}\n".format(idx, line)
                num_line = u"{0}\n".format(line)
                f.write(num_line.encode("UTF-8"))
    print("{0} aggregated".format(out))
    
start = time.time()
aggregate_data('aggregated-*.txt', 'alldata-id.txt')
aggregate_data('aggregated-train*.txt', 'train-all.txt')
aggregate_data('aggregated-test*.txt', 'test-all.txt')
aggregate_data('aggregated-train-pos*.txt', 'train-pos.txt')
aggregate_data('aggregated-train-neg*.txt', 'train-neg.txt')
aggregate_data('aggregated-test-pos*.txt', 'test-pos.txt')
aggregate_data('aggregated-test-neg*.txt', 'test-neg.txt')
end = time.time()
print ("Total running time: {0}".format(end-start))

print("Processed completed")

Process started...
28
alldata-id.txt aggregated
20
train-all.txt aggregated
8
test-all.txt aggregated
10
train-pos.txt aggregated
10
train-neg.txt aggregated
4
test-pos.txt aggregated
4
test-neg.txt aggregated
Total running time: 515.8943219184875
Processed completed


In [20]:
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

# num_lines_alldata =  file_len(os.path.join(dirname, 'alldata-id.txt'))
# num_lines_train =  file_len(os.path.join(dirname, 'alldata-id.txt'))
# num_lines_test =  file_len(os.path.join(dirname, 'alldata-id.txt'))
# num_lines_train_pos =  file_len(os.path.join(dirname, 'alldata-id.txt'))
# num_lines_train_neg =  file_len(os.path.join(dirname, 'alldata-id.txt'))
# num_lines_test_pos =  file_len(os.path.join(dirname, 'alldata-id.txt'))
# num_lines_test_neg =  file_len(os.path.join(dirname, 'alldata-id.txt'))

# print("Total number of paragraphs is {}".format(num_lines_alldata))
# print("Total number of training docs is {}".format(num_lines_train))
# print("Total number of test docs is {}".format(num_lines_test))
# print("Total number of pos training docs is {}".format(num_lines_train_pos))
# print("Total number of neg training docs is {}".format(num_lines_train_neg))
# print("Total number of pos test docs is {}".format(num_lines_test_pos))
# print("Total number of neg test docs is {}".format(num_lines_train_neg))

## Placeholder title

References: [doc2vec-lee](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb)

In [22]:
def read_corpus(fname):
    for i, line in enumerate(open(fname, encoding='utf-8')):
        # For training data, add tags
        yield gensim.models.doc2vec.TaggedDocument(str(line), [i])
SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')
def read_labelled_corpus(fname, split, sentiment):
    for i, line in enumerate(open(fname, encoding='utf-8')):
        tokens = gensim.utils.to_unicode(line).split()
        words = tokens[1:]
        tags = [line_no]
        split_enum = ['train', 'test', 'extra']
        sentiment_enum = [1.0, 0.0, None]
        assert split in split_enum
        assert sentiment in sentiment_enum
        yield SentimentDocument(words, tags, split, sentiment)

#### Set-up Doc2Vec Training & Evaluation Models
References: [Le & Mikolov 2014](http://cs.stanford.edu/~quocle/paragraph_vector.pdf), [go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ)  
Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model.

In [23]:
cores = multiprocessing.cpu_count()
print("{} cores".format(cores))
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    # Every 10 million word types need about 1GB of RAM (For setting max_vocab_size)
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, max_vocab_size=100000, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, max_vocab_size=100000, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/ average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, max_vocab_size=100000, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(read_corpus(os.path.join(dirname, 'alldata-id.txt')))  # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])

models_by_name = OrderedDict((str(model), model) for model in simple_models)

models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])
for model in models_by_name:
    print(model)

8 cores


KeyboardInterrupt: 

## Predictive Evaluation Methods

Let's define some helper methods for evaluating the performance of our Doc2vec using paragraph vectors. We will classify document sentiments using a logistic regression model based on our paragraph embeddings. We will compare the error rates based on word embeddings from our various Doc2vec models.

In [None]:
@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start

def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""
    
    train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])
    logistic = linear_model.LogisticRegression(C=1e5)
    logistic.fit(train_regressors, train_targets)

    test_data = test_set
    if infer:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    
    # Predict & evaluate
    test_predictions = logistic.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), logistic)

## Bulk Training
We use an explicit multiple-pass, alpha-reduction approach as sketched in this [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) with added shuffling of corpus on each pass.

Note that vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.

We evaluate each model's sentiment predictive power based on error rate, and the evaluation is repeated after each pass so we can see the rates of relative improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. 

In [None]:
best_error = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [None]:
alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())
start = time.time()

doc_list = read_corpus(os.path.join(dirname, 'alldata-id.txt'))
train_docs = read_labelled_corpus(os.path.join(dirname, 'train-all.txt'))
test_docs = read_labelled_corpus(os.path.join(dirname, 'test-all.txt'))

for epoch in range(passes):
    shuffle(doc_list)  # Shuffling gets best results
    
    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
            #train_model.train(alldocs, total_examples=len(alldocs), epochs=1)
            duration = '%.1f' % elapsed()
            
        # Evaluate
        eval_duration = ''
        with elapsed_timer() as eval_elapsed:
            err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)
        eval_duration = '%.1f' % eval_elapsed()
        best_indicator = ' '
        if err <= best_error[name]:
            best_error[name] = err
            best_indicator = '*'
        print("%s%f : %i passes : %s %ss %ss" % (best_indicator, err, epoch + 1, name, duration, eval_duration))

        if ((epoch + 1) % 5) == 0 or epoch == 0:
            eval_duration = ''
            with elapsed_timer() as eval_elapsed:
                infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)
            eval_duration = '%.1f' % eval_elapsed()
            best_indicator = ' '
            if infer_err < best_error[name + '_inferred']:
                best_error[name + '_inferred'] = infer_err
                best_indicator = '*'
            print("%s%f : %i passes : %s %ss %ss" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))
        
        if (epoch == passes-1):
            train_model.save('name')
            
    print('Completed pass %i at alpha %f' % (epoch + 1, alpha))
    alpha -= alpha_delta

train_docs = None
test_docs = None 
doc_list = None # We're done using this
    
print("END %s" % str(datetime.datetime.now()))
end = time.time()
print("Time elapsed: {0}".format(end-start))

## Achieved Sentiment-Prediction Accuracy

In [None]:
# Print best error rates achieved
print("Err rate Model")
for rate, name in sorted((rate, name) for name, rate in best_error.items()):
    print("%f %s" % (rate, name))

## Examining Results

### Are inferred vectors close to the precalculated ones?

In [None]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

### Do close documents seem more related than distant ones?

In [None]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))

### Do the word vectors show useful similarities?

In [None]:
word_models = simple_models[:]

In [None]:
# pick a random word with a suitable number of occurences
while True:
    word = random.choice(word_models[0].wv.index2word)
    if word_models[0].wv.vocab[word].count > 10:
        break
# or uncomment below line, to just pick a word from the relevant domain:
#word = 'comedy/drama'
similars_per_model = [str(model.most_similar(word, topn=20)).replace('), ','),<br>\n') for model in word_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
print("most similar words for '%s' (%d occurrences)" % (word, simple_models[0].wv.vocab[word].count))
HTML(similar_table)

DBOW words look meaningless because DBOW Model doesn't train word vectors - they remain at random initialized values (unless you use dbow_words=1, which slows training with little improvement). On the other hand, DM Models show meaningfully similar words when there are enough examples.

### Do documents have useful similarities?

In [None]:
# pick random doc
doc_id = np.random.randint(simple_models[0].docvecs.count)
similars_per_model = [str([(' '.join(alldocs[a_sim[0]].words), a_sim[1]) for a_sim in model.docvecs.most_similar(doc_id, topn=1)]).replace('), ','),<br>\n') for model in simple_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join(["Original doc"] + ([str(model) for model in word_models])) + 
    "</th></tr><tr><td style=\"vertical-align:top\">" +
    "</td><td style=\"vertical-align:top\">".join([' '.join(alldocs[doc_id].words)] + (similars_per_model)) +
    "</td></tr></table>")
#print("most similar words for {}".format(' '.join(alldocs[doc_id].words)))
HTML(similar_table)