# Doc2Vec

Code adapted from doc [gensim doc2vec & IMDB sentiment dataset](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)

## Load corpus

Fetch and prep exactly as in Mikolov's go.sh shell script. (Note this cell tests for existence of required files, so steps won't repeat once the final summary file (`aclImdb/alldata-id.txt`) is available alongside this notebook.)

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
import numpy as np
import gensim
from gensim.models import Word2Vec
from nltk.data import find
from nltk.corpus import wordnet as wn
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import random

data = pd.read_csv('../nodups_combined_jokes.csv', sep = ',', index_col = 0)
# word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
# model = gensim.models.Word2Vec.load_word2vec_format(word2vec_sample, binary=False)
# tfidfvectorizer = TfidfVectorizer(max_features=2000, stop_words='english')
# tfidf_counts = tfidfvectorizer.fit_transform(data['text'].values.astype('U'))


In [4]:
import locale
import glob
import os.path
import requests

# Convert text to lower-case and strip punctuation/symbols from words
def normalize_text(text):
    norm_text = text.lower()

    # Replace breaks with spaces
    norm_text = norm_text.replace('<br />', ' ')

    # Pad punctuation with spaces on both sides
    for char in ['.', '"', ',', '(', ')', '!', '?', ';', ':']:
        norm_text = norm_text.replace(char, ' ' + char + ' ')

    return norm_text

In [5]:
# import os.path
# assert os.path.isfile("aclImdb/alldata-id.txt"), "alldata-id.txt unavailable"

The data is small enough to be read into memory. 

In [6]:
import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple
alldocs = []
SentimentDocument = namedtuple('SentimentDocument', 'words tags split')
for line_no, line in enumerate(data['text']):
# for line_no, line in enumerate(data2['text']):
    tokens = gensim.utils.to_unicode(normalize_text(line)).split()
    words = tokens
    tags = [line_no] # `tags = [tokens[0]]` would also work at extra memory cost
#     split = ['train','test','extra','extra'][line_no//25000]  # 25k train, 25k test, 25k extra
    split = ['train']
    alldocs.append(SentimentDocument(words, tags, split))

train_docs = [doc for doc in alldocs if doc.split == 'train']
# test_docs = [doc for doc in alldocs if doc.split == 'test']
doc_list = alldocs[:]  # for reshuffling per pass

print('%d docs: %d train-sentiment' % (len(doc_list), len(train_docs)))

9076 docs: 0 train-sentiment


## Set-up Doc2Vec Training & Evaluation Models

Approximating experiment of Le & Mikolov ["Distributed Representations of Sentences and Documents"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf), also with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):

`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`

Parameter choices below vary:

* 100-dimensional vectors, as the 400d vectors of the paper don't seem to offer much benefit on this task
* similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out
* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`
* added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)
* a `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)

In [7]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "this will be painfully slow otherwise"

simple_models = [
    # PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=200, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=200, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/average
    Doc2Vec(dm=1, dm_mean=1, size=200, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# speed setup by sharing results of 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM/concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

Doc2Vec(dm/c,d200,n5,w5,mc2,t8)
Doc2Vec(dbow,d200,n5,mc2,t8)
Doc2Vec(dm/m,d200,n5,w10,mc2,t8)


Following the paper, we also evaluate models in pairs. These wrappers return the concatenation of the vectors from each model. (Only the singular models are trained.)

In [8]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])

## Predictive Evaluation Methods

Helper methods for evaluating error rate.

In [9]:
import numpy as np
import statsmodels.api as sm
from random import sample

# for timing
from contextlib import contextmanager
from timeit import default_timer
import time 

@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start
    
def logistic_predictor_from_data(train_targets, train_regressors):
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    #print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if infer:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)
    
    # predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)


## Bulk Training

Using explicit multiple-pass, alpha-reduction approach as sketched in [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) – with added shuffling of corpus on each pass.

Note that vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.

Evaluation of each model's sentiment-predictive power is repeated after each pass, as an error rate (lower is better), to see the rates-of-relative-improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. 

(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)

In [10]:
from collections import defaultdict
best_error = defaultdict(lambda :1.0)  # to selectively-print only best errors achieved

In [11]:
from random import shuffle
import datetime

alpha, min_alpha, passes = (0.025, 0.001, 30)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)  # shuffling gets best results
    
    for name, train_model in models_by_name.items():
        # train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list)
            duration = '%.1f' % elapsed()
    print('completed pass %i at alpha %f' % (epoch + 1, alpha))
    alpha -= alpha_delta
    
print("END %s" % str(datetime.datetime.now()))

START 2016-12-06 18:52:31.699664
completed pass 1 at alpha 0.025000
completed pass 2 at alpha 0.024200
completed pass 3 at alpha 0.023400
completed pass 4 at alpha 0.022600
completed pass 5 at alpha 0.021800
completed pass 6 at alpha 0.021000
completed pass 7 at alpha 0.020200
completed pass 8 at alpha 0.019400
completed pass 9 at alpha 0.018600
completed pass 10 at alpha 0.017800
completed pass 11 at alpha 0.017000
completed pass 12 at alpha 0.016200
completed pass 13 at alpha 0.015400
completed pass 14 at alpha 0.014600
completed pass 15 at alpha 0.013800
completed pass 16 at alpha 0.013000
completed pass 17 at alpha 0.012200
completed pass 18 at alpha 0.011400
completed pass 19 at alpha 0.010600
completed pass 20 at alpha 0.009800
completed pass 21 at alpha 0.009000
completed pass 22 at alpha 0.008200
completed pass 23 at alpha 0.007400
completed pass 24 at alpha 0.006600
completed pass 25 at alpha 0.005800
completed pass 26 at alpha 0.005000
completed pass 27 at alpha 0.004200
comp

## Achieved Sentiment-Prediction Accuracy

In [12]:
# print best error rates achieved
for rate, name in sorted((rate, name) for name, rate in best_error.items()):
    print("%f %s" % (rate, name))

In my testing, unlike the paper's report, DBOW performs best. Concatenating vectors from different models only offers a small predictive improvement. The best results I've seen are still just under 10% error rate, still a ways from the paper's 7.42%.


### Do close documents seem more related than distant ones?

In [44]:
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
# 0 for dm/c, 1 dbow, 2 dm/m
model = simple_models[1]
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 2), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))


TARGET (1562): «i hope you know cpr because you're taking my breath away .»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dbow,d200,n5,mc2,t8):

MOST (8987, 0.6442590951919556): «you better hope you marry rich .»

MEDIAN (4486, 0.27110743522644043): «i saw a transvestite holding one of those fancy handbags that said 'guess . ' i said , 'you're a man ? '»

LEAST (2515, -0.019454311579465866): «an army major visits the sick soldiers , goes up to one private and asks :»



In [106]:
sample_text = ("Yo mama's like a brick. dirty, flat on both sides, and always getting laid by Mexicans.")
tokens = gensim.utils.to_unicode(normalize_text(sample_text)).split()
vector = model.infer_vector(tokens)
# print(dir(model))
sims = model.docvecs.most_similar([vector], topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET: «%s»\n' % (sample_text))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 2), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))


TARGET: «Yo mama's like a brick. dirty, flat on both sides, and always getting laid by Mexicans.»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dbow,d200,n5,mc2,t8):

MOST (3778, 0.5133495926856995): «yo mama's so fat , she uses a semi-trailer as a couch .»

MEDIAN (1571, 0.196953684091568): «you're like a dictionary -- you add meaning to my life .»

LEAST (4585, -0.04875617474317551): «an old woman says to an old man at the rest home , " i can guess your age . " the man doesn't believe her , but tells her to go ahead and try . " pull down your pants , " she says . she inspects his rear end for a few minutes and then says , " you're 84 years old . " " that's amazing , " the man says . " how did you know ? " " you told me yesterday . "»



In [108]:
model.save('../doc2vecmodel.pkl')

(Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST.)

In [109]:
# save docvecs
model = simple_models[1]
docvecs = model.docvecs
# dir(docvecs)
# docvecs.__dict__

# np.save('docvec.npy', docvecs.doctag_syn0norm)
np.save('../docvec_unique.npy', docvecs.doctag_syn0)

In [15]:
docvecs.__dict__

{'count': 9076,
 'doctag_syn0': array([[ 0.41157216, -0.15710752,  0.21310677, ...,  0.01677105,
          0.28926131, -0.65804493],
        [ 0.3731648 , -0.45764539,  0.15154262, ...,  0.20910114,
         -0.09889825, -0.91377634],
        [ 0.12960051,  0.08168242, -0.12865862, ...,  0.12778267,
         -0.01339548, -0.26391193],
        ..., 
        [ 0.011427  , -0.03433529,  0.17223276, ...,  0.11492898,
          0.38475436, -0.31939203],
        [-0.1054318 ,  0.4818863 , -0.11470713, ..., -0.29730418,
         -0.27723277, -0.20750199],
        [ 0.25216168,  0.27897939, -0.15056613, ...,  0.06536528,
          0.14018208, -0.12054388]], dtype=float32),
 'doctag_syn0_lockf': array([ 1.,  1.,  1., ...,  1.,  1.,  1.], dtype=float32),
 'doctag_syn0norm': None,
 'doctags': {},
 'mapfile_path': None,
 'max_rawint': -1,
 'offset2doctag': []}

In [None]:
# Structure stuff
def joke_structure(joke):
    '''This function takes in a joke as a single string. Then it calculates the length of the joke (in words),
    the number of sentences in the joke, and whether (1) or not (0) the joke involves a question.'''
    # one additional thought is to catch whether it is a joke AT somebody (uses "you") or if it's a general, 
    # situational joke that is always in the third person.
    structure = []
    
    # first, calculate the length of the joke (I'll define this in words):
    structure.append(count_words(joke))
    
    # then, calculate the number of sentences/segments in the joke:
    structure.append(count_sents(joke))
    
    # finally, determine whether or not the joke involves a question:
    structure.append(is_a_question(joke))
    
    return structure
    
def count_words(joke):
    words = nltk.word_tokenize(joke)
    return len(words)
    
def count_sents(joke):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    sents = sent_tokenizer.tokenize(joke) # Split text into sentences    
    return len(sents)

def is_a_question(joke):
    words = nltk.word_tokenize(joke)
    if "?" in words:
        return 1
    else:
        return 0

In [None]:
# append these features to an existing array
data = pd.read_csv('../combined_jokes_unique.csv', sep = ',', index_col = 0)
new_features = np.zeros((len(data),3))
for i, joke in enumerate(data):
    try:
        new_features[i] = np.asarray(joke_structure(joke))
    except:
        pass
existing_features = np.load('../docvec_unique.npy')
print(existing_features.shape)
print(new_features.shape)
combined_features = np.append(existing_features, new_features, axis=1)
np.save('../combined_features_unique.npy', combined_features)