In [2]:
import numpy as np
import pandas as pd
import multiprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, log_loss
from sklearn.linear_model import LogisticRegression
import gensim
from gensim import utils
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
from tqdm import tqdm
from random import shuffle
from utils import labelize_reviews, get_learned_vectors
import matplotlib.pyplot as plt

tqdm.pandas(desc="progress-bar")

## Paragraph Vector (Doc2Vec)

In this notebook, we'll explore the [Paragraph Vector](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) a.k.a Dov2Vec algorithm on ~3 million Yelp reviews. Doc2Vec is an extension to word2vec for learning document embeddings and basically acts as  if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim’s Doc2Vec class implements this algorithm.

To recap, Word2Vec is a model from 2013 that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings

There are two approaches within `doc2vec:` `dbow` and `dmpv`. 

`dbow (Paragraph Vector - Distributed Bag of Words)` works in the same way as `skip-gram` in word2vec ,except that the input is replaced by a special token representing the document (i.e. $v_{wI}$ is a vector representing the document). In this architecture, the order of words in the document is ignored; hence the name distributed bag of words. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

`dmpv (Paragraph Vector - Distributed Memory)` works in a similar way to `cbow` in word2vec. For the input, dmpv introduces an additional document token in addition to multiple target words. Unlike cbow, however, these vectors are not summed but concatenated (i.e. $v_{wI}$ is a concatenated vector containing the document token and several target words). The objective is again to predict a context word given the concatenated document and word vectors. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.) There are 2 DM models, specifically: 
*  one which averages context vectors (dm_mean)
*  one which concatenates them (dm_concat, resulting in a much larger, slower, more data-hungry model)


In [2]:
df = pd.read_csv('allcat_clean_reviews.csv',index_col=0)
df.head()

Unnamed: 0,reviews,target
0,the rooms are big but the hotel is not good as...,0
1,second time with ocp saturday night pm not bus...,0
2,food is still great since they remodeled but t...,0
3,dirty location and very high prices but they d...,0
4,so first the off stood outside for mins to try...,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3085663 entries, 0 to 3086007
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   reviews  object
 1   target   int64 
dtypes: int64(1), object(1)
memory usage: 70.6+ MB


In [4]:
SEED = 1000

x = df.reviews
y = df.target

#defining our training, validation and test set
x_train, x_validation_test, y_train, y_validation_test = train_test_split(x, y, test_size=.06, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_test, y_validation_test, test_size=.5, random_state=SEED)

In [5]:

print('The Training set has {0} reviews with {1:.2f}% negative, {2:.2f}% positive reviews'.format(len(x_train),
                                                                             (len(x_train[y_train == 0]) / (len(x_train)*1))*100,
                                                                            (len(x_train[y_train == 1]) / (len(x_train)*1))*100))

print('The Validation set has {0} entries with {1:.2f}% negative, {2:.2f}% positive reviews'.format(len(x_validation),
                                                                             (len(x_validation[y_validation == 0]) / (len(x_validation)*1))*100,
                                                                            (len(x_validation[y_validation == 1]) / (len(x_validation)*1))*100))

print('The test set has a total of {0} reviews with {1:.2f}% negative, {2:.2f}% positive reviews'.format(len(x_test),
                                                                             (len(x_test[y_test == 0]) / (len(x_test)*1))*100,
                                                                            (len(x_test[y_test == 1]) / (len(x_test)*1))*100))

The Training set has 2900523 reviews with 50.00% negative, 50.00% positive reviews
The Validation set has 92570 entries with 50.06% negative, 49.94% positive reviews
The test set has a total of 92570 reviews with 49.94% negative, 50.06% positive reviews


Now, we label each review with a unique ID using Gensim's `TaggedDocument()` function. Then, we'll concatenate the training and validation and test sets for word representation. For training, I have decided to use the whole data set. The rationale behind this is that the Doc2Vec training is completely unsupervised (unlabelled) and thus there is no need to hold out any data.

In [6]:
df = pd.DataFrame()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrame

In [7]:
%%time
from utils import labelize_reviews
full = pd.concat([x_train,x_validation,x_test])
full_tagged = list(labelize_reviews(full,'all'))

Wall time: 2min 29s


In [8]:
%%time
cores = multiprocessing.cpu_count() #12

init_kwargs = dict(
    vector_size=150, epochs=10, min_count=2,
    sample=0, workers=cores, negative=5, hs=0,
    alpha=0.05, min_alpha=0.0001, window=5
)
#The learning rate, alpha decreases linearly per epoch from the initial rate to the minimum rate. I will use alpha = 0.0025 and min_alpha = 0.0001 as implemented by Le and Mikolov
#plain DBOW
model_dbow = Doc2Vec(dm=0, **init_kwargs)

model_dbow.build_vocab(full_tagged)

Wall time: 13min 36s


In [9]:
%%time
model_dbow.train(full_tagged, total_examples=len(full_tagged), epochs=model_dbow.epochs)

Wall time: 1h 29min 23s


In [10]:
model_dbow.save("dbow.model")

In [8]:
%%time
cores = multiprocessing.cpu_count() #12

dmm_kwargs = dict(
    vector_size=200, epochs=10, min_count=2,
    sample=0, workers=cores, negative=5, hs=0,
    alpha=0.05, min_alpha=0.0001, window=5
)

dmc_kwargs = dict(
    vector_size=200, epochs=10, min_count=2,
    sample=0, workers=cores, negative=5, hs=0,
    alpha=0.05, min_alpha=0.0001, window=3
)
#Distributed Memory (mean)
model_dmm = Doc2Vec(dm=1, dm_mean=1, **dmm_kwargs)
    
# Distributed Memory(Concatenation)
model_dmc = Doc2Vec(dm=1, dm_concat=1, **dmc_kwargs)

model_dmm.build_vocab(full_tagged)
model_dmc.build_vocab(full_tagged)

Wall time: 26min 13s


In [16]:
%%time
model_dmm.train(full_tagged, total_examples=len(full_tagged), epochs=model_dmm.epochs)

Wall time: 2h 30min 7s


In [19]:
model_dmm.save("dmm/dmm.model")

In [17]:
%%time
model_dmc.train(full_tagged, total_examples=len(full_tagged), epochs=model_dmc.epochs)

Wall time: 1h 41min 21s


In [20]:
model_dmc.save("dmc/dmc.model")

# Sentiment Classification with DBOW, DMM (Mean), DMC (Concatenation)

Given a document, our Doc2Vec models output a vector representation of the document. How useful is a particular model? In case of sentiment classification, we want the ouput vector to reflect the sentiment in the input document. So, in vector space, positive documents should be distant from negative documents.



In [3]:
%%time
model_dbow = Doc2Vec.load("dbow/dbow.model")
model_dmm = Doc2Vec.load("dmm/dmm.model")
model_dmc = Doc2Vec.load("dmc/dmc.model")

Wall time: 1min 22s


# DBOW Unigram

In [25]:
%%time
train_vecs_dbow = get_learned_vectors(model_dbow, x_train)
validation_vecs_dbow = get_learned_vectors(model_dbow, x_validation)

clf = LogisticRegression(solver="liblinear")
clf.fit(train_vecs_dbow, y_train)

y_pred = clf.predict_proba(validation_vecs_dbow)

logloss_dbow = log_loss(y_validation, y_pred)
acc= clf.score(validation_vecs_dbow, y_validation)
print("Validation Logloss:", logloss_dbow, "\nValidation Accuracy:", acc)

Validation Logloss: 0.28415165036804385 
Validation Accuracy: 0.881981203413633
Wall time: 1min 19s


# DMM Unigram

In [22]:
%%time
train_vecs_dmm = get_learned_vectors(model_dmm, x_train)
validation_vecs_dmm = get_learned_vectors(model_dmm, x_validation)

clf = LogisticRegression(solver="liblinear")
clf.fit(train_vecs_dmm, y_train)

y_pred = clf.predict_proba(validation_vecs_dmm)

logloss_dmm = log_loss(y_validation, y_pred)
acc_dmm = clf.score(validation_vecs_dmm, y_validation)
print("Validation Logloss:", logloss_dmm, "\nValidation Accuracy:", acc_dmm)

Validation Logloss: 0.2901750885047062 
Validation Accuracy: 0.8821756508588096
Wall time: 4min 29s


# DMC Unigram

In [23]:
%%time
train_vecs_dmc = get_learned_vectors(model_dmc, x_train)
validation_vecs_dmc = get_learned_vectors(model_dmc, x_validation)

clf = LogisticRegression(solver="liblinear")
clf.fit(train_vecs_dmc, y_train)

y_pred = clf.predict_proba(validation_vecs_dmc)

logloss_dmc = log_loss(y_validation, y_pred)
acc_dmc = clf.score(validation_vecs_dmc, y_validation)
print("Validation Logloss:", logloss_dmc, "\nValidation Accuracy:", acc_dmc)

Validation Logloss: 0.6930191077491353 
Validation Accuracy: 0.5046343307767095
Wall time: 2min 28s


#### Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. So, we'll concatenate the paragraph vectors obtained from each model using the `ConcatenatedDoc2Vec` function from gensim

In [3]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
dbow_dmm = ConcatenatedDoc2Vec([model_dbow, model_dmm])
dbow_dmc = ConcatenatedDoc2Vec([model_dbow, model_dmc])

## DBOW + DMM

In [33]:
train_vecs_dbow_dmm = get_learned_vectors(dbow_dmm,x_train)
validation_vecs_dbow_dmm = get_learned_vectors(dbow_dmm, x_validation)

clf = LogisticRegression(solver="liblinear")
clf.fit(train_vecs_dbow_dmm,y_train)

y_pred = clf.predict_proba(validation_vecs_dbow_dmm)
logloss_dbowdmm = log_loss(y_validation,y_pred)
acc_dbowdmm = clf.score(validation_vecs_dbow_dmm, y_validation)
print("Validation Logloss:", logloss_dbowdmm, "\nValidation Accuracy:", acc_dbowdmm)

Validation Logloss: 0.21934084701063036 
Validation Accuracy: 0.9125418602138922


## DBOW + DMC

In [35]:
train_vecs_dbow_dmc = get_learned_vectors(dbow_dmc,x_train)
validation_vecs_dbow_dmc = get_learned_vectors(dbow_dmc, x_validation)

clf = LogisticRegression(solver="liblinear")
clf.fit(train_vecs_dbow_dmc,y_train)

y_pred = clf.predict_proba(validation_vecs_dbow_dmc)
logloss_dbowdmc = log_loss(y_validation,y_pred)
acc_dbowdmc = clf.score(validation_vecs_dbow_dmc, y_validation)
print("Validation Logloss:", logloss_dbowdmc, "\nValidation Accuracy:", acc_dbowdmc)

Validation Logloss: 0.28430528204900746 
Validation Accuracy: 0.881916387598574


#### From the result above, we see that concatenating both DMM and DBOW models has increased our accuracy by ~ 3%. The best validation accuracy I got from a single model is the DMM model at 88.22% but the results are fairly similar to the DMC model at 88.20%


Now that we have trained several models, let's take a look at the awesomeness of a trained word embedding! An incredible property of embeddings is the concept of analogies. We can add and subtract word embeddings and arrive at interesting results. For example, if we can see that the model has learnt the relationship between dinner, lunch and breakfast and see that dinner - afternoon = lunch. Since all the models we trained are on the YELP dataset that contains mostly of food reviews, let's look at some syntactic/semantic NLP word tasks with the trained vectors

In [39]:
model_dmm.wv.most_similar(positive=['dinner'], negative=['afternoon'])[0:2]

[('lunch', 0.48653391003608704), ('breakfast', 0.48273783922195435)]

In [40]:
#it's interesting to see that the model is able to recognize the different type of
model_dmm.wv.most_similar('pasta')

[('fettuccine', 0.7643482685089111),
 ('gnocchi', 0.7500354647636414),
 ('risotto', 0.7493999004364014),
 ('spaghetti', 0.7333984375),
 ('linguine', 0.7279767990112305),
 ('linguini', 0.7070115804672241),
 ('penne', 0.7034958600997925),
 ('lasagna', 0.686015248298645),
 ('fettuccini', 0.6845107078552246),
 ('ravioli', 0.6838340759277344)]

In [41]:
# Pick the odd one out!
model_dmm.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

** Result Summary for different variations of Dov2Vec on Validation set **

| Models     | unigram |
|------------|---------|
| DBOW       |  88.20% |
| DMM        |  88.22% |
| DMC        |  50.46% |
| dbow + dmc |  91.25% |
| dbow + dmm |  88.19% |

In summary, we see that the combined model of dbow+dmc performed the best and like Le and Mikolov mentioned, combining two models does indeed improve the performance. In the next notebook, `bigram_Doc2Vec.ipynb`, we will explore the concept of phrase modelling, which is essentially the detection of common phrases. More information on this can be found [here](https://radimrehurek.com/gensim/models/phrases.html)