In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, log_loss
from gensim import utils
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from tqdm import tqdm
import multiprocessing
from random import shuffle
import os
from path import general_path
from utils import labelize_reviews_bg

from sklearn.linear_model import LogisticRegression

## Paragraph Vector (Doc2Vec)

In this notebook, we'll explore the [Paragraph Vector](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) a.k.a Dov2Vec algorithm on ~3 million Yelp reviews. Doc2Vec is an extension to word2vec for learning document embeddings and basically acts as  if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim’s Doc2Vec class implements this algorithm.

To recap, Word2Vec is a model from 2013 that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings

There are two approaches within `doc2vec:` `dbow` and `dmpv`. 

`dbow (Paragraph Vector - Distributed Bag of Words)` works in the same way as `skip-gram` in word2vec ,except that the input is replaced by a special token representing the document (i.e. $v_{wI}$ is a vector representing the document). In this architecture, the order of words in the document is ignored; hence the name distributed bag of words. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

`dmpv (Paragraph Vector - Distributed Memory)` works in a similar way to `cbow` in word2vec. For the input, dmpv introduces an additional document token in addition to multiple target words. Unlike cbow, however, these vectors are not summed but concatenated (i.e. $v_{wI}$ is a concatenated vector containing the document token and several target words). The objective is again to predict a context word given the concatenated document and word vectors. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.) There are 2 DM models, specifically: 
*  one which averages context vectors (dm_mean)
*  one which concatenates them (dm_concat, resulting in a much larger, slower, more data-hungry model)


In [2]:
df = pd.read_csv('allcat_clean_reviews.csv',index_col=0)
df.head()

Unnamed: 0,reviews,target
0,the rooms are big but the hotel is not good as...,0
1,second time with ocp saturday night pm not bus...,0
2,food is still great since they remodeled but t...,0
3,dirty location and very high prices but they d...,0
4,so first the off stood outside for mins to try...,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3085663 entries, 0 to 3086007
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   reviews  object
 1   target   int64 
dtypes: int64(1), object(1)
memory usage: 70.6+ MB


In [3]:
SEED = 1000
x = df.reviews
y = df.target

#defining our training, validation and test set
x_train, x_validation_test, y_train, y_validation_test = train_test_split(x, y, test_size=.06, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_test, y_validation_test, test_size=.5, random_state=SEED)

In [4]:

print('The Training set has {0} reviews with {1:.2f}% negative, {2:.2f}% positive reviews'.format(len(x_train),
                                                                             (len(x_train[y_train == 0]) / (len(x_train)*1))*100,
                                                                            (len(x_train[y_train == 1]) / (len(x_train)*1))*100))

print('The Validation set has {0} entries with {1:.2f}% negative, {2:.2f}% positive reviews'.format(len(x_validation),
                                                                             (len(x_validation[y_validation == 0]) / (len(x_validation)*1))*100,
                                                                            (len(x_validation[y_validation == 1]) / (len(x_validation)*1))*100))

print('The test set has a total of {0} reviews with {1:.2f}% negative, {2:.2f}% positive reviews'.format(len(x_test),
                                                                             (len(x_test[y_test == 0]) / (len(x_test)*1))*100,
                                                                            (len(x_test[y_test == 1]) / (len(x_test)*1))*100))

The Training set has 2900523 reviews with 50.00% negative, 50.00% positive reviews
The Validation set has 92570 entries with 50.06% negative, 49.94% positive reviews
The test set has a total of 92570 reviews with 49.94% negative, 50.06% positive reviews


In [5]:
%%time
from utils import labelize_reviews
full = pd.concat([x_train,x_validation,x_test])
full_tagged = list(labelize_reviews(full,"all"))

Wall time: 118 ms


# Phrase Modeling
Another thing that can be implemented with Gensim library is phrase detection. It is similar to n-gram, but instead of getting all the n-gram by sliding the window, it detects frequently-used phrases and sticks them together.

$$\frac{{count(A B)}-{count_{min}}} {{count(A)} \times {count(B)}} \times \text{N} \gt \text{threshhold} $$

where:

count(A) is the number of times token A appears in the corpus <br/>
count(B) is the number of times token B appears in the corpus <br/>
count(A B) is the number of times the tokens A B appear in the corpus in order <br/>
N is the total size of the corpus vocabulary <br/>
count_{min} is a user-defined parameter to ensure that accepted phrases occur a minimum number of times <br/>
threshold is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase (default threshold used in Gensim's Phrases function is 10.0)

Once our phrase model has been trained on our corpus, we can apply it to new text. When the model encounters 2 tokens in the new text that identifies as a phrase, it will merge the two into a single new token. 

Phrase modelling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york becomes new_york). But you would also expect multi-word expressions that represent common concepts, but arne't specifically named entities (such as *happy hour*) to also become phrases in the model.

In [12]:
from gensim.models.phrases import Phrases, Phraser

In [13]:
tokens_trained = (tokens.split() for tokens in x_train)

In [14]:
%%time
bigram_model_path = os.path.join(general_path, 'YELP', 'model')
phrases_bigram = Phrases(tokens_trained, min_count=1)

# Turn the finished Phrases model into a "Phraser" object that is optimized for speed and memory use
bigram_phrases = Phraser(phrases_bigram)
#bigram_phrases.save(bigram_model_path)

Wall time: 10min 55s


In the example below, we can see that the model has learn that ice cream is a frequently used term and concatenated them together as a word

In [15]:
ex = [u'i', u'love', u'ice', u'cream']
bigram_phrases[ex]

['i', 'love', 'ice_cream']

Now, we label each review with a unique ID using Gensim's `TaggedDocument()` function. Then, we'll concatenate the training, validation and test sets for word representation. This is because doc2vec training are completely unsupervised and thus there is no need to hold out any data as it's unlabelled.

In [17]:
%%time
full = pd.concat([x_train,x_validation,x_test])
full_tagged = list(labelize_reviews_bg(full,'all', bigram_phrases))