# Kaggle Bag of Words Meets Popcorn Part 2: Word Vectors

Following: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors

## Introducing Distributed Word Vectors

[Word2Vec](https://code.google.com/archive/p/word2vec/) is a neural network that learns [distributed representations](http://www.cs.toronto.edu/~bonner/courses/2014s/csc321/lectures/lec5.pdf) for words. One big advantage of Word2Vec over similar models is it learns very quickly.

Word2Vec does not need labels in order to create meaningful representations. If it is given enough data, it produces word vectors with nice characteristics - words with similar meanings appear in clusters and clusters are spaced such that some word relationships (e.g., analogies) can be reproduced by vector math (e.g., 'king - man + woman = queen').

Google's papers and [this presentation](https://docs.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?pref=2&pli=1) are helpful for understanding Word2Vec. Researchers at Stanford [applied deep learning to sentiment analysis](http://nlp.stanford.edu/sentiment/). This approach relies on sentence parsing and cannot be applied straightforwardly to paragraphs of arbitrary length.

## Using word2vec in Python

Word2Vec is implemented in the `gensim` package. Word2Vec also requires `cython` to speed up the running time (from days to minutes!).

In [1]:
import gensim

## Preparing to Train a Model

We now use unlabeled training data! 

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

In [3]:
train = pd.read_csv('./labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test = pd.read_csv('./testData.tsv', header=0, delimiter='\t', quoting=3)
unlabeled_train = pd.read_csv('./unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

print 'read %d labeled, %d test, and %d unlabeled' % (train['review'].size,
                                                      test['review'].size,
                                                      unlabeled_train['review'].size)

read 25000 labeled, 25000 test, and 50000 unlabeled


When training Word2Vec, we generally want to leave stopwords in place as the algorithm uses broad context to establish word vectors. We may also wish to preserve numbers?

In [4]:
# no need to recompute this set for every function call
stops = set(stopwords.words('english'))

def review_to_wordlist(review_text, remove_stopwords=False):
    """
    here we will have already removed html for tokenizing into sentences
    """
    # remove non-letters
    review_text = re.sub(r'[^a-zA-Z]', ' ', review_text)
    # convert to lower-case and split
    words = review_text.lower().split()
    # remove stopwords?
    if remove_stopwords:
        words = [w for w in words if not w in stops]
    # return a list of words
    return words

Word2Vec expects single sentences - each one as a list of words. So the input format is a list of lists. Splitting paragraphs into sentences is not simple. We will use NLTK's `punkt` tokenizer for sentence splitting.

In [5]:
import nltk.data

# load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [6]:
def review_to_sentences(review, tokenizer, remove_stopwords=False, parser='lxml'):
    # remove html first
    review_text = BeautifulSoup(review, parser).get_text()
    # use tokenizer to split paragraph into sentences
    raw_sentences = tokenizer.tokenize(review_text.strip())
    # loop over sentences...
    sentences = []
    for raw_sentence in raw_sentences:
        # skip empty
        if len(raw_sentence) > 0:
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
            
    return sentences

In [7]:
review_to_sentences(train['review'][0], tokenizer)

[[u'with',
  u'all',
  u'this',
  u'stuff',
  u'going',
  u'down',
  u'at',
  u'the',
  u'moment',
  u'with',
  u'mj',
  u'i',
  u've',
  u'started',
  u'listening',
  u'to',
  u'his',
  u'music',
  u'watching',
  u'the',
  u'odd',
  u'documentary',
  u'here',
  u'and',
  u'there',
  u'watched',
  u'the',
  u'wiz',
  u'and',
  u'watched',
  u'moonwalker',
  u'again'],
 [u'maybe',
  u'i',
  u'just',
  u'want',
  u'to',
  u'get',
  u'a',
  u'certain',
  u'insight',
  u'into',
  u'this',
  u'guy',
  u'who',
  u'i',
  u'thought',
  u'was',
  u'really',
  u'cool',
  u'in',
  u'the',
  u'eighties',
  u'just',
  u'to',
  u'maybe',
  u'make',
  u'up',
  u'my',
  u'mind',
  u'whether',
  u'he',
  u'is',
  u'guilty',
  u'or',
  u'innocent'],
 [u'moonwalker',
  u'is',
  u'part',
  u'biography',
  u'part',
  u'feature',
  u'film',
  u'which',
  u'i',
  u'remember',
  u'going',
  u'to',
  u'see',
  u'at',
  u'the',
  u'cinema',
  u'when',
  u'it',
  u'was',
  u'originally',
  u'released'],
 [u'some

Now we may apply this function to prepare our data for input to Word2Vec (it will take a few minutes):

In [8]:
# note we use `+=` to append lists of lists because `append()` in this case will only get the first list
# from the second list of lists...

sentences = []
print 'Parsing sentences for training set'
for review in train['review']:
    sentences += review_to_sentences(review, tokenizer)
    
print 'Parsing sentences for unlabeled training set'
for review in unlabeled_train['review']:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences for training set
Parsing sentences for unlabeled training set


In [9]:
print len(sentences)

797215


In [10]:
print sentences[0]

[u'with', u'all', u'this', u'stuff', u'going', u'down', u'at', u'the', u'moment', u'with', u'mj', u'i', u've', u'started', u'listening', u'to', u'his', u'music', u'watching', u'the', u'odd', u'documentary', u'here', u'and', u'there', u'watched', u'the', u'wiz', u'and', u'watched', u'moonwalker', u'again']


## Training and Saving Your Model

It is important to look at the Word2Vec [API documentation](http://radimrehurek.com/gensim/models/word2vec.html) in `gensim`, and at the [Google documentation](https://code.google.com/archive/p/word2vec/).

* architecture: skip-gram, continuous bag of words
* training algorithm: hierarchical softmax or negative sampling
* downsampling of frequent words: Google recommends between 0.00001 and 0.001
* word vector dimensionality
* context / window size
* worker threads
* minimum word count

Choosing parameters is subtle, but with parameters in hand, the model is easy to build:

In [11]:
import logging

In [12]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [13]:
num_features = 300     # word vector dimensionality
min_word_count = 40
num_workers = 4
context = 10
downsampling = 1e-3

In [14]:
from gensim.models import word2vec

In [15]:
model = word2vec.Word2Vec(sentences, workers=num_workers, size=num_features, min_count=min_word_count,
                         window=context, sample=downsampling)

In [16]:
# if we don't plan any further training, this will improve memory efficiency:
model.init_sims(replace=True)

In [17]:
# let's save the model with a meaningful name - we can load it later with `Word2Vec.load()`
model_name = '300features_40minwords_10context'
model.save(model_name)

## Exploring the Model Results

In [18]:
# `doesnt_match()` will try to deduce which word in a set is most dissimilar
model.doesnt_match('man woman child kitchen'.split())

'kitchen'

In [19]:
model.doesnt_match('france england germany berlin'.split())

'berlin'

In [20]:
# with small training set, this isn't perfect...
model.doesnt_match('paris berlin london austria'.split())

'paris'

In [21]:
model.most_similar('man')

[(u'woman', 0.6359847187995911),
 (u'guy', 0.5018903017044067),
 (u'person', 0.4842371344566345),
 (u'girl', 0.4584445357322693),
 (u'boy', 0.45680925250053406),
 (u'lady', 0.45362645387649536),
 (u'men', 0.4079422950744629),
 (u'he', 0.39524075388908386),
 (u'himself', 0.3905450105667114),
 (u'kid', 0.3840121030807495)]

In [22]:
model.most_similar('queen')

[(u'princess', 0.5241572260856628),
 (u'latifah', 0.4557710886001587),
 (u'bee', 0.454694926738739),
 (u'victoria', 0.4427083134651184),
 (u'prince', 0.4377889037132263),
 (u'king', 0.43018147349357605),
 (u'margaret', 0.38042858242988586),
 (u'stepmother', 0.37996935844421387),
 (u'selena', 0.3768102824687958),
 (u'maid', 0.3498295545578003)]

In [23]:
# for sentiment analysis...
model.most_similar('awful')

[(u'terrible', 0.6530819535255432),
 (u'horrible', 0.6271060705184937),
 (u'dreadful', 0.5867845416069031),
 (u'atrocious', 0.5674465894699097),
 (u'abysmal', 0.5667630434036255),
 (u'laughable', 0.5436089634895325),
 (u'horrendous', 0.5300754308700562),
 (u'embarrassing', 0.5102269649505615),
 (u'appalling', 0.5048348903656006),
 (u'amateurish', 0.498172402381897)]