<table align="left">
<tr>

<th, style="background-color:white">
<img src="https://github.com/mlgill/ODSC_East_2017_PythonNLP/blob/master/assets/logo.png?raw=true", width=140, height=100>
</th>

<th, style="background-color:white">
<div align="left">
<h1>Learning from Text: <br> Introduction to Natural Language Processing with Python</h1>  
<h2>Michelle L. Gill, Ph.D.</h2>     
Senior Data Scientist, Metis  
ODSC East  
May 3, 2017 
</div>
</th>

</tr>
</table>  

## Word2Vec Walkthrough and Exercises

Begin by loading Google's pre-trained Word2Vec model.

In [1]:
import nltk
import gensim
from accessory_functions import google_vec_file, nltk_path

# Setup nltk corpora path
nltk.data.path.insert(0, nltk_path)

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format(google_vec_file, binary=True)

Google's model contains an extensive vocabulary.

In [None]:
type(model.vocab)

In [None]:
vocab_list = model.vocab.keys()
len(vocab_list)

## Vocabulary Features

Each word contains an array of 300 features.

In [None]:
len(model.word_vec('cat'))

In [None]:
model.word_vec('cat')[:20]

The cosine similarity between words can be computed and produces intuitive trends.

In [None]:
print(model.similarity('cat', 'cat'))
print(model.similarity('cat', 'dog'))
print(model.similarity('cat', 'car'))

In [None]:
print(model.similarity('car', 'truck'))
print(model.similarity('car', 'drive'))

Word2Vec captures some interesting similarities between words, such as the relationship between **man --> king** and **woman --> queen**.

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

It can also detect words that don't belong in a sequence.

In [None]:
model.doesnt_match("breakfast cereal dinner lunch".split())

## Word2Vec in Models

Let's load the spam/ham classification data set, split it into train/test sets, and add Word2Vec features.

In [None]:
import pandas as pd
from accessory_functions import preprocess_series_text, nltk_path

data = pd.read_csv('../data/spam.csv', sep='\t')
data['text'] = preprocess_series_text(data.text, nltk_path=nltk_path)

data.head()

Split the data.

In [None]:
from sklearn.model_selection import train_test_split

X = data.text
y = data.label

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                test_size=0.3,
                                random_state=42)

Create a document-term matrix.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train)
X_test_cv  = cv.transform(X_test)

Get the Word2Vec vector for each word in the vocabulary. Store in a dictionary for faster retrieval.

In [None]:
feature_list = cv.get_feature_names()
len(feature_list)

In [None]:
feature_dict = dict([(x, model.word_vec(x)) for x in feature_list
                      if x in vocab_list])

For each document, get an average of the vector mappings.

In [None]:
from nltk.tokenize import word_tokenize
import numpy as np

def embed_words(document, feature_dict):
    # get a list of all words for which there is a vector embedding
    feature_list = feature_dict.keys()
    
    # split the document into words
    words = word_tokenize(document)
    
    # store all vector embeddings for a document
    vector_list = list()
    for w in words:
        if w in feature_list:
            vector_list.append(feature_dict[w])
    
    # return mean value of vector embeddings
    if len(vector_list) > 0:
        vector = np.mean(vector_list, axis=0)
    else:
        vector = np.zeros(300)
        
    return vector


# create vector embeddings
X_train_embed = X_train.apply(lambda x: embed_words(x, feature_dict))
X_test_embed  = np.array(X_test.apply(lambda x: embed_words(x, feature_dict)))

# force into two-dimensional numpy array
X_train_embed = np.array([x for x in X_train_embed])
X_test_embed = np.array([x for x in X_test_embed])

print(X_train_embed.shape, X_test_embed.shape)

These embeddings can be used in a machine learning classifier alone or joined to a document-term matrix.

In [None]:
# combine the document-term matrix and the word2vec embeddings

X_train_comb = np.hstack((X_train_cv.toarray(), X_train_embed))
X_test_comb = np.hstack((X_test_cv.toarray(), X_test_embed))

## Question

* Using Logistic regression, fit models using the three feature matrices: document-term alone, word2vec embeddings alone, and both combined.
* Compare the accuracy of each of these models.