<table align="left">
<tr>

<th, style="background-color:white">
<img src="https://github.com/mlgill/ODSC_East_2017_PythonNLP/blob/master/assets/logo.png?raw=true", width=140, height=100>
</th>

<th, style="background-color:white">
<div align="left">
<h1>Learning from Text: <br> Introduction to Natural Language Processing with Python</h1>  
<h2>Michelle L. Gill, Ph.D.</h2>     
Senior Data Scientist, Metis  
ODSC East  
May 3, 2017 
</div>
</th>

</tr>
</table>  

## Word2Vec Walkthrough and Exercise Answers

Begin by loading Google's pre-trained Word2Vec model.

In [1]:
import nltk
import gensim
from accessory_functions import google_vec_file, nltk_path

# Setup nltk corpora path
nltk.data.path.insert(0, nltk_path)

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format(google_vec_file, binary=True)

Google's model contains an extensive vocabulary.

In [3]:
type(model.vocab)

dict

In [4]:
vocab_list = model.vocab.keys()
len(vocab_list)

3000000

## Vocabulary Features

Each word contains an array of 300 features.

In [5]:
len(model.word_vec('cat'))

300

In [6]:
model.word_vec('cat')[:20]

array([ 0.0123291 ,  0.20410156, -0.28515625,  0.21679688,  0.11816406,
        0.08300781,  0.04980469, -0.00952148,  0.22070312, -0.12597656,
        0.08056641, -0.5859375 , -0.00445557, -0.296875  , -0.01312256,
       -0.08349609,  0.05053711,  0.15136719, -0.44921875, -0.0135498 ], dtype=float32)

The cosine similarity between words can be computed and produces intuitive trends.

In [7]:
print(model.similarity('cat', 'cat'))
print(model.similarity('cat', 'dog'))
print(model.similarity('cat', 'car'))

1.0
0.760945708978
0.215281850364


In [8]:
print(model.similarity('car', 'truck'))
print(model.similarity('car', 'drive'))

0.673579016963
0.313944691624


Word2Vec captures some interesting similarities between words, such as the relationship between **man --> king** and **woman --> queen**.

In [9]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431607246399)]

It can also detect words that don't belong in a sequence.

In [10]:
model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

## Word2Vec in Models

Let's load the spam/ham classification data set, split it into train/test sets, and add Word2Vec features.

In [11]:
import pandas as pd
from accessory_functions import preprocess_series_text, nltk_path

data = pd.read_csv('../data/spam.csv', sep='\t')
data['text'] = preprocess_series_text(data.text, nltk_path=nltk_path)

data.head()

Unnamed: 0,label,text
0,ham,early bird purchase yet
1,spam,hi mandy sullivan call hotmix fm choose receiv...
2,ham,heart empty without love mind empty without wi...
3,ham,yes start send request make pain come back bac...
4,ham,see swing bit get thing take care firsg


Split the data.

In [12]:
from sklearn.model_selection import train_test_split

X = data.text
y = data.label

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                test_size=0.3,
                                random_state=42)

Create a document-term matrix.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train)
X_test_cv  = cv.transform(X_test)

Get the Word2Vec vector for each word in the vocabulary. Store in a dictionary for faster retrieval.

In [14]:
feature_list = cv.get_feature_names()
len(feature_list)

5482

In [15]:
feature_dict = dict([(x, model.word_vec(x)) for x in feature_list
                      if x in vocab_list])

For each document, get an average of the vector mappings.

In [16]:
from nltk.tokenize import word_tokenize
import numpy as np

def embed_words(document, feature_dict):
    # get a list of all words for which there is a vector embedding
    feature_list = feature_dict.keys()
    
    # split the document into words
    words = word_tokenize(document)
    
    # store all vector embeddings for a document
    vector_list = list()
    for w in words:
        if w in feature_list:
            vector_list.append(feature_dict[w])
    
    # return mean value of vector embeddings
    if len(vector_list) > 0:
        vector = np.mean(vector_list, axis=0)
    else:
        vector = np.zeros(300)
        
    return vector


# create vector embeddings
X_train_embed = X_train.apply(lambda x: embed_words(x, feature_dict))
X_test_embed  = np.array(X_test.apply(lambda x: embed_words(x, feature_dict)))

# force into two-dimensional numpy array
X_train_embed = np.array([x for x in X_train_embed])
X_test_embed = np.array([x for x in X_test_embed])

print(X_train_embed.shape, X_test_embed.shape)

(3900, 300) (1672, 300)


These embeddings can be used in a machine learning classifier alone or joined to a document-term matrix.

In [17]:
# combine the document-term matrix and the word2vec embeddings

X_train_comb = np.hstack((X_train_cv.toarray(), X_train_embed))
X_test_comb = np.hstack((X_test_cv.toarray(), X_test_embed))

## Question

* Using Logistic regression, fit models using the three feature matrices: document-term alone, word2vec embeddings alone, and both combined.
* Compare the accuracy of each of these models.

First the models.

In [18]:
from sklearn.linear_model import LogisticRegressionCV

lr = LogisticRegressionCV()

# document-term alone
lr.fit(X_train_cv, y_train)
y_pred_cv = lr.predict(X_test_cv)

# word2vec embeddings
lr.fit(X_train_embed, y_train)
y_pred_embed = lr.predict(X_test_embed)

# combined
lr.fit(X_train_comb, y_train)
y_pred_comb = lr.predict(X_test_comb)

Evaluate the error metrics.

In [19]:
from sklearn.metrics import accuracy_score

# store each accuracy in a list
accuracy_list = list()

for lab,pred in zip(['document-term', 'word2vec', 'combined'],
                    [y_pred_cv, y_pred_embed, y_pred_comb]):
    accuracy_list.append((lab, accuracy_score(y_test, pred)))

There is a small improvement in this data set with the combined set of features.

In [20]:
pd.DataFrame(accuracy_list,
             columns=['model', 'accuracy'])

Unnamed: 0,model,accuracy
0,document-term,0.979665
1,word2vec,0.958134
2,combined,0.983254
