# Transforming Text into Features Using Word Embeddings

In this demo, you will see how to train a Word2Vec model to obtain word embeddings for spam message classification. We will train a logistic regression model using the embeddings to create our feature vectors. We will then use our model to predict whether a new email is spam.

A Word2Vec model should be trained on a large text corpus that is relevant to the domain in which you will later use a machine learning model to make predictions. For example, if you will be working with medical data, your Word2Vec model should not be trained on text data originating from product reviews, but rather should be trained on medical data.

You can train a Word2Vec model using an existing tool, such as Gensim, which is a Python package used for natural language processing. After the embeddings have been produced, you will use them to create feature vectors for your machine learning model. For example, we can represent each training example as a feature vector by taking the average of the Word2Vec embeddings of the words in that training example. Once we have created feature vectors for all of our training examples, we can fit our machine learning model to the training data and use our trained model to make predictions. 

This demo we will walk you through the following steps:

1. Load the spam message data set. The spam email dataset contains email subject lines.

2. Create labeled examples out of the data: We will have one text feature and one label. Each feature will contain one email subject line.

3. Preprocess the text features: We will preprocess the data by removing stop words, converting all text to lowercase, removing punctuation, etc. from the email subject lines. 

4. Create training and test datasets: Each example in the training and test datasets will now contain one preprocessed email subject line as a feature.

5. Train the Word2Vec model using the training dataset's features: we will train a Word2Vec model using Gensim. The model will figure out how to represent words as vectors based on how they are used in the spam email dataset. You will inspect the resulting word embeddings to develop a better understanding of what they can tell you about a particular word.

6. Create feature vectors out of the training and test data: After training the Word2Vec model, we will represent the features in every training and test example (recall, each example contains an email subject line) as a vector by taking the average of the Word2Vec embeddings of the words in the example. 

7. Train a logistic regression model using the feature vectors as input features, and evaluate the model's performance.

**<font color='red'>Note: Some of the code cells in this notebook may take a while to run.</font>**

### Import Packages

Before you get started, import a few packages. Run the code cell below. 

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib as plt
import seaborn as sns

We will also import the scikit-learn `LogisticRegression class`, the `train_test_split()` function for splitting the data into training and test sets, and the `roc_auc_score()` function to evaluate the model.


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

## Step 1: Build Your DataFrame and Define Your ML Problem

We will work with a dataset containing emails that are labeled as either spam or not spam.

#### Load a Data Set and Save it as a Pandas DataFrame

In [3]:
filename = os.path.join(os.getcwd(), "data", "spamDataset.csv")
df = pd.read_csv(filename, header=0)

df.head()

Unnamed: 0,email_text,spam
0,Subject: enron methanol ; meter # : 988291\r\n...,False
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",False
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",False
3,"Subject: photoshop , windows , office . cheap ...",True
4,Subject: re : indian springs\r\nthis deal is t...,False


#### Define the Label

This is a binary classification problem in which we will predict whether an email is a spam or not spam. The label is the `spam` column.

#### Identify Features

We only have one feature. The feature is the `email_text` column.


## Step 2: Create Labeled Examples from the Data Set

Let's create labeled examples from our dataset. We will have one text feature and one label. 
The code cell below carries out the following steps:

* Gets the `spam` column from DataFrame `df` and assigns it to the variable `y`. This will be our label. Note that the label contains True or False values that indicate whether an email is spam or not spam.
* Gets the column `email_text` from DataFrame `df` and assigns it to the variable `X`. This will our feature. Note that the `email_text` feature contains the subject line of the email.


In [4]:
y = df['spam'] 
X = df['email_text'] #contains subject line of email

X.shape

(5170,)

In [5]:
X.head()

0    Subject: enron methanol ; meter # : 988291\r\n...
1    Subject: hpl nom for january 9 , 2001\r\n( see...
2    Subject: neon retreat\r\nho ho ho , we ' re ar...
3    Subject: photoshop , windows , office . cheap ...
4    Subject: re : indian springs\r\nthis deal is t...
Name: email_text, dtype: object

Let's take a look at an example of a spam email and an email that is not spam.

In [6]:
print('A spam email: \n\n', X[67])
print('A non-spam email: \n\n', X[135])

A spam email: 

 Subject: re : husband soup would be
as you know election time is not the best thing for the economy .
economy is in a very unstable condition , as you can see gas prices
are going up along with the m o rtgvage rat e s . once the
r a te goes up you will not have a chance to s av e money again
for a very long time .
it is your last chance . get r e f inanced at 4 . 2 point !
http : / / www . fintod . com /
- -
despoil , compote a amende
the me orbital irruption
gfawn a ax henrietta
a the in boatswain
out whither the accompanist lint macintosh

A non-spam email: 

 Subject: re : tuesday , december 26 th
i will be here tuesday , also .
mark mccoy
12 / 20 / 2000 09 : 04 am
to : michael olsen / na / enron @ enron , tom acton / corp / enron @ enron , clem
cernosek / hou / ect @ ect , robert cotten / hou / ect @ ect , jackie young / hou / ect @ ect ,
sabrae zajac / hou / ect @ ect , carlos j rodriguez / hou / ect @ ect , mark
mccoy / corp / enron @ enron ,

## Step 3: Preprocess the Text

The next step is to preprocess the text. Preprocessing technqiues can include cleaning the data, converting all text to lowercase, removing special characters, removing stopwords from the text, tokenizing the text (split it into smaller chunks), and lemmatizing the text (converting a word to its root word).

You can perform preprocessing on your own or use tools to accomplish this. One common tool is NLTK [Natural Language Toolkit](https://www.nltk.org/). However, for this demo, we will use the built-in function from Gensim to preprocess the text. This function will remove some stop words, covert all text to lowercase, remove punctuation and tokenize the text.

Let's import the Gensim package.

In [7]:
import gensim

Before we preprocess our data, let's look at an example output of preprocessed text. Let's take a simple sentence and perform preprocessing. Run the code cell below and inspect the results.

In [8]:
sentence = "I went to the market to buy some apples for my pet horse." 

list(gensim.utils.simple_preprocess(sentence))

['went',
 'to',
 'the',
 'market',
 'to',
 'buy',
 'some',
 'apples',
 'for',
 'my',
 'pet',
 'horse']

Now let's perform preprocessing on our `email_text` feature and compare the difference between the original text and the preprocessed text

In [9]:
original_X = X
X = X.apply(lambda row: gensim.utils.simple_preprocess(row))

In [10]:
X.head()

0    [subject, enron, methanol, meter, this, is, fo...
1    [subject, hpl, nom, for, january, see, attache...
2    [subject, neon, retreat, ho, ho, ho, we, re, a...
3    [subject, photoshop, windows, office, cheap, m...
4    [subject, re, indian, springs, this, deal, is,...
Name: email_text, dtype: object

In [11]:
original_X.head()

0    Subject: enron methanol ; meter # : 988291\r\n...
1    Subject: hpl nom for january 9 , 2001\r\n( see...
2    Subject: neon retreat\r\nho ho ho , we ' re ar...
3    Subject: photoshop , windows , office . cheap ...
4    Subject: re : indian springs\r\nthis deal is t...
Name: email_text, dtype: object

## Step 4: Create Training and Test Data Sets

Let's take our preprocessed text data set and split the data into training and test sets with 80% of the data being the training set.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1234)
#once again, larger training size than non-NLP tasks
X_train.head()

562     [subject, hpl, nom, for, sept, see, attached, ...
148     [subject, copanno, changes, forwarded, by, ami...
4831    [subject, re, maynard, oil, revised, nom, dare...
3385    [subject, deal, for, december, can, either, of...
2389    [subject, heisse, sx, action, hallo, mein, lie...
Name: email_text, dtype: object

## Step 5: Training the Word2Vec Model and Inspecting the Word Embeddings

Now that the data has been preprocessed and we have our training data, we will train a Word2Vec model using Gensim using the training data `X_train`. For more information on Gensim, consult the online [documentation](https://radimrehurek.com/gensim/models/word2vec.html). The model will produce word embeddings. That is, words represented as numerical vectors based on how these words are used in the email spam dataset.
    
We will specify the following parameters:

* size: dimension of each resulting vector that will represent each word.
* window: the number of words behind or ahead of a target word that will be used to provide context for that word
* min_count: the number of times a word must appear in our text document in order to create a word vector. The model will ignore words that do not satisfy the `min_count` specification, therefore ignoring wors that are not important.



In [13]:
print("Begin")
word2vec_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100,
                                   window=5,
                                   min_count=2)

print("End")

Begin
End


You can check the size of a particular model with the `len()` method. 

In [14]:
len(word2vec_model.wv.key_to_index)  # retrieve vocabulary and measure its size

18015

While 18,015 words may seem like a lot, this is actually quite small for most real-world applications. In practice, you can use models with larger vector sizes (say 300) and much larger vocabulary.

To check if a word is included the vocabulary, you can use the `key_to_index` attribute. 

In [15]:
'cornell' in word2vec_model.wv.key_to_index

False

In [16]:
'dog' in word2vec_model.wv.key_to_index

True

The word `'dog'` is in our vocabulary. Let's inspect its word vector. The code cell belows retrieves the vector for the word `'dog'`. 

In [17]:
word2vec_model.wv['dog']

array([-0.08800697,  0.2969751 ,  0.03915569, -0.1692794 ,  0.01895426,
       -0.19426493,  0.13264418,  0.44111824, -0.22261326, -0.18814257,
       -0.19595255, -0.34056878,  0.00196441, -0.0728687 ,  0.20262288,
       -0.31712624,  0.13925686, -0.20498897, -0.05148866, -0.3215945 ,
        0.0848152 ,  0.06962517,  0.1255184 , -0.2155873 , -0.01691423,
        0.07290611, -0.22520283, -0.11448094, -0.23458217, -0.00737267,
        0.1459848 , -0.03765708,  0.06731088, -0.27333966, -0.11757497,
        0.06647122, -0.07528554, -0.06889766, -0.16014363, -0.22595166,
        0.03138285, -0.04476285, -0.12286556,  0.06627252,  0.19952278,
       -0.16769838,  0.00510299, -0.19203351,  0.07441817,  0.2601467 ,
        0.0676915 , -0.08695829, -0.22357376, -0.12076445, -0.2661733 ,
        0.00076911, -0.00507512,  0.02597741, -0.20201504,  0.0400296 ,
        0.03207266, -0.00737297,  0.09984277, -0.00743724, -0.2046644 ,
        0.13766515,  0.19104104,  0.17909253, -0.27455446,  0.18

Examine the vector, noting each of the following important characteristics: 

1. It contains 100 values somewhere between -0.39333 and 0.4495
2. There are no zeros, so it can be considered a **dense vector** (a vector comprised of mostly non-zero values)
3. Each value represents a dimension, but is not necessarily interpretable by humans
4. Larger (in magnitude) values may relate to categories in which `'dog'` has a presence
5. Many broader categories are also likely to be represented by a combination of these dimensions
6. A few values are relatively close to zero. This implies that these dimensions do not carry "significant information" about `'dog'`


Let's find the most similar words for `'dog'`.


In [18]:
word2vec_model.wv.most_similar('dog')

[('falls', 0.9899334907531738),
 ('cosmetic', 0.9887095093727112),
 ('accredit', 0.988670289516449),
 ('necrosis', 0.9886582493782043),
 ('cheney', 0.9881411790847778),
 ('castanet', 0.987744152545929),
 ('debenture', 0.9877417683601379),
 ('barge', 0.9875988960266113),
 ('cannister', 0.9875513315200806),
 ('gestapo', 0.987534761428833)]

The code cell below outputs the first few 25 words that our Word2Vec model learned a vector for.

In [19]:
top25 = word2vec_model.wv.index_to_key[:25]
top25

['the',
 'to',
 'ect',
 'and',
 'for',
 'of',
 'you',
 'subject',
 'in',
 'on',
 'is',
 'this',
 'hou',
 'enron',
 'be',
 'that',
 'we',
 'from',
 'your',
 'will',
 'have',
 'with',
 'at',
 'are',
 'it']

You can go a step further and print vectors associated with each of these words. Use the `background_gradient` method to wrap these vectors into a DataFrame as rows with the corresponding words as row indices. Let's spice it up with colors.

In [20]:
pd.DataFrame({w:word2vec_model.wv[w] for w in top25}).T.style.background_gradient(cmap='coolwarm').set_precision(2)
#first step is dictionary, "w" is probably word, value for dict is list comprenehnsion replacing the vector with the word,
#and then calling T to transpose, and then background gradient to wrap vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
the,-0.89,1.12,-0.39,0.66,-0.74,-2.39,0.6,0.87,-0.2,-1.41,0.86,-0.62,-0.46,1.57,0.1,1.03,-0.54,-1.43,0.42,-1.53,1.75,0.13,1.18,-0.06,-0.81,-0.72,-0.0,0.55,0.26,-0.39,-0.42,-0.31,0.33,-0.39,1.2,0.98,0.98,-0.21,-0.3,-0.43,-0.5,-0.76,-1.65,0.18,0.66,-0.16,-1.33,1.27,1.73,1.11,-0.2,1.0,0.09,1.66,1.16,0.63,1.59,-0.33,-1.08,1.47,-0.25,1.35,-1.57,-0.59,0.16,-0.29,-0.24,1.02,0.62,0.61,0.1,0.46,-0.18,0.64,0.37,-0.4,-0.26,-0.27,-0.09,-1.16,-0.5,0.36,0.53,0.22,1.08,0.59,0.44,0.12,0.41,1.87,1.4,0.25,-1.43,0.2,1.98,1.96,-0.57,-1.33,-0.09,0.31
to,-1.14,1.77,-0.14,2.25,-0.99,-1.19,1.61,1.23,0.65,-0.54,0.96,-0.15,-0.47,0.96,-0.62,0.97,1.83,0.26,0.06,-1.01,0.79,-0.47,-0.56,-1.39,0.42,0.12,-0.41,1.11,-0.72,1.39,2.06,0.43,0.97,-0.49,0.5,1.78,0.81,-0.11,0.31,-0.76,1.28,-1.54,-0.72,0.5,-0.26,0.66,-0.27,-0.51,1.32,-0.46,0.9,-1.11,-0.07,1.78,0.01,0.86,0.58,0.01,-0.57,0.19,0.23,0.19,0.31,-1.03,0.6,0.34,-0.42,0.01,-0.73,-0.02,0.51,-0.27,0.17,1.74,1.31,-1.86,0.21,-0.87,1.38,-0.34,-1.39,0.78,0.23,-0.89,0.95,-0.09,0.54,-1.2,0.21,1.43,2.92,1.9,-0.54,-0.49,0.42,0.11,-0.81,0.38,-0.25,0.4
ect,-0.63,0.35,-1.99,0.32,-0.71,2.44,0.14,2.39,-0.84,-0.86,-0.33,-1.04,-1.22,-1.54,-0.1,-1.17,1.08,3.77,0.57,-1.35,-2.08,0.35,0.65,-1.91,0.5,2.26,0.45,1.81,-2.47,1.85,-1.03,0.38,0.61,-0.27,-1.28,0.04,-2.0,0.63,-0.88,-0.91,-0.24,-0.19,-2.24,0.83,4.71,-3.24,1.39,-3.81,0.29,0.86,0.05,-0.78,-0.11,-1.31,-0.96,-1.43,-0.68,-2.78,-1.3,-1.78,0.4,-0.77,2.8,-0.02,-3.29,3.39,-0.12,2.09,-5.89,0.8,-1.23,3.34,2.6,0.86,0.92,3.54,2.86,-0.42,0.23,-0.67,-1.9,-1.74,-1.56,1.1,-0.12,-1.9,-0.18,-2.32,1.14,2.24,0.91,1.02,2.28,0.12,4.32,1.42,2.41,2.35,0.68,-2.49
and,-0.82,1.55,-0.41,0.66,-0.67,-1.37,0.54,1.29,0.04,-0.75,0.55,-0.77,-0.17,0.24,0.52,0.33,0.3,-1.44,-0.04,-0.79,0.72,0.21,0.79,-0.02,-0.62,0.12,-0.37,0.18,0.47,0.25,0.48,-0.05,0.09,-0.69,0.5,0.78,1.03,-0.43,-0.16,-0.02,0.48,-0.98,-0.74,0.65,0.64,0.19,-0.87,0.01,0.77,-0.05,0.59,-0.01,-0.12,0.48,-0.11,0.09,1.13,0.44,-0.23,0.3,0.35,0.33,-0.87,-0.42,0.68,0.05,0.22,0.49,-0.22,0.44,0.02,-0.62,-0.08,0.78,0.57,-0.71,-0.69,-0.33,0.16,-0.32,-0.13,0.46,0.29,-0.21,0.43,0.3,0.31,0.5,0.22,0.36,1.02,-0.21,-0.81,-0.49,1.48,0.39,0.18,-0.95,-0.55,-0.36
for,-1.73,0.93,-0.02,1.5,-0.83,-1.49,1.0,0.9,-0.36,-0.92,0.64,0.14,-0.67,0.48,-0.53,0.81,-0.18,-1.16,0.5,-0.88,1.2,-0.19,0.85,0.18,-0.46,0.4,-0.17,0.2,-0.62,-0.05,0.48,-0.04,0.03,-0.39,1.53,1.75,0.71,1.02,-0.06,-0.57,1.53,-1.38,-1.29,0.52,0.36,-0.56,-0.28,0.52,1.65,0.46,0.43,0.18,0.39,1.66,-0.01,1.14,0.25,-0.16,-1.2,0.16,0.12,0.34,-0.0,-0.97,-0.69,-0.24,-0.67,0.96,0.43,0.59,-0.39,0.29,-0.53,1.17,1.58,0.39,-0.85,-0.37,1.1,-0.66,-1.11,0.64,1.26,0.44,-0.98,-0.64,0.41,1.48,0.21,0.9,0.96,-0.25,-1.35,0.96,1.0,-0.2,-0.3,-0.07,-0.06,-0.17
of,-1.91,1.67,0.2,-0.63,-1.41,-1.88,0.66,1.0,-1.0,-1.39,-0.57,-1.06,-0.54,-0.03,0.57,-0.18,-1.48,-2.86,0.51,-0.58,0.66,0.24,1.83,1.11,-1.26,0.85,-0.96,-0.93,0.83,0.05,1.21,-1.84,-0.92,-0.53,0.45,1.41,1.39,0.57,0.68,-0.56,0.79,-1.69,-1.0,0.97,-0.14,0.32,-0.5,0.8,1.02,0.51,1.12,-0.07,0.49,1.59,-0.75,1.3,2.34,-0.15,-1.41,-0.25,-0.2,0.9,-0.76,-0.42,0.68,-0.48,-0.46,0.36,0.54,0.67,-0.48,-0.26,0.65,1.29,-0.01,0.72,-1.23,0.48,0.05,-0.02,0.53,0.11,1.84,-0.22,0.21,0.17,-0.73,2.18,0.07,0.95,1.61,0.46,-1.05,-0.03,1.8,1.28,0.0,-1.14,0.05,-0.26
you,1.09,-0.88,-0.82,0.86,-1.61,-1.63,-0.01,2.15,0.3,-0.05,1.29,0.39,0.68,2.61,-1.41,0.34,0.5,-0.83,-0.72,-1.7,2.5,0.22,0.13,1.67,-1.78,-0.44,0.35,-0.89,0.24,-0.37,-0.32,0.73,0.57,-0.83,0.07,3.03,-0.08,-1.27,-0.67,-2.66,1.44,-2.3,0.77,-0.18,0.69,1.87,-1.7,-0.01,0.76,-0.3,-0.33,-2.13,0.69,0.32,0.48,0.67,-0.52,-0.45,-1.78,-0.68,0.13,-0.43,0.56,0.13,0.31,1.66,-1.7,1.01,-1.11,0.17,0.8,0.06,-0.06,1.82,2.49,-1.13,0.09,-0.22,-0.03,0.79,-0.02,0.74,-1.16,-0.39,-0.57,-0.04,1.27,1.33,0.15,2.39,3.62,-0.4,-0.07,1.45,0.55,-0.02,-1.93,0.45,-2.1,0.89
subject,-0.11,-1.24,1.06,1.89,-0.14,1.85,1.05,-0.06,-1.24,-0.87,0.05,0.1,-0.62,-0.39,-0.7,2.41,0.66,2.08,-0.24,-1.98,0.23,1.3,0.22,-0.72,-0.65,2.19,-0.79,1.09,-1.84,1.9,-1.93,1.53,0.26,-1.19,0.22,1.55,-1.29,0.94,-1.17,0.1,1.51,0.01,-1.91,1.07,2.48,-2.62,-0.02,-0.85,0.46,0.1,-1.02,-1.04,0.94,-0.3,0.64,0.94,-1.56,0.13,0.27,0.28,0.25,-0.6,0.09,-0.06,-1.1,0.51,-0.16,2.12,-0.5,-0.36,0.44,1.46,-0.8,0.21,1.36,2.05,0.16,-0.83,1.36,-1.38,-1.05,-0.22,-0.85,1.22,-1.72,-0.56,0.39,1.18,1.76,-0.04,0.43,-0.31,-0.58,1.84,1.31,-0.91,0.43,-0.2,2.49,-1.16
in,-1.66,2.18,0.14,0.64,-0.9,-2.05,1.26,1.22,-0.68,-1.15,0.61,-0.97,0.38,-0.15,1.07,0.46,-0.21,-2.06,0.35,-0.68,0.17,0.23,1.11,0.22,-0.82,0.62,-0.74,-0.41,0.16,-0.34,0.98,-1.19,-0.16,-0.16,0.51,1.92,0.67,-0.26,0.01,-0.35,0.65,-1.35,-0.93,0.93,0.34,0.22,-0.58,0.8,0.68,0.07,1.09,0.2,-0.2,0.92,-0.3,0.66,1.08,-0.41,0.01,0.54,-0.17,0.74,-0.26,-0.12,0.66,0.25,-0.72,0.79,0.82,0.13,-0.22,-0.62,0.22,1.34,0.42,0.02,-0.78,-0.99,0.32,-0.6,0.33,0.21,1.12,-0.36,0.37,0.17,-0.04,0.62,0.12,1.18,1.53,0.37,-0.94,-0.12,1.69,0.65,-0.15,-0.09,-0.26,-0.9
on,-0.94,1.49,-1.21,2.14,-2.37,-1.75,1.26,0.31,0.36,-0.49,-0.09,-0.58,-0.33,-0.72,-0.63,0.3,1.33,-0.8,0.55,-0.44,0.59,-0.39,-0.24,-0.53,0.59,-0.11,-0.92,0.78,-0.96,-0.51,1.82,0.2,1.36,0.22,-0.22,0.71,1.0,0.52,-0.33,0.73,1.46,-1.26,-1.34,0.38,0.12,0.2,0.58,-0.75,0.39,0.35,0.75,0.51,1.31,2.02,-0.78,0.12,0.28,-1.28,-0.96,0.4,0.45,0.24,0.81,-0.34,-0.1,-0.07,-1.42,-0.1,-0.26,-0.74,-0.33,-0.3,0.37,0.11,0.95,-0.23,-0.25,-0.49,1.84,0.23,-1.47,0.06,0.13,-0.2,0.04,-0.57,-0.1,-0.51,0.76,0.98,1.26,1.85,-0.18,0.53,1.8,0.36,-0.82,1.03,-0.58,-0.77


Note that the 100 columns you see here are dimensions you specified earlier. We do not know what each dimension signifies, although this is an active area of research. However, if you examine each dimension you'll notice more red in some columns and more blue in others. Some of the words are very common and might be stop words we would normally remove. However, if we load more specific groups of words, we might make good guesses about what some of the dimensions mean.

The code cells below compute the similarities between two words in the vocabulary.

In [21]:
word2vec_model.wv.similarity(w1='dog', w2='person')

0.35236177

In [22]:
word2vec_model.wv.similarity(w1='dog', w2='cat')

0.9652891

## Step 6: Create Feature Vectors out of Word Embeddings for a Classifier

Now let's convert the features in our training and test datasets into feature vectors using our word embeddings. We will use this  to train a logistic regression model. Let's first inspect our original training and test datasets:

In [23]:
X_train.head()

562     [subject, hpl, nom, for, sept, see, attached, ...
148     [subject, copanno, changes, forwarded, by, ami...
4831    [subject, re, maynard, oil, revised, nom, dare...
3385    [subject, deal, for, december, can, either, of...
2389    [subject, heisse, sx, action, hallo, mein, lie...
Name: email_text, dtype: object

In [24]:
X_test.head()

5075    [subject, hpl, nom, for, may, see, attached, f...
3817    [subject, cleburne, tenaska, iv, plant, daren,...
2967    [subject, re, producer, connects, on, the, con...
3014    [subject, mail, hey, daren, is, this, your, ma...
2185    [subject, revision, forest, oil, november, gas...
Name: email_text, dtype: object

The code cell below walks through every example in both the training and test datasets and replaces every word contained in the `email_text` feature with its corresponding word embedding. Original words that do not have a corresponding word embedding will not be part of the training and test sets. For example, stop words that were removed when creating the word embeddings will not appear in the training and test sets.

<b>Note</b>: This may take a while to run.

In [25]:
words = set(word2vec_model.wv.index_to_key) #a set is an unordered collection of unique elements
#index_to_key to obtain indices?
print('Begin transforming X_train')
X_train_word_embeddings = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_train], dtype=object)
print('Finish transforming X_train')

print('Begin transforming X_test')
X_test_word_embeddings = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_test], dtype=object)
print('Finish transforming X_test')


Begin transforming X_train
Finish transforming X_train
Begin transforming X_test
Finish transforming X_test


In [26]:
print('Number of words in first training example: {0}'.format(len(X_train.iloc[0])))
print('First word in first training example: {0}'.format(X_train.iloc[0][0]))
print('Second word in first training example: {0}\n'.format(X_train.iloc[0][1]))

print('Number of word vectors in first training example: {0}'.format(len(X_train_word_embeddings[0])))
print('First word vector in first training example:\n {0}'.format(X_train_word_embeddings[0][0]))
print('\nSecond word vector in first training example: \n {0}\n'.format(X_train_word_embeddings[0][1]))


Number of words in first training example: 12
First word in first training example: subject
Second word in first training example: hpl

Number of word vectors in first training example: 10
First word vector in first training example:
 [-0.49532762  0.14175564 -1.5668895   0.24423969 -0.94954234  0.01721316
  0.20677774  0.02985767  0.09828629 -0.23114619  0.32010263 -0.54806477
 -1.9139943  -0.7508413  -0.87572557  1.420324    0.25443664  1.1923311
  1.3937329  -1.1350873   2.6584358  -0.8781236   0.9395594   0.09562054
 -0.1237262   0.7475929   0.01806622 -1.1826546   0.07537986 -0.6200768
  0.18233465  2.0059676   1.565773   -0.62722814  2.1923568   0.22282617
 -0.18763624  1.2370421  -0.7312534  -1.5394593   0.71885914 -0.08247944
 -1.754546   -0.23928364  1.8177963  -2.1507998   0.93718654 -1.3639182
  1.5151078   0.51757383  0.89508265  0.16342057  1.0827389   1.4818457
 -0.09418495  1.6387844   0.27591452 -0.8857176  -1.1679686  -0.11514526
  0.8464545   0.78790414 -1.0009046  -0

After replacing the `email_text` feature in our training and test data with word embeddings, each example in our training and test data now has a different number of features, each corresponding to a word vector:

In [27]:
print('Number of word vectors in first five examples in training set:')
for w in range(0, 5):
    print(len(X_train_word_embeddings[w]))

print('Number of word vectors in first five examples in test set:')
for w in range(0, 5):
    print(len(X_test_word_embeddings[w]))

Number of word vectors in first five examples in training set:
10
23
37
57
30
Number of word vectors in first five examples in test set:
10
31
59
8
59


This will cause an error when we train our model. We have to create feature vectors that will provide our classifier with a consistent set of features per example.

We can take an element-wise average of the word embeddings of the words contained in each training and test example. This makes feature vector representations that can be used as training and test features for our classifier. 

In [28]:
X_train_feature_vector = [] #creating new matrix on which to train the model
for w in X_train_word_embeddings: #for each word, which might contain multiple word vectors
    if w.size:
        X_train_feature_vector.append(w.mean(axis=0))
    else:
        X_train_feature_vector.append(np.zeros(100, dtype=float)) #add zero if no corresponding mean
        
X_test_feature_vector = []
for w in X_test_word_embeddings:
    if w.size:
        X_test_feature_vector.append(w.mean(axis=0))
    else:
        X_test_feature_vector.append(np.zeros(100, dtype=float))

Each example now consists of one feature, which is a numerical feature vector of length 100. Run the code cell below to inspect the first five training examples.

In [29]:
for w in range(0, 5):
    print('Length of training example {0}: {1}'.format(w, len(X_train_feature_vector[w])))
    
print('First training example\'s feature vector: \n{0}'.format(X_train_feature_vector[0]))

Length of training example 0: 100
Length of training example 1: 100
Length of training example 2: 100
Length of training example 3: 100
Length of training example 4: 100
First training example's feature vector: 
[-1.1049316  -0.00495676 -0.71286833  0.01581209 -0.22380786 -0.2819681
  0.30092195  0.9385498  -0.10728385 -0.5707252  -0.03826191 -0.6150856
 -0.46273813 -0.60411924 -0.55337846  0.3294662  -0.07696322  0.05543231
  0.8247735  -0.9024153   2.0328298   0.6258118   0.38968202 -0.06787213
 -0.15417658  0.6521823  -0.9909808  -0.72445536 -0.4518307   0.10237423
 -0.94199294  1.4847244   0.6714918  -0.5642654   1.5965024   1.1199354
 -0.39535072  0.47974354 -0.09820977 -1.2816906   0.44184595  0.1856958
 -1.4144332   0.4187681   1.3257167  -1.5528185  -0.09832243 -0.68504333
  0.86033535  0.5125574   0.33904094 -0.07332923  0.1359521   0.5930164
  0.35865253  1.0714424   0.6495768   0.03501019 -1.0122678   0.2621215
 -0.2146981   0.35157427 -0.08845037 -0.60187906 -1.4178727  -0.

## Step 7: Fit a Logistic Regression Model to the Training Data and Evaluate the Model

Now we can train our model on our transformed data. The code cell below trains a logistic regression model and computes the AUC on the test set.

In [30]:
# 1. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
model = LogisticRegression(max_iter=200)
model.fit(X_train_feature_vector, y_train)

# 2. Make predictions on the transformed test data using the predict_proba() method and 
# save the values of the second column
probability_predictions = model.predict_proba(X_test_feature_vector)[:,1]

# 3. Make predictions on the transformed test data using the predict() method 
class_label_predictions = model.predict(X_test_feature_vector)

# 4. Compute the Area Under the ROC curve (AUC) for the test data. Note that this time we are using one 
# function 'roc_auc_score()' to compute the auc rather than using both 'roc_curve()' and 'auc()' as we have 
# done in the past
auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

AUC on the test data: 0.9875


Let's check two emails and see if our model properly predicted whether an email is spam or not spam.

In [31]:
print('Email #1:\n')
print(original_X[14])

print('\nPrediction: Is this a spam email? {}\n'.format(class_label_predictions[14])) 

print('Actual: Is this a spam email? {}\n'.format(y_test.to_numpy()[14]))


Email #1:

Subject: tenaska iv july
darren :
please remove the price on the tenaska iv sale , deal 384258 , for july and enter the demand fee . the amount should be $ 3 , 902 , 687 . 50 .
thanks ,
megan

Prediction: Is this a spam email? True

Actual: Is this a spam email? True



In [32]:
print('Email #2:\n')
print(original_X[132])

print('\nPrediction: Is this a spam email? {}\n'.format(class_label_predictions[132])) 

print('Actual: Is this a good spam email? {}\n'.format(y_test.to_numpy()[132]))

Email #2:

Subject: re : noms / actual flow for 3 / 29 / 01
we agree with the nomination .
" eileen ponton " on 03 / 30 / 2001 10 : 05 : 40 am
to : david avila / lsp / enserch / us @ tu , charlie stone / texas utilities @ tu , melissa
jones / texas utilities @ tu , hpl . scheduling @ enron . com ,
liz . bellamy @ enron . com
cc :
subject : noms / actual flow for 3 / 29 / 01
nom mcf mmbtu
24 , 583 24 , 999 25 , 674
btu = 1 . 027

Prediction: Is this a spam email? False

Actual: Is this a good spam email? False

