<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Classification-With-Word-Embeddings---Codealong" data-toc-modified-id="Classification-With-Word-Embeddings---Codealong-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Classification With Word Embeddings - Codealong</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Getting-Started" data-toc-modified-id="Getting-Started-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Getting Started</a></span></li><li><span><a href="#Loading-A-Pretrained-GloVe-Model" data-toc-modified-id="Loading-A-Pretrained-GloVe-Model-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Loading A Pretrained GloVe Model</a></span><ul class="toc-item"><li><span><a href="#Getting-the-Total-Vocabulary" data-toc-modified-id="Getting-the-Total-Vocabulary-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Getting the Total Vocabulary</a></span></li></ul></li><li><span><a href="#Creating-Mean-Word-Embeddings" data-toc-modified-id="Creating-Mean-Word-Embeddings-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Creating Mean Word Embeddings</a></span></li><li><span><a href="#Using-Pipelines" data-toc-modified-id="Using-Pipelines-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Using Pipelines</a></span></li><li><span><a href="#Deep-Learning-With-Word-Embeddings" data-toc-modified-id="Deep-Learning-With-Word-Embeddings-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Deep Learning With Word Embeddings</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Summary</a></span></li></ul></li></ul></div>

# Classification With Word Embeddings - Codealong

## Introduction

In this lesson, you'll use everything you've learned in this section to perform text classification using word embeddings!

## Objectives

You will be able to:

- Effectively incorporate embedding layers into neural networks using Keras
- Import and use pretrained word embeddings from popular pretrained models such as GloVe
- Understand and explain the concept of a mean word embedding, and how this can be used to vectorize text at the sentence, paragraph, or document level


## Getting Started

Load the data, and all the frameworks and libraries. 

In [1]:
import pandas as pd
import numpy as np
np.random.seed(0)
from nltk import word_tokenize
from gensim.models import word2vec

Now, load the dataset. You'll be working with the same dataset you worked with in the previous lab for this section, which you'll find inside `News_Category_Dataset_v2.zip`.  **_Go into the repo and unzip this file before continuing._**

Once you've unzipped this dataset, go ahead and use pandas to read the data stored in `News_Category_Dataset_v2.json` in the cell below. Then, display the head of the DataFrame to ensure everything worked correctly. 

**_NOTE:_** When using the `pd.read_json()` function, be sure to include the `lines=True` parameter, or else it will crash!

In [2]:
df = pd.read_json('News_Category_Dataset_v2.json', lines=True)
df = df.sample(frac=0.2)
print(len(df))
df.head()

40171


Unnamed: 0,authors,category,date,headline,link,short_description
23341,,POLITICS,2017-06-21,Jared Kushner Arrives In Israel For Whirlwind ...,https://www.huffingtonpost.com/entry/jared-kus...,It remains unclear what approach the White Hou...
100639,JamesMichael Nichols,QUEER VOICES,2015-01-23,'The Best Thing Is To See How Much Love Can Do...,https://www.huffingtonpost.com/entry/stacy-hol...,
184179,"Party Earth, Contributor\nContributor",TRAVEL,2012-07-25,Berlin's Nightlife: 48 Hours You Might Not Rem...,https://www.huffingtonpost.com/entry/berlins-n...,If you think spending time boozing and schmooz...
136649,"Shelly Ulaj, Contributor\nFounder and CEO of W...",DIVORCE,2013-12-13,Finding Strength to Stand on Your Own,https://www.huffingtonpost.com/entry/finding-s...,I was so used to being taken care of by family...
196185,Ellie Krupnick,STYLE & BEAUTY,2012-03-18,Alexander Wang Lawsuit Will Move To Federal Co...,https://www.huffingtonpost.com/entry/alexander...,Representatives of Alexander Wang's brand cont...


Now, let's transform the dataset, as you did in the previous lab. 

In the cell below:

*  Store the column that will be the target, `category`, in the variable `target`.
* Combine the `headline` and `short_description` columns and store the result in a column called `combined_text`. When concatenating these two columns, make sure they are separated by a space character (`' '`)!
* Use the `combined_text` column's map function to use the `word_tokenize` function on every piece of text. 
* Store the `.values` from the newly tokenized `combined_text` column inside the variable data

In [3]:
target = df.category
df['combined_text'] = df.headline + ' ' + df.short_description
data = df['combined_text'].map(word_tokenize).values

## Loading A Pretrained GloVe Model

For this lab, you'll be loading the pretrained weights from **_GloVe_** (short for _Global Vectors for Word Representation_) from the [Stanford NLP Group](https://nlp.stanford.edu/projects/glove/).  These are commonly accepted as some of the best pre-trained word vectors available, and they're open source, so you can get them for free! Even the smallest file is still over 800 MB, so you'll you need to download this file manually. 

Note that there are several different sizes of pretrained word vectors available for download from the page linked above&mdash;for the purposes, you'll only need to use the smallest one, which still contains pretrained word vectors for over 6 billion words and phrases! To download this file, follow the link above and select the file called `glove.6b.zip`.  For simplicity's sake, you can also start the download by clicking [this link](http://nlp.stanford.edu/data/glove.6B.zip).  You'll be using the GloVe file containing 100-dimensional word vectors for 6 billion words. Once you've downloaded the file, unzip it, and move the file `glove.6B.50d.txt` into the same directory as this jupyter notebook. 

### Getting the Total Vocabulary

Although the pretrained GloVe data contains vectors for 6 billion words and phrases, you don't need all of them. Instead, you only need the vectors for the words that appear in the dataset. If a word or phrase doesn't appear in the dataset, then there's no reason to waste memory storing the vector for that word or phrase. 

This means that you need to start by computing the total vocabulary of the dataset. You can do this by adding every word in the dataset into a python `set` object. This is easy, since you've already tokenized each comment stored within `data`.

In the cell below, add every token from every comment in data into a set, and store the set in the variable `total_vocabulary`.

**_HINT_**: Even though this takes a loop within a loop, you can still do this with a one-line list comprehension!

In [7]:
total_vocabulary = set(word for headline in data for word in headline)
total_vocabulary

{'thereby',
 'Rohingyas',
 'www.salon.com',
 'Painless',
 'cheeseburgers',
 'Warmth',
 'pets',
 'writers',
 'bow',
 'Fogerty',
 'descendants',
 'Vocation',
 'fearlessness',
 'Sekulow',
 "'Makers",
 'Tia',
 'owned',
 'Rev',
 'Gutierrez',
 'Auld',
 'Lieutenant',
 'carton',
 'Honk',
 'cOoL',
 'Hattar',
 'suffused',
 'Gowns',
 'Braden',
 'six-digit',
 'mudslinging',
 'littler',
 'Whale',
 'gamely',
 'shamed',
 'decade',
 'tastings',
 'Preparation',
 'desert',
 'Appearances',
 'melodramatic',
 'league',
 'Caravaggio',
 'wide-eyed',
 'Reformists',
 '4.5',
 'Unbeaten',
 'instinct',
 'Bi',
 'brighten',
 'Checker',
 'KimYe',
 'Feeders',
 'extends',
 'Kroger',
 'Jihadism',
 'Schnabel',
 'behemoth',
 'Bjork',
 'graciously',
 'Disk',
 'KD',
 'Zoolander',
 'earns',
 'palate',
 'cardiologist',
 'Disappears',
 'lattes',
 'interface',
 'wildest',
 'Pleasantly',
 "'Godfather",
 'hypotheticals',
 'Boyega',
 'Debika',
 'fantastical',
 'energised',
 'Orwellian',
 'anesthetic',
 'Burnt',
 'Refugee',
 'Coll

In [5]:
len(total_vocabulary)
print("There are {} unique tokens in the dataset.".format(len(total_vocabulary)))

There are 71173 unique tokens in the dataset.


Now that you have gotten the total vocabulary, you can get the appropriate vectors out of the GloVe file. 

For the sake of expediency, the code to read the appropriate vectors from the file is included below. 

In [8]:
glove = {}
with open('glove.6B.50d.txt', 'rb') as f:
    for line in f:
        parts = line.split()
        word = parts[0].decode('utf-8')
        if word in total_vocabulary:
            vector = np.array(parts[1:], dtype=np.float32)
            glove[word] = vector

After running the cell above, you now have all of the words and their corresponding vocabulary stored within the dictionary, `glove`, as key/value pairs. 

Double-check that everything worked by getting the vector for a word from the `glove` dictionary. It's probably safe to assume that the word 'school' will be mentioned in at least one news headline, so let's get the vector for it. 

Get the vector for the word `'school'` from `glove` in the cell below. 

In [9]:
glove['school']

array([-0.90629  ,  1.2485   , -0.79692  , -1.4027   , -0.038458 ,
       -0.25177  , -1.2838   , -0.58413  , -0.11179  , -0.56908  ,
       -0.34842  , -0.39626  , -0.0090178, -1.0691   , -0.35368  ,
       -0.052826 , -0.37056  ,  1.0931   , -0.19205  ,  0.44648  ,
        0.45169  ,  0.72104  , -0.61103  ,  0.6315   , -0.49044  ,
       -1.7517   ,  0.055979 , -0.52281  , -1.0248   , -0.89142  ,
        3.0695   ,  0.14483  , -0.13938  , -1.3907   ,  1.2123   ,
        0.40173  ,  0.4171   ,  0.27364  ,  0.98673  ,  0.027599 ,
       -0.8724   , -0.51648  , -0.30662  ,  0.37784  ,  0.016734 ,
        0.23813  ,  0.49411  , -0.56643  , -0.18744  ,  0.62809  ],
      dtype=float32)

Great&mdash;it worked!  Now that you've gotten the word vectors for every word in the  dataset, the next step is to combine all the vectors for a given headline into a **_Mean Embedding_** by finding the average of all the vectors in that headline. 

## Creating Mean Word Embeddings

For this step, it's worth the extra effort to write your own mean embedding vectorizer class, so that you can make use of pipelines from scikit-learn. Using pipelines will save us time and make the code a bit cleaner. 

The code for a mean embedding vectorizer class is included below, with comments explaining what each step is doing. Take a minute to examine it and try to understand what the code is doing. 

In [10]:
class W2vVectorizer(object):
    
    def __init__(self, w2v):
        # takes in a dictionary of words and vectors as input
        self.w2v = w2v
        if len(w2v) == 0:
            self.dimensions = 0
        else:
            self.dimensions = len(w2v[next(iter(glove))])
    
    # Note from Mike: Even though it doesn't do anything, it's required that this object implement a fit method or else
    # It can't be used in a sklearn Pipeline. 
    def fit(self, X, y):
        return self
            
    def transform(self, X):
        return np.array([
            np.mean([self.w2v[w] for w in words if w in self.w2v]
                   or [np.zeros(self.dimensions)], axis=0) for words in X])

## Using Pipelines

Since you've created a mean vectorizer class, you can pass this in as the first step in the pipeline, and then follow it up with the model you'll feed the data into for classification. 

Run the cell below to create pipeline objects that make use of the mean embedding vectorizer that you built above. 

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

rf =  Pipeline([("Word2Vec Vectorizer", W2vVectorizer(glove)),
              ("Random Forest", RandomForestClassifier(n_estimators=100, verbose=True))])
svc = Pipeline([("Word2Vec Vectorizer", W2vVectorizer(glove)),
                ('Support Vector Machine', SVC())])
lr = Pipeline([("Word2Vec Vectorizer", W2vVectorizer(glove)),
              ('Logistic Regression', LogisticRegression())])

Now, you'll create a list that contains a tuple for each pipeline, where the first item in the tuple is a name, and the second item in the list is the actual pipeline object. 

In [12]:
models = [('Random Forest', rf),
          ("Support Vector Machine", svc),
          ("Logistic Regression", lr)]

You can then use the list you've created above, as well as the `cross_val_score` function from scikit-learn to train all the models, and store their cross validation score in an array. 

**_NOTE:_** Running the cell below may take a few minutes!

In [13]:
scores = [(name, cross_val_score(model, data, target, cv=2).mean()) for name, model, in models]

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   24.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    1.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   22.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.9s finished


In [14]:
scores

[('Random Forest', 0.31960910174587964),
 ('Support Vector Machine', 0.3036012008096788),
 ('Logistic Regression', 0.3255087322734529)]

These scores may seem pretty low, but remember that there are 41 possible categories that headlines could be classified into. This means the naive accuracy rate (random guessing) would achieve an accuracy of just over 0.02! Our models have plenty of room for improvement, but they do work!

## Deep Learning With Word Embeddings

To end, you'll see an example of how you can use an **_Embedding Layer_** inside of a Deep Neural Network to compute the own word embedding vectors on the fly, right inside the model! 

Don't worry if you don't understand the code below just yet&mdash;you'll be learning all about **_Sequence Models_** like the one below in the next section!

Run the cells below.

First, you'll import everything you'll need from Keras. 

In [15]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, LSTM, Embedding
from keras.layers import Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence

Using TensorFlow backend.


Next, you'll convert the labels to a one-hot encoded format.

In [16]:
y = pd.get_dummies(target).values

Now, you'll preprocess the text data. To do this, you start from the step where you combined the headlines and short description. You'll then use Keras's preprocessing tools to tokenize each example, convert them to sequences, and then pad the sequences so they're all the same length. 

Note how during the tokenization step, you set a parameter to tell the tokenizer to limit the overall vocabulary size to the `20000` most important words. 

In [17]:
tokenizer = text.Tokenizer(num_words=20000)
tokenizer.fit_on_texts(list(df.combined_text))
list_tokenized_headlines = tokenizer.texts_to_sequences(df.combined_text)
X_t = sequence.pad_sequences(list_tokenized_headlines, maxlen=100)

Now, construct the neural network. Notice how the **_Embedding Layer_** comes second, after the input layer. In the Embedding Layer, you specify the size you want the word vectors to be, as well as the size of the embedding space itself.  The embedding size you specified is 128, and the size of the embedding space is best as the size of the total vocabulary that we're using. Since you limited the vocab to 20000, that's the size you choose for the embedding layer. 

Once the data has passed through an embedding layer, you feed this data into an LSTM layer, followed by a Dense layer, followed by output layer. You also add some Dropout layers after each of these layers, to help fight overfitting.

Our output layer is a Dense layer with 41 neurons, which corresponds to the 41 possible classes in the labels. You set the activation function for this output layer to `'softmax'`, so that the network will output a vector of predictions, where each element's value corresponds to the percentage chance that the example is the class that corresponds to that element, and where the sum of all elements in the output vector is 1. 

In [18]:
embedding_size = 128
input_ = Input(shape=(100,))
x = Embedding(20000, embedding_size)(input_)
x = LSTM(25, return_sequences=True)(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.5)(x)
x = Dense(50, activation='relu')(x)
x = Dropout(0.5)(x)
# There are 41 different possible classes, so we use 41 neurons in our output layer
x = Dense(41, activation='softmax')(x)

model = Model(inputs=input_, outputs=x)

Once you have designed the model, you still have to compile it, and provide important parameters such as the loss function to use (`'categorical_crossentropy'`, since this is a multiclass classification problem), and the optimizer to use. 

In [19]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

After compiling the model, you quickly check the summary of the model to see what the model looks like, and make sure the output shapes line up with what you expect. 

In [20]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100, 25)           15400     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 25)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 25)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                1300      
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
__________

Finally, you can fit the model by passing in the data, the labels, and setting some other hyperparameters such as the batch size, the number of epochs to train for, and what percentage of the training data to use for validation data. 

If trained for 3 epochs, you'll find the model achieves a validation accuracy of almost 41%. 

Run the cell below for 1 epoch. Note that this is a large network, so the training will take some time!

In [21]:
model.fit(X_t, y, epochs=2, batch_size=32, validation_split=0.1)

Train on 36153 samples, validate on 4018 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1a53eb5a90>

After 1 epoch, the model does about as well as the shallow algorithms you tried above. However, the LSTM Network was able to achieve a validation accuracy of over 40% after only 3 epochs of training. It's likely that if you trained for more epochs or added in the rest of the data, the performance would improve even further (but the run time would get much, much longer). 

It's common to add embedding layers in LSTM networks, because both are special tools most commonly used for text data. The embedding layer creates it's own vectors based on the language in the text data it trains on, and then passes that information on to the LSTM network one word at a time. You'll learn more about LSTMs and other kinds of **_Recurrent Neural Networks_** in the next section!

## Summary

In this codealong, you used everything you know about word embeddings to perform text classification, and then you built a Multi-Layer Perceptron model that incorporated a word embedding layer in it's own architecture!