# Keras Tutorial on usage of pre-trained embeddings in topic classification

Most code in this notebook is taken from the [excellent tutorial page](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html). The goal of this notebook was to understand, in depth, the usage of CNNs in text classification - have never used these before in combination. The notebook has some notes, pointers, learnings, and is largely for a self-learning purpose.

# Basic Pre-processing

In [1]:
# Extracting files - One off
"""
import tarfile
import zipfile

tf = tarfile.open("news20.tar.gz")
tf.extractall()
tf.close()

with zipfile.ZipFile("glove.6B.zip","r") as zip_ref:
    zip_ref.extractall()
"""
    
"""Basic Pre-processing"""
import os
import sys
import numpy as np
# Look Ma, no pandas!

TEXT_DATA_DIR = 'data'

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if i > 0:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)

print('Found %s texts.' % len(texts))

Found 19997 texts.


# Formatting the problem to suit Keras

In [2]:
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, Input, Conv1D, MaxPooling1D, Flatten, Dense, Conv2D
from keras.models import Sequential, Model
from keras.utils import plot_model

MAX_NB_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000
VALIDATION_SPLIT = 0.2

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Found 174074 unique tokens.
Shape of data tensor: (19997, 1000)
Shape of label tensor: (19997, 20)


# Creating the embeddings matrix

In [3]:
GLOVE_DIR = 'vectors'
EMBEDDING_DIM = 100

embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.' + str(EMBEDDING_DIM) + 'd.txt'), encoding = 'utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


Embedding matrix, with one row for each word in *word_index*, denoting its embedding vector

In [4]:
embedding_matrix = np.zeros((len(word_index)+1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# Going through the model, layer-by-layer

### We start with Input layer...

From the [docs](https://keras.io/layers/core/)

1st argument is **shape** - A shape tuple, *not including the batch size*. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors.

A less-used alternative is **batch_shape**: A shape tuple, *including the batch size*. For instance, batch_shape=(10, 32) indicates that the expected input will be batches of 10 32-dimensional vectors.  batch_shape=(None, 32) indicates batches of an arbitrary number of 32-dimensional vectors.

This is so weird. So basically, logically speaking shape=(,32) might be more sensible, but that's not really the case!

In [5]:
"""Input Layer"""
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

### Embedding layer...

Creating a non-trainable word-embedding layer. From the [reference](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html):

$$\text{All that the Embedding layer does is to map the integer inputs to the vectors found at the corresponding index in the embedding}$$ $$\text{matrix, i.e. the sequence [1, 2] would be converted to [embeddings[1], embeddings[2]]. This means that the output}$$
$$\text{of the Embedding layer will be a 3D tensor of shape (samples, sequence_length, embedding_dim)}$$

It becomes clearer by the trial shown in the code below

In [6]:
embedding_layer = Embedding(len(word_index)+1, EMBEDDING_DIM, weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH, trainable=False)

"""Trying out the embedding layer to understand input / output"""
trial_samples = 20
trial_model = Sequential()
trial_model.add(embedding_layer)

trial_model.compile('adam', 'mse')
output_array = trial_model.predict(x_train[:trial_samples])
# assert output_array.shape == (trial_samples, MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)
print (x_train[:trial_samples].shape, output_array.shape)

(20, 1000) (20, 1000, 100)


In [7]:
"""Adding Embeddings Layer to the model, which kind-of takes Input Layer as input"""
embedded_sequences = embedding_layer(sequence_input)

### Convolutional Layers...

Conv1D basically is a convolutional layer, applying 1D filters. For intuition, 2D filters are typically the ones applied for images. This is similar, but in 1D.

The argument **filters** is basically how many filters would one like to use. So in image processing, if we use, say, 2 filters of (3, 3) *kernel_size* to check for certain edges / patterns in images, they could look like this (note that the numbers are for illustrative purpose only. In reality, the numbers are replaced by weights that need to be trained!)

Filter 1 | Filter 2
- | -
![filter1](img/filter1.PNG) | ![filter2](img/filter2.PNG)

In the case of 1D, they'd just be... 1D! Thus, in the layer below we are using 128 such filters of size 5 x 1 that we will train the weights of.

**Important**:

*Input shape* - 3D tensor with shape: (batch, steps, channels). In our case, (batch_size, MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)

*Output shape* - 3D tensor with shape: (batch, new_steps, filters). In our case, (batch_size, MAX_SEQUENCE_LENGTH - kernel_size + 1, 128), assuming stride = 1!

To make the input / output shapes more clear, here's an example from Andrew Ng's course describing one part of the LeNet architecture:

![LeNet](img/shape_clarity.PNG)

You can see that the input image of 14x14 has 6 channels, to which 16 5x5 filters are applied with stride 1, leading to a shape of 10x10 with 16 channels. What is happening is that each filter is of a 5x5x6 dimension, and each filter outputs a 10x10 matrix, and all these matrices are stacked together to yield a 16-deep output tensor. Thus, for our 1D case, each of our input rows is of dimension MAX_SEQUENCE_LENGTH with EMBEDDING_DIM channels, which get changed to what was mentioned earlier, using the same logic as for the image example in LeNet.

Also, for our case, due to above reasons, the number of parameters to train for this particular layer will be about 128 (# filters) * 5 (filter_size) * 100 (EMBEDDING_SIZE) = 64000. Additionally, we need to add 128 bias parameters too, bringing total parameters to be trained to 64,128

In [8]:
"""Convolutional Layer 1"""
x = Conv1D(128, 5, activation='relu')(embedded_sequences)

### Max Pooling...

Pretty straightforward - Chooses sub-chunks of 5 continuous data points in the 996-length sequence, and aggregates over them using a max function, yielding int(996 / 5) = 199 dimensions. **No difference in the channel_size**. Thus, output shape is (batch_size, 199, 128)

In [9]:
x = MaxPooling1D(5)(x)

### Conv + MaxPool Layers, on repeat...

In [10]:
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling, since input shape at this point is (batch_size, 35, 128)

### Flattening Layer...

*One block of code is worth a thousand words!:*

In [11]:
"""Tester Code from Docs"""
_trial_model = Sequential()
_trial_model.add(Conv2D(64, (3, 3), input_shape=(32, 32, 3), padding='same',))
# now: model.output_shape == (None, 64, 32, 32)

_trial_model.add(Flatten())
# now: model.output_shape == (None, 65536)

In [12]:
"""Back to our model"""
x = Flatten()(x)

### Dense...

Normal NN layer with 128 hidden units, and an input of (batch_size, 128) too. Seems like a bit of overkill! The second dense layer is outputting the predicted class, and has 20 hidden units

In [13]:
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

### Model...

Nothing new here

In [20]:
model = Model(inputs = sequence_input, outputs = preds)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 100)         17407500  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 996, 128)          64128     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 199, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 195, 128)          82048     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 39, 128)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 35, 128)           82048     
__________

# Learning the parameters!

In [31]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)

Train on 15998 samples, validate on 3999 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f596d0c76d8>

In [32]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=3, batch_size=128)

Train on 15998 samples, validate on 3999 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f596d0e2e80>

So, after 5 epochs, we get about 66.5% accuracy on the validation test.

Next, I'd have really liked to understand what is being outputted by the intermediate Conv1D layers, but I'm not sure how interpret this output. It doesn't seem exactly as straight-forward as CNN for images, which has packages at this point that allow one to easily "peek into" the network. Oh, well.!