# Convolutional and Recurrent Neural Networks

In the previous lesson we took a dive on Natural Language Processing using neural networks. Using sentiment analysis as our first task to work with, we tried several models to tackle it as much as we could. We did so incrementally, trying to simplify the problem as much as possible to ensure that it would fit in the neural networks we have seen so far. **This approach of coming up with the simplest possible solution is usually a good idea, because complicated models tend to be harder to develop and understand, as well as more expensive to train.** However, we finished the lesson with the clear understanding that some problems might need more involved machinery.

We also left four exercises to the reader, which you hopefully have put some time into. Since they were mostly exploratory in nature, we will not solve them here. However, we will sketch what you might have gotten from them so we can keep going:

1. On the impact of dataset size, **the only clear-cut answer is that we really need training data!** After trying a few partition sizes for training and evaluating, you might have noticed that the learned weights were a lot noisier for smaller training sizes. This is normal: **the fewer training examples, the least likely it is that we will learn a proper representation.** Our dataset was very small, which accentuates this problem.
2. Using a `Bag of Words` model that also encodes the frequency of words in a document might have some small effect, but for our dataset it did not matter much. **It is a good idea to revisit this approach for some of the more involved problems we will see now.**
3. Because of the small dataset, **we need a high ratio of negative samples for the embeddings to make sense.** Lower ratios meant less powerful embeddings. On the other hand, higher ratios also mean more samples, and that has a direct impact on training speed!
4. Finally, alternative words embeddings can better capture syntactic over semantic characteristics or vice-versa. As such, **they can outperform one another depending on the problem we are trying to solve!**

This is already too much of a retrospective! To set the stage, let's remember at the neural networks we have been working with so far:

<img src="images/NeuralNetSchematic.png" width="500" alt="An abstracted neural network with inputs, hidden layers and outputs"></img>

In code, we would be able to create similar arbitrary (deep) neural architectures with a simple function:

In [None]:
from keras.models import Model
from keras.layers import Input, Dense

def make_network(input_size, output_size, hidden_sizes, hidden_activations, output_activation):
    '''Creates a neural network with input, output and hidden sizes and the activations for the hidden and output'''
    inp = Input((input_size,) )
    hid = inp
    for (size, act) in zip(hidden_sizes, hidden_activations):
        hid = Dense(size, activation=act)(hid)
    out = Dense(output_size, activation=act)(hid)
    return Model(inputs=inp, outputs=out)

# create a sample network!
print('Creating an example network: ')
mdl = make_network(10, 1, [10, 12, 10], ['relu', 'relu', 'tanh'], 'sigmoid')
mdl.summary()

As you may remember from what we have seen before, there is still some plumbing left: we would need to specify our loss function and optimization algorithm in order to compile the network. However, we don't even need to cover that if we want to point out the limitations of the networks we are building! **The main problem is that we always assume our inputs are unstructured blobs, and ignore the patterns that may exist within them.**

This is a pretty big issue: a document is composed of sentences, a sentence is made up of words, and a word is formed by letters. **We will not be able to solve hard tasks such as Question Answering without a model that can cope with the underlying structures!**

Now, to a degree, we have already been cheating. By representing using embedding vectors and averaging the embeddings of all words in a document, we are introducing structural information from a large corpus into our reduced problem. However, we are still squashing the relationships words have in the text, and the longer the text, the more we are diluting the contribution of each word, expression or sentence. 

**At this point, we ask ourselves: how can we keep the structure data has into our models?** A naïve first way is to apply a neural model on each word, and another model over the resulting representation. Our neural network will take a full document with $N$ words, each of which will be an embedded vector with $d$ components. 

The input, then will be a $N \times d$ matrix. On each of the vectors, we will apply a neural network with $k$ outputs, so we will get a $N \times k$ matrix, which we will then flatten and feed into a final set of layers up to the output. **This way, the network will work on the individual words and have access to the whole document!** Let's sketch that with some code:

In [None]:
from keras.layers import Flatten

# create the word-level model
embedding_size = 300
word_out_size = 5
word_out_activ = 'linear' # do nothing with the output
word_hidden_sizes = [50, 20]
word_hidden_activs = ['relu', 'relu']

print('Word model summary: ')
word_model = make_network(embedding_size, word_out_size, word_hidden_sizes, word_hidden_activs, word_out_activ)
word_model.summary()
print('\n\n')

# create the document model
document_words = 500
doc_input = Input((document_words, embedding_size)) 
doc_in_words = word_model(doc_input)
doc_flat = Flatten()(doc_in_words)
doc_dense = Dense(20, activation='relu')(doc_flat)
doc_dense = Dense(20, activation='relu')(doc_dense)
doc_output = Dense(1, activation='sigmoid')(doc_dense)

print('Document model summary: ')
doc_model = Model(inputs=doc_input, outputs=doc_output)
doc_model.summary()

**Keras manages applying a model on each of the inputs for us, so the parameters of our model are shared when processing each of the words.** However, this approach has a clear flaw, and it is a flaw we already saw when trying to represent words as vectors in our previous lesson. When we flatten out all the outputs of the word-level model, the resulting vector will be fed onto a simple network. 

Let's consider sentiment analysis again. For the network to correctly classify the incoming inputs, it will have to learn that if `not` precedes a positive word such as `great`, the signal is actually negative. **However, since we have flattened the input, it will need to learn this over the whole vector, as if the components were isolated from one another!**

An alternative could be to feed the model words in pairs, so that we slide through all pairs of bigrams as we go. Then, our word level model would produce word representations, which would then be fed into a bigram model and, finally, flattened and processed by a final network. Of course, when programming, this generalizes to any arbitrary ngram easily:

In [None]:
from keras.layers import Reshape, Activation

# create the ngram-level model
ngram_size = 2
ngram_flat_size = ngram_size * word_out_size
ngram_out_size = 3
ngram_out_activ = 'linear' # do nothing with the output
ngram_hidden_sizes = [10, 8]
ngram_hidden_activs = ['tanh', 'tanh']

print('N-gram model summary: ')
ngram_model = make_network(ngram_flat_size, ngram_out_size, ngram_hidden_sizes, ngram_hidden_activs, ngram_out_activ)
ngram_model.summary()
print('\n\n')

# create the document model
ng_doc_input = Input((document_words, ngram_size, embedding_size)) 
ng_bdoc_in_words = word_model(ng_doc_input) # word model over each word
ng_doc_flat = Reshape((document_words, ngram_flat_size))(ng_bdoc_in_words) # flatten the words in each ngram
ng_doc_in_ngrams = ngram_model(ng_doc_flat) # ngram model over the ngrams
ng_doc_flat = Flatten()(ng_doc_in_ngrams) # flatten the document
ng_doc_dense = Dense(20, activation='relu')(ng_doc_flat)
ng_doc_dense = Dense(20, activation='relu')(ng_doc_dense)
ng_doc_output = Dense(1, activation='sigmoid')(ng_doc_dense)

print('Document model summary: ')
ng_doc_model = Model(inputs=ng_doc_input, outputs=ng_doc_output)
ng_doc_model.summary()

The obvious problem with this approach is that we are hard-coding a specific kind of structure. Instead, we would like our architecture to generally learn those representations. Yes, we may still need to perform some architecture engineering, **but the pipeline for the input data would always be the same!** By following this approach, our only task would be to appropriately model the structure of our problem with our network.

After these preliminaries, we can see that vanilla neural networks are not enough for this—or rather, we need a way to generalize our ngram idea! 

# Dealing with structure: Convolutional Neural Networks

In a sense, we have hinted our intentions all the way through. After all, why should a neural network take up a whole document, image, song as input? **Doesn't it make more sense to have small networks that learn the relationships between the parts that make up the whole?**

This simple idea is the fundamental leap with Convolutional Neural Networks (or CNNs for brevity). In short, what we try to learn are filters over an structured input. This structure can be plenty of things: a sequence of words, a matrix of colors, the frames of a video... The idea is that our neural network will learn to capture some property in the parts that form input. Then, we will aggregate and compress the representation to produce a higher understanding of what we are dealing with. **By stacking layers upon layers that encode and compress with this structure, we reach the point where we can easily classify it.**

What we have presented is best understood with a couple of pictures. The terms we use to describe CNNs come from the field of computer vision, so we will first show an example of their application to image classification:

<img src="images/ImageIntoCNN.png" width="1000" alt="A single diagram explaining how a convolutional neural network will turn a structured input into many unstructured vector inputs."></img>

A convolutional neural network learns filters over small parts of the whole. In our picture example, this means little blocks of $3 \times 3$ pixels. With these filters, we can aggregate the information from the concrete into the abstract: from lines we go into shapes, from shapes into textures, from textures we see materials and from materials we learn to classify concrete objects. **The networks that we use to process the groups of information at each level are the simplest amongst what we have seen**: single layer neural networks with a simple activation such as ReLU. The input for these little networks is the vector shown at the rightmost side of the image, which is itself the flattenned version of the third pixel patch on the top row.

Typically, we introduce additional operations in the middle to further simplify our problem. These operations add another degree of non-linearity over the structure our networks process without adding additional learnable parameters. An example of this is computing the average or maximum over chunks, which is called pooling. The combination of the convolutional layers, which learn filters, and the pooling layers, which aggregate over them, is applied repeatedly. **At some point, the output is reduced to a few features that, hopefully, capture the nature of our problem and can be flattened and passed to a vector-hungry neural network.**  The general principles of the underlying machinery remain the same. 

The following is an artificial example of what we mean. We generate circles a small image, with small variations in size and position. Those in the bottom left quadrant are positive examples (red), the top right ones are negative (blue):

In [None]:
%matplotlib inline
import numpy as np
from math import pi, atan2
from scipy.ndimage.filters import gaussian_filter
import matplotlib.pyplot as plt


def make_circle_batch(N, d, r, pos=None, rad=None):
    '''Makes a dataset of N images, sized d * d
    containing circles with a smooth radius of up to r. 
    
    The circles are positive if they are in the 
    bottom left side of the image, negative otherwise.'''
    # select the points
    z = np.zeros((N, d, d))
    n = np.random.randint(0, d, (N, 3))
    n[:, 0] = np.arange(N)
    if pos is not None:
        n[:, 1:] = pos
    z[n[:, 0], n[:,1], n[:,2]] = 1
    
    # compute the class
    c = [1 if atan2(y, x) < pi / 4 else 0 for (x, y) in n[:, 1:]]
    
    # smoothen out the points
    r = max(2, r)
    for (i, f) in enumerate(z):
        g = np.random.randint(1, r) if rad is None else rad
        z[i] = gaussian_filter(f, g)
    return z, c


z, c = make_circle_batch(40, 64, 10)

# show the circle examples as a (4, 10) grid
rows = 4
columns = 10
plt.figure(figsize=(5, 2), dpi=160, facecolor='w', edgecolor='k')
for i in range(rows):
    for j in range(columns):
        index = i + j * rows
        plt.subplot(rows, columns, index + 1)
        label = c[index]
        color = 'Reds' if label else 'Blues'
        plt.imshow(z[index], cmap=color)
        plt.axis('off')
        
print('A look at the points we want to classify, colored as described before: ')
plt.show()

In [None]:
def circle_generator(batch_size, side_size, max_radius):
    '''Wraps our circle image function as a generator for training.'''
    while True:
        X, y = make_circle_batch(batch_size, side_size, max_radius)
        yield np.asarray(X).reshape(batch_size, side_size, side_size, 1), np.asarray(y)

Now onto the convolutional model we go! Convolutional networks are simple. As we said, at a given representation level we apply a series of filters of a given size, and then usually aggregate their results. In some cases, the windows for the filters may fall out of the ranges of the input. If they do, we can chose to discard them altogether or pad the values outside in some way. We can also chose to move the windows over each block in the input, or instead perform larger jumps, which is called the `stride`. The best way to understand this is by looking at the following picture:

<img src="images/ConvStridesPadding.gif" width="280" alt="A convolutional filter being applied to an image. The filter is slid across the whole image, in intervals that may skip some pixels."></img>

Here we have an structured input, perhaps the image of a tree tree in a sunny day for a classifier that tells the season of the year appart. We slide a $3 \times 3$ pixel wide filter over the image (shaded overlay) to extract a series of features on a patch of the image (represented in cyan, with various features per 'piece'). The filters will be applied on all patches by sliding them over the whole image. This sliding can leap over pixels, as seen in the image. In this case we have a stride of 1, so we skip one pixel as often times the information one pixel apart is not very significant, striding our convolution. **Of course, the result of applying our filter over the image also has a spatial structure, so we can apply the same building blocks several times to learn increasingly complex patterns!**

So far, we have seen that Convolutional Networks are nothing fancy—we just apply the network over little groups of structured inputs! To really settle it down, let's look at some code:

In [None]:
from keras.layers.convolutional import Convolution2D
from keras.layers.pooling import MaxPool2D, AvgPool2D

# training params
batch_size = 256
side_size = 40
max_radius = 6
batches_per_epoch = 100
num_epochs = 3
pooling_function = MaxPool2D

# our image is a square image with a single 'color'
def make_image_model(num_epochs):
    conv_input = Input((side_size, side_size, 1)) 

    # 50 filters, in windows of (3, 3) 
    conv_1 = Convolution2D(50, (3, 3), name='conv_1', padding='same')(conv_input) 

    # pool over a (2, 2) window (2 is shorthand for a square window!)
    pool_1 = pooling_function(2, name='pool_1')(conv_1) 

    # 30 filters, in windows of (2, 2), strided by (2, 2) so that we jump over every odd pixel
    conv_2 = Convolution2D(30, (2, 2), name='conv_2', strides=(2, 2))(pool_1)

    # business as usual from here!
    pool_2 = pooling_function(2, name='pool_2')(conv_2)
    conv_3 = Convolution2D(15, (2, 2), name='conv_3')(pool_2)
    pool_3 = pooling_function(2, name='pool_3')(conv_3)
    flatten = Flatten(name='flat')(pool_3)
    dense = Dense(20, activation='tanh')(flatten)
    output = Dense(1, activation='sigmoid')(dense)

    # compile the model as a binary classification problem
    conv_model = Model(inputs=conv_input, outputs=output)
    conv_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    conv_model.summary()

    # train on the generator
    train_gen = circle_generator(batch_size, side_size, max_radius)
    conv_model.fit_generator(train_gen, steps_per_epoch=batches_per_epoch, epochs=num_epochs)
    
    return conv_model

fully_trained_image_model = make_image_model(num_epochs)

We have managed to build our near perfect circle classifier in a very short time! Of course, this is a very simple problem: there is no noise of any kind and circles are linearly separated. **However, it already shows us how the network has to learn to extract the structure to a degree: in order to correctly classify a point, we have to know whether its center is closer to the bottom left quadrant or not.** As such, the network has to extract the most salient value and identify where on the image it lies—and then accurately predict if we are dealing with a lower-left point or an upper-right one!

The code comes with a few surprises: we have new layers that we have not used before! They are pretty straightforward if you have followed the description up to this point: A **Convolutional2D layer** applies a number of 2D filters over its input. A pooling layer computes some fixed aggregation function over a window: we have used **MaxPool2D**, which computes the maximum over a 2D window. In most models, common pooling functions are the maximum and the average.

Let's take an interactive look at our model before jumping into natural language. We will generate sample images and try to find an instance that gets misclassified. For simplicity, retrain the model with just 1 training epoch in the code above. 

In [None]:
from ipywidgets import interact, widgets


image_model = make_image_model(1)
def draw_interactive_circle(x, y, r):
    '''Plot an image and show its class as:
    
    red or blue if we classify it correctly,
    green otherwise.'''
    x, y = max(0, min(side_size - 1, x)), max(0, min(side_size - 1, y))
    i, c = make_circle_batch(1, side_size, 10, [y, x], r)
    inp = i.reshape(1, side_size, side_size, 1)
    p = image_model.predict(inp).flatten().round()
    
    # red or blue if we predict correctly, green otherwise!
    true_color = 'Reds' if c[0] else 'Blues'
    disp_color = 'Greens' if p[0] != c[0] else true_color
    plt.imshow(i[0], cmap=disp_color)

half_side_size = side_size / 2
interact(draw_interactive_circle, 
         x=widgets.IntText(value=half_side_size, step=1, description='X: '),
         y=widgets.IntText(value=half_side_size, step=1, description='Y: '),
         r=widgets.IntText(value=5, step=1, min=0, description='Radius: '));

Even with just 1 epoch of training, you probably needed some moving around to find a point that the model didn't classify correctly! Of course, we have to stress yet again that we are dealing with a very basic toy problem. **In any case, the results should convince you that CNNs are very powerful and widely applicable.**

How does this toy example apply to the written word? After all, we care about Natural Language Processing—and text is a different beast altogether! **In a text, words may influence words that appear before or after them, very far away, in subtle ways.** Some patterns are hard even for humans, as often happens with humor or irony!

**Ideally, we would be looking at real world examples for everything—it so just happens that image classification problems tend to take a good amount of time to train!** In the case of text, we can just think of our embedded document matrix, the one that we discussed before for the ngram processing approach. **Instead of processing the ngrams that we have defined ourselves, the network will learn filters over windows of tokens.** The windows will be of a fixed size, but they will be able to learn the relationships between the embeddings in those windows—**and that includes unigrams, bigrams, and any sort of ngram up to the window length!**

Further, the beauty about this trick is that we can stack it up, just as we did with our circle classifier. Even if the first layers only capture relationships over groups of 3 words, we can apply another set of filters on top! **This lets us capture relationships that are farther away, weighting each of the words appropriately.** As always, we prefer drawings over explanations, so let's just take a look at how this would look like:

<img src="images/WordConvolution.png" width="700" alt="An idealized convolutional network for text: we apply filters over n-grams, which we then aggregate."></img>

# Convolutional Neural Networks for NLP!

**We are applying the same idea as with the images—but this time our structure is composed of word embeddings, laid out in a matrix!** Upon computing the filters on the n-grams, we typically apply some aggregation pooling function just the same. Finally, to get the model to learn the details of how words play with one another, we will usually repeat this structure several times over the learned and aggregated filters.

So far, we have been looking at hypotheticals and toy examples. **It is time to jump right into the action with a new  sentiment analysis dataset!** We will use Keras' dataset module, which bundles a few useful small, real world datasets that you can generally use to prototype. Let's see how that looks like:

In [None]:
from keras.datasets import imdb

start_index = 3 
(X_train_raw, y_train), (X_test_raw, y_test) = imdb.load_data(index_from=start_index)

# load the word embeddings, including the 3 'artificial' tokens
word_mapping = imdb.get_word_index()
word_mapping = {k: v + start_index for (k, v) in word_mapping.items()}
word_mapping["<PAD>"] = 0
word_mapping["<START>"] = 1
word_mapping["<UNK>"] = 2

# inverse the mapping for when we intend to print an example
inv_mapping = {v: k for (k, v) in word_mapping.items()}
vocab_size = len(word_mapping)

In [None]:
label = ['negative', 'positive'][y_train[0]]
doc = ' '.join([inv_mapping[w] for w in X_train_raw[0]])

print('An example review, labelled "{}", is: \n\n{}'.format(label, doc))

Since our focus is to learn about convolutional networks, this time we won't be loading any embeddings. Instead, and remembering our work in the previous lesson, **we will train a sentiment classifier that tries to capture information about words in windows!** This will also help us come up with different way of producing a `Bag of Words` using the embedding layer.

We have to create a convolutional network that takes an IMDB review of up to $N$ words and correctly classifies it. Each word will be represented by its word index. **This means that our input will simply be a vector of $N$ elements, where we will pad with zeros if the review shorter.** In our case, we will stick to $N = 300$. Similarly to what we saw with the circle classifier, we will stack up a hierarchy of layers so that we can capture the interactions from nearby words. Let's get hacking:

In [None]:
maxlen = 300
from keras.preprocessing import sequence

# pad the sequences up to a maximum length
X_train = np.asarray(sequence.pad_sequences(X_train_raw, maxlen=maxlen))
X_test = np.asarray(sequence.pad_sequences(X_test_raw, maxlen=maxlen))

In [None]:
from keras.layers import Embedding
from keras.layers.convolutional import Convolution1D
from keras.layers.pooling import MaxPool1D, AvgPool1D, GlobalMaxPool1D, GlobalAvgPool1D


# build the convolutional sentiment analysis model
inpt = Input((maxlen,))
embs = Embedding(vocab_size, 3, name='embedding')(inpt)
cv_1 = Convolution1D(6, 5, name='sent_conv_1', padding='same')(embs)
cv_2 = Convolution1D(1, 3, name='sent_conv_2', padding='same')(cv_1)
pool = GlobalAvgPool1D(name='global_avg_pool')(cv_2)
outp = Dense(1, activation='sigmoid')(pool)
conv_sent_model = Model(inputs=inpt, outputs=outp)
conv_sent_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
conv_sent_model.summary()

# fit the model
conv_sent_model.fit(X_train, y_train, epochs=20, batch_size=512, validation_data=(X_test, y_test))

The model we have created is simple: our embedding layer learns 3 features for every word in our corpus. Then, we compute 6 different filters over groups of 5 words to capture the interactions between words. Then, we compute a final filter over groups of 3 adjacent outputs from the previous layer and average out the results. Finally, we capture the predictions with a single sigmoid output.

How have we achieved this? Let's go step by step, as we have introduced a few new layers and defined a somewhat different model. The layers work in the same way as our image processing example, but we will go over the details to make sure we are not missing anything:

1. We are using **Convolution1D** layers—and as their name goes, they only run over a one-dimensional structure!
2. Similarly to the image classification problem, **we have pooling functions to apply over the 1D patches**. We can apply either **Average or Maximum pooling**. Since we want to capture the overall sentiment of the document while keeping the model very simple, **we use Global pooling, which means applying the pooling operation over the whole structure!**
3. Most of our parameters are coming from the embedding layer—the features we learn for each word. **The convolutional part of our model takes just a little over a hundred parameters!**

Finally, we have to discuss padding. We have set up the convolutional inputs with `same` as their padding mode. This means that the learned filters will be computed with windows that will slide through the whole input, including any necessary zero padding to cover the edges. Alternatively, we could have used `valid` which would have forced the windows to only contain inputs within the input. The two padding modes are more easily understood through images: 

<img src="images/ConvPadding.png" width="400" alt="Same and valid padding illustrated."></img>

In the image, 2 convolution windows of 3 elements with a stride of 3 (that is, one after the other and without overlaps!) process a 5-element input. `valid` padding will mean we only take the output of the first, green window because the second, red one includes an element which lies out of the input. If we use `same` as our padding mode, we will take both, padding the missing item with a vector of zeros.

We said we were done with toy examples but this instance of sentiment analysis remains a relatively simple problem. This is in part because we only have to learn if a word is mostly positive or negative, with no degrees in between. **Although in some cases sentence structure plays a role in the final classification for shorter texts, it is not a very good way of showing us the power of our fancy Convolutional Neural Networks.** Let's introduce a new problem: a problem with multiple separate classes, so that the overall task cannot easily be reduced to just a single value. **In particular, we will work with the 20 newsgroups dataset.**

This dataset contains messages posted to... newsgroups! Each of the newsgroups represents a different topical category, with a sort of language of their own. The task is to predict, given a document, the newsgroup it belongs—and there are 20 to choose from! Let's get started:

In [None]:
import io
from collections import defaultdict

# load the 20 newsgroups dataset, simplified
groups = defaultdict(list)
with io.open('../Datasets/20news_simplified.tsv', mode='r', encoding='utf8') as f:
    for l in f:
        label, content = l.split(u'\t')
        content = content.strip()
        # filter out the line of the header and empty content!
        if label != 'label' and content:
            groups[label].append(content)

print('There are {} classes, with a total of {} non-empty documents.'.format(len(groups), sum(map(len, groups.values()))))
print('The classes have the following amounts of documents:\n')
print(u'\n'.join([u' • {}: {}'.format(c, len(d)) for (c, d) in groups.items()]))

The original dataset we are using can be found [here](http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html). We load a preprocessed version, prepared for this course, that already normalizes the text so that it is suitable for us to get going. If you want to learn how we did it, you can get the preprocessing script [here](../Datasets/prepare_newsgroups.py). 

It is a good time for a reminder: all throughout this course we have dealt with data that was in one way or another preprocessed for convenience. **In the real world, it is often the case that good data preparation takes the most effort!** Sadly, there is no good way of learning this through a course—you sharpen the blade by trying to get through as many problems as you can!

Let's prepare our training and validation splits and take a look at our data. Even with preprocessing we performed, there will be obvious things we could want to deal with. Coding that away:

In [None]:
train_split_pct = 0.8

# the complete dataset is a sequence of (label, document) tuples
data = []
for label, documents in groups.items():
    for d in documents:
        data.append((label, d))

# shuffle the data using a fixed seed
np.random.seed(7)
np.random.shuffle(data)

# let's partition the data
total_docs = len(data)
total_train_docs = int(round(total_docs * train_split_pct))

data_train = data[:total_train_docs]
data_valid = data[total_train_docs:]
print('We split the data so that we have '
      '{} training documents and {} validation documents.\n'.format(
          len(data_train), 
          len(data_valid)))

# and look at a couple of examples
num_samples = 3
max_tokens = 250
print('\n\n'.join(['\t Document class: {} \n\n  {}'.format(l, ' '.join(d.split()[:max_tokens])) 
                   for l, d in data[:num_samples]]))

Just at a first glance it seems obvious we may have some problems with the usual subjects: numbers, tokens produced by the removal of punctuation and words specific to the dataset, such as newsgroup names. **We will do away with our worries and try to see if word embeddings stand up to the task!**

With this in mind, we will prepare a convolutional model using pretrained word embeddings. This time, we will be using the [GloVe 6B english embeddings](https://nlp.stanford.edu/projects/glove/) from the 20000 most common words from the dataset. Let's get that going:

In [None]:
from collections import Counter
from keras.utils import to_categorical

# compute the label mappings
label_mapping = {l: i for i, l in enumerate(groups.keys())}
inv_label_mapping = {v: k for k, v in label_mapping.items()}
total_classes = len(label_mapping)

# compute the most common words of the dataset
PAD_TOKEN = 0
NUM_WORDS = 20000
EMB_SIZE = 100
counts = Counter([t for (_, d) in data for t in d.split(' ')])
top_words = counts.most_common(NUM_WORDS)
word_mapping = {u'<PAD>': PAD_TOKEN}
for w, c in top_words:
    word_mapping[w] = len(word_mapping)
total_word_indices = len(word_mapping)

In [None]:
# load up the word embeddings
gv_emb_matrix = np.zeros((total_word_indices, EMB_SIZE))
with io.open('../Embeddings/glove.6B.100d.txt', 'r', encoding='utf-8') as fp:
    for l in fp:
        toks = l.strip().split(' ')
        word = toks[0]
        if word in word_mapping:
            index = max(i, index)
            i = word_mapping[word]
            gv_emb_matrix[i] = [float(x) for x in toks[1:]]
            
# preprocessing function to get X, y
def vectorize_data(dataset, label_mapping, word_mapping, max_length=1000):
    X = np.asarray(sequence.pad_sequences(
                    [[word_mapping.get(t, PAD_TOKEN) for t in d.split(' ')]
                      for _, d in dataset], max_length), dtype=np.int32)
    y = to_categorical(np.asarray([label_mapping[l] for l, _ in dataset]))
    return X, y

# prepare the vectorized datasets
max_length = 1000
X_20ns_train, y_20ns_train = vectorize_data(data_train, label_mapping, word_mapping, max_length)
X_20ns_valid, y_20ns_valid = vectorize_data(data_valid, label_mapping, word_mapping, max_length)

The last step is to prepare our convolutional model. In this case, we are dealing with a problem where we have to predict a class out of many, so we will use `categorical_crossentropy` as our loss function. We will set up the model to use the pretrained GloVe embeddings, allowing them to be further trained as the learning goes. **Finally, we will try to avoid overfitting using Dropout layers.** We will describe those right after coding up our model:

#### Evaluating model parameters [[Exercise 1]](#Exercise-1:-Evaluating-model-parameters.)

In [None]:
from keras.layers import Dropout
from keras.initializers import Constant


def train_conv_20news(dropout_rate, use_weights, train_embeddings):
    '''Trains a convolutional neural network on the 20 newsgroups dataset.'''
    print(u'Training model with params:')
    print('')
    print(' • Dropout rate: {}'.format(dropout_rate))
    print(' • GloVe weights: {}'.format(use_weights))
    print(' • Retraining weights: {}'.format(train_embeddings))
    
    inp = Input((max_length,))
    
    # initialize the embeddings either with uniform floats or the pretrained weights
    emb_init = Constant(gv_emb_matrix) if use_weights else 'uniform'
    emb = Embedding(total_word_indices, 
                    EMB_SIZE, 
                    embeddings_initializer=emb_init, 
                    trainable=train_embeddings, 
                    name='embedding')(inp)

    # build the convolutional classification model
    cnv = Convolution1D(128, 5, activation='relu')(emb)
    cnv = MaxPool1D(5)(cnv)
    cnv = Convolution1D(128, 5, activation='relu')(cnv)
    cnv = MaxPool1D(5)(cnv)
    cnv = Dropout(dropout_rate)(cnv)
    cnv = Convolution1D(128, 5, activation='relu')(cnv)
    cnv = GlobalMaxPool1D()(cnv)  
    dns = Dense(128, activation='relu')(cnv)
    dns = Dropout(dropout_rate)(dns)
    out = Dense(total_classes, activation='softmax')(dns)
    conv_20ns_model = Model(inputs=inp, outputs=out)
    conv_20ns_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    conv_20ns_model.summary()

    # fit the model
    history = conv_20ns_model.fit(X_20ns_train, y_20ns_train, epochs=10, batch_size=100, validation_data=(X_20ns_valid, y_20ns_valid))
    return conv_20ns_model, history


# run the model with the following parameters
dropout_rate = 0.1
use_weights = True
train_embeddings = True

conv_20ns_model, conv_hist = train_conv_20news(dropout_rate, use_weights, train_embeddings)

The model we have designed does not surprise us! **We have simply put together a few of the layers we already know, our usual suspects.** Our convolutions this time are applied using `valid` padding, which we can easily see in the length of the output shapes changing. After stacking our bunch of convolutions, we build a final multi-class classifier. Remember that when we want to select a class out of many, we typically use a softmax output and categorical crossentropy as our loss function. **This way we build a model that predicts the individual probability of each class, in hopes that the highest probability will belong to the actual class!** 

We have included a new layer in the mix, though. This layer, called Dropout, is used to help our neural network generalize more. Dropout is one but many regularization techniques to improve general performance. Here, regularization means exactly that: achieving more general models. **Regularization techniques are helpful because very often our neural networks have so many parameters that they will pay too much attention to the training examples and overfit on them.** This means that the model they implement is not very general, because it will not apply beyond the training set. Using Dropout solves this by randomly dropping units at training time. Here, dropping means that they will not learn and that, instead, they will just pass a constant signal (typically zero). **This allows the model to learn more robust representations, since there must be multiple avenues for a signal to really be taken into account!** Dropout is easy to understand visually:

<img src="images/DropoutDiagram.png" width="900" alt="An example application of Dropout, showing the skipped units in a deep neural network with 2 hidden layers. Dropped units will simply forward zeros, i.e., produce no tangible output."></img>

Of course, dropping connections should make us think about the number of parameters in our model. **It should be clear from the picture that a model where we drop too many connections will not be able to learn.** Our model has 2 million parameters for the embeddings, but its convolutional components amount to just under a quarter million. Furthermore, we have kept the embeddings to be trainable. **Our embeddings not being trainable means that they may not be shared with other tasks and that they just act as background information when we begin training.** Is this setup enough to capture the classification problem? How much performance are we actually getting out of retraining the embeddings? Should we use a higher or lower dropout rate? 

Generally, it is a good idea to answer those questions through experiments. Answering the first question will depend on what we expect from our models. Indeed, **enough will depend on how we measure performance and how far we want to get!** With that in mind, how do we measure how well we are classifying for a problem in which we have multiple classes? So far we have relied on accuracy, interpreted as counting the number of times we get a prediction correct out of all inputs we test. **However, this measure posses some problems: if our dataset is not balanced, a model that predicts just a single class might seem good enough!** Likewise, in a multiclass setting, we would want to have a breakdown of how well we are doing for each class—as we might underperform for some kinds of inputs, and we would want to improve on them!

**For this purpose, we may use precision, recall and their combination, F1 score.** Precision is the proportion of correctly classified instances out of all instances we say belong to a class. Recall is the proportion of correctly classified instances of a class given all instances in that class. Finally, the F1 measure is the harmonic mean of precision and recall, and can be understood as an aggregate metric to reduce both to a single value that can . **These metrics give us a per-class understanding of how well our model captures the class and how much it misses elements from it.** Let's compute them for our model:

In [None]:
from sklearn.metrics import precision_recall_fscore_support

model_preds = conv_20ns_model.predict(X_20ns_valid)
y_valid_true = y_20ns_valid.argmax(axis=1)
y_valid_pred = model_preds.argmax(axis=1)
prec, rec, f1, support = precision_recall_fscore_support(y_valid_true, y_valid_pred)

print('Average per-class precision: {0:.2f}'.format(100 * prec.mean()))
print('Average per-class recall: {0:.2f}'.format(100 * rec.mean()))
print('Average per-class F1 score: {0:.2f}'.format(100 * f1.mean()))

In [None]:
# plot the per-class metrics
def plot_class_measures(prec, rec, f1, inv_label_mapping):
    '''Plots per-class precision, recall and F1 scores for all classes.'''
    fig, ax = plt.subplots()
    fig.set_size_inches(12, 4)
    fig.dpi = 120
    
    total_classes = len(inv_label_mapping)
    ind = np.arange(total_classes)
    width = 0.25
    prec_bar = ax.bar(ind, prec, width, color='r', bottom=0)
    rec_bar = ax.bar(ind + width, rec, width, color='b', bottom=0)
    f1_bar = ax.bar(ind + 2 * width, f1, width, color='g', bottom=0)

    ax.set_title('Performance metrics by class')
    ax.set_xticks(ind + width)
    ax.set_xticklabels([inv_label_mapping[i] for i in range(total_classes)], rotation=90)

    ax.legend((prec_bar[0], rec_bar[0], f1_bar[0]), ('Precision', 'Recall', 'F1-score'))
    ax.autoscale_view()

    plt.show()
    
plot_class_measures(prec, rec, f1, inv_label_mapping)

We can see that in general we do well enough, but for some classes we are missing the mark a bit. We will not go into the details, but it is a good idea to try to improve performance on classes for which we do poorly. In any case, now that we have a good way of visualizing the performance of our model, we can go back to our previous questions. We were wondering:

1. **Is the impact of retraining the word embeddings worth it?**
2. **Could we improve our precision by using a higher or lower dropout rate?**

**We could muse, think and wonder, but in the end experiments are the best way to settle these questions.** It is a good idea to give that a shot, and try the 4 possible embedding setups and different dropout rates. 

Thinking is good for figuring out limitations about our models, however. When we were using vanilla deep neural networks, our problem was that we could not capture the structure that an image or a document may have. **We solved this by introducing convolutional neural networks, which allow us to process the forming pieces of some continuous structure.** In that way, we can process chunks of pixels or groups of words... but is this enough?

For natural language, it seems like we would need something more. Very often, the relationship between words can be far away. Consider the following sentence:

> The holiday resort, which has been open since the early 90s and only recently got renovated by its new management, sits on the side of a gorgeous polynesian beach.

If we wanted to ask `what sits on the side of a polynesian beach?`, we would have to capture the long running relationship between the words that explain some fact about `the holiday resort`, our answer. Furthermore, we should understand that parts of that explanation are not the subject of sitting! As we would be looking as windows of tokens, it really should not surprise us if a model said that the `new management` were the ones sitting!

The problem with CNNs for text is that structure assumed is flat—we take words and fundamentally compute a weighted sum... **However, it is clear that our understanding of a word depends on how the word appears in the sentence.** In a sense, we could talk about the `state` of a word—the way in which it plays with the words that came before. After all, holiday resorts probably are not sitting on chairs!

# Time is structure: Recurrent Neural Networks

By the way we ended the last section, you should already guess what we are going for. Although vanilla and convolutional neural networks are very powerful, in a sense we are still missing some basic building block. **We are dealing with sentences or documents, which are ordered lists of words—and the order matters!** Ideally, we would want a model that understands that `blue` coming before `sky` is not the same as it appearing after `feeling`! **More generally, our ideal model would keep track of what it has seen so far, and this in turn would let it appropriately "read" what we are trying to say.**

How could such a model be implemented? The models we have seen so far always take the data as it is, without any reference to their own output. However, when we are processing things that have an order, things that flow in some given direction, we need to be able to do that kind of remembering. This is what Recurrent Neural Networks do: keeping track of state. **Recurrent Neural Networks (or RNNs) extend the same plain old vanilla neural networks with input state vectors that represent our knowledge up to a point.** This is best seen visually:

<img src="images/RecurrentNetwork.png" width="1000" alt="A side-by-side look at a traditional neural network and a recurrent neural network. The recurrent neural network can be seen as a vanilla neural network whose output is used as part of its input in successive states!"></img>

This diagram actually hides away a good chunk of complexity. **Although simple RNNs as such have been used, they typically have problems dealing with long sequences and can be generally hard to train.** This is because the propagation of the errors, due to the simple self-referential nature of the design, ends up either exploding (going to infinity) or vanishing (going to zero). Modern RNN architectures involve some other transformations on the input and state vectors to keep this from happening. Since we are focusing on using them, however, we just have to be happy that researchers have carried us to this point!

So with this idea in mind, we must give recurrent networks a shot! **Because of how they are defined, they are very well suited to our problem: the input to a recurrent network will be a sequence of vectors.** As we have been doing up to this point, this means we can use a padded $(N \times d)$ matrix of up to $N$ states with embedding word vectors of dimension $d$. Depending on whether our task is to talk about every state or about the whole document, we can then use the whole set of output states or just the last one. Let's look at our 20 newsgroups classification problem, using the same pretrained GloVe embeddings and a modern RNN:

**NOTE:** Training this model can take some time if you don't have a CUDA-compatible GPU. On a late 2015 MacBook Pro, it takes 1-hour of training time for 25 total epochs.

#### Unidirectional vs. Bidirectional RNNs [[Exercise 2]](#Exercise-2:-Unidirectional-vs.-Bidirectional-RNNs.)

In [None]:
from keras.layers import LSTM

# build the sentiment analysis model with pretrained embeddings
inpr = Input((max_length,))
embr = Embedding(total_word_indices, 
                EMB_SIZE, 
                embeddings_initializer=Constant(gv_emb_matrix), 
                trainable=True,
                name='embedding')(inpr)
lstm = LSTM(50, name='recurrent')(embr)
outr = Dense(total_classes, activation='softmax')(lstm)
rnn_20ns_model = Model(inputs=inpr, outputs=outr)
rnn_20ns_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
rnn_20ns_model.summary()

# fit the model
rnn_hist = rnn_20ns_model.fit(X_20ns_train, y_20ns_train, epochs=25, batch_size=100, validation_data=(X_20ns_valid, y_20ns_valid))

In [None]:
rnn_model_preds = rnn_20ns_model.predict(X_20ns_valid)
y_valid_true = y_20ns_valid.argmax(axis=1)
y_valid_pred_rnn = rnn_model_preds.argmax(axis=1)
prec, rec, f1, support = precision_recall_fscore_support(y_valid_true, y_valid_pred_rnn)

print('Average per-class precision: {0:.2f}'.format(100 * prec.mean()))
print('Average per-class recall: {0:.2f}'.format(100 * rec.mean()))
print('Average per-class F1 score: {0:.2f}'.format(100 * f1.mean()))
plot_class_measures(prec, rec, f1, inv_label_mapping)

Those results look quite promising! **Looking only at the recurrent part of the model, we are getting close to the performance of our convolutional classification model—with 6 times less parameters!** Of course, you may have noticed that this network took longer to train. After all, we had to train for 25 epochs instead of the 10 epochs we needed for our Convolutional Network! This is because Recurrent Neural Networks perform their processing sequentially and have to learn to keep track of the state. **Their sequential nature means that they cannot be easily parallelized, which in turn has an obvious impact on performance.** Furthermore, managing the interaction between states is no easy task, with vanilla RNNs usually having poor short term memory. This means that longer sequences become hard to train on, since the network is not able to keep up with the information contained across the states.

This last problem is the reason why we have used the LSTM, which stands for Long Short Term Memory. **LSTMs manage to process longer sequences by processing the inputs through a series of small networks, called gates, that let them keep and forget specific parts of the state, the input and the output.** There is no need to describe the details of the gating mechanism, but you may understand it as component-wise probabilities of keeping, forgetting or outputting each of the vector features. Another similarly designed and used network is the GRU, which stands for Gated Recurrent Unit. **The GRU simplifies the gating by merging some of these small networks together to further reduce the number of parameters.**

In either case, it is a good idea to go over the idea of RNNs as an abstraction. **Recurrent nets go through a list of states sequentially, and they do so in one direction.** However, in some cases the relationships between words can be a hard to capture if we are reading sequentially. One trivial extension is to read in both directions: this is aptly named a *bidirectional* RNN. Bidirectional RNNs are often used in Natural Language problems because the relationships between words can happen in either direction:

1. The `blue` car speeded through the road.
2. The car was `blue` and speeded through the road.

Granted, capturing that the car was `blue` should be simple in this case, but we humans like to make complicated sentences at times! Consider our dataset. **We have several different types of content, each of which will have their specific ways in which words interact with eachother.** Just to make sure we are in the clear, this is what we mean by 'bidirectional': 

<img src="images/Bidirectional.png" width="320" alt="A bidirectional RNN, processing the tokens in both directions!"></img>

**Does using a bidirectional RNN make sense for our newsgroups classification problem?** That seems like another great question to be answered experimentally — and indeed will be another one of our exercises! But before jumping onto those, we have to have one final discussion about recurrent neural networks: can RNNs be composed? Our diagrams and our code give us different responses. In our abstract examples, we are processing each input one at a time and outputting something. **The result, then, should always be a sequence with as many vectors as there were inputs in the original sequence. However, looking at our code, the RNN layer is just producing a final output vector.** What gives? Well, turns out both things make sense! Keras by default makes RNN layers return the last output they produced. However, we can make our RNNs return sequences very easily, by simply passing the `return_sequences=True` argument! Let's take a peek at some code:

In [None]:
# build the sentiment analysis model with pretrained embeddings
inpr = Input((max_length,))
embr = Embedding(total_word_indices, 
                EMB_SIZE, 
                embeddings_initializer=Constant(gv_emb_matrix), 
                trainable=True,
                name='embedding')(inpr)
lstm = LSTM(50, name='recurrent', return_sequences=True)(embr)
dens = Dense(100)(lstm) # transform each state into per-token data
pool = GlobalAvgPool1D()(dens) # average across all states -- probably not a good strategy!
outr = Dense(total_classes, activation='softmax')(pool)
rnn_seq_20ns_model = Model(inputs=inpr, outputs=outr)
rnn_seq_20ns_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
rnn_seq_20ns_model.summary()

This little model is the culmination of everything we have seen in this lesson... And in the course as a whole! **We are using pretrained word embeddings, a recurrent neural network, a pooling operation, and building a multiclass classification model!** We have built this shared understanding of what it means to work with text and natural language as a whole using neural networks. From the basic foundations up to the involved neural architectures that we are building now, we have travelled quite a bit — and hopefully without too many hitches along the way! Before delving into the exercises, additional projects and the conclusions of the course, we have to make sure we answer the question about RNN composibility. 

Looking at our last snippet should convince you that a recurrent layer in Keras can output whole sequences. **Those sequences could then be an input to another recurrent layer, learning ever-more complex relationships across the states, whose information would cross-polinate and better capture our problem.** However, you should realize from our use of pooling that we can do something beyond this. We can combine the output of convolutional layers with a recurrent layer on top, and in that way attempt to capture both ngrams and long term relationships. We could output not just single class predictions, but predictions over whole sequences — and then learn to translate, answer questions or generate text. In short, we really can do a lot now!

With the great toolkit we have developed through our lessons, you are fully equipped to take hard problems head-on. **From now on, it is you, the effort you have put into this journey of learning, the display of ability that you have shown to yourself by getting here alive and the experiments you will be carrying out to solve interesting problems.** If you have enjoyed the course, please drop a line when you build something you are proud of. Until then, the floor is all yours!

# Exercises

---

### Exercise 1: Evaluating model parameters.
> Machine Learning models tend to have many possible options to control how the learning is performed and to adjust different settings of the underlying algorithms. These settings are called hyperparameters, and determine the actual parameters that the model learns — hence the name. **In the case of neural networks, we can choose activation functions, layers to use, number of layers, units per layer, the interaction between successive layers as a whole... That is, we have a ton of hyperparameters!** When we are designing our model for a particular task, we obviously want to make sure that it is well suited for it, and the best way to do this is through experiment.
>
> One common way of finding reasonable parameters is called [Grid search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search). **Grid search is simple process: we perform an exhaustive search over some set parameters, and select the best with respect to some metric.** Consider the convolutional neural network architecture that we designed, and assume that our layers, number of units per layer and layer interactions are fixed. Try to find reasonable values for the dropout rate, whether or not to use pretrained embeddings and whether or not to retrain those embeddings. Specifically, your task is to evaluate the following set of hyperparameters:
> > ```python
> > dropout_rates = [0.0, 0.1, 0.25, 0.5]
> > glove_options = [True, False]
> > retrain_options = [True, False]```
>
> **To compare each of the models, you could either use the validation accuracy or the validation loss.** In the case of accuracy, we will chose the model reaching the highest value of all of them. On the contrary, in the case of our loss function, we will chose the model that reaches the lowest possible value. As a reminder, the loss function is the error function we minimize during learning, so lower is better.
>
> **HINT #1:** You can do this by calling the `train_conv_20news` function with different parameters. Notice that the function returns both the model and a history object. You can do look at the function definition [here](#Evaluating-model-parameters-[Exercise-1]).
>
> **HINT #2:** To access the training statistics, you can make use of the [`History`](https://keras.io/visualization/#training-history-visualization) object returned by Keras when calling `model.fit` or `model.fit_generator`: `hst = model.fit(X, y, ...)`. You can access the per-epoch metrics by accessing the `history` dictionary in the object: `hst.history['val_acc']` will get the list of per-epoch validation accuracies.

---


### Exercise 2: Unidirectional vs. Bidirectional RNNs.

> We have explored Recurrent Networks as models that process a sequence in one direction. **However, language is a sequence with references all over the place!** Indeed, a word can be talking about something that appears before or after it. **This structure is at times modelled as a tree, and linguists enjoy those syntax parse trees very much.** The problem is that often we do not have access to the parse structure, or want a model that learns the structure along the way!
>
> Bidirectional Recurrent Networks make sense in this case, because they can learn dependencies going in either direction. Of course, this means learning two Recurrent Networks — one in each direction! **In turn, this means that our model may take even longer to converge, because we are increasing the number of features to train, processing what amounts to two different sequences!** To get a sense of what we mean, modify our simple implementation of the LSTM-based Recurrent classifier to be bidirectional. **Remember: you may need to further increase the number of training epochs! How does your Bidirectional model compare with an unidirectional LSTM?**
>
> **OPTIONAL:** Do we now have too many parameters? Could we try to regularize the weights using the dropout options that the [LSTM layer](https://keras.io/layers/recurrent/#lstm) has to offer? Try to answer as many of these questions as you can after implementing your Bidirectional model: the more curious you get, the better your intuition will be!
>
> **HINT:** You can do this by importing [`Bidirectional`](https://keras.io/layers/wrappers/#bidirectional) from `keras.layers` and applying it over the recurrent layer (in our case, a LSTM) so that our model calls `Bidirectional(RNN(params...))(previous_layer)`. You can do this [here](#Unidirectional-vs.-Bidirectional-RNNs-[Exercise-2]).

---

In [None]:
raise Exception('You should try to find your own solutions first!')

# exercise 1 solution -- grid search over the data
from itertools import product

# possible parameters to evaluate
dropout_rates = [0.0, 0.1, 0.25, 0.5]
glove_options = [True, False]
retrain_options = [True, False]

# train all the possible configurations
histories = []
for t in product(dropout_rates, glove_options, retrain_options):
    model_name = 'model_drop{}-glove{}-retrain{}'.format(*t)
    learned_model, history = train_conv_20news(*t)
    histories.append((model_name, history, learned_model))
    print('Done with {}'.format(model_name))

# get the best performing model of the bunch
(model_name, model_hist, model) = min(histories, key=lambda x: min(x[1].history['val_loss']))
min_loss = min(model_hist.history['val_loss'])
max_acc = max(model_hist.history['val_acc'])
print('The top model was {} with a validation loss of {} having reached a validation accuracy of {}.'.format(model_name, min_loss, max_acc))

# challenge: how do you programatically stop learning when the model begins to overfit?
# hint: https://keras.io/callbacks/#earlystopping

In [None]:
raise Exception('You should try to find your own solutions first!')

# exercise 2 solution -- bidirectional with smaller LSTM (30 units vs. 50 originally)
from keras.layers import Bidirectional

# build the sentiment analysis model with pretrained embeddings
inpb = Input((max_length,))
embb = Embedding(total_word_indices, 
                EMB_SIZE, 
                embeddings_initializer=Constant(gv_emb_matrix), 
                trainable=True,
                name='embedding')(inpb)
brnn = Bidirectional(LSTM(50, name='recurrent'), name='bidir_rnn')(embb)
outb = Dense(total_classes, activation='softmax')(brnn)
birnn_20ns_model = Model(inputs=inpb, outputs=outb)
birnn_20ns_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
birnn_20ns_model.summary()

# fit the model
birnn_20ns_model.fit(X_20ns_train, y_20ns_train, epochs=30, batch_size=100, validation_data=(X_20ns_valid, y_20ns_valid))