In [0]:
try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

In [0]:
import os
import sys
import math
import time
import itertools

import tensorflow as tf
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tensorflow import keras
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

# Text processing

Deep learning models are used to process various NLP tasks, such as:


*   Sentiment classification
*   POS tagging
*   Translation
*   Sequence labeling
*   Language inference
*   Question answering
*   Image captioning



## Text preprocessing before training

Before processing text with the model, we have to make the preprocessing, which usually takes steps like:


1.   Convert text to lowercase
2.   Numbers removing / replacing with token *NUM*
3.   Remove punctuation
4.   Tokenization
5.   Remove stop words
6.   Remove common / rare words
7.   Stemming / Lemmatization
8.   Split word into pieces (playing -> play + ing)



### Embeddings

Usually after the preprocessing, we map each word from the training corpora into the integer number. Then this number is mapped into the vector representation, named word embedding.

Examplary embeddings that we could use in our models:

1.   One-hot-encoding / tf-idf
2.   LSA / LDA
3.   Word2vec / GloVe
4.   Trained embedding layer

![](https://cdn-images-1.medium.com/max/800/1*_kDJnuzDA5SiQ9N0tmJRbw.png)

In very simplistic terms, Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. The best words embeddings could model word analogies and word similarities from the corpora.

![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/06062705/Word-Vectors.png)

In our tasks, we could use pretrained embeddings (usually in unsupervised manner) or train it from scratch and have the task specific embedding.

You could visualise embeddings in [Embedding Projector](https://projector.tensorflow.org/).

### Padding

All inputs of neural networks has to be of the same shape. In order to fulfill this, we pad all our sentences with *blank* token.
We could pad sequences before the training (then we pad all sentences to the length of the longest sentence in the training dataset) or online, during the training (then we pad all batch sentences to the length of the longest sentence in the training bacth).

## Deep learning models for NLP

### Recurrent neural networks

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)

### Convolutional neural networks

![](https://cdn-images-1.medium.com/max/1800/1*aBN2Ir7y2E-t2AbekOtEIw.png)

![](http://www.joshuakim.io/wp-content/uploads/2017/12/filtering2.jpg)

![](https://www.researchgate.net/profile/Qingcai_Chen/publication/273471942/figure/fig1/AS:281712618688515@1444176930410/The-over-all-architecture-of-the-convolutional-sentence-model-A-box-with-dashed-lines.png)

### Self attention based models

![](http://nlp.seas.harvard.edu/images/the-annotated-transformer_38_0.png)
![](http://nlp.seas.harvard.edu/images/the-annotated-transformer_33_0.png)

## IMDB Movie reviews dataset for sentiment classification

**Overview**

This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification.

**Dataset**

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets.

# Sentiment classification with 1D CNNs

## Load the data and make padding

Keras provides us with the [imdb dataset](https://keras.io/datasets/). The dataset is preprocessed and loaded data consists of the different shaped lists with integers (words indices). During the loading we could preprocess it further to get sequences with specified lenght and truncate all words that occurs rare.

We could also use keras to preprocess the data more and [pad](https://keras.io/preprocessing/sequence/) all our sequences.


In [0]:
from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data()

x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=None, padding='post')

MAX_SEQUENCE_LEN = x_train.shape[1]
WORDS_IN_CORPORA = np.max(x_train) + 1

x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=MAX_SEQUENCE_LEN, padding='post', truncating='post')

In [0]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape, WORDS_IN_CORPORA

In [0]:
words2index = imdb.get_word_index()
index2word = {v: k for k,v in words2index.items()}
imdb_words = [v for k,v in sorted(index2word.items())]
len(imdb_words)

In [0]:
imdb_words[0:5]

## Define the network

Define the CNN for text classification that consists of:

1.   Input layer [keras.layers.Input](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Input) that will process the input sentences, remember about the proper input shape.
2.   Embedding layer [keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) for training the word embeddings. You should pass the propper input dim and length to the layer and specify the embedding dim.
3.   Some 1D convolutional layers [keras.layers.Conv1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D) with specified number of filters, kernel size, stride, activation function. After the convolution you can use the batch norm layer [keras.layers.BatchNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization).
4.   Together with 1D convolutions, you can also use the 1D max pooling layers [keras.layers.MaxPooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool1D), to decrease the input length.
5.   After some convolutions and poolings, you should use the 1D global max pooling [keras.layers.GlobalMaxPool1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalMaxPool1D) to be sure that the each sequence has specified shape (1 x num_channels)
6.   Layer that will flatten the result of convolutions [keras.layers.Flatten](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten).
7.   Some Dense layers [keras.layers.Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), with specified number of units and activation function. After dense layers you could also use the batch norm layer.
8.   A final Dense layer with 2 neurons (one per class 0 & 1).

I used the following architecture and obtained ~89% accuracy. But you can experiment with your own architectures.

1.  Embedding layer with embedding dim equal to 100.
2.  3 times repeated the following layers sequence:

    1.    1D conv with 128 filters, kernel size equal to 5, stride equal to 1, relu activation and same padding.
    2.    Batch norm
    3.   1D max pooling with pool_size equal to 5

3.   After flatten, before the final layer, I use dense layer with 128 units and relu activation, followed by batch norm.


**Define the input layer**

In [0]:
sequence_input = keras.layers.Input(shape=(MAX_SEQUENCE_LEN,), dtype='int32')

**Define the embedding layer**

In [0]:
embedded_sequences = keras.layers.Embedding(input_dim=WORDS_IN_CORPORA,
                                            output_dim=100,
                                            input_length=MAX_SEQUENCE_LEN)(sequence_input)

**Define all convolutional and pooling layers**

In [0]:
x = keras.layers.Conv1D(filters=128, kernel_size=5, activation='relu', padding='same')(embedded_sequences)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.MaxPooling1D(pool_size=5)(x)
x = keras.layers.Conv1D(filters=128, kernel_size=5, activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.MaxPooling1D(pool_size=5)(x)
x = keras.layers.Conv1D(filters=128, kernel_size=5, activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)

**Define the global max pooling layer, together with flatten layer**

In [0]:
x = keras.layers.GlobalMaxPool1D()(x)
x = keras.layers.Flatten()(x)

**Define all dense layers**

In [0]:
x = keras.layers.Dense(128, activation='relu')(x)
x = keras.layers.BatchNormalization()(x)
sequence_output = keras.layers.Dense(2, activation='softmax')(x)

**Define keras model**

You should pass the proper input and output tensors to the initializer.

In [0]:
model = keras.models.Model(inputs=[sequence_input], outputs=[sequence_output])

**Check the model summary**

In [0]:
model.summary()

## Train the network

**Before training, you should compile the model with a propper loss function and optimizer**

You could experiment with different optimizers, with different learning rates and [lr schedulers](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler).

In [0]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', 
              metrics=["accuracy"])

**Train the model**

In [0]:
model.fit(x_train, y_train, epochs=5, batch_size=32,
          validation_data=(x_test, y_test),
          callbacks=[tf.keras.callbacks.LearningRateScheduler(schedule = lambda x: 0.001 if x == 0 else 0.0001)])

**Save word embeddings**

You could use trained embedding layer to save word embeddings.
Then you could visualise the embedding space in e.g. [Embedding Projector](https://projector.tensorflow.org/).

In [0]:
words_sample = np.arange(2, 1002, dtype=np.int32)

In [0]:
model.submodules

In [0]:
words_embeddings = model.submodules[1](words_sample)
words_embeddings

In [0]:
np.savetxt("words_embeddings.tsv", words_embeddings.numpy(), delimiter="\t")
np.savetxt("words.tsv", np.array(imdb_words[:1000]).reshape((-1,1)), fmt='%s')

# 1D CNNs with various kernel sizes

[Kim - Convolutional Neural Networks for Sentence Classification](https://www.aclweb.org/anthology/D14-1181)

Instead of working with convolutions with a single kernel size, we could also use convolutions with different kernel sizes to work with our text data. 

![](https://richliao.github.io/images/YoonKim_ConvtextClassifier.png)

Define the CNN for text classification that consists of:

1.   Input layer [keras.layers.Input](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Input) that will process the input sentences, remember about the proper input shape.
2.   Embedding layer [keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) for training the word embeddings. You should pass the propper input dim and length to the layer and specify the embedding dim.
3.  Some 1D convolutional layers [keras.layers.Conv1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D) with specified number of filters, stride, activation function, but with **different kernel sizes**. After the convolution you can use the batch norm layer [keras.layers.BatchNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization).
4.   Together with 1D convolutions, you can also use the 1D max pooling layers [keras.layers.MaxPooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool1D), to decrease the input length.
5.   After these convolutions you should concatenate their outputs. You can do it with concatenate layer [keras.layers.Concatenate](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Concatenate).
6.   On concatenated outputs You could again use 1D convolutional layers, with different or one kernel size, together with batch norm and pooling.
5.   After some convolutions and poolings, you should use the 1D global max pooling [keras.layers.GlobalMaxPool1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalMaxPool1D) to be sure that the each sequence has specified shape (1 x num_channels)
6.   Layer that will flatten the result of convolutions [keras.layers.Flatten](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten).
7.   Some Dense layers [keras.layers.Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), with specified number of units and activation function. After dense layers you could also use the batch norm layer.
8.   A final Dense layer with 2 neurons (one per class 0 & 1).

I used the following architecture and obtained ~90% accuracy. But you can experiment with your own architectures.

1.  Embedding layer with embedding dim equal to 100.
2.  The following layers sequence:
    1.   1D convolutions with filter sizes 3,4,5
    2.    Batch norm for the output of every convolution
    3.   1D max pooling with pool_size equal to 5 for the output of every convolution
    
3.  2 times repeated the following layers sequence:

    1.    1D conv with 128 filters, kernel size equal to 5, stride equal to 1, relu activation and same padding.
    2.    Batch norm
    3.   1D max pooling with pool_size equal to 5

4.   After flatten, before the final layer, I use dense layer with 128 units and relu activation, followed by batch norm.


**Define the input layer**

In [0]:
sequence_input = keras.layers.Input(shape=(MAX_SEQUENCE_LEN,), dtype='int32')

**Define the embedding layer**

In [0]:
embedded_sequences = keras.layers.Embedding(input_dim=WORDS_IN_CORPORA,
                                            output_dim=100,
                                            input_length=MAX_SEQUENCE_LEN)(sequence_input)

**Define convolutional layers with different kernel sizes**

In [0]:
filter_sizes = [3,4,5]
convs = []

for size in filter_sizes:
    x = keras.layers.Conv1D(filters=128, kernel_size=size, activation='relu', padding='same')(embedded_sequences)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.MaxPooling1D(pool_size=5)(x)
    convs.append(x)

x = keras.layers.Concatenate()(convs)


**Define all others convolutional and pooling layers**

In [0]:
x = keras.layers.Conv1D(filters=128, kernel_size=5, activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.MaxPooling1D(pool_size=5)(x)
x = keras.layers.Conv1D(filters=128, kernel_size=5, activation='relu', padding='same')(x)
x = keras.layers.BatchNormalization()(x)

**Define the global max pooling layer, together with flatten layer**

In [0]:
x = keras.layers.GlobalMaxPool1D()(x)
x = keras.layers.Flatten()(x)

**Define all dense layers**

In [0]:
x = keras.layers.Dense(128, activation='relu')(x)
x = keras.layers.BatchNormalization()(x)
sequence_output = keras.layers.Dense(2, activation='softmax')(x)

**Define keras model**

You should pass the proper input and output tensors to the initializer.

In [0]:
model = keras.models.Model(inputs=[sequence_input], outputs=[sequence_output])

**Check the model summary**

In [0]:
model.summary()

## Train the network

You could use classic tf.keras training approach (model compile and fit).

You could experiment with different optimizers, with different learning rates and [lr schedulers](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler).

In [0]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', 
              metrics=["accuracy"])

In [0]:
model.fit(x_train, y_train, epochs=5, batch_size=32,
          validation_data=(x_test, y_test),
          callbacks=[tf.keras.callbacks.LearningRateScheduler(schedule = lambda x: 10 ** (-x-3))])

Or use more [TensorFlow like training](https://www.tensorflow.org/alpha/guide/keras/training_and_evaluation#part_ii_writing_your_own_training_evaluation_loops_from_scratch).

In [0]:
model = keras.models.Model(inputs=[sequence_input], outputs=[sequence_output])

In [0]:
# Instantiate an optimizer.
optimizer = keras.optimizers.SGD(lr=1e-3)

# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy()

In [0]:
# Prepare the training dataset.
batch_size = 32
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=25000).batch(batch_size)

In [0]:
# Iterate over epochs.
for epoch in range(5):
    print('Start of epoch %d' % (epoch,))
  
    # Iterate over the batches of the dataset.
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):

        # Open a GradientTape to record the operations run
        # during the forward pass, which enables autodifferentiation.
        with tf.GradientTape() as tape:

            # Run the forward pass of the layer.
            # The operations that the layer applies
            # to its inputs are going to be recorded
            # on the GradientTape.
            logits = model(x_batch_train)  # Logits for this minibatch

            # Compute the loss value for this minibatch.
            loss_value = loss_fn(y_batch_train, logits)

            # Use the gradient tape to automatically retrieve
            # the gradients of the trainable weights with respect to the loss.
            grads = tape.gradient(loss_value, model.trainable_variables)

            # Run one step of gradient descent by updating
            # the value of the weights to minimize the loss.
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Log every 200 batches.
        if step % 200 == 0:
            print('Training loss (for one batch) at step %s: %s' % (step, float(loss_value)))
            print('Seen so far: %s samples' % ((step + 1) * 64))

# Images sources

Images used in this notebook comes from the following web pages and papers:

1.   https://medium.com/datadriveninvestor/neural-networks-or-deep-learning-in-natural-language-processing-f7b534a14728
2.   http://www.joshuakim.io/understanding-how-convolutional-neural-network-cnn-perform-text-classification-with-word-embeddings/
3.   https://blog.goodaudience.com/introduction-to-1d-convolutional-neural-networks-in-keras-for-time-sequences-3a7ff801a2cf
4.   https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
5.   https://www.researchgate.net/figure/The-over-all-architecture-of-the-convolutional-sentence-model-A-box-with-dashed-lines_fig1_273471942
6.   http://colah.github.io/posts/2015-08-Understanding-LSTMs/
7.   http://nlp.seas.harvard.edu/2018/04/03/attention.html
8.   [Kim publication](https://www.aclweb.org/anthology/D14-1181)