# Keras and Natural Language Processing

### Sebastian Sierra - DL-NLP workshop
<img src="http://m.memegen.com/4j0k0i.jpg" />

## Outline
* What is Keras?
  * Installing Keras
* Models in Keras: Sequential vs Graph
* Recurrent layers in keras.
  * LSTM and bidirectional LSTM
  * GRU
* Creating new layers in Keras
* Ex: Using Keras and gensim to solve semantic similarity task

## What is Keras?

It's a Deep Learning library for Theano and TensorFlow. Keras is also built upon four guiding principles:
* Modularity.
  *  Neural layers, cost functions, optimizers, initialization schemes, activation functions, regularization schemes are all standalone modules.
* Minimalism.
* Easy extensibility.
* Work with Python.

Keras is suited for easy and fast prototyping. It also supports **convolutional neural networks** and **recurrent neural networks** and easy combination between both. Besides it enables multi-input and multi-output training. Keras runs on GPU or CPU.

**Further documentation** can be found on [Keras Docs](http://keras.io/)

### Installing Keras

Keras requirements are:
* numpy, scipy
* pyyaml
* HDF5 and h5py
* In case of using CNNs: cuDNN

In this case we are going to work with **Theano** as backend, so the latest version of **Theano** should be used
```bash
sudo pip install git+git://github.com/Theano/Theano.git
```
Finally pip install the latest version of keras
```bash
sudo pip install keras
```
Then we check if we have the latest version(>0.3)

In [None]:
import pkg_resources
pkg_resources.get_distribution("keras").version

## Models in Keras: Sequential vs Graph

Models are the main structure in Keras. There are two kinds of models: Sequential model and Graph model. Sequential is a sequence of layers, organized in the exact order they where added. Graph models are determined by the connections nodes and the connections between their nodes.

Sequential models can be easily created:
```python
from keras.models import Sequential
model = Sequential()
```
Then we can add each layer, in this short example we are creating a network with a Embedding layer as input layer, then we add a LSTM, a Dropout layer, a Dense layer that is a standard fully connected layer and finally an Activation layer using a sigmoid function.
```python
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

model.add(Embedding(input_dim, output_dim, input_length=maxlen))
model.add(LSTM(output_dim))
model.add(Dropout(prob))
model.add(Dense(1))
model.add(Activation('sigmoid'))
```

One example of Keras' easy extensibility is that we can define functions :
```python
def tanh(x):
    return theano.tensor.tanh(x)

model.add(Dense(64, activation=tanh))
model.add(Activation(tanh))
```

On the other side we have Graph models, that can be created so:
```python
from keras.models import Sequential
model = Graph()
```
In this case we are defining a bidirectional LSTM for a classification problem. Note that in this case we have to define first the input. *maxlen* stands for the input size that our network will have. The details of the construction of a bidirectional LSTM will be further discussed. At the end of the specification of the network we can see that it is really similar to the specification of the previous network.
```python
model.add_input(name='input', input_shape=(maxlen,), dtype=int)
model.add_node(Embedding(input_dim, output_dim, input_length=maxlen),
               name='embedding', input='input')
model.add_node(LSTM(output_dim), name='forward', input='embedding')
model.add_node(LSTM(output_dim, go_backwards=True), name='backward', input='embedding')
model.add_node(Dropout(prob), name='dropout', inputs=['forward', 'backward'])
model.add_node(Dense(1, activation='sigmoid'), name='sigmoid', input='dropout')
model.add_output(name='output', input='sigmoid')
```

## Recurrent Layers in Keras

Recurrent Layers are implemented in Keras. It supports LSTM, GRU and SimpleRNN recurrent layers. Each of one can be called easiy using this:
```python
from keras.layers.recurrent import LSTM, GRU, SimpleRNN
```
Its input is a 3D tensor with shape **(nb_samples, timesteps, input_dim)**. The output will be 3D tensor with shape  **(nb_samples, timesteps, output_dim)**.

Keras by default resets the memory of the recurrent network. In some cases we would like to enable statefulness, so the input of the following iteration is fed with the previous state of the network. This can be done specifying `stateful=True` in the layer constructor.

We are going to see how a RNN can be used in text classification task and compare the performance of three basic structures: LSTM, GRU and Bidirectional LSTM. Although we have to set our data ready to use in Keras. Keras has a module with some standard datasets, in our case we will work with the sentiment analysis task of the IMDB reviews dataset.

### IMDB sentiment analysis task.
Sentiment Analysis is a widely known text classification task. In 2011 was released a dataset composed of 25,000 reviews of movies for training and 25,000 reviews for testing [More info](http://ai.stanford.edu/~amaas/data/sentiment/). As its authors claim, the reviews are highly polar. The labels used for this dataset were 0(Negative review) and 1(Positive review). A negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Besides, the training and testing sets contain a disjoint set of movies. We will try to predict if a review contains a positive review or a negative one.

In [None]:
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(1337)

from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.utils.np_utils import accuracy
from keras.models import Sequential, Graph
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from keras.datasets import imdb
from six.moves import cPickle
import pandas as pd
import nltk
from nltk import FreqDist
from utils.helper_keras import sentence_to_wordlist, review_to_words, load_imdb

We have to define the number of top most frequent words to consider of our Embedding layer, this number will be *max_features*, then we define the maximum length of the input sequence. 

In [None]:
max_features = 20000
maxlen = 100
batch_size = 32

Then we easily load the IMDB data, defining the percentage for test.

In [None]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features, test_split=0.2)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

In [None]:
X_train[1][:5]

In [None]:
y_train[1]

However this data is not that interpretable. We are going to upload manually the dataset to see how keras is loading it.

In [None]:
acl_path = "/data1/aclImdb/"
processed_path = "/data1/IMDB/"
train = pd.read_csv(processed_path+"labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv(processed_path+"labeledTestSet.tsv", header=0, delimiter="\t", quoting=3)

Let's check how a review looks like.

In [None]:
print(train["review"][1])
print("Sentiment: %d" % (train["sentiment"][1]))

In [None]:
print(train["review"][10])
print("Sentiment: %d" % (train["sentiment"][10]))

Now we can create a list of reviews

In [None]:
clean_train_reviews = [sentence_to_wordlist(review.decode("utf8")) for review in train["review"][:]]
corpus_reviews = [sentence_to_wordlist(review.decode("utf8"), tokenized=False) for review in train["review"][:]]

In [None]:
clean_train_reviews[1]

First we calculate the frequency distribution of the terms in the document

In [None]:
whole_reviews=' '.join(corpus_reviews)
tokens = nltk.word_tokenize(whole_reviews)
fdist=FreqDist(tokens)

In [None]:
freq_df = pd.DataFrame(fdist.items(), columns=['Term', 'Frequency'])
ordered_freqdf = freq_df.sort(["Frequency"], ascending=[False])
ordered_freqdf.head(10)

In [None]:
indexed_dict = {key: value for (key, value) in zip(ordered_freqdf["Term"][:], range(len(ordered_freqdf["Term"][:])))}

In [None]:
X = []
i = 1
for review in clean_train_reviews:
    tmp = []
    for x in review:
        if indexed_dict.has_key(x):
            tmp.append(indexed_dict[x])
    X.append(tmp)

In [None]:
X[1]

Finally we can get again *X_train*, *y_train* and *X_test*, *y_test*

In [None]:
(X_train, y_train), (X_test, y_test) = load_imdb(X, train["sentiment"].tolist(), nb_words=max_features, test_split=0.2)

Then the sequences will be padded(where the length is less than 100):

In [None]:
print("Pad sequences (samples x time)")
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Then we build the model as we have previously done.  
## LSTM
<img src="https://github.com/Element-Research/rnn/blob/master/doc/image/LSTM.png?raw=true" style="width: 50%; height: 50%"/>

In [None]:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

With compile function we can set the objective function, the optimizer and the class evaluation mode. The following objectives are available:
* mean_squared_error / mse
* root_mean_squared_error / rmse
* mean_absolute_error / mae
* mean_absolute_percentage_error / mape
* mean_squared_logarithmic_error / msle
* squared_hinge
* hinge
* binary_crossentropy: logloss.
* categorical_crossentropy: multiclass logloss

On the side of the optimizers, Keras provide us these:
* SGD
* RMSprop
* Adagrad
* Adadelta
* Adam

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary")

Finally we can use *fit* function(In a sci-kit learn fashion) to train the model. *evaluate* will show the performance of the model on the test set.

In [None]:
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=4, validation_data=(X_test, y_test), show_accuracy=True)
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size, show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)

We can easily use a GRU instead of a LSTM. Most of the code will be similar to the previous one.
## GRU
<img src="https://camo.githubusercontent.com/3ea758e7796a3e21d6b002f7aa588361d7e0bb7b/687474703a2f2f64336b62707a626d63796e6e6d782e636c6f756466726f6e742e6e65742f77702d636f6e74656e742f75706c6f6164732f323031352f31302f53637265656e2d53686f742d323031352d31302d32332d61742d31302e33362e35312d414d2e706e67" style="width: 75%; height: 75%" />

In [None]:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(GRU(128))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary")

print("----------------------")
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=4, validation_data=(X_test, y_test), show_accuracy=True)
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size, show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)

In the following case we will use a little more complicated structure. A bidirectional LSTM, it will be built using a Graph model. The main key is to declare two LSTM, one of them have to be enabled to go backward. Unfortunately documentation about this functionality is not clear.  
<img src="http://zhaoshuaijiang.com/paper_image/Bidirectional_RNN.png" />
M. Schuster and K. K. Paliwal. [Bidirectional Recurrent Neural Networks](http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf). IEEE Transactions on Signal
Processing, vol. 45, pp. 2673–2681, 1997.

In [None]:
model = Graph()
model.add_input(name='input', input_shape=(maxlen,), dtype=int)
model.add_node(Embedding(max_features, 128, input_length=maxlen),
               name='embedding', input='input')
model.add_node(LSTM(64), name='forward', input='embedding')
model.add_node(LSTM(64, go_backwards=True), name='backward', input='embedding')
model.add_node(Dropout(0.5), name='dropout', inputs=['forward', 'backward'])
model.add_node(Dense(1, activation='sigmoid'), name='sigmoid', input='dropout')
model.add_output(name='output', input='sigmoid')

This time instead of using *evaluate* function, we will evaluate it manually

In [None]:
model.compile('adam', {'output': 'binary_crossentropy'})

print('--------------------')
model.fit({'input': X_train, 'output': y_train}, batch_size=batch_size, nb_epoch=4)
acc = accuracy(y_test, np.round(np.array(model.predict({'input': X_test},
                                               batch_size=batch_size)['output'])))
print('Test accuracy:', acc)

### Making sense of the model

We are going to perform prediction using this model. Let's take a negative and a positive example. (From the test dataset)

In [None]:
clean_test_reviews = [sentence_to_wordlist(review.decode("utf8")) for review in test["review"][:]]

In [None]:
print(' '.join(clean_test_reviews[2]))
print("Sentiment= %d" % (test["sentiment"][2]))
print(' '.join(clean_test_reviews[4]))
print("Sentiment= %d" % (test["sentiment"][4]))

We will add a neutral review to see how it behaves.

In [None]:
testing_X = []
i = 1
additional_examples = "This movie was amazing, though, I did't like when Tyrion dies."

for review in [clean_test_reviews[2], clean_test_reviews[4], sentence_to_wordlist(additional_examples)]:
    tmp = []
    for x in review:
        if indexed_dict.has_key(x):
            tmp.append(indexed_dict[x])
    testing_X.append(tmp)
(new_X, new_y), _ = load_imdb(testing_X, [1, 0, 1], nb_words=max_features, test_split=0.)
new_X = sequence.pad_sequences(new_X, maxlen=maxlen)

There is still a problem where you'd want to predict chains larger than 100 tokens. 

In [None]:
len(new_X[1])

In [None]:
model.predict({'input': np.array(new_X)}, batch_size=batch_size)['output']

## Using Keras and gensim to solve Semantic Similarity task

Ex: Now we are going to apply Keras in another NLP task. Semantic Similarity task has become a central problem in NLP. Recently a dataset was introduced for the Semeval, it was named SICK((Sentences Involving Compositional Knowledge). Further info can be found at [SICK dataset](http://clic.cimec.unitn.it/composes/sick.html). The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description data set. Each sentence pair was annotated for relatedness by means of crowdsourcing techniques. In the final set, gold scores were distributed as follows: the relatednes scoring resulted in 923 pairs within the [1,2) range, 1373 pairs within the [2,3) range, 3872 pairs within the [3,4) range, and 3672 pairs within the [4,5] range. This exercise is part of Skip-thoughts work.

Firstly we will define the architecture to use.

In [None]:
import numpy as np
import copy       
from sklearn.metrics import mean_squared_error as mse
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from sklearn.utils import shuffle

from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import Adam
from gensim import models
from gensim.models import Word2Vec
import re
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
sick_data="/home/datasets/datasets1/skip_thoughts_models/skip-thoughts/data/"
model_path = "/home/datasets/datasets1/word2vec-embeddings/GoogleNews-vectors-negative300.bin.gz"
def prepare_model(ninputs=9600, nclass=5):
    """
    Set up and compile the model architecture (Logistic regression)
    """
    lrmodel = Sequential()
    lrmodel.add(Dense(ninputs))
    lrmodel.add(Activation('softmax'))
    lrmodel.compile(loss='categorical_crossentropy', optimizer='adam')
    return lrmodel

In [None]:
def train_model(lrmodel, X, Y, devX, devY, devscores):
    """
    Train model, using pearsonr on dev for early stopping
    """
    done = False
    best = -1.0
    r = np.arange(1,6)

    while not done:
        # Every 100 epochs, check Pearson on development set
        lrmodel.fit(X, Y, verbose=2, shuffle=False, validation_data=(devX, devY))
        yhat = np.dot(lrmodel.predict_proba(devX, verbose=2), r)
        score = pearsonr(yhat, devscores)[0]
        if score > best:
            print(score)
            best = score
            bestlrmodel = copy.deepcopy(lrmodel)
        else:
            done = True

    yhat = np.dot(bestlrmodel.predict_proba(devX, verbose=2), r)
    score = pearsonr(yhat, devscores)[0]
    print('Dev Pearson: ' + str(score))
    return bestlrmodel

In [None]:
def encode_labels(labels, nclass=5):
    """
    Label encoding from Tree LSTM paper (Tai, Socher, Manning)
    """
    Y = np.zeros((len(labels), nclass)).astype('float32')
    for j, y in enumerate(labels):
        for i in range(nclass):
            if i+1 == np.floor(y) + 1:
                Y[j,i] = y - np.floor(y)
            if i+1 == np.floor(y):
                Y[j,i] = np.floor(y) - y + 1
    return Y

In [None]:
def load_data(loc=sick_data):                                                                                                      
    """                                                                                                                            
    Load SICK
    """                                                                                                                            
    trainA, trainB, devA, devB, testA, testB = [],[],[],[],[],[]                                                                   
    trainS, devS, testS = [],[],[]                                                                                                 
                                                                                                                                   
    with open(loc + 'SICK_train.txt', 'rb') as f:                                                                                  
        for line in f:                                                                                                             
            text = line.strip().split('\t')                                                                                        
            trainA.append(text[1])                                                                                                 
            trainB.append(text[2])                                                                                                 
            trainS.append(text[3])                                                                                                 
    with open(loc + 'SICK_trial.txt', 'rb') as f:                                                                                  
        for line in f:                                                                                                             
            text = line.strip().split('\t')                                                                                        
            devA.append(text[1])                                                                                                   
            devB.append(text[2])                                                                                                   
            devS.append(text[3])                                                                                                   
    with open(loc + 'SICK_test_annotated.txt', 'rb') as f:                                                                         
        for line in f:                                                                                                             
            text = line.strip().split('\t')                                                                                        
            testA.append(text[1])                                                                                                  
            testB.append(text[2])                                                                                                  
            testS.append(text[3])                                                                                                  
                                                                                                                                   
    trainS = [float(s) for s in trainS[1:]]                                                                                        
    devS = [float(s) for s in devS[1:]]                                                                                            
    testS = [float(s) for s in testS[1:]]                                                                                          
                                                                                                                                   
    return [trainA[1:], trainB[1:]], [devA[1:], devB[1:]], [testA[1:], testB[1:]], [trainS, devS, testS]

In [None]:
def encode_word2vec(model, dataset):
    #model = Word2Vec.load_word2vec_format(model_path, binary=True)
    #Replace anything but a character for a space, lowercase everything and tokenize
    #Pending to add stop words
    trainA = [word_tokenize(re.sub("[^a-zA-Z]", " ", t).lower()) for t in dataset[0][:]]
    trainB = [word_tokenize(re.sub("[^a-zA-Z]", " ", t).lower()) for t in dataset[1][:]]
    feat_trainA = [[model[t] for t in sentence if model.vocab.has_key(t) ] for sentence in trainA]
    feat_trainB = [[model[t] for t in sentence if model.vocab.has_key(t) ] for sentence in trainB]
    return feat_trainA, feat_trainB

In [None]:
train, dev, test, scores = load_data()                                                                                         
train[0], train[1], scores[0] = shuffle(train[0], train[1], scores[0], random_state=1234)
model = Word2Vec.load_word2vec_format(model_path, binary=True)
feat_trainA, feat_trainB = encode_word2vec(model, train)
feat_devA, feat_devB = encode_word2vec(model, dev)

In [None]:
agg_featA = np.array([np.sum(sentence, axis=0) for sentence in feat_trainA])
agg_featB = np.array([np.sum(sentence, axis=0) for sentence in feat_trainB])
agg_devA = np.array([np.sum(sentence, axis=0) for sentence in feat_devA])
agg_devB = np.array([np.sum(sentence, axis=0) for sentence in feat_devB])
trainF = np.c_[np.abs(agg_featA - agg_featB), agg_featA * agg_featB]
devF = np.c_[np.abs(agg_devA - agg_devB), agg_devA * agg_devB]

In [None]:
trainY = encode_labels(scores[0])
devY = encode_labels(scores[1])
lrmodel = prepare_model(ninputs=trainF.shape[1])
bestlrmodel = train_model(lrmodel, trainF, trainY, devF, devY, scores[1])

In [None]:
feat_testA, feat_testB = encode_word2vec(model, test)
agg_testA = np.array([np.sum(sentence, axis=0) for sentence in feat_testA])
agg_testB = np.array([np.sum(sentence, axis=0) for sentence in feat_testB])
testF = np.c_[np.abs(agg_testA - agg_testB), agg_testA * agg_testB]

print 'Evaluating...'
r = np.arange(1,6)
yhat = np.dot(bestlrmodel.predict_proba(testF, verbose=2), r)
pr = pearsonr(yhat, scores[2])[0]
sr = spearmanr(yhat, scores[2])[0]
se = mse(yhat, scores[2])
print 'Test Pearson: ' + str(pr)
print 'Test Spearman: ' + str(sr)
print 'Test MSE: ' + str(se)

## About us
<img src="https://sites.google.com/a/unal.edu.co/mindlab/_/rsrc/1353286903227/config/customLogo.gif?revision=10" />