# Text classification with Keras

This notebook uses keras layers to show examples of training a text classifier model. Models with different combinations of Attention, LSTM and GRU have been shown. Glove and Fastext embeddings are used to initialize the word embeddings.

In [1]:
import keras, tensorflow, sys
keras.__version__, tensorflow.__version__, sys.version

Using TensorFlow backend.


('2.2.4',
 '1.11.0',
 '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')

### Install two required Attention Packages

pip install keras_multi_head

Link to the package -> https://pypi.org/project/keras-multi-head/

pip install keras_self_attention

Link to the package -> https://pypi.org/project/keras-self-attention/

In [1]:
# import required packages
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

import keras
from keras.layers import CuDNNLSTM,CuDNNGRU, BatchNormalization, Dense, Dropout, Activation, Embedding, Input, Concatenate
from keras.layers import Bidirectional,CuDNNGRU,SpatialDropout1D, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.models import Model
from keras.optimizers import Adam
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from keras_multi_head import MultiHeadAttention
from keras_self_attention import ScaledDotProductAttention, SeqSelfAttention

from sklearn.metrics import confusion_matrix,f1_score, precision_score, recall_score, roc_auc_score, accuracy_score
from sklearn.cross_validation import train_test_split

import pandas as pd
import numpy as np
import re
from glob import glob

import math

Using TensorFlow backend.


In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1" 

## IMDB Dataset

The task here is to categorize the incoming comment as positive or negative. 

Find the dataset at the link -> http://ai.stanford.edu/~amaas/data/sentiment/

load_imdb_dataset() function is used to load the imdb data.

In [3]:
def load_imdb_dataset():

    # Load the dataset
    train = pd.DataFrame(columns=["text", "positive"])
    test = pd.DataFrame(columns=["text", "positive"])
    ctr = 0
    cte = 0
    for fil in ['train/', 'test/']:
        for cls in ['pos', 'neg']:
            dset_path = "./" + fil + cls
            for fname in sorted(os.listdir(dset_path)):
                if fname.endswith('.txt'):
                    with open(os.path.join(dset_path, fname), encoding="utf8") as f:
                        if fil == 'train/':
                            train.loc[ctr] = (f.read(), int(cls == "pos"))
                            ctr+=1
                        else:
                            test.loc[cte] = (f.read(), int(cls == "pos"))
                            cte+=1
                            
    return train, test

## Number of sample and there distribution in test and train sets

In [4]:
train, test = load_imdb_dataset()

print ("Train data shape", train.shape)
print ("Test data shape", test.shape)

Train data shape (25000, 2)
Test data shape (25000, 2)


In [5]:
print("Train data class distbn", train.positive.value_counts())
print("Test data class distbn", test.positive.value_counts())

Train data class distbn 1    12500
0    12500
Name: positive, dtype: int64
Test data class distbn 1    12500
0    12500
Name: positive, dtype: int64


## Few sample comments from both test and train sets

In [6]:
train.head()

Unnamed: 0,text,positive
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


In [7]:
test.head()

Unnamed: 0,text,positive
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [8]:
# Average number of words per review 
tr_l = [len(x.split()) for x in train.text]
te_l = [len(x.split()) for x in test.text]
print("Train Sequence length distribution:\n")
print(pd.Series(tr_l).describe())
print("\n\nTest Sequence length distribution:\n")
print(pd.Series(te_l).describe())

Train Sequence length distribution:

count    25000.000000
mean       233.787200
std        173.733032
min         10.000000
25%        127.000000
50%        174.000000
75%        284.000000
max       2470.000000
dtype: float64


Test Sequence length distribution:

count    25000.000000
mean       228.526680
std        168.883693
min          4.000000
25%        126.000000
50%        172.000000
75%        277.000000
max       2278.000000
dtype: float64


In [9]:
# Number of unique words by finding the length of dictionary of words mapped with unique tokens (integers)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(train.text))
print("Vocab size", len(tokenizer.word_counts))

Vocab size 88582


Set the sentence length to be the mean number of words per sentence in the train set

In [10]:
embed_size = 300 

# mean number of words per sentence in the train set is taken as maximum sentence length.
max_sent_len = int(np.percentile(tr_l, 50)) 

num_words = len(tokenizer.word_counts)

## Tokenize and pad the text sequences

Tokenize -> change the word to there integer ids

Pad -> Trim or pad with zeros to make all sentences of same length

In [11]:
# Converte sentence text to list of token represented sentences, required for training
X = tokenizer.texts_to_sequences(train.text)
X = pad_sequences(X, maxlen=max_sent_len)

x_test = tokenizer.texts_to_sequences(test.text)
x_test = pad_sequences(x_test, maxlen=max_sent_len)

## Get the validataion data

In [12]:
# Split into train and validation data
x_train, x_val, y_train, y_val = train_test_split(X, train.positive, test_size=0.1, random_state=3)
x_train.shape, x_val.shape

((22500, 174), (2500, 174))

## Functions to load different Embedding files.

Load the embedding file and find the mean and standard deviation vectors ot the word vectors. Than for all the words in the vocab initialize the corresponding word vector from the loaded embedded file. For the words for which wordvecs cannot be found in the embedding file, initialize them with a random normal distribution with the above found mean and standard deviation.

In [13]:
def load_glove(word_index):
    EMBEDDING_FILE = '../embeddings/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8"))

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(num_words, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= num_words: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 

def load_google_news(word_index):
    EMBEDDING_FILE = '../embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8"))

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(num_words, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= num_words: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 
    
def load_fasttext(word_index):    
    EMBEDDING_FILE = '../embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8") if len(o)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(num_words, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= num_words: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector

    return embedding_matrix

In [14]:
word_index = tokenizer.word_index


## Train Function

Trains the model. The parameters decides what model to be trained and on what type of embeddings.

In [15]:
def train(embd=0, train_embd=True, bilstm=True, attn=0, avg_pool=False):
## embd -> 0 if not to use pretrained embedding 
#       -> 1 if Glove pretrained vectors is to be used
#       -> 2 if FastText pretrained vectors is to be used
#       -> 3 if mean of Glove and FastText pretrained vectors is to be used.
#
## train_embd should be either of True or False depending upon Embedding layer is to be fine tuned or not.
#
## bilstm -> False if Bidirectional GRU is to be used
#         -> True if Bidirectional LSTM is to be used.
#
## attn -> 0 No attention
#       -> 1 Sequence self attention
#       -> 2 Multi-head attention
#       -> 3 Use both Sequence and multi head attention and concatenate the outputs.
#
## avg_pool Set to True to concatenate average_pool along with max pool output in the network.
    
    pred_avg = []
    real = list(test.positive)
    
    # Performing cross validation of 5 for result consistency.
    for cv in range(5):
        embedding_layer = None

        if (embd==0):
            embedding_layer = Embedding(input_dim=num_words, output_dim=embed_size, input_length=max_sent_len,
                                        trainable=train_embd)

        elif (embd==1):
            embedding_layer = Embedding(input_dim=num_words, output_dim=embed_size, input_length=max_sent_len,
                                        trainable=train_embd, weights=[load_glove(word_index)])
        elif (embd==2):
            embedding_layer = Embedding(input_dim=num_words, output_dim=embed_size, input_length=max_sent_len,
                                        trainable=train_embd, weights=[load_fasttext(word_index)])
        elif (embd==3):    
            embedding_layer = Embedding(input_dim=num_words, output_dim=embed_size, input_length=max_sent_len,
                                        trainable=train_embd, 
                                        weights=[np.mean([load_glove(word_index), load_fasttext(word_index)], axis = 0)])


        sequence_input = Input(shape=(max_sent_len,), dtype='int32')
        embedded_sequences = embedding_layer(sequence_input)
        x = SpatialDropout1D(rate=0.2)(embedded_sequences)
        # Spatial drop out will drop a whole 1d word vector of incoming sentence, 
        # rather than dropping random units from any of the word vectors. 

        if bilstm:
            x = Bidirectional(CuDNNLSTM(units=64, return_sequences=True), merge_mode='concat')(x)
        else:
            x = Bidirectional(CuDNNGRU(units=64, return_sequences=True), merge_mode='concat')(x)


        if attn==1:
            x = SeqSelfAttention()(x)

        if attn==2:
            x = MultiHeadAttention(head_num=4)(x)

        if attn==3:
            x1 = SeqSelfAttention()(x)
            x2 = MultiHeadAttention(head_num=4)(x)
            if avg_pool:
                xg1 = GlobalMaxPooling1D()(x1)
                xa1 = GlobalAveragePooling1D()(x1)
                xg2 = GlobalMaxPooling1D()(x2)
                xa2 = GlobalAveragePooling1D()(x2)
                x = Concatenate()([xg1, xa1, xg2, xa2])
            else:
                x1 = GlobalMaxPooling1D()(x1)
                x2 = GlobalMaxPooling1D()(x2)
                x = Concatenate()([x1, x2])
        else:
            if avg_pool:
                xg = GlobalMaxPooling1D()(x)
                xa = GlobalAveragePooling1D()(x)
                x = Concatenate()([xg, xa])
            else:
                x = GlobalMaxPooling1D()(x)


        x = Dense(units=16, activation="relu", kernel_initializer="glorot_normal")(x)
        x = BatchNormalization()(x)
        x = Dropout(rate=0.4)(x)

        pred = Dense(units=1, activation="sigmoid", kernel_initializer="glorot_normal")(x)
        model = Model(sequence_input, pred)

        model.compile(loss="binary_crossentropy", optimizer=Adam(5e-5),metrics=['accuracy'])

        model.fit(x=x_train, y=y_train, validation_data=(x_val, y_val), epochs=30, batch_size=128, 
                  shuffle=True, verbose=0)

        pred = model.predict(x=x_test)
        pred = pred > 0.5
        pred = [int(p[0]) for p in pred]
        pred_avg.append(pred)
        print("Model:", cv, ", Accuracy_score:", accuracy_score(real, pred))
    
    pred = np.mean(pred_avg, axis=0)
    pred = pred > 0.5
    pred = [int(p) for p in pred]
    print("Confusion Matrix:\n", confusion_matrix(real, pred))
    print("f1_score:",f1_score(real, pred), "precision_score:",precision_score(real, pred),
          "recall_score:",recall_score(real, pred), "accuracy_score:",accuracy_score(real, pred))
    return

## Results for different models trained on the IMDB dataset

The results suggests that ->
1. Use of LSTM or GRU must be experimented for a given dataset. Normally LSTM works better in tasks where long term dependency in sequence matters for making the classification.

2. Choice of embedding will also depend on the task and one may choose to use combination of embeddings. Note here Concatenation of embedding have not been shown in this notebook, but one may combine embeddings in that way. But generally, after concatenating multiple embeddings, the results detoriate. 

3. Fine tuning of embeddings usually helps when there is enough data and data has fairly good number of examples for words which were not present in embedding vocublary. 

4. Using Attention layer have given improvements of above 1%. Sequence self attention seems to work better for this problem but combination of attentions may produce even better results, as has been shown below.

5. Using concatenation of global average pooled features and global max-pooled features may generally produce better results. In this case concatenating average pooled features has given an improvement of another 0.5-1%.

In [16]:
# base model - embedding randomly intitialized, Using LSTM, no attention.
train()

Model: 0 , Accuracy_score: 0.8382
Model: 1 , Accuracy_score: 0.83828
Model: 2 , Accuracy_score: 0.8282
Model: 3 , Accuracy_score: 0.83884
Model: 4 , Accuracy_score: 0.8394
Confusion Matrix:
 [[10870  1630]
 [ 2232 10268]]
f1_score: 0.8417083367489139 precision_score: 0.863002185241217 recall_score: 0.82144 accuracy_score: 0.84552


In [17]:
# base model - embedding randomly intitialized, Using GRU, no attention.
train(bilstm=False)

Model: 0 , Accuracy_score: 0.86064
Model: 1 , Accuracy_score: 0.86112
Model: 2 , Accuracy_score: 0.85936
Model: 3 , Accuracy_score: 0.86092
Model: 4 , Accuracy_score: 0.8628
Confusion Matrix:
 [[10965  1535]
 [ 1597 10903]]
f1_score: 0.8744085331622424 precision_score: 0.8765878758642869 recall_score: 0.87224 accuracy_score: 0.87472


In [18]:
# Using embedding Glove, No fine tune, Using GRU, no attention.
train(embd=1, train_embd=False, bilstm=False)

Model: 0 , Accuracy_score: 0.86192
Model: 1 , Accuracy_score: 0.86748
Model: 2 , Accuracy_score: 0.86356
Model: 3 , Accuracy_score: 0.86416
Model: 4 , Accuracy_score: 0.86636
Confusion Matrix:
 [[10911  1589]
 [ 1628 10872]]
f1_score: 0.8711189455550659 precision_score: 0.8724821442901853 recall_score: 0.86976 accuracy_score: 0.87132


In [19]:
# Using embedding Glove, No fine tune, Using LSTM, no attention.
train(embd=1, train_embd=False, bilstm=True)

Model: 0 , Accuracy_score: 0.873
Model: 1 , Accuracy_score: 0.87424
Model: 2 , Accuracy_score: 0.87404
Model: 3 , Accuracy_score: 0.87176
Model: 4 , Accuracy_score: 0.87428
Confusion Matrix:
 [[10950  1550]
 [ 1459 11041]]
f1_score: 0.8800765214618788 precision_score: 0.8768961956953379 recall_score: 0.88328 accuracy_score: 0.87964


In [20]:
# Using embedding Glove, fine tuning, Using GRU, no attention.
train(embd=1, train_embd=True, bilstm=False)

Model: 0 , Accuracy_score: 0.87604
Model: 1 , Accuracy_score: 0.87756
Model: 2 , Accuracy_score: 0.87604
Model: 3 , Accuracy_score: 0.8814
Model: 4 , Accuracy_score: 0.8768
Confusion Matrix:
 [[11098  1402]
 [ 1434 11066]]
f1_score: 0.8864146107016981 precision_score: 0.8875521334616618 recall_score: 0.88528 accuracy_score: 0.88656


In [21]:
# Using embedding Glove, fine tuning, Using LSTM, no attention.
train(embd=1, train_embd=True, bilstm=True)

Model: 0 , Accuracy_score: 0.88644
Model: 1 , Accuracy_score: 0.888
Model: 2 , Accuracy_score: 0.8856
Model: 3 , Accuracy_score: 0.88808
Model: 4 , Accuracy_score: 0.88376
Confusion Matrix:
 [[11111  1389]
 [ 1288 11212]]
f1_score: 0.8933508625154376 precision_score: 0.889770653122768 recall_score: 0.89696 accuracy_score: 0.89292


In [22]:
# Using embedding Fastext, No fine tune, Using GRU, no attention.
train(embd=2, train_embd=False, bilstm=False)

Model: 0 , Accuracy_score: 0.85296
Model: 1 , Accuracy_score: 0.85712
Model: 2 , Accuracy_score: 0.8652
Model: 3 , Accuracy_score: 0.8664
Model: 4 , Accuracy_score: 0.86064
Confusion Matrix:
 [[10383  2117]
 [ 1192 11308]]
f1_score: 0.8723625843780135 precision_score: 0.8423091247672253 recall_score: 0.90464 accuracy_score: 0.86764


In [24]:
# Using embedding as mean of Glove and Fastext, No fine tune, Using GRU, no attention.
train(embd=3, train_embd=False, bilstm=False)

Model: 0 , Accuracy_score: 0.86636
Model: 1 , Accuracy_score: 0.863
Model: 2 , Accuracy_score: 0.86236
Model: 3 , Accuracy_score: 0.86484
Model: 4 , Accuracy_score: 0.86224
Confusion Matrix:
 [[10818  1682]
 [ 1540 10960]]
f1_score: 0.871847903905815 precision_score: 0.866951431735485 recall_score: 0.8768 accuracy_score: 0.87112


In [26]:
# Using embedding Glove, No fine tune, Using GRU, Sequence Self-Attention.
train(embd=1, train_embd=False, bilstm=False, attn = 1)

Model: 0 , Accuracy_score: 0.87948
Model: 1 , Accuracy_score: 0.88204
Model: 2 , Accuracy_score: 0.8778
Model: 3 , Accuracy_score: 0.87932
Model: 4 , Accuracy_score: 0.88364
Confusion Matrix:
 [[10997  1503]
 [ 1403 11097]]
f1_score: 0.884223107569721 precision_score: 0.8807142857142857 recall_score: 0.88776 accuracy_score: 0.88376


In [28]:
# Using embedding Glove, No fine tune, Using GRU, Multi-head Attention.
train(embd=1, train_embd=False, bilstm=False, attn = 2)

Model: 0 , Accuracy_score: 0.87672
Model: 1 , Accuracy_score: 0.8812
Model: 2 , Accuracy_score: 0.8842
Model: 3 , Accuracy_score: 0.87912
Model: 4 , Accuracy_score: 0.87592
Confusion Matrix:
 [[10918  1582]
 [ 1304 11196]]
f1_score: 0.8858295751246142 precision_score: 0.8761934575050868 recall_score: 0.89568 accuracy_score: 0.88456


In [30]:
# Using embedding Glove, No fine tune, Using GRU, Concat of self and Multi head Attentions.
train(embd=1, train_embd=False, bilstm=False, attn = 3)

Model: 0 , Accuracy_score: 0.873
Model: 1 , Accuracy_score: 0.88292
Model: 2 , Accuracy_score: 0.88172
Model: 3 , Accuracy_score: 0.88024
Model: 4 , Accuracy_score: 0.8806
Confusion Matrix:
 [[10771  1729]
 [ 1179 11321]]
f1_score: 0.886183953033268 precision_score: 0.8675095785440613 recall_score: 0.90568 accuracy_score: 0.88368


In [32]:
# Using embedding Glove , No fine tune, Using GRU, Sequence Self-Attention, concat of average and max pool.
train(embd=1, train_embd=True, bilstm=False, attn = 1, avg_pool=True)

Model: 0 , Accuracy_score: 0.8928
Model: 1 , Accuracy_score: 0.89204
Model: 2 , Accuracy_score: 0.8928
Model: 3 , Accuracy_score: 0.8928
Model: 4 , Accuracy_score: 0.88996
Confusion Matrix:
 [[11235  1265]
 [ 1279 11221]]
f1_score: 0.8981829824701834 precision_score: 0.8986865289123819 recall_score: 0.89768 accuracy_score: 0.89824


In [34]:
# Using embedding Glove , No fine tune, Using GRU, Multi-head Attention, concat of average and max pool.
train(embd=1, train_embd=True, bilstm=False, attn = 2, avg_pool=True)

Model: 0 , Accuracy_score: 0.88364
Model: 1 , Accuracy_score: 0.88896
Model: 2 , Accuracy_score: 0.88716
Model: 3 , Accuracy_score: 0.88964
Model: 4 , Accuracy_score: 0.88512
Confusion Matrix:
 [[11192  1308]
 [ 1350 11150]]
f1_score: 0.8935010818174534 precision_score: 0.8950072242735592 recall_score: 0.892 accuracy_score: 0.89368


In [18]:
# Using embedding Glove, No fine tune, Using GRU, Concat of self and Multi head attention, concat of avg and max pool.
train(embd=1, train_embd=False, bilstm=False, attn = 3)

Model: 0 , Accuracy_score: 0.88268
Model: 1 , Accuracy_score: 0.88324
Model: 2 , Accuracy_score: 0.87912
Model: 3 , Accuracy_score: 0.88304
Model: 4 , Accuracy_score: 0.88188
Confusion Matrix:
 [[10923  1577]
 [ 1300 11200]]
f1_score: 0.8861811132650236 precision_score: 0.8765750958754012 recall_score: 0.896 accuracy_score: 0.88492
