## Text classification using Neural Networks

The goal of this notebook is to learn to use Neural Networks for text classification.

In this notebook, we will:
- Train a simple neural network, learning embeddings
- Download pre-trained embeddings from Glove
- Use these pre-trained embeddings

However keep in mind:
- Deep Learning can be better on text classification that simpler ML techniques, but only on very large datasets and well designed/tuned models
- We won't be using the most efficient (in terms of computing) techniques, as Keras is good for prototyping but rather inefficient for training models on text 

### 20 Newsgroups Dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups http://qwone.com/~jason/20Newsgroups/


In [55]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')


In [30]:
print(newsgroups_train["data"][1000])

From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)
Subject: Diamond SS24X, Win 3.1, Mouse cursor
Organization: National Library of Medicine
Lines: 10


Anybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?
Sorry, don't know the version of the driver (no indication in the menus) but it's a recently
delivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered
if anyone else had seen this.

post or email

--Don Lindbergh
dabl2@lhc.nlm.nih.gov



In [36]:
# What are the target classes
print("class of previous message:", newsgroups_train["target_names"][newsgroups_train["target"][1000]])
print("all classes:", set(newsgroups_train["target_names"]))

class of previous message: comp.os.ms-windows.misc
all classes: {'comp.os.ms-windows.misc', 'sci.crypt', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'talk.politics.guns', 'talk.politics.mideast', 'rec.sport.hockey', 'sci.space', 'talk.politics.misc', 'alt.atheism', 'comp.windows.x', 'comp.sys.ibm.pc.hardware', 'rec.motorcycles', 'sci.electronics', 'talk.religion.misc', 'comp.sys.mac.hardware', 'rec.sport.baseball', 'rec.autos', 'misc.forsale'}


### Preporcessing text CBOW model

We will implement a simple classification model in Keras. Raw text requires (sometimes a lot of) preprocessing.

The following cells uses Keras to preprocess text:
- using a Tokenizer. You may use different tokenizer (from scikit-learn, NLTK, etc.). This converts the texts into sequences of indices reprensenting the `20000` most frequent words
- sequences have different lenght, so we pad them (add 0s at the end until the sequence is of length `1000`)
- we convert the output classes as 1-hot encodings

In [60]:
from keras.preprocessing.text import Tokenizer

MAX_NB_WORDS = 20000

# get the raw text data
texts_train = newsgroups_train["data"]
texts_test = newsgroups_test["data"]

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts_train)
sequences = tokenizer.texts_to_sequences(texts)
sequences_test = tokenizer.texts_to_sequences(texts_test)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 134142 unique tokens.


In [39]:
seq_lens = list(map(lambda x: len(x), sequences))
print("average length:", sum(seq_lens)/len(seq_lens))
print("max length:", max(seq_lens))

average length: 302.5179423722821
max length: 15363


In [63]:
from keras.preprocessing.sequence import pad_sequences

MAX_SEQUENCE_LENGTH = 1000

# pad sequences with 0s
x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', x_train.shape)
print('Shape of data test tensor:', x_test.shape)

Shape of data tensor: (11314, 1000)
Shape of data test tensor: (7532, 1000)


In [64]:
from keras.utils.np_utils import to_categorical
y_train = newsgroups_train["target"]
y_test = newsgroups_test["target"]

labels_index = {k:v for (k,v) in enumerate(newsgroups_train["target_names"])}

y_train = to_categorical(np.asarray(y_train))
y_test = np.asarray(y_test)
print('Shape of label tensor:', y_train.shape)

Shape of label tensor: (11314, 20)


### A simple CBOW model in Keras

The following computes a very simple model, as described in FastText https://github.com/facebookresearch/fastText:
- build an embedding layer mapping each word to a vector representation
- comptue the vector representation of all words in each sequence and average them
- add a dense layer to output 20 classes (+ softmax)

In [65]:
from keras.layers import Dense, Input, Flatten
from keras.layers import GlobalAveragePooling1D, Embedding
from keras.models import Model

EMBEDDING_DIM = 50

embedding_layer = Embedding(nb_words,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(20, activation='softmax')(average)

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

Training model.
Train on 10182 samples, validate on 1132 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f58e7a45780>

In [48]:
model.fit(x_train, y_train, validation_split=0.1,
          nb_epoch=10, batch_size=128, )

Train on 8146 samples, validate on 906 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f58eced7d68>

In [71]:
output_test = model.predict(x_test)
test_casses = np.argmax(output_test, axis=-1)
print("test accuracy:", np.mean(test_casses == y_test))

test accuracy: 0.433483802443


### Building more complex models

**Exercise**
- From the previous template, build more complex models using:
  - Recurrent neural networks through LSTM
  - 1d convolution and 1d maxpooling 

**Bonus**
- You may try different architectures with:
  - more intermediate layers, combination of dense, conv, recurrent
  - different recurrent (GRU, RNN)
  - bidirectional LSTMs

Note: The goal is to build working models rather than getting better test accuracy. To achieve much better results, we'd need more computation time and data quantity. Build your model, and verify that they converge to OK results.

In [None]:
from keras.layers import Dense, Input, Flatten
from keras.layers import GlobalAveragePooling1D, Embedding
from keras.models import Model

EMBEDDING_DIM = 50

embedding_layer = Embedding(nb_words,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

# TODO

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

In [None]:
# %load solutions/conv1d.py

In [79]:
# %load solutions/lstm.py

In [80]:
model.fit(x_train, y_train, validation_split=0.1,
          nb_epoch=10, batch_size=128, )

Train on 10182 samples, validate on 1132 samples
Epoch 1/10
 1024/10182 [==>...........................] - ETA: 80s - loss: 2.9986 - acc: 0.0596

KeyboardInterrupt: 

In [None]:
embeddings_index = {}
f = open('glove100K.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Preparing embedding matrix.')

# prepare embedding matrix
nb_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [96]:
# Build a layer with pre-trained embeddings
pretrained_embedding_layer = Embedding(nb_words,
                            100,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH)

In [97]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = pretrained_embedding_layer(sequence_input)
average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(20, activation='softmax')(average)

model = Model(sequence_input, predictions)
# We don't want to fine-tune embeddings
model.layers[1].trainable=False
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

### Reality check

On small/medium datasets, simpler classification methods usually perform better, and are much more efficient to compute. Here are two resources to go further:
- Naive Bayes approach, using scikit-learn http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
- Alec Radford (OpenAI) gave a very interesting presentation, showing that you need a VERY large dataset to have real gains from GRU/LSTM in text classification https://www.slideshare.net/odsc/alec-radfordodsc-presentation

However, when looking at features, one can see that classification using simple methods isn't very robust, and won't generalize well to slightly different domains (e.g. forum posts => emails)