# Working with text

In this problem, we will load movie reviews from IMDB, a famous movie database and website, and we will try to predict whether the review is positive or negative.

First, the function we will use to diagnose the performance of our model

In [1]:
%pylab inline
plt.style.use('seaborn-talk')

Populating the interactive namespace from numpy and matplotlib


In [2]:
def plot_metric(history, metric):
    history_dict = history.history
    values = history_dict[metric]
    if 'val_' + metric in history_dict.keys():  
        val_values = history_dict['val_' + metric]

    epochs = range(1, len(values) + 1)

    if 'val_' + metric in history_dict.keys():  
        plt.plot(epochs, val_values, label='Validation')
    plt.semilogy(epochs, values, label='Training')

    if 'val_' + metric in history_dict.keys():  
        plt.title('Training and validation %s' % metric)
    else:
        plt.title('Training %s' % metric)
    plt.xlabel('Epochs')
    plt.ylabel(metric.capitalize())
    plt.legend()
    plt.grid()

    plt.show()  

## Input data

In [3]:
from keras.datasets import imdb

Using TensorFlow backend.


In [4]:
train, test = imdb.load_data(num_words=10000)

In [5]:
train_text, train_labels = train
test_text, test_labels = test

In [6]:
train_labels

array([1, 0, 0, ..., 0, 1, 0])

In [7]:
train_text.shape

(25000,)

Why are these *texts* numbers?

In [8]:
train_text[4][0:10]  # we show only 10 numbers from this vector for brevity

[1, 249, 1323, 7, 61, 113, 10, 10, 13, 1637]

In [9]:
max(max(s) for s in train_text)

9999

These are actually indices in a word index

In [10]:
word_index = imdb.get_word_index()

In [11]:
word_index['car']

516

In [12]:
reversed_word_index = dict((value, key) for (key, value) in word_index.items())

In [13]:
def get_text_from_vector(v):
    return ' '.join(reversed_word_index.get(i-3, '?') for i in v)

In [14]:
get_text_from_vector(train_text[4][0:20])

'? worst mistake of my life br br i picked this movie up at target for 5 because i figured'

### Prepare data for the network

We need to prepare the data to be an input to the neural network. The input must be a **tensor**. In our case, all vectors should be of the same length. But not all reviews are of the same size, so the vectors will have different sizes. How can we overcome this problem?

* We can zero-pad the vectors, so all of them have the same size, and then combine them in a tensor. We would need to add an *Embedding* layer to learn **word embeddings** (more later)
* Or we can use 1-HOT encoding

In both cases, we will have vectors of size $10^4$ (the maximum number of words). Let's go with the 1-HOT encoding.

In [15]:
from keras.preprocessing.text import Tokenizer

In [16]:
tokenizer = Tokenizer(num_words=10000)

In [17]:
x_train = tokenizer.sequences_to_matrix(train_text, mode='binary')

In [18]:
x_train.shape  # 25k rows, one per review;  and 10k columns, one per word. Cells will be 1 or 0

(25000, 10000)

In [19]:
x_train[0:5,0:7]  # excerpt from the matrix

array([[0., 1., 1., 0., 1., 1., 1.],
       [0., 1., 1., 0., 1., 1., 1.],
       [0., 1., 1., 0., 1., 0., 1.],
       [0., 1., 1., 0., 1., 1., 1.],
       [0., 1., 1., 0., 1., 1., 1.]])

In [20]:
x_test = tokenizer.sequences_to_matrix(test_text, mode='binary')

**EXERCISE 1**. Can you see any problem with this approach? How would you solve it?

**EXERCISE 2**. Do we need to transform the labels? Why? Or why not?

## Let's build the model

In [6]:
from keras import layers
from keras import models

In [22]:
def build_model():
    m = models.Sequential()
    m.add(layers.Dense(128, activation='relu', input_shape=(10000,)))
    #m.add(layers.Dense(64, activation='relu'))
    #m.add(layers.Dense(32, activation='relu'))
    m.add(layers.Dense(16, activation='relu'))
    m.add(layers.Dense(1, activation='sigmoid'))
    return m

In [7]:
from keras import optimizers
from keras import losses
from keras import metrics

In [24]:
m = build_model()

In [25]:
m.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 128)               1280128   
_________________________________________________________________
dense_2 (Dense)              (None, 16)                2064      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 1,282,209
Trainable params: 1,282,209
Non-trainable params: 0
_________________________________________________________________


In [26]:
m.compile(
    optimizer=optimizers.rmsprop(),
    loss=losses.binary_crossentropy,
    metrics=[metrics.binary_accuracy]
)

In [27]:
h = m.fit(x_train, train_labels, epochs=20, batch_size=1024, validation_split=.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


# Analyze performance

In [None]:
plot_metric(h, 'loss')

In [None]:
plot_metric(h, 'binary_accuracy')

We cannot find a satisfactory model with what we have learned so far. Is there any way to have a better representation of text that can provide better results?

In [None]:
m.evaluate(x_test, test_labels)

# Word embeddings

Using 1-HOT encoded vectors produce large and sparse tensors, that are difficult to learn from using a neural network. Word embeddings are compact vectors, representing words in a vector space. These vectors are learnt in a neural network, with a layer of type *Embedding*. We can also even use pre-trained word embeddings, to improve our model

![](./imgs/07_embeddings.png)

To generate  an embedding, we need to tokenize the text, transforming words into indices, and then we use these lists of numbers to produce the vectorial representation:

![](./imgs/08_embeddings.png)

More info:
* http://www.offconvex.org/2015/12/12/word-embeddings-1/
* http://www.offconvex.org/2016/02/14/word-embeddings-2/

## Input data for word embeddings

In [8]:
max_words = 10000
max_len = 100  # maximum length of the reviews
embedding_dim = 32  # number of components of the embedding vector

In [9]:
from keras import preprocessing

In [10]:
x_train = preprocessing.sequence.pad_sequences(train_text, maxlen=max_len)

In [11]:
x_train.shape

(25000, 100)

In [12]:
x_train[0:5, 0:10]  # we are limiting the reviews to just 100 words, with a vocabulary of 10ˆ4 words

array([[1415,   33,    6,   22,   12,  215,   28,   77,   52,    5],
       [ 163,   11, 3215,    2,    4, 1153,    9,  194,  775,    7],
       [1301,    4, 1873,   33,   89,   78,   12,   66,   16,    4],
       [  40,    2,   13,  188, 1076, 3222,   19,    4,    2,    7],
       [  13,   16,  131, 2073,  249,  114,  249,  229,  249,   20]],
      dtype=int32)

In [13]:
x_test = preprocessing.sequence.pad_sequences(test_text, maxlen=max_len)

## Let's build the model with embeddings

In [14]:
# Try first with Dense layers,
# Then SimpleRNN without return_sequences
# Then with return_sequences
# Then several RNN layers
# Then show LSTM layers
def build_model():
    m = models.Sequential()
    m.add(layers.Embedding(max_words, embedding_dim))
    #m.add(layers.Dense(32, activation='relu'))
    #m.add(layers.SimpleRNN(32, return_sequences=True))        
    #m.add(layers.SimpleRNN(32, return_sequences=True))
    #m.add(layers.SimpleRNN(32, return_sequences=True))
    #m.add(layers.SimpleRNN(32, return_sequences=True))
    m.add(layers.LSTM(32, return_sequences=True))
    #m.add(layers.LSTM(32, return_sequences=True))
    m.add(layers.LSTM(32))
    m.add(layers.Dense(1, activation='sigmoid'))
    return m

In [15]:
m = build_model()

In [16]:
m.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 32)          8320      
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 336,673
Trainable params: 336,673
Non-trainable params: 0
_________________________________________________________________


In [17]:
m.compile(
    optimizer=optimizers.rmsprop(),
    loss=losses.binary_crossentropy,
    metrics=[metrics.binary_accuracy]
)

In [18]:
h = m.fit(x_train, train_labels, epochs=10, batch_size=1024, validation_split=.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 3072/20000 [===>..........................] - ETA: 2:00 - loss: 0.6929 - binary_accuracy: 0.5280

KeyboardInterrupt: 

## Analyze performance

In [None]:
plot_metric(h, 'loss')

In [None]:
plot_metric(h, 'binary_accuracy')

Not bad, with just an embedding layer, we get $75\%$ accuracy

In [None]:
loss, acc = m.evaluate(x_test, test_labels)

In [None]:
loss, acc

How many reviews will be misclassified?

In [None]:
(1-acc)*len(test_labels)

Let's check some of the predictions

In [None]:
# N = 123
N = 2344

In [None]:
m.predict(x_test[N:N+1,])[0][0] >= 0.5

In [None]:
test_labels[N]

So this prediction is correct. It says the review is negative. Let's have a look at the text:

In [None]:
get_text_from_vector(test_text[N])

Can we find all the reviews that are wrongly classified?

In [None]:
preds = m.predict(x_test)

In [None]:
preds.shape

In [None]:
preds[0:10]

In [None]:
preds_binary = (preds >= 0.5).reshape((len(preds),))

In [None]:
preds_binary.shape

In [None]:
test_labels.shape

In [None]:
wrong_pos = np.where(test_labels != preds_binary)[0]

In [None]:
wrong_pos.shape

In [None]:
wrong_pos[0:100]

In [None]:
preds_binary[3], preds[3][0]

In [None]:
test_labels[3]

In [None]:
get_text_from_vector(test_text[3])

Is the classifier symmetric?

In [None]:
fp_pos = wrong_pos[test_labels[wrong_pos] == 0]

In [None]:
fp_pos.shape

In [None]:
fn_pos = wrong_pos[test_labels[wrong_pos] == 1]

In [None]:
fn_pos.shape

In [None]:
plt.bar(['fp','fn'],[fp_pos.shape[0],fn_pos.shape[0]])

**EXERCISE** Can you construct the confusion matrix for this model? Can you calculate the precision and recall? How does it compare to accuracy?
* See https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

**EXERCISE (more complex)** Keras decided some time ago to remove precision, recall and F1-score from the list of available metrics. Was it a good decision? Why? Why did the Keras' authors did not remove accuracy too?
* https://github.com/keras-team/keras/issues/5794
* https://github.com/keras-team/keras/issues/4592

**EXERCISE** What is the ROC curve? Could you build the ROC curve for this model? How would you use a ROC curve to evaluate a classifier?
* https://en.wikipedia.org/wiki/Receiver_operating_characteristic
* Help: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

In [None]:
from sklearn import metrics

In [None]:
roc = metrics.roc_curve(test_labels, preds)

In [None]:
roc

In [None]:
auc = metrics.roc_auc_score(test_labels, preds)

In [None]:
auc

In [None]:
fpr, tpr, _ = roc
plt.plot(fpr,tpr)
plt.plot([0,1], [0,1],'-')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.grid()

Let's analyze wrong positives and wrong negatives separately. Then we will try to find a relationship between the words and the misclassification, both for false positives and negatives.

In [None]:
fp_pos.shape

In [None]:
def words_hist(pos, texts):
    words_hist = {}

    for p in pos:
        ws = get_text_from_vector(texts[p]).split(' ')
        for w in ws:
            if w in words_hist.keys():
                words_hist[w] += 1
            else:
                words_hist[w] = 1
        
    return words_hist

In [None]:
fp_words = words_hist(fp_pos, test_text)

Now let's compare with the words of the true positives

In [None]:
tp_pos = np.where(test_labels == preds_binary)[0]

In [None]:
tp_pos.shape

In [None]:
tp_words = words_hist(tp_pos, test_text)

In [None]:
import pandas as pd

In [None]:
fp_df = pd.DataFrame.from_dict(fp_words, orient='index')
tp_df = pd.DataFrame.from_dict(tp_words, orient='index')

In [None]:
fp_df.head()

In [None]:
fp_df.sort_values(by=0, ascending=False)[0:40].plot.bar()

In [None]:
tp_df.sort_values(by=0, ascending=False)[0:40].plot.bar()

So the most common words are very similar. Not surprising. Let's calculate the relative frequency of each word, and then find what are the words with the highest difference in relative frequency.

In [None]:
tp_df['f'] = tp_df[0]/tp_df[0].sum()*100
fp_df['f'] = fp_df[0]/fp_df[0].sum()*100

In [None]:
fp_df.head()

In [None]:
tp_df.head()

In [None]:
diffs = tp_df-fp_df

In [None]:
diffs.sort_values(by='f', ascending=False)[0:40]['f'].plot.bar()

We see words such as *great*, *best*, *excellent*, which have a large difference between the true and the false positives. So false positives seem to lack some extreme words, and the classifier is having a hard time trying to assign a category to those reviews.