In [None]:
# execute this cell before you start

import tensorflow as tf
from tensorflow.keras import layers

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras

print(tf.VERSION)
print(tf.keras.__version__)


#  CA2
## due on 8/03/2019

to submit the assignment, please do the following:

- do `Cell -> All output -> Clear` to clear all your output
- save the notebook (CA3.ipynb)

# The Reuters newswire data

Consider the data in  `keras.datasets.reuters` and train a network which reliably categorizes the newswires. 

Load the data, restricting the number of words to the most frequent 10,000 words

In [None]:
num_words=10000
reuters = keras.datasets.reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=num_words)



`num_classes` is the number of unique labels in the data

In [None]:
num_classes = len(set(train_labels))

Inspect the data

In [None]:
print(len(train_data), len(train_labels))
print(len(test_data), len(test_labels))

Get the word index map

In [None]:
word_index = reuters.get_word_index()

Index 0 is unused. Use it to set the reserved `<PAD>` string, which is used later on.

In [None]:
word_index["<PAD>"] = 0

`word_index` maps words to the word code. Create a reverse map from word code to word and a `decode_article` function to view articles in human readable form.

In [None]:
reverse_word_index = {}
for key, value in word_index.items():
    reverse_word_index[value] = key

def decode_article(article):
    decodedArticle = ""
    for word_code in article:
        decodedArticle += " "
        decodedArticle += reverse_word_index.get(word_code, "?")
    return decodedArticle

To have uniform input data, we make sure every article is a uniform 256 word long and pad the data with `<PAD>` when necessary

In [None]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], maxlen=256, padding="post")
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], maxlen=256, padding="post")

Verify that padding and trimming

In [None]:
decode_article(train_data[0])

Now build the model and compile it.

In [None]:
vocab_size = num_words
model = keras.models.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(512, activation=tf.nn.relu))
model.add(keras.layers.Dense(num_classes, activation=tf.nn.softmax))

model.compile(optimizer='adam',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

Now train the data, using 30% of the data as validation data

In [None]:
fit_result = model.fit(train_data, train_labels, epochs=40, validation_split=0.3)
history = fit_result.history

Let's analyze the model training :
First we plot the accuracy as function of epoch

In [None]:
plt.plot(fit_result.epoch, history['acc'], 'b', label='Training acc')
plt.plot(fit_result.epoch, history['val_acc'], 'r', label='Validation acc')
plt.title('Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

From the plot of `Epochs vs Accuracy`, we can see that the training and validation accuracy initially increases.
After about 15 epochs, the validation accuracy does not increase, and in fact decreases a bit, although the training accuracy seems to increase monotonically. This could mean that the model is being over fitted. This is hint that the model parameters need to be tuned.

In [None]:
plt.plot(fit_result.epoch, history['loss'], 'b', label='Training loss')
plt.plot(fit_result.epoch, history['val_loss'], 'r', label='Validation loss')
plt.title('Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

The `Epochs vs Loss` plot above confirms the inferences made from the `Epochs vs Accuracy` plot.
After about 15 epochs, the training loss keeps on decreasing monotonically, although the validation loss in fact increases.

Now evaluate the model with the test data

In [None]:
test_loss, test_acc = model.evaluate(x=test_data, y=test_labels)
test_loss, test_acc

The test_accuracy is about 70-75%, which should be improved. This, coupled with the Epoch vs Loss plot, tells us to fine tune the model more.