# Certificate in Data Science | Assignment 10 |  
> University of Washington, Seattle, WA    
> January 2020  
> N. Hicks

## Problem Statement

Your next generation search engine startup was successful in having the ability to search for images based on their content. As a result, the startup received its second round of funding to be able to search news articles based on their topic. As the lead data scientist, you are tasked to ***build a model that classifies the topic of each article or newswire***.

- Leverage the RNN_KERAS.ipynb lab in the lesson.

- Use the Keras Reuters newswire topics classification dataset.

This dataset contains 11,228 newswires from Reuters, labeled with over 46 topics. Each wire is encoded as a sequence of word indexes. For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words". As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

## Assignment

Using the Keras dataset, create a new notebook and perform each of the following data preparation tasks and answer the related questions:
    - Read the Reuters dataset into both training and testing datasets. 
    - Prepare the dataset for modeling.
    - Build/compile 3 different models using Keras LTSM ideally to improve the model each iteration.
    - Describe and explain your findings.


## Import the Libraries

In [1]:
# import the required libraries
import tensorflow as tf
from tensorflow import keras
from keras.datasets import reuters

import numpy as np
import matplotlib as plt

Using TensorFlow backend.


## Functions for Scripting

In [2]:
'''
Report the Numpy n-dimensional array characteristics.
RETURN: None; print the characteristics.
'''
def print_array_attrs(arr, txt):
    print('--------------------------')
    print('DATASET                 {}'.format(txt))
    print('dType                   {}'.format(arr.dtype))      # the array data type
    print('num_dimensions          {}'.format(arr.ndim))       # numbr of dimensions
    print('shape                   {}'.format(arr.shape))      # the array shape
    print('stride                  {}'.format(arr.strides))    # the stride of the array
    print('total num_elements      {}\n'.format(arr.size))       # number of elements
    print('memory address          {}'.format(arr.data))       # the memory address
    print('element length, bytes   {}'.format(arr.itemsize))   # length of one array element, in bytes
    print('elements size, bytes    {}'.format(arr.nbytes))     # total bytes consumed of the elements
    print('memory layout\n{}'.format(arr.flags))      # memory layout

In [3]:
'''
Decode the mapped integers to the assigned words
'''
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

In [4]:
'''
Create a scale function for a single feature.
INPUT: pd.Series; a single attribute.
RETURN: a scaled column attribute.
'''
def scale(col):
    mean_col = np.mean(col)
    sd_col = np.std(col)
    std = (col - mean_col) / sd_col
    return std

## Import the Data

In [5]:
# import the dataset from Keras
num_of_words=10000
(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz",
                                                         num_words=num_of_words,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

## Evaluate the Data

In [6]:
# return the memory address of each array
arrays = [x_train, y_train, x_test, y_test]
labels = ['x_train', 'y_train', 'x_test', 'y_test']
for (arr, txt) in zip(arrays, labels):
    print_array_attrs(arr, txt)

--------------------------
DATASET                 x_train
dType                   object
num_dimensions          1
shape                   (8982,)
stride                  (8,)
total num_elements      8982

memory address          <memory at 0x000000001001EB88>
element length, bytes   8
elements size, bytes    71856
memory layout
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

--------------------------
DATASET                 y_train
dType                   int64
num_dimensions          1
shape                   (8982,)
stride                  (8,)
total num_elements      8982

memory address          <memory at 0x000000001001EB88>
element length, bytes   8
elements size, bytes    71856
memory layout
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

--------------------------
DATASET         

In [7]:
print(x_train[0])

[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]


In [8]:
print(x_test[0])

[1, 4, 1378, 2025, 9, 697, 4622, 111, 8, 25, 109, 29, 3650, 11, 150, 244, 364, 33, 30, 30, 1398, 333, 6, 2, 159, 9, 1084, 363, 13, 2, 71, 9, 2, 71, 117, 4, 225, 78, 206, 10, 9, 1214, 8, 4, 270, 5, 2, 7, 748, 48, 9, 2, 7, 207, 1451, 966, 1864, 793, 97, 133, 336, 7, 4, 493, 98, 273, 104, 284, 25, 39, 338, 22, 905, 220, 3465, 644, 59, 20, 6, 119, 61, 11, 15, 58, 579, 26, 10, 67, 7, 4, 738, 98, 43, 88, 333, 722, 12, 20, 6, 19, 746, 35, 15, 10, 9, 1214, 855, 129, 783, 21, 4, 2280, 244, 364, 51, 16, 299, 452, 16, 515, 4, 99, 29, 5, 4, 364, 281, 48, 10, 9, 1214, 23, 644, 47, 20, 324, 27, 56, 2, 2, 5, 192, 510, 17, 12]


## Prepare the Data

### Map the Integers to Words

In [9]:
# A dictionary mapping words to an integer index
word_index = reuters.get_word_index(path="reuters_word_index.json")

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

### Decode the Mapped Words

In [10]:
# review the words
decode_review(x_train[0])

'<START> <UNK> <UNK> said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3'

## Build Models - Keras LTSM

### Scale the Data

In [11]:
# scale the feature arrays
x_train[0] = scale(x_train[0])
x_test[0] = scale(x_test[0])

In [12]:
# Only consider the first 400 words within the review
max_review_length = 400
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_review_length)

### LSTM(units=100, activation='tanh')

In [13]:
# Construct our model
embedding_vecor_length = 32
model_0 = keras.models.Sequential()
model_0.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
# basic default LSTM model
model_0.add(keras.layers.LSTM(units=100, activation='tanh'))
model_0.add(keras.layers.Dense(1, activation='sigmoid'))
model_0.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [14]:
print(model_0.summary())
x_train_0, y_train_0 = x_train.copy(), y_train.copy()
model_0.fit(x_train_0, y_train_0, validation_data=(x_test, y_test), epochs=3, batch_size=64)   # ensure the batch size divides the feature array size

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 32)           320000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 8982 samples, validate on 2246 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x119f3a08>

In [15]:
# Evaluate model
scores = model_0.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.6f%%" % (scores[1]*100))

Accuracy: 4.674978%


### LSTM(units=100, activation='sigmoid', dropout=0.2)

In [16]:
# Construct our model
embedding_vecor_length = 32
model_1 = keras.models.Sequential()
model_1.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
# update the model by adding hyperpareter dropout=0.2
model_1.add(keras.layers.LSTM(units=100, activation='sigmoid', dropout=0.2))
model_1.add(keras.layers.Dense(1, activation='sigmoid'))
model_1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [17]:
print(model_1.summary())
x_train_1, y_train_1 = x_train.copy(), y_train.copy()
model_1.fit(x_train_1, y_train_1, validation_data=(x_test, y_test), epochs=3, batch_size=64)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 32)           320000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 8982 samples, validate on 2246 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1b4dbe88>

In [18]:
# Evaluate model
scores = model_1.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.6f%%" % (scores[1]*100))

Accuracy: 4.674978%


### LSTM(units=100, activation='sigmoid', dropout=0.2, use_bias=True, unit_forget_bias=True)

### model_2.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=6, batch_size=138)

In [19]:
# Construct the model
embedding_vecor_length = 32
model_2 = keras.models.Sequential()
model_2.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
# add new LSTM hyperparameter: us_bias=True and unit_forget_bias=True
model_2.add(keras.layers.LSTM(units=100, activation='sigmoid', dropout=0.2, use_bias=True, unit_forget_bias=True))
model_2.add(keras.layers.Dense(1, activation='sigmoid'))
model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [20]:
print(model_2.summary())
x_train_2, y_train_2 = x_train.copy(), y_train.copy()
# update the model by epochs=6 and batch_size=138
model_2.fit(x_train_2, y_train_2, validation_data=(x_test, y_test), epochs=6, batch_size=138)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 400, 32)           320000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 8982 samples, validate on 2246 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<tensorflow.python.keras.callbacks.History at 0xafb03488>

In [21]:
# Evaluate model
scores = model_2.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.6f%%" % (scores[1]*100))

Accuracy: 4.674978%


In [22]:
scores

[-207.9033485881367, 0.046749778]

## Results

This assignment is a brief introduction to Recurrent Neural Network characteristics specific to a Long Short-Term Memory network. The scripting first imports the data prior to evaluation of its underlying numpy arrays. Then the dataset is prepared for ingestion into the required RNN.

This entails decoding the imported numeric object arrays for identification of the encoded words. Once this is accomplished, the RNN-LSTM model is established with default hyperparameters.

In total, 3 RNN-LSTM architectures were implimented in order to attempt model tuning. The models thus developed were the following:
    - LSTM(units=100, activation='tanh')
      (# basic default LSTM model)
        * accuracy: 4.674978%
        
    - LSTM(units=100, activation='sigmoid', dropout=0.2)
      (# update the model by updating `activation='sigmoid'` and adding `dropout=0.2`)
        * accuracy: 4.674978%
        
    - LSTM(units=100, activation='sigmoid', dropout=0.2, use_bias=True, unit_forget_bias=True)
      (# add new LSTM hyperparameter: ues_bias=True and unit_forget_bias=True)
        * accuracy: 4.674978%



Contrary to the intended outcomes, these results are all equal and do not show any performance criteria improvements. The notion to utilize a `unit_forget_bias=True` is a recommendation identified within the KERAS website regarding the LSTM implimentation. In the development of the RNN here, defining this hyperparameter as `True`, was not influential.

Other attempts to develop improved performance was the addition of `dropout=0.2`, that also did not impact the accuracy measure. Further, the hyperparameters `use_bias=True` had no effect, just like the other attempts.

Equally, instantiating a `sigmoid` kernel as apposed to the `tanh` hyperparameter returned no differnce in the accuracy scores. However, the objectives of this assignment have been accomplished. 