# Lesson 10 Assignment

## Backround

    Your next generation search engine startup was successful in having the ability to search for images based on their content. As a result, the startup received its second round of funding to be able to search news articles based on their topic. As the lead data scientist, you are tasked to build a model that classifies the topic of each article or newswire. 
    
    For this assignment, you will leverage the RNN_KERAS.ipynb lab in the lesson. You are tasked to use the Keras Reuters newswire topics classification dataset. This dataset contains 11,228 newswires from Reuters, labeled with over 46 topics. Each wire is encoded as a sequence of word indexes. For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words". As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

## Steps

    Complete the lab exercises for this week before following these steps to complete your assignment.

    Using the Keras dataset (Links to an external site.)Links to an external site., create a new notebook and perform each of the following data preparation tasks and answer the related questions:

    (1) Read Reuters dataset into training and testing 
    (2) Prepare dataset
    (3) Build and compile 3 different models using Keras LTSM ideally improving model at each iteration.
    (4) Describe and explain your findings.

### (1) Read Reuters dataset into training and testing 

In [1]:
# Importing packages

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
from keras.datasets import reuters

#import numpy and matplotlib
import numpy as np

from keras.utils import to_categorical

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# Defining Formula

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

In [3]:
# Read Reuters dataset into training and testing 
data = tf.keras.datasets.reuters
num_of_words = 10000
(x_train, y_train), (x_test, y_test) = data.load_data(num_words=num_of_words)

### (2) Prepare dataset

In [4]:
# Looking at data
                                                    
print(x_train[0])

[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]


In [5]:
# A dictionary mapping words to an integer index

#word_index = tf.keras.datasets.reuters.get_word_index()
word_index = reuters.get_word_index(path="reuters_word_index.json")
# The first indices are reserved

word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [6]:
# Reviewing Decoding Data

decode_review(x_train[0])

'<START> <UNK> <UNK> said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3'

In [7]:
# Only consider the first 400 words within the review

max_review_length = 400  
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_review_length)

### (3) Build and compile 3 different models using Keras LTSM ideally improving model at each iteration.

In [None]:
# Construct model 1

embedding_vecor_length = 46
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model.add(keras.layers.LSTM(100))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='kullback_leibler_divergence', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 46)           460000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               58800     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 518,901
Trainable params: 518,901
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 8982 samples, validate on 2246 samples
Epoch 1/3

In [None]:
# Evaluate model 1

scores = model.evaluate(x_test, y_test, verbose=0)

In [None]:
# Construct model 2

embedding_vecor_length = 46
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model.add(keras.layers.LSTM(100))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

In [None]:
# Evaluate model 2

scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
# Construct model 3

embedding_vecor_length = 46
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model.add(keras.layers.LSTM(100))
model.add(keras.layers.Dense(4579, activation='sigmoid'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

In [None]:
# Evaluate model 3

scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

### (4) Describe and explain your findings.

#### Comments:

    The first two models that I ran had extremely poor accuracy. The first model that I ran was using a loss function of kullback_leibler_divergence which led to a accuracy score around 4%. I then ran my second model using a loss function of binary_crossentropy which did not improve the accuracy. The final and third model that I ran which produced the highest accuracy of 36%.