# Lesson 5b: Recurrent Neural Networks: Categorization

Recurrent Neural Networks (RNNs) can be used in many different ways, such as classification, single-step prediction, and the generation of an entire sequence. 

* **Classification**: the input is a sequence, and the output is a single category - this is the focus of this assignment. (Alternatively, a sequence of categories could be generated, one for each partial sequence as it is processed).

* **Prediction**: the input is a sequence, and the output is a prediction for the next element in the sequence. You will explore this in lesson 5b.

* **Sequence Generation** (Seq-to-Seq): both the input and the output are entire sequences. For example, RNN-based language translation may take in an input sequence (of characters or word tokens) in English, and generate as output a sequence (of characters or word tokens) in French.

RNNs can be used to process inputs that occur naturally in time (such as an audio recording of speech or music represented as a stream of timestamped MIDI messages), but they can also be applied to material that has an order to it, even if it's not necessarly temporal in natures, such as written text (which can be read one character or one word at a time) or even written numbers or math equations (which can be read one digit or symbol at a time, from left to right, for instance.)  This is the problem we investigate today: looking at numbers such as "1423" as a sequence of digits ['1', '4', '2', '3'].

Our problem comes curtosy of Distinguished Professor Douglas R. Hofstadter of Indiana University, author of books such as _Gödel, Escher, Bach: an Eternal Golden Braid_. Hofstadter writes [private communication, shared with permission)]:

---

_Lately, I have been musing about the seeming power of deep neural nets.  They learn to recognize members of all sorts of categories, when those members (and non-members) are fed to them as patterns of symbols or of pixels.  So, how about the following challenges involving the natural numbers?_

* To recognize the even numbers, expressed in base 3.
     (Specifically, 0, 2, 11, 20, 22, 101, 110, 112,...)
* To recognize the multiples of 3, expressed in base 10.
* To recognize the multiples of 9, expressed in base 10.
* To recognize the multiples of 7, expressed in base 10.
* To recognize the multiples of 29, expressed in base 10.

_(I suppose that if a net can learn any particular one of the above list, it can learn all of them.  Just a guess...)_

 _Moving right along, how about the following somewhat harder challenges?_

* To recognize the correct integer additions, expressed either in base 2 or in base 10.  (For example, the string “12+29=41”.)
* To recognize the correct integer multiplications, expressed either in base 2 or in base 10.  (For example, the string “12x29=348”.)

_(The latter of this pair seems significantly harder than the former.)_
     
_And then, of course, the canonical challenge of this sort:_

* To recognize the prime numbers, expressed either in base 2 or in base 10.

_Each of the above challenges involves a number-theoretical category that can easily be described in purely syntactic terms (i.e., as a rule-based pattern of symbols).  It would be trivial to generate millions of examples of such categories mechanically, and then you just feed them to the neural net.  You can also feed the network lots of counterexamples -- marking them, of course, as non-members of the category.  Can a deep neural network learn any of these categories?  All of them?  Some of them?_

---

In this assignment, you will use an RNN to try to solve the divisibility-by-3 problem (the rest are challenges you might want to try in your free time!): 

* **"To recognize the multiples of 3, expressed in base 10."**  Specifically, you must:
    * Design an RNN that takes a sequence of digits as input. 
    * Represent digits in base 10 by using a categorical, one-hot encoding, with one node for each digit from 0 through 9.
    * Train the RNN to categorize a number as True if it is evenly divisible by 3, False otherwise.
    * Test the RNN on a set of previously-unseen numbers, including numbers that are 4 digits long, such as 2225 and 3333.
    * Acheive an accuracy of at least 95% on the test set (report the accuracy in the cell marked below).
    * Answer the questions at the end of this notebook.




- [One hot encoding](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/) 
- [Sequence Prediction](https://machinelearningmastery.com/sequence-prediction-problems-learning-lstm-recurrent-neural-networks/)
- [LSTM Tutorial Keras](https://adventuresinmachinelearning.com/keras-lstm-tutorial/)

# Setup
## Imports

- [Keras - Guide to the sequential model](https://keras.io/getting-started/sequential-model-guide/)
- [Input](https://keras.io/layers/core/)
- [GRU](https://keras.io/layers/recurrent/)
- [LSTM](https://keras.io/layers/recurrent/)
- [Dense](https://keras.io/layers/core/#Dense)
- [Masking](https://keras.io/layers/core/)
- [Dropout](https://keras.io/layers/core/)

In [23]:
%matplotlib inline
from keras.models import Model, Sequential, load_model
from keras.layers import Input, GRU, LSTM, Dense, Masking, Dropout, InputLayer, TimeDistributed
from keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
from keras.regularizers import l1_l2
from keras.utils import to_categorical
import math
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## Constants

In [24]:
#Set up params for dataset.
DIVISIBILITY_NUMBER = 3         # We want to test for divisilibity by 3.
TRAIN_TEST_SPLIT = 0.7          # Percentage of data in training set
NUM_EXAMPLES_PER_CLASS = 1000      # Generate the first 1000 multiples of 3 for training/testing
                                # Also generate 1000 non-multiples of 3.
NUM_CATEGORIES = 10             # 10 digits
MAX_DIGITS = 5                  # Number of digits allowed in input strings

# Neural net hyperparameters-- just an example. Adjust these as needed.
BATCH_SIZE = 200
NUM_LSTM_NODES = 100             
DROPOUT = 0.5 
LEARING_RATE = 0.3
NUM_EPOCHS = 100

# TODO: add/modify constants as needed


## Helper functions to generate the dataset of training/testing examples

In [25]:
from random import sample

def generate_example_numbers(base_number=DIVISIBILITY_NUMBER, num_examples_per_class=NUM_EXAMPLES_PER_CLASS):
    """Return a tuple of two lists: (list_of_multiples, list_of_nonmultiples).
    
    For example, ([0, 3, 6, 9, 12, ...2997], [1, 4, 5, 8, 11, 13, 14,...,2999]).
    Each list contains num_examples_per_class elements.
    """
    a = sample(range(0,num_examples_per_class * 60), num_examples_per_class * base_number * 2)
    # based on the density of the multiples within a set of integers
    multiples = []
    non_multiples = []
    for i in a:
        if i % 3 == 0:
            multiples.append(i)
        else:
            non_multiples.append(i)
    
    multiples = sample(multiples, num_examples_per_class)
    non_multiples = sample(non_multiples, num_examples_per_class)
    
    return (multiples, non_multiples)

In [26]:
(a,b) = generate_example_numbers(base_number=3, num_examples_per_class=1000)
print('min {} max {} len {}'.format(min(a), max(a), len(a)))
print('min {} max {} len {}'.format(min(b), max(b), len(b)))

min 51 max 59883 len 1000
min 83 max 59980 len 1000


In [27]:
assert(sum([i % 3 for i in a]) == 0)

In [28]:
assert(0 not in [i % 3 for i in b])

In [29]:
def generate_labels(size_multiples, size_non_multiples):
    """Return two list of labels one for the True case (multiples) and one for the False case (nonmultiples).
    
    Represent True as 1, False as 0.
    For example, return ([1, 1, 1, 1.....], [0, 0, 0, 0, ....]) with each list the requested size.
    """
    return ([1] * size_multiples, [0] * size_non_multiples)

In [30]:
def digit_to_vector(digit):
    """Given a digit from 0-9, return a numpy array representing the digit using a 1-hot encoding.
    keras.utils.to_categorical may be useful.
    """
    tmp = to_categorical(digit, dtype='int')
    tmp = list(tmp)
    tmp.extend([0] * (10 - len(tmp)))
    return np.array(tmp)

In [31]:
def number_to_input_example(number, max_digits=MAX_DIGITS):
    """Given an integer number, return a numpy float array of 0.0s and 1.0s, of the correct shape to feed into the 
    neural net.
    
    For example, if you have a max of 5 digits then you should have a 2D numpy matrix: 5 rows (one for each
    sequence index), and 10 columns (1 for each digit).
    
    In order to train in "batch" mode, the RNN expects every example to have the same shape. So if you have a 2-digit
    number such as "42", you need to pad the example with a "padding" token somehow; for example, "???42", and then
    use keras.layers.Masking to ignore the leading digits. Or just pad with 0s, as in "00042". 
    keras.preprocessing.sequence.pad_sequences can help with this.
    """
    number_str = str(number)
    number_str = '0' * (max_digits - len(number_str)) + number_str
    tmp = []
    for i in number_str:
        tmp.append(digit_to_vector(int(i)))
    return np.array(tmp)
        

from random import shuffle
    
def generate_dataset(divisibility_number=DIVISIBILITY_NUMBER, train_test_split=TRAIN_TEST_SPLIT, 
                     num_examples_per_class=NUM_EXAMPLES_PER_CLASS):
    """Generate a dataset ready for training. Returns a list of tuples. Each tuple is of the form
    (input_array, label). The dataset should be shuffled either here or during the training process to
    mix divisile-by-DIVISIBILITY_NUMBER and not-divisible-by-DIVISIBILITY_NUMBER examples.
    The dataset should consist of NUM_EXAMPLES_PER_CLASS positive examples (e.g., 1000 examples of divisible-by-3), and
    also NUM_EXAMPLES_PER_CLASS negative examples (e.g., 1000 examples of not-divisible-by-3).
    """
    (multiples, non_multiples) = generate_example_numbers(base_number=divisibility_number, 
                                                          num_examples_per_class=num_examples_per_class)
    
    (l_multiples, l_non_multiples) = generate_labels(num_examples_per_class, num_examples_per_class)
    
    MAX_DIGITS = len(str(max(max(multiples), max(non_multiples))))
    def input_f(x):
        return number_to_input_example(x, max_digits=MAX_DIGITS)
    
    
    (multiples, non_multiples) = (list(map(input_f, multiples)), list(map(input_f, non_multiples)))
    
    def helper(x, y):
        res = [(i, j) for i,j in zip(x, y)]
        return res
    
    mult = helper(multiples, l_multiples)
    non_mult = helper(non_multiples, l_non_multiples)
    result = mult + non_mult
    shuffle(result)
    return result

In [32]:
result = generate_dataset()

In [33]:
result[-1]

(array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 0)

In [34]:
len(result[-1][0])

5

## Helper functions to generate the model

In [35]:
# Build RNN model.
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

def build_model():
    model = Sequential()
    
    #TODO: Add/modify layers as desired.
    '''
    model.add(Input(shape=(784,))) # to edit 
    model.add(LSTM(NUM_LSTM_NODES, dropout=0.3, return_sequences=False)) 
    '''
    
    model.add(InputLayer(input_shape=(5,10))) # to edit 
    #model.add(Dropout(0.2))
    #model.add(Masking(mask_value=0.0))
    model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(LSTM(NUM_LSTM_NODES, dropout=0.1, return_sequences=False, recurrent_dropout=0.1 ))
    
    #model.add(LSTM(NUM_LSTM_NODES, dropout=0.3, return_sequences=False))
    #model.add(LSTM(100))
    # Use return_sequences=True for multiple hidden layers
    #model.add(Dense(NUM_LSTM_NODES,activation='relu') )
    #TODO: Add/modify layers as desired.
    # model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))  # Model should return 1 or 0 for divisible/not-divisible
    #model.build()
    return model

In [36]:
# Print the model configuration.
model = build_model()
adam = Adam(lr=LEARING_RATE)   # Modify learning algorithm as needed
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_2 (Conv1D)            (None, 5, 32)             992       
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 2, 32)             0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 54,293
Trainable params: 54,293
Non-trainable params: 0
_________________________________________________________________


# Generate dataset and model

In [37]:
data = generate_dataset()
X = [d[0] for d in data]
y = [d[1] for d in data]

In [38]:
from sklearn.model_selection import train_test_split
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
train_inputs = np.array(train_inputs)
validation_inputs = np.array(validation_inputs)

train_labels = np.array(train_labels)
validation_labels = np.array(validation_labels)

In [40]:
#TODO: Configure for viewing validation loss/accuracy using the "validation_data" parameter.
model.fit(train_inputs, train_labels, validation_data=(validation_inputs, validation_labels), batch_size=BATCH_SIZE, epochs=NUM_EPOCHS, shuffle=True, verbose=1)

Train on 1600 samples, validate on 400 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1078fc080>

# Train the model

- [Keras - fit](https://keras.io/models/sequential/)
- [Recurrent Neural Networks by Example in Python](https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470)
- [Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/)

Report your final accuracy on the validation dataset below.

In [41]:
scores = model.evaluate(validation_inputs, validation_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 47.75%


## Examine model outputs

In [42]:
# Examine the outputs of the model on some test data.
[model.predict(np.expand_dims(validation_inputs[i], 0)) for i in range(10)]

[array([[0.40687916]], dtype=float32),
 array([[0.43077323]], dtype=float32),
 array([[0.3480372]], dtype=float32),
 array([[0.65829575]], dtype=float32),
 array([[0.40687916]], dtype=float32),
 array([[0.40687916]], dtype=float32),
 array([[0.40687916]], dtype=float32),
 array([[0.3480372]], dtype=float32),
 array([[0.40687916]], dtype=float32),
 array([[0.43077323]], dtype=float32)]

In [43]:
[validation_labels[i] for i in range(10)]

[1, 0, 0, 1, 1, 1, 1, 1, 1, 1]

In [44]:
# Plot some results
results = []
lo = 0
hi = 1000
rng = range(lo,hi)
for num in rng:
    # Hint: to run on a single example, you can use "np.expand_dims" to add an extra 
    # dimension to a 2D array, in order to make a "batch" of 1.
    #
    # TODO something like this:
    results.append(model.predict(number_to_input_example(num))[0][0])

ValueError: Error when checking input: expected input_2 to have 3 dimensions, but got array with shape (5, 10)

In [None]:
plt.plot(rng[:100], results[:100])
plt.show()

# Further Questions

1) What happens if you give a 5-digit number or a 6-digit number to the trained model, after training on 1-, 2-, 3-, and 4- digit numbers?

TODO

2) Pick another number from Hofstadter's list above, such as 9, 7, or 29. Train a model, and report the accuracy of your results. Did it work or not? Why or why not (your best guess)?

TODO

3) Record any other comments/insights from your model training process. What worked well? What caused trouble?

TODO

4) If you didn't have a training algorithm, how would you design a RNN-style system to recognize divisibility by 3?
Ignoring the details of the weights, what kind of state must be carried over from step to step as each digit is read in
successively?

TODO

5) BONUS (hard): Explain how the neural net you trained above works, with evidence from examining the node activations as the net runs. Does it do anything similar to what you would have designed as a human?

TODO

In [None]:



# LSTM and CNN for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
X_train

In [None]:
y_train

In [None]:
X_train[0]