# Lesson 5b: Recurrent Neural Networks: Categorization

Recurrent Neural Networks (RNNs) can be used in many different ways, such as classification, single-step prediction, and the generation of an entire sequence. 

* **Classification**: the input is a sequence, and the output is a single category - this is the focus of this assignment. (Alternatively, a sequence of categories could be generated, one for each partial sequence as it is processed).

* **Prediction**: the input is a sequence, and the output is a prediction for the next element in the sequence. You will explore this in lesson 5b.

* **Sequence Generation** (Seq-to-Seq): both the input and the output are entire sequences. For example, RNN-based language translation may take in an input sequence (of characters or word tokens) in English, and generate as output a sequence (of characters or word tokens) in French.

RNNs can be used to process inputs that occur naturally in time (such as an audio recording of speech or music represented as a stream of timestamped MIDI messages), but they can also be applied to material that has an order to it, even if it's not necessarly temporal in natures, such as written text (which can be read one character or one word at a time) or even written numbers or math equations (which can be read one digit or symbol at a time, from left to right, for instance.)  This is the problem we investigate today: looking at numbers such as "1423" as a sequence of digits ['1', '4', '2', '3'].

Our problem comes curtosy of Distinguished Professor Douglas R. Hofstadter of Indiana University, author of books such as _Gödel, Escher, Bach: an Eternal Golden Braid_. Hofstadter writes [private communication, shared with permission)]:

---

_Lately, I have been musing about the seeming power of deep neural nets.  They learn to recognize members of all sorts of categories, when those members (and non-members) are fed to them as patterns of symbols or of pixels.  So, how about the following challenges involving the natural numbers?_

* To recognize the even numbers, expressed in base 3.
     (Specifically, 0, 2, 11, 20, 22, 101, 110, 112,...)
* To recognize the multiples of 3, expressed in base 10.
* To recognize the multiples of 9, expressed in base 10.
* To recognize the multiples of 7, expressed in base 10.
* To recognize the multiples of 29, expressed in base 10.

_(I suppose that if a net can learn any particular one of the above list, it can learn all of them.  Just a guess...)_

 _Moving right along, how about the following somewhat harder challenges?_

* To recognize the correct integer additions, expressed either in base 2 or in base 10.  (For example, the string “12+29=41”.)
* To recognize the correct integer multiplications, expressed either in base 2 or in base 10.  (For example, the string “12x29=348”.)

_(The latter of this pair seems significantly harder than the former.)_
     
_And then, of course, the canonical challenge of this sort:_

* To recognize the prime numbers, expressed either in base 2 or in base 10.

_Each of the above challenges involves a number-theoretical category that can easily be described in purely syntactic terms (i.e., as a rule-based pattern of symbols).  It would be trivial to generate millions of examples of such categories mechanically, and then you just feed them to the neural net.  You can also feed the network lots of counterexamples -- marking them, of course, as non-members of the category.  Can a deep neural network learn any of these categories?  All of them?  Some of them?_

---

In this assignment, you will use an RNN to try to solve the divisibility-by-3 problem (the rest are challenges you might want to try in your free time!): 

* **"To recognize the multiples of 3, expressed in base 10."**  Specifically, you must:
    * Design an RNN that takes a sequence of digits as input. 
    * Represent digits in base 10 by using a categorical, one-hot encoding, with one node for each digit from 0 through 9.
    * Train the RNN to categorize a number as True if it is evenly divisible by 3, False otherwise.
    * Test the RNN on a set of previously-unseen numbers, including numbers that are 4 digits long, such as 2225 and 3333.
    * Acheive an accuracy of at least 95% on the test set (report the accuracy in the cell marked below).
    * Answer the questions at the end of this notebook.




- [One hot encoding](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/) 
- [Sequence Prediction](https://machinelearningmastery.com/sequence-prediction-problems-learning-lstm-recurrent-neural-networks/)
- [LSTM Tutorial Keras](https://adventuresinmachinelearning.com/keras-lstm-tutorial/)

# Setup
## Imports

- [Keras - Guide to the sequential model](https://keras.io/getting-started/sequential-model-guide/)
- [Input](https://keras.io/layers/core/)
- [GRU](https://keras.io/layers/recurrent/)
- [LSTM](https://keras.io/layers/recurrent/)
- [Dense](https://keras.io/layers/core/#Dense)
- [Masking](https://keras.io/layers/core/)
- [Dropout](https://keras.io/layers/core/)

In [1]:
%matplotlib inline
from keras.models import Model, Sequential, load_model
from keras.layers import Input, GRU, LSTM, Dense, Masking, Dropout
from keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
from keras.regularizers import l1_l2
from keras.utils import to_categorical
import math
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


## Constants

In [2]:
#Set up params for dataset.
DIVISIBILITY_NUMBER = 3         # We want to test for divisilibity by 3.
TRAIN_TEST_SPLIT = 0.7          # Percentage of data in training set
NUM_EXAMPLES_PER_CLASS = 1000      # Generate the first 1000 multiples of 3 for training/testing
                                # Also generate 1000 non-multiples of 3.
NUM_CATEGORIES = 10             # 10 digits
MAX_DIGITS = 5                  # Number of digits allowed in input strings

# Neural net hyperparameters-- just an example. Adjust these as needed.
BATCH_SIZE = 32
NUM_LSTM_NODES = 10             
DROPOUT = 0.5 
LEARING_RATE = 0.001
NUM_EPOCHS = 50

# TODO: add/modify constants as needed


## Helper functions to generate the dataset of training/testing examples

In [3]:
from random import sample

def generate_example_numbers(base_number=DIVISIBILITY_NUMBER, num_examples_per_class=NUM_EXAMPLES_PER_CLASS):
    """Return a tuple of two lists: (list_of_multiples, list_of_nonmultiples).
    
    For example, ([0, 3, 6, 9, 12, ...2997], [1, 4, 5, 8, 11, 13, 14,...,2999]).
    Each list contains num_examples_per_class elements.
    """
    a = sample(range(0,num_examples_per_class * 60), num_examples_per_class * base_number * 2)
    # based on the density of the multiples within a set of integers
    multiples = []
    non_multiples = []
    for i in a:
        if i % 3 == 0:
            multiples.append(i)
        else:
            non_multiples.append(i)
    
    multiples = sample(multiples, num_examples_per_class)
    non_multiples = sample(non_multiples, num_examples_per_class)
    
    return (multiples, non_multiples)

In [4]:
(a,b) = generate_example_numbers(base_number=3, num_examples_per_class=1000)
print('min {} max {} len {}'.format(min(a), max(a), len(a)))
print('min {} max {} len {}'.format(min(b), max(b), len(b)))

min 42 max 59931 len 1000
min 58 max 59897 len 1000


In [5]:
assert(sum([i % 3 for i in a]) == 0)

In [6]:
assert(0 not in [i % 3 for i in b])

In [7]:
def generate_labels(size_multiples, size_non_multiples):
    """Return two list of labels one for the True case (multiples) and one for the False case (nonmultiples).
    
    Represent True as 1, False as 0.
    For example, return ([1, 1, 1, 1.....], [0, 0, 0, 0, ....]) with each list the requested size.
    """
    return ([1] * size_multiples, [0] * size_non_multiples)

In [8]:
def digit_to_vector(digit):
    """Given a digit from 0-9, return a numpy array representing the digit using a 1-hot encoding.
    keras.utils.to_categorical may be useful.
    """
    tmp = to_categorical(digit, dtype='int')
    tmp = list(tmp)
    tmp.extend([0] * (10 - len(tmp)))
    return np.array(tmp)

In [9]:
def number_to_input_example(number, max_digits=MAX_DIGITS):
    """Given an integer number, return a numpy float array of 0.0s and 1.0s, of the correct shape to feed into the 
    neural net.
    
    For example, if you have a max of 5 digits then you should have a 2D numpy matrix: 5 rows (one for each
    sequence index), and 10 columns (1 for each digit).
    
    In order to train in "batch" mode, the RNN expects every example to have the same shape. So if you have a 2-digit
    number such as "42", you need to pad the example with a "padding" token somehow; for example, "???42", and then
    use keras.layers.Masking to ignore the leading digits. Or just pad with 0s, as in "00042". 
    keras.preprocessing.sequence.pad_sequences can help with this.
    """
    number_str = str(number)
    number_str = '0' * (max_digits - len(number_str)) + number_str
    tmp = []
    for i in number_str:
        tmp.append(digit_to_vector(int(i)))
    return np.array(tmp)
        

from random import shuffle
    
def generate_dataset(divisibility_number=DIVISIBILITY_NUMBER, train_test_split=TRAIN_TEST_SPLIT, 
                     num_examples_per_class=NUM_EXAMPLES_PER_CLASS):
    """Generate a dataset ready for training. Returns a list of tuples. Each tuple is of the form
    (input_array, label). The dataset should be shuffled either here or during the training process to
    mix divisile-by-DIVISIBILITY_NUMBER and not-divisible-by-DIVISIBILITY_NUMBER examples.
    The dataset should consist of NUM_EXAMPLES_PER_CLASS positive examples (e.g., 1000 examples of divisible-by-3), and
    also NUM_EXAMPLES_PER_CLASS negative examples (e.g., 1000 examples of not-divisible-by-3).
    """
    (multiples, non_multiples) = generate_example_numbers(base_number=divisibility_number, 
                                                          num_examples_per_class=num_examples_per_class)
    
    (l_multiples, l_non_multiples) = generate_labels(num_examples_per_class, num_examples_per_class)
    
    MAX_DIGITS = len(str(max(max(multiples), max(non_multiples))))
    def input_f(x):
        return number_to_input_example(x, max_digits=MAX_DIGITS)
    
    
    (multiples, non_multiples) = (list(map(input_f, multiples)), list(map(input_f, non_multiples)))
    
    def helper(x, y):
        res = [(i, j) for i,j in zip(x, y)]
        return res
    
    mult = helper(multiples, l_multiples)
    non_mult = helper(non_multiples, l_non_multiples)
    result = mult + non_mult
    shuffle(result)
    return result

In [10]:
result = generate_dataset()

In [11]:
result[-1]

(array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]]), 0)

## Helper functions to generate the model

In [26]:
# Build RNN model.
def build_model():
    model = Sequential()
    
    #TODO: Add/modify layers as desired.
    model.add(Input(shape=(784,))) # to edit 
    model.add(LSTM(NUM_LSTM_NODES, dropout=0.3, return_sequences=False))  # Use return_sequences=True for multiple hidden layers

    #TODO: Add/modify layers as desired.

    model.add(Dense(1, activation='sigmoid'))  # Model should return 1 or 0 for divisible/not-divisible
    #model.build()
    return model

In [27]:
model.summary()

ValueError: This model has not yet been built. Build the model first by calling build() or calling fit() with some data. Or specify input_shape or batch_input_shape in the first layer for automatic build. 

In [23]:
# Print the model configuration.
model = build_model()
adam = Adam(lr=LEARING_RATE)   # Modify learning algorithm as needed
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
#model.summary()
#model.summary()

In [24]:
mod = Sequential()
mode.add()
mod.summary()

ValueError: This model has not yet been built. Build the model first by calling build() or calling fit() with some data. Or specify input_shape or batch_input_shape in the first layer for automatic build. 

# Generate dataset and model

In [22]:
data = generate_dataset()
X = [d[0] for d in data]
y = [d[1] for d in data]

In [None]:
# TODO: build train/validate datasets.
train_data = ...
validation_data = ...
train_inputs = ...
train_labels = ...
validation_inputs = ...
validation_labels = ...

In [None]:
# Compile model
adam = Adam(lr=LEARING_RATE)   # Modify learning algorithm as needed
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
model.summary()

# Train the model

In [None]:
#TODO: Configure for viewing validation loss/accuracy using the "validation_data" parameter.
model.fit(train_inputs, train_labels, validation_data=..., batch_size=BATCH_SIZE, epochs=NUM_EPOCHS, shuffle=True, verbose=1)

Report your final accuracy on the validation dataset below.

TODO: accuracy = ??%

## Examine model outputs

In [None]:
# Examine the outputs of the model on some test data.
[model.predict(np.expand_dims(validation_inputs[i], 0)) for i in range(10)]

In [None]:
# Plot some results
results = []
lo = 0
hi = 1000
rng = range(lo,hi)
for num in rng:
    # Hint: to run on a single example, you can use "np.expand_dims" to add an extra 
    # dimension to a 2D array, in order to make a "batch" of 1.
    #
    # TODO something like this:
    results.append(model.predict(number_to_input_example(num))[0][0])

In [None]:
plt.plot(rng[:100], results[:100])
plt.show()

# Further Questions

1) What happens if you give a 5-digit number or a 6-digit number to the trained model, after training on 1-, 2-, 3-, and 4- digit numbers?

TODO

2) Pick another number from Hofstadter's list above, such as 9, 7, or 29. Train a model, and report the accuracy of your results. Did it work or not? Why or why not (your best guess)?

TODO

3) Record any other comments/insights from your model training process. What worked well? What caused trouble?

TODO

4) If you didn't have a training algorithm, how would you design a RNN-style system to recognize divisibility by 3?
Ignoring the details of the weights, what kind of state must be carried over from step to step as each digit is read in
successively?

TODO

5) BONUS (hard): Explain how the neural net you trained above works, with evidence from examining the node activations as the net runs. Does it do anything similar to what you would have designed as a human?

TODO