# Recurrent Neural Networks

RNNs capture patterns in sequences of data. In this case, the sequence will be characters in the names of chemical compounds, which are often composed of parts like prefixes (e.g. "mono-", "methyl-"), roots ("fluor", "nitr"), and suffixes (e.g. "-ate", "-ide"). We'll see if a simple RNN can pick up on these patterns and generate new chemical compounds.

## Objective

This will be a "seq-to-seq" RNN, meaning it has multiple inputs and outputs. We will train it to predict the next character at each step. This makes it a *language model*, and also a type of autoencoder!

During training, the correct label is the true next character, which means the labels will be obtained by simply offsetting the full string one to the left (see figure below). During inference, the best guess output for a given time step will be passed to the next step as input.

In order to generate variable length sequences, we'll add characters to signify the start ('<') and end ('>') of a compound name. At inference time, we will pass '<' to the network to start the sequence, and stop generation when the network outputs '>'.

### Training

![train](https://drive.google.com/uc?id=1_dn3LMkMnWz8CMXT7xayGDdq5S9aVY3V)

### Inference

![inference](https://drive.google.com/uc?id=1kMKxCukw9M5K6CA2UxO-eQBxLqAa4O0B)

In [23]:
import tensorflow as tf
import numpy as np
import os

In [24]:
!gdown 1eBQlWh4npoIP3l9kWbfall6x85X4Jkls

Downloading...
From: https://drive.google.com/uc?id=1eBQlWh4npoIP3l9kWbfall6x85X4Jkls
To: /content/compounds.txt
  0% 0.00/66.1k [00:00<?, ?B/s]100% 66.1k/66.1k [00:00<00:00, 106MB/s]


In [25]:
text = open('compounds.txt', 'rb').read()
text = text.decode(encoding='utf-8')

In [26]:
# The unique characters in the corpus
vocab = sorted(set(text))

# We'll use < and > to denote start and end
vocab.append('>')
vocab.append('<')

print('Vocab:\n', vocab)
print('Vocab size:\n', len(vocab))

Vocab:
 ['\n', ' ', "'", '(', ')', ',', '-', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '>', '<']
Vocab size:
 48


Neural networks operate on vectors of continuous values, not lists of discrete values like characters. We will need to vectorize the input using "one-hot" encoding, in which the vector is the size of the "vocabulary" (in this case all characters we saw in the input). For each input character, the one-hot vector will have a 1 in the dimension representing that character and 0s everywhere else.

In [27]:
# Create a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
# Make a copy of the unique set elements in NumPy array format for later use in the decoding the predictions
idx2char = np.array(vocab)
# Vectorize the text with a for loop
lines = text.split('\n')
lines = list(map(lambda x: '<' + x + '>', lines))
max_len = np.max(list(map(len, lines)))
print("Max length:", max_len)

Max length: 67


In order to generate variable length sequences, we'll add characters to signify the start ('<') and end ('>') of a compound name. At inference time, we will pass '<' to the network to start the sequence, and stop generation when the network outputs '>'.

We will also pad the training names with zeros so they are all the same length.

In [28]:
source=[]
target=[]
for line in lines:
    padded=np.zeros(max_len, np.int32)
    for i, c in enumerate(line):
        padded[i] = char2idx[c]
    source.append(padded[:-1])
    target.append(padded[1:])

Let's examine an encoded training example. Notice how the target is actually just the source, but shifted by one, because our network is being tasked with looking one time step into the future.

In [29]:
print("Text:")
print(lines[0])
print("\nVector encoded source:")
print(source[0])
print("\nVector encoded target:")
print(target[0])

Text:
<(1r,3r)-1,2,3-trimethylcyclopentane>

Vector encoded source:
[47  3  9 37  5 11 37  4  6  9  5 10  5 11  6 39 37 28 32 24 39 27 44 31
 22 44 22 31 34 35 24 33 39 20 33 24 46  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]

Vector encoded target:
[ 3  9 37  5 11 37  4  6  9  5 10  5 11  6 39 37 28 32 24 39 27 44 31 22
 44 22 31 34 35 24 33 39 20 33 24 46  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]


We'll create a TensorFlow dataset object to handle shuffling, batching, and one-hot encoding.

In [30]:
dataset = tf.data.Dataset.from_tensor_slices((source, target))

In [31]:
BUFFER_SIZE = 10000 # TF shuffles the data only within buffers
BATCH_SIZE = 1 # Batch size
vocab_size = len(vocab)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset=dataset.map(lambda X_batch,Y_batch: (tf.one_hot(X_batch,depth=vocab_size),Y_batch))

**1. Create a Keras sequential model with a [SimpleRNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN) layer with 128 units and a dense output layer. Store the model in `model`.**
- Hint: The batch dimension is implied, so you will specify an input shape of two dimensions: the first is for the number of time steps (which can be None to allow any); the second is for the number of values in each time step (think about our one-hot input).
- Hint: Give the dense layer the appropriate size for predicting the next character. It does not need a softmax activation, however, since we will use `from_logits` in the loss.
- Hint: The RNN layer should return sequences, since this RNN will have an output at each time step.

In [32]:
# YOUR CODE HERE (1)
model = tf.keras.models.Sequential( # creating sequential model
    [ # layers within sequential model
      tf.keras.layers.SimpleRNN(128, input_shape=[None, vocab_size], return_sequences=True), # SimpleRNN 128 units
      tf.keras.layers.Dense(vocab_size) # dense output layer
    ]
)

In [33]:
model.summary()

We'll use sparse categorical cross entropy, since our labels are integers (not one-hot encoded).

In [34]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

Let's make a function to generate text using our model.

In [35]:
def generate_text(model, max_preds):
    input_list = [char2idx['<']] # start with the start char; the rest is zeros
    text_generated = [] # Empty string to store our results

    for i in range(max_preds-1): #Run a loop for number of characters to generate

        predictions = model(tf.one_hot(tf.expand_dims(input_list, 0), depth=vocab_size)) # prediction for single character
        predictions = tf.squeeze(predictions, 0) # remove the batch dimension
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # The predicted character as the next input to the model
        # along with the previous hidden state
        # So the model makes the next prediction based on the previous character
        input_list.append(predicted_id)
        # Also devectorize the number and add to the generated text
        if predicted_id==0:
            continue
        pred_char=idx2char[predicted_id]
        if pred_char=='>':
            break
        text_generated.append(pred_char)

    return ''.join(text_generated)

Let's train the network for 5 epochs (this will take a few minutes). We will generate a few examples each epoch to check progress.

In [36]:
epochs=5
for i in range(epochs):
    print("EPOCH %d"%(i+1))
    model.fit(dataset, epochs=1, batch_size=BATCH_SIZE)
    print("EXAMPLES:")
    for i in range(5):
        generated_text = generate_text(model, max_preds=max_len)
        print(generated_text)
    print()

EPOCH 1
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 4ms/step - loss: 0.7422
EXAMPLES:
iodisd motathydroxporodide
mickplocungurfsumate
stlatum(iii) ditrale
brom(di-umbinmothylrogen methaflane
cidelenu-fide

EPOCH 2
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 4ms/step - loss: 0.4030
EXAMPLES:
canpiut arseliid retriuratite
sidinium tridiate
silver iu
nammerium tinan amanf
silver cyliodamerdifluoride

EPOCH 3
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 4ms/step - loss: 0.3537
EXAMPLES:
sitvit arseiderhlophanane
monoflumate
iodide denthyltatidine
fopeniid hypronidine
obdrate hydophosihide

EPOCH 4
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 4ms/step - loss: 0.3260
EXAMPLES:
nioron(i) orthopopsulfate
rubluvanidium(ii) sulfide
indium iodide
ioden tritrolite brosubid
licoplite

EPOCH 5
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 4ms/step - loss: 0.3041
EXAMPLES:
trafluol

Try generating some more examples.

In [37]:
for i in range(10):
    generated_text = generate_text(model, max_preds=max_len)
    print(generated_text)

beryllium oxide zentane
1,4-benzorobide mendhoxane
titanglumoride
disulfltum(iii) oxide
bismathenic g
catcium nitrite
calcium tetraflufluoride
arseniceic acid
thomium(v) osterhon
neodonhya selesale


Let's try a more advanced network. LSTMs carry an additional hidden state vector over each time step, with learned gates controlling the flow into and out of this carryover vector.

We'll also try stacking RNN layers.

**2. Create a model with two stacked [LSTM](https://keras.io/api/layers/recurrent_layers/lstm/) layers, each with 64 units, and a dense layer for output.**
  - Hint: Again, softmax is not needed for the output.
  - Hint: A recurrent layer that is passing activations to a subsequent recurrent layer always has to return sequences.

In [38]:
# YOUR CODE HERE (2)
model = tf.keras.models.Sequential( # creating sequential model
    [ # layers within sequential model
      # first layer
      tf.keras.layers.LSTM(64, return_sequences=True, input_shape=[None, vocab_size]), # LSTM
      # second layer
      tf.keras.layers.LSTM(64, return_sequences=True), # LSTM
      # dense layer
      tf.keras.layers.Dense(vocab_size) # dense output layer
    ]
)

In [39]:
model.summary()

In [40]:
model.compile(optimizer='adam', loss=loss)

Train the new network for 5 epochs (this will take a few minutes).

In [41]:
epochs=5
for i in range(epochs):
    print("EPOCH %d"%(i+1))
    model.fit(dataset, epochs=1, batch_size=BATCH_SIZE)
    print("EXAMPLES:")
    for i in range(5):
        generated_text = generate_text(model, max_preds=max_len)
        print(generated_text)
    print()

EPOCH 1
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 7ms/step - loss: 0.9117
EXAMPLES:
42riybroc tanas
all(iodium arionide
matriug tel
ibumadium(iii) (iidi)(iim) bryridte
iparium(ii) hpuoride

EPOCH 2
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 7ms/step - loss: 0.4411
EXAMPLES:
ansil(iii) dichloride
piros elinitt
cperium ise
flurom-tethalolen celicide
thalumanum(iii) alumic

EPOCH 3
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 7ms/step - loss: 0.3303
EXAMPLES:
niuordium magnanate
silvar(ii) oxide
calcium acid
hydropholaten
leadium(iv) nitratetrovanale

EPOCH 4
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 7ms/step - loss: 0.2931
EXAMPLES:
beracyl sulfate
nitrog el
zirconium(i) odimenionic acid
nilorostyl pommanate
lead vanadacal

EPOCH 5
[1m3542/3542[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 7ms/step - loss: 0.2682
EXAMPLES:
aluminium monoxide
thyllius fluoride
mercure(ii) orth

Do the LSTM-generated compound names look any different than the simple RNN?

In [42]:
for i in range(10):
    generated_text = generate_text(model, max_preds=max_len)
    print(generated_text)

hydrofulminiamon
cerium(iv) oxide
copper(ii) bromide
phamorine
tungsten((ii) soline
lacolin
bismuth(ii) chloride
chromium tetrafluoride
mercuropropenta osycanidononic acid
tinmium(vi) acidrate
