# Deciphering Code with Character-Level RNN

In this notebook, we'll look at how to build a recurrent neural network and train it to decipher strings encrypted with a certain cipher.

This exercise will make you familiar with the techniques of preprocessing and model-building that will come in handy when you start building more advanced models for machine translation, text summarization, and beyond.

## Dataset
The dataset we have consists of 10,000 encrypted phrases and the plaintext version of each encrypted phrase.

Let's start by loading up the dataset to get more familiar with it.

In [1]:
import helper

codes = helper.load_data('cipher.txt')
plaintext = helper.load_data('plaintext.txt')

Now `codes` and `plaintext` are both arrays with each element being a phrase. The first three encoded phrases are:

In [2]:
codes[:5]

['YMJ QNRJ NX MJW QJFXY QNPJI KWZNY , GZY YMJ GFSFSF NX RD QJFXY QNPJI .',
 'MJ XFB F TQI DJQQTB YWZHP .',
 'NSINF NX WFNSD IZWNSL OZSJ , FSI NY NX XTRJYNRJX BFWR NS STAJRGJW .',
 'YMFY HFY BFX RD RTXY QTAJI FSNRFQ .',
 'MJ INXQNPJX LWFUJKWZNY , QNRJX , FSI QJRTSX .']

And their plaintext versions are:

In [3]:
plaintext[:5]

['THE LIME IS HER LEAST LIKED FRUIT , BUT THE BANANA IS MY LEAST LIKED .',
 'HE SAW A OLD YELLOW TRUCK .',
 'INDIA IS RAINY DURING JUNE , AND IT IS SOMETIMES WARM IN NOVEMBER .',
 'THAT CAT WAS MY MOST LOVED ANIMAL .',
 'HE DISLIKES GRAPEFRUIT , LIMES , AND LEMONS .']

## Model Overview: Character-Level RNN
The model we will use here is a character-level RNN since the cipher seems to work on the characer level. In a machine translation scenario, a word-level RNN is the more common choice.

A character-level RNN will take as input an integer referring to a specific character and output another integer. To be able to get our model to work, we'll need to preprocess our dataset in the following steps:
 1. Isolating each character as an array element (instead of an entire phrase, or word being the element of the array)
 1. Tokenizing the characters so we can turn them from letters to integers and vice-versa
 1. Padding the strings so that all the inputs and outputs can fit in matrix form
 
To visualize this processing, let's assume either our source sequences (`codes` in this case) or target sequences (`plaintext` in this case) look like this (a list of strings):

<img src="list_1.png" />

Since this model will be working on the character level, we'll need to separate each string into a list of characters (implicitly done by the tokenizer in this notebook):

<img src="list_2.png" />

Then, the process of tokenization will turn each character into an integer.  Note that when you're working on the a word-level RNN (as in most machine translation examples), the tokenizer will assign an integer to each word rather than each letter, and each cell would represent a word rather than a character.

<img src="list_3.png" />

Most machine learning platforms expect the input to be a matrix rather than a list of lists. To turn the input into a matrix, we need to find the longest member of the list, and pad all shorter sequences with 0. Assuming 'and two' is the longest sequence in this example, the matrix ends up looking like this:

<img src="padded_list.png" />
 
## Preprocessing (IMPLEMENT)
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings.  Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number.  These are called character and word ids, respectively.  Character ids are used for character level models that generate text predictions for each character.  A word level model uses word ids that generate text predictions for each word.  Word level models tend to learn better.

Turn each sentence into a sequence of words ids using Keras's [`Tokenizer`](https://keras.io/preprocessing/text/#tokenizer) function. Since we're working on the character level, make sure to set the `char_level` flag to the appropriate value. Then, fit the tokenizer on x.

In [9]:
from keras.preprocessing.text import Tokenizer

def tokenize(x):
    """function to tokenise x
    
    Parameters:
    - x: list of setences/strings to be tokenised
    
    Returns:
    - Tuple of (tokenised x data, tokeniser used to tokenise x)
    """
    
    x_tk = Tokenizer(char_level=True)
    x_tk.fit_on_texts(x)
    
    return x_tk.texts_to_sequences(x), x_tk

In [10]:
# Tokenise example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .'
]

text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index, '\n')

for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print(f'Sequence {sample_i + 1} in x')
    print(f'\tInput: {sent}')
    print(f'\tOutput: {token_sent}')

{' ': 1, 'e': 2, 'o': 3, 'i': 4, 's': 5, 'h': 6, 'r': 7, 'y': 8, 'u': 9, 'c': 10, 'n': 11, 't': 12, 'a': 13, 'p': 14, '.': 15, 'T': 16, 'q': 17, 'k': 18, 'w': 19, 'f': 20, 'x': 21, 'm': 22, 'v': 23, 'l': 24, 'z': 25, 'd': 26, 'g': 27, 'b': 28, 'j': 29, 'B': 30, 'J': 31, ',': 32} 

Sequence 1 in x
	Input: The quick brown fox jumps over the lazy dog .
	Output: [16, 6, 2, 1, 17, 9, 4, 10, 18, 1, 28, 7, 3, 19, 11, 1, 20, 3, 21, 1, 29, 9, 22, 14, 5, 1, 3, 23, 2, 7, 1, 12, 6, 2, 1, 24, 13, 25, 8, 1, 26, 3, 27, 1, 15]
Sequence 2 in x
	Input: By Jove , my quick study of lexicography won a prize .
	Output: [30, 8, 1, 31, 3, 23, 2, 1, 32, 1, 22, 8, 1, 17, 9, 4, 10, 18, 1, 5, 12, 9, 26, 8, 1, 3, 20, 1, 24, 2, 21, 4, 10, 3, 27, 7, 13, 14, 6, 8, 1, 19, 3, 11, 1, 13, 1, 14, 7, 4, 25, 2, 1, 15]
Sequence 3 in x
	Input: This is a short sentence .
	Output: [16, 6, 4, 5, 1, 4, 5, 1, 13, 1, 5, 6, 3, 7, 12, 1, 5, 2, 11, 12, 2, 11, 10, 2, 1, 15]


### Padding (IMPLEMENTATION)
When batching the sequence of word ids together, each sequence needs to be the same length.  Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the cipher sequences have the same length and all the plaintext sequences have the same length by adding padding to the **end** of each sequence using Keras's [`pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences) function.

In [16]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences

def pad(x, length=None):
    """function to pad list of sequences.
    
    Parameters:
    - x: List of sequences.
    - length: length to pad the sequence to. If None, use length of the longest sequence in x.
    
    Returns:
    - Padded numpy array of sequences.
    """
    
    # find the length of the longest string in the dataset
    if length is None: 
        length = max([len(sequence) for sequence in x])
    
    # pass it to pad_sentences as the maxlen parameter
    return pad_sequences(x, maxlen=length, padding='post')

In [17]:
# pad Tokenised output
test_pad = pad(text_tokenized)

for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print(f'Sequence {sample_i + 1} in x')
    print(f'\tInput: {np.array(token_sent)}')
    print(f'\tOutput: {pad_sent}')

Sequence 1 in x
	Input: [16  6  2  1 17  9  4 10 18  1 28  7  3 19 11  1 20  3 21  1 29  9 22 14  5
  1  3 23  2  7  1 12  6  2  1 24 13 25  8  1 26  3 27  1 15]
	Output: [16  6  2  1 17  9  4 10 18  1 28  7  3 19 11  1 20  3 21  1 29  9 22 14  5
  1  3 23  2  7  1 12  6  2  1 24 13 25  8  1 26  3 27  1 15  0  0  0  0  0
  0  0  0  0]
Sequence 2 in x
	Input: [30  8  1 31  3 23  2  1 32  1 22  8  1 17  9  4 10 18  1  5 12  9 26  8  1
  3 20  1 24  2 21  4 10  3 27  7 13 14  6  8  1 19  3 11  1 13  1 14  7  4
 25  2  1 15]
	Output: [30  8  1 31  3 23  2  1 32  1 22  8  1 17  9  4 10 18  1  5 12  9 26  8  1
  3 20  1 24  2 21  4 10  3 27  7 13 14  6  8  1 19  3 11  1 13  1 14  7  4
 25  2  1 15]
Sequence 3 in x
	Input: [16  6  4  5  1  4  5  1 13  1  5  6  3  7 12  1  5  2 11 12  2 11 10  2  1
 15]
	Output: [16  6  4  5  1  4  5  1 13  1  5  6  3  7 12  1  5  2 11 12  2 11 10  2  1
 15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0]


### Preprocess Pipeline
Your focus for this project is to build neural network architecture, so we won't ask you to create a preprocess pipeline.  Instead, we've provided you with the implementation of the `preprocess` function.

In [18]:
def preprocess(x, y):
    """function to preprocess x and y.
    
    Parameters:
    - x: Feature list of sentences.
    - y: label list of sentences.
    
    Returns:
    - Tuple of (preprocessed x, preprocessed y, x tokeniser, y tokeniser)
    """
    
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)
    
    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)
    
    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
    
    return preprocess_x, preprocess_y, x_tk, y_tk

In [19]:
preproc_code_sentences, preproc_plaintext_sentences, code_tokenizer, plaintext_tokenizer = \
    preprocess(codes, plaintext)

In [20]:
print('Data Preprocessed')

Data Preprocessed


In [21]:
preproc_code_sentences[0]

array([ 5, 14,  3,  1, 10,  2, 13,  3,  1,  2,  4,  1, 14,  3,  6,  1, 10,
        3,  8,  4,  5,  1, 10,  2, 25,  3, 11,  1, 20,  6,  9,  2,  5,  1,
       18,  1, 17,  9,  5,  1,  5, 14,  3,  1, 17,  8,  7,  8,  7,  8,  1,
        2,  4,  1, 13, 15,  1, 10,  3,  8,  4,  5,  1, 10,  2, 25,  3, 11,
        1, 19,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0], dtype=int32)

In [22]:
from keras.layers import GRU, Input, Dense, TimeDistributed
from keras.models import Model
from keras.layers import Activation
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

In [39]:
def simple_model(input_shape, output_sequence_length, code_vocab_size, plaintext_vocab_size):
    """function to build and train a basic RNN on x and y.
    
    Parameters:
    - input_shape: Tuple of input shape.
    - output_sequence_length: length of output sequence.
    - code_vocab_size: number of unique code characters in the dataset.
    - plaintext_vocab_size: number of unique plaintext characters in the dataset.
    
    Returns:
    - Keras model built, but not trained.
    """
    
    learning_rate = 1e-3
    
    input_seq = Input(input_shape[1:])
    rnn = GRU(64, return_sequences=True)(input_seq)
    logits = TimeDistributed(Dense(plaintext_vocab_size))(rnn)
    
    model = Model(input_seq, Activation('softmax')(logits))
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model

In [32]:
# reshaping the input to work with a basic RNN
tmp_x = pad(preproc_code_sentences, preproc_plaintext_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_plaintext_sentences.shape[-2], 1))

In [40]:
# train the neural network
simple_rnn_model = simple_model(
    tmp_x.shape,
    preproc_plaintext_sentences.shape[1],
    len(code_tokenizer.word_index) + 1,
    len(plaintext_tokenizer.word_index) + 1
)

In [45]:
simple_rnn_model.fit(tmp_x, preproc_plaintext_sentences, batch_size=32, epochs=10, validation_split=0.2)

Train on 8000 samples, validate on 2001 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f47962bb908>

In [42]:
def logits_to_text(logits, tokenizer):
    """function to turn logits from a neural network into text using the tokenizer
    
    Parameters:
    - logits: logits from a neural network.
    - tokenizer: Keras tokenizer fit on the labels.
    
    Returns:
    - String that represents the text of the logits.
    """
    
    index_to_words = {id: word for (word, id) in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'
    
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, axis=1)])

In [77]:
print('`logits_to_text` function loaded.')
print(logits_to_text(simple_rnn_model.predict(tmp_x[4:5])[0], plaintext_tokenizer))

`logits_to_text` function loaded.
H E   D I S L I K E S   G R A P E F R U I T   ,   L I M E S   ,   A N D   L E M O N S   . <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


In [78]:
plaintext[4]

'HE DISLIKES GRAPEFRUIT , LIMES , AND LEMONS .'

And there it is. The RNN was able to learn this basic character-level cipher (which was a simple [Caesar cipher](https://en.wikipedia.org/wiki/Caesar_cipher). If you want a bigger cryptography challenge, check out [Learning the Enigma with Recurrent Neural Networks](https://greydanus.github.io/2017/01/07/enigma-rnn/). 