# Homework 09 - RNNs

In this homework you will learn how to generate text with an RNN. 

Defining an RNN in TensorFlow is based on a specific framework. Therefore I will provide you with the correct model definition. Your task will be to to understand how the model processes sequential data, which kind of data it returns and how to train it.

You will train a stacked RNN to generate text passages, similar to those in the bible.
For that make sure that the 'bible.txt' (from Stud.IP) file is in the same folder as this notebook.

In [None]:
import numpy as np
import tensorflow as tf
import random

### Load text.

In [None]:
# Load the text.
txt = open("bible.txt",'r').read()
print("Text length: {}".format(len(txt)))
print('-------------------------------------')

# Inspect first lines:
print(txt[:500])

In [None]:
# Get the vocabulary of the text
vocab = list(set(txt))
print("Vocabulary: {}".format(vocab))
print('--------------------------')
vocab_size = len(vocab)
print("Vocabulary size: {}".format(vocab_size))

In [None]:
# Create dictionaries to switch between the indices of the characters and the characters themselves.
char2idx = {ch:i for i,ch in enumerate(vocab)}
idx2char = {i:ch for i,ch in enumerate(vocab)}

# Translate the text to indices.
txt_idx = [char2idx[ch] for ch in txt]

### Prepare TensorFlow dataset.

We will train our RNNs with TBPTT(k1=20, k2=20).
In the following we will process the text to a suitable dataset.

In [None]:
# First create a tensorflow dataset out of the text (in indices). (tf.data.Dataset.from_tensor_slices)
### YOUR CODE HERE ###

######################

In [None]:
# We will trainon subsequences of length 20 and compute the loss for each timestep.
# Let's think about how a single training datapoint will look.
# Example:
# Input sequence: "Moses:  Called Genesi"
# Target sequence: "oses:  Called Genesis"
# To create these pairs of sequence we chunk the dataset into subsequences of length k+1.
# You can use .batch() for this. 
# And make sure that all subsequences in the resulting dataset have length k+1 (understand
# parameter 'drop_remainder' in .batch())

### YOUR CODE HERE ###

######################

In [None]:
# Now we have to map each sequence of length 21
# to a (input, target) pair.
# Given the following function you can use the dataset method .map() here.
def input_target_split(seq):
    return seq[:-1], seq[1:]

### YOUR CODE HERE ###

######################

In [None]:
# Now as usual we shuffle our dataset and chunk it into batches of 64.
### YOUR CODE HERE ###

######################

In [None]:
# Provided definitions of Vanilla RNN cell and RNN model.

class VanillaRNNCell(tf.keras.layers.Layer):

    def __init__(self, input_dim, units):
        super(VanillaRNNCell, self).__init__()
        self.input_dim = input_dim
        self.units = units
        # TF needs this.
        self.state_size = units
    
    def build(self, input_shape):
        self.w_in = self.add_weight(
                            shape=(self.input_dim, self.units),
                            initializer='uniform'
                            )
        self.w_h = self.add_weight(
                            shape=(self.units, self.units),
                            initializer='uniform'
                            )
        self.b_h = self.add_weight(
                            shape=(self.units,),
                            initializer='zeros'
                            )       
            
    def call(self, inputs, hidden_states):
        h_prev = hidden_states[0]
        h_new = tf.nn.sigmoid(tf.matmul(inputs, self.w_in) + tf.matmul(h_prev, self.w_h) + self.b_h)
        return h_new, [h_new]

state_size_1 = 128
state_size_2 = 256

class RNN(tf.keras.layers.Layer):
    
    def __init__(self):
        super(RNN, self).__init__()
        self.cell_1 = VanillaRNNCell(input_dim=vocab_size, units=state_size_1)
        self.cell_2 = VanillaRNNCell(input_dim=state_size_1, units=state_size_2)
        self.cells = [self.cell_1, self.cell_2]
        self.rnn = tf.keras.layers.RNN(self.cells, return_sequences=True)
        self.output_layer = tf.keras.layers.Dense(units=vocab_size, activation=tf.nn.softmax)
        
    def call(self,x):
        seqs = self.rnn(x)
        output = self.output_layer(seqs)
        return output

In [None]:
# IT MIGHT MAKE SENSE TO FIRST CODE THE ACTUAL TRAINING CELL (NEXT ONE) AND THEN COME BACK HERE.

# Defining the function for generating novel text samples.
# The function should take the sample below (of length k) and generate a text sequence of length k+n from it.

sample = 'And God and Jesus go'
assert(len(sample)==k)

def generate_sample(sample, n):
    
    ### YOUR CODE HERE ###
    # Translate sample string into list of characters.

    # Transform list into tensor of shape (1,20,vocab_size)
    
    ######################
    
    # Sample n new characters.
    for _ in range(n):
        
        ### YOUR CODE HERE ###
        # Feed sample sequence into RNN and get probabilities of next character.

        # Sample index for new character (use tf.random.categorical()).

        # Translate to actual character and add it to sample string.

        # Create new sequence of 20 indices by deleting the first character of the old sequence
        # and adding the new character.

        ######################
        
    return sample

In [None]:
tf.keras.backend.clear_session()

### YOUR CODE HERE ### 
# Initialize the RNN, cross entropy as a loss function and as an optimizer Adam with learning rate 0.01.

# Train for one epoch. Your loss should be around 1.4.
# Remember to encode the inputs and target values as one hots.

######################    

In [None]:
# Feel free to generate some funny samples.
print(generate_sample(sample,1000))