Assignment 10: Learn to Write Like Shakespeare
==============================================


Microsoft Forms Document: https://forms.office.com/r/xs1Xb1pe3g

In this assignment we will implement a simple recurrent network with one hidden layer.
We train this network on a medium-size poem "The Sonnet" written by William Shakespeare and use it for auto-completing sentences/phrases.

The data that we will use is originally provided here: http://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt

In [None]:
import os
import random
import torch

# download the data file
filename = "shakespeare.txt"
if not os.path.exists(filename):
  url = "http://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/"
  import urllib.request
  urllib.request.urlretrieve(url+filename, filename)
  print ("Downloaded datafile", filename)

# select to run everything on CUDA
device = torch.device("cuda")

We need to parse the data and turn it into a representation from which we can learn.
First, we need to count the number of unique characters to obtain the dimension $D$ of out input and output.
Then, we need to obtain one-hot encoding vectors for each of the characters.
Finally, we need to implement sequences and their according targets, using zero-padding where required.

Task 1: Data Characteristics
----------------------------

Load all text data from the file `shakespeare.txt`.
Count the number of unique characters contained in the poem. 
Here, we consider only lower-case characters to reduce the alphabet size.
At the same time, we also store the complete poem in a data variable.

Please make sure that you handle the newline character at the end of each line correctly and consistently.


In [None]:
# load all data from the text file
data = ...

# extract a list of all unique characters
characters = ...

D = len(characters)
print (f"Collected a total of {len(data)} elements of {D} unique characters")

Task 2: One-hot Encoding
------------------------

Each of the characters need to be represented by a one-hot encoding.
Create a dictionary that provides the encoding for each unique character.

In [None]:
one_hot = dict()
for c in characters:
  one_hot[c] = ...

Task 3: Sequence Coding
-----------------------

Write a function that provides the inputs and targets for a given sequence of the specified sequence length.
The last value of the target sequence should be the character of the given index.
If a character would be requested from outside of the data range, prepend the inputs (and the targets) with 0.
Assure that $\vec t^{\{s\}} = \vec x^{\{s+1\}}$ $\forall s<S$.

In [None]:
def sequence(index, S):
  # collect both input and target encodings
  inputs, targets = [], []
  # go through the sequence and turn characters into encodings
  ...
  return torch.stack(inputs), torch.stack(targets)

Test 1: Sequences
-----------------

Get a sequence for size 5 with index 2. Assure that the data and target vectors are as desired, i.e., the first elements are 0 vectors, and later one-hot encoded data is added.

In [None]:
# get sequence
x,t = sequence(2,5)

# perform checks
...

We use the standard data loader with a batch size of $B=256$. Theoretically, each training sample could have its own sequence length $S$. To enable batch processing, the sequence size must be the same for each element in the batch (otherwise it cannot be transformed as one large tensor). Thus, our dataset needs to have a fixed sequence size $S$.

Task 4: Dataset and Data Loader
-------------------------------
Implement a dataset that takes parameters $N$ (size of the dataset) and $S$ (size of the sequence).
In the `__getitem__` function, return the `sequence` (using the function of Task 3) for the sample with the given index, i.e., both the input and the target sequence.


In [None]:
class Dataset(torch.utils.data.Dataset):
  def __init__(self, data, S):
    self.S = S
    ...

  def __getitem__(self, index):
    return ...

  def __len__(self):
    return ...

dataset = ...
data_loader = ...

Test 2: Data Sizes
------------------

Check that all samples in the dataset have the desired size and behavior.

In [None]:
for x,t in data_loader:
  # check that the data and targets are as expected
  ...

Task 5: Elman Network Implementation
------------------------------------

Manually implement an Elman network using one fully-connected layer for hidden, recurrent and output units.

Implement the processing of the input in the Elman network. Make sure that logit values are computed and returned for each element in the sequence. Try to use as much tensor processing as possible. Remember the shape of $X$ is $B\times S\times D$, and when going through the sequence, we need to process $\vec x^{\{s\}}$ separately, while working on all batch elements simultaneously.

In [None]:
class ElmanNetwork(torch.nn.Module):
  def __init__(self, D, K):
    super(ElmanNetwork,self).__init__()
    self.W1 = ...
    self.Wr = ...
    self.W2 = ...
    self.activation = ...

  def forward(self, x):
    # get the shape of the data
    B, S, D = x.shape
    # initialize the hidden vector in the desired size with 0
    # remember to put it on the device
    h_s = ...
    # store all logits (we will need them in the loss function)
    Z = torch.empty(x.shape, device=device)
    # iterate over the sequence
    for s in range(S):
      # use current sequence item
      x_s = ...
      # compute recurrent activation
      a_s = ...
      # apply activation function
      h_s = ...
      # compute logit values
      z = ...
      # store logit value
      Z[:,s] = z
      
    # return logits for all sequence elements
    return Z

Test 3: Network Output
----------------------

Instantiate an Elman network with arbitrary numbers for $D$ and $K$.
Generate training samples in a given format, forward them through the network and assure that the results are in the required dimensionality.

In [None]:
# instantiate test network
test_network = ...

# create test input in size BxSxD
test_input = ...
# get the network output
test_output = ...
# check that the netowrk output size is as intended
...

To train the Elman network, we will use categorical cross-entropy loss, averaged over all samples in the sequence.
For each batch, we will use a different sequence size -- while the size inside a batch must stay the same.

According to the PyTorch documentation, the `CrossEntropyLoss` handles logits and targets in shape $B\times O\times\ldots$.
In our case, logits and targets are in dimension $B\times S\times O$.
Hence, we need to make sure that we re-order the indexes such that we fulfil the requirement; you might want to use the `permute` operator.

WARNING: `CrossEntropyLoss` will not complain when the order for the `CrossEntropyLoss` is wrong, just the results will be wrong.


Task 6: Training Loop
---------------------
Instantiate the optimizer with an appropriate learning rate $\eta$ and the loss function.
Implement the training loop for 10 epochs -- more epochs will further improve the results.
Compute the average training loss per epoch.
Possibly, at the end of each batch, overwrite the `dataset.S` with a value randomly samples from $S\in[5,20]$.

Note that 10 epochs will train for about 2 minutes, if implemented in an optimized way, on the GPU. Non-optimized training will take considerably longer.


In [None]:
network = ...
optimizer = ...
loss = ...

for epoch in range(10):
  # create random sequence
  train_loss = 0.

  for x, t in data_loader:
    # compute network output
    z = ...
    # compute loss, arrange order of logits and targets
    J = ...
    # compute gradient for this batch

    # compute average loss
    train_loss += ...
    # select a new sequence length S in [5,20]
    dataset.S = ...

  # print average loss for training and validation
  print(f"\rEpoch {epoch+1}; train loss: {train_loss/len(data_loader):1.5f}")

Task 7: Text Encoding
---------------------
For a given text (a sequence of $S$ characters), provide the encoding $\mathcal X \in R^{B\times S\times D}$.
Assure that the batch index $B=1$ is added to the encoding, so that the network is able to handle it.

In [None]:
def encode(text):
  encoding = ...
  return encoding

Task 8: Next Element Prediction
-------------------------------

Implement a function that return the next character from the logits returned by the network.
Note that the logits are in dimension $\mathcal Y \in \mathbb R^{B\times S\times D}$ with $B=1$, and we are generally only interested in the prediction for the last sequence item.

Select the character with the highest SoftMax probability $\max_o z^{\{S\}}_o$ and append this character to the `text`.
Alternatively, we can also randomly draw a character based on the SoftMax probability distribution $\vec y^{\{S\}}$. `random.choices` provides the possibility to pass a list of characters and a list of probabilities.

In [None]:
def predict(z, use_best):
  # select the appropriate logits
  z_S = ...
  if use_best:
    # take character with maximum probability
    next_char = ...
  else:
    # sample character based on class probabilities
    next_char = ...
  return next_char

Task 9: Sequence Completion
---------------------------

Write a function that takes a `seed` text which it will complete with the given number of characters.
Write a loop that turns the current `text` into an encoded sequence of its characters using the function from Task 7.
Forward the text through the network and take the prediction of the last sequence item $\vec z^{\{S\}}$ using the function from Task 8.
Append this to the current text (remember that Python strings are immutable).
Repeat this loop 80 times, and return the resulting `text`.

In [None]:
def sequence_completion(seed, count, use_best):
  # we start with the given seed
  text = seed
  for i in range(count):
    # turn current text to one-hot batch
    x = ...
    # predict the next character
    next_char = ...
    # append character to text
    ...
    
  return text

Task 10: Text Production
-----------------------

Select several seeds (such as `"the ", "beau", "mothe", "bloo"`) and let the network predict the following 80 most probable characters, or using probability sampling.
Write the completed sentences to console.

In [None]:
seeds = ...

for seed in seeds:
  best = ...
  # print seed and text
  print (f"\"{seed}\" -> \"{best}\"")
  sampled = ...
  # print seed and text
  print (f"\"{seed}\" -> \"{sampled}\"")
  print()