# Exercise 10 (solution)

In this exercise we learn about simple RNNs as well as encoder-decoder RNNs. We only implement some components from scratch in numpy and leave out the training. By now, it would not be hard for you to implement this in `torch` and add the training.

In [1]:
import numpy as np
from scipy.special import softmax 
from dataclasses import dataclass

## Task 1: Tokenization and embeddings

Implement a simple **character level** tokenization and embedding algorithm. In contrast to what we did in an earlier lecture, we want to minimize the vocab size to just the characters that are present in a given text.

1. Write a function called `get_vocabulary(text)` that returns a sorted list of all characters that occur in the text
2. Write a function called `tokenize(text, vocabulary)` that takes a text and list of characters and returns a list of ints. 
3. Write a function called `embed(tokens, vocab_size)` that returns a numpy array of shape (n_tokens, vocab_size) where each row is a one-hot vector corresponding to a token
4. Call all the functions and to create `in_embeddings` for our text

In [2]:
text = "hello"

In [3]:
def get_vocabulary(text):
    """Get a minimal character level vocabulary to tokenize the text."""
    text = text.lower()
    characters = sorted(set(text))
    return characters



vocabulary = get_vocabulary(text)
vocab_size = len(vocabulary)
vocabulary

['e', 'h', 'l', 'o']

In [4]:
def tokenize(text, vocabulary):
    """Tokenize the text, given the vocabulary."""
    text = text.lower()
    token_dict = {character: pos for pos, character in enumerate(vocabulary)}
    out = [token_dict[character] for character in text]
    return out

tokens = tokenize(text, vocabulary)
tokens

[1, 0, 2, 2, 3]

In [5]:
# embed
def embed(tokens, vocab_size):
    """Create input embeddings for each token."""
    out = np.zeros((len(tokens), vocab_size))
    out[np.arange(len(out)), tokens] = 1
    return out
    
    
in_embeddings = embed(tokens, vocab_size)
in_embeddings

array([[0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

## Task 2: A Params class

1. Define a `dataclass` called `Params` that has the three attributes `w_xh`, `w_hh`, `w_hy`
2. Create an instance of `Params` with weight matrices that have the correct shapes and are filled with uniform random values between -1 and 1. 

In [6]:
n_in = vocab_size
n_out = vocab_size
n_hidden = 3

In [7]:
@dataclass
class Params:
    w_xh: np.ndarray
    w_hh: np.ndarray
    w_hy: np.ndarray

In [8]:
np.random.seed(12345)

p = Params(
    w_xh = np.random.uniform(size=(n_hidden, n_in)),
    w_hh = np.random.uniform(size=(n_hidden, n_hidden)),
    w_hy = np.random.uniform(size=(n_out, n_hidden)),
)
p

Params(w_xh=array([[0.92961609, 0.31637555, 0.18391881, 0.20456028],
       [0.56772503, 0.5955447 , 0.96451452, 0.6531771 ],
       [0.74890664, 0.65356987, 0.74771481, 0.96130674]]), w_hh=array([[0.0083883 , 0.10644438, 0.29870371],
       [0.65641118, 0.80981255, 0.87217591],
       [0.9646476 , 0.72368535, 0.64247533]]), w_hy=array([[0.71745362, 0.46759901, 0.32558468],
       [0.43964461, 0.72968908, 0.99401459],
       [0.67687371, 0.79082252, 0.17091426],
       [0.02684928, 0.80037024, 0.90372254]]))

## Task 3: Implement a Vanilla RNN (for Language Modelling)

1. Implement a function called `model_step(x, h, p)` where `x` is a one-hot vector, `h` is a vector that holds the internal state of the RNN and `p` is an instance of `Params`
2. Implement a function called `model(embeddings, p)` that calles the `model_step` internally and produces an array of logits. The output array has shape (len(embeddings) -1, vocab_size). The function does roughly the following steps:
    - Initialize h to a vector of zeros
    - call `model` in a loop
    - Collect all y in a list

In [9]:
def model_step(x, h, p):
    h = np.tanh(p.w_xh @ x + p.w_hh @ h)
    y = p.w_hy @ h
    return h, y

In [10]:
def model(embeddings, p):
    """Model that takes input_embeddings and produces logits."""
    h = np.zeros(len(p.w_hh))
    out = []
    for x in embeddings[:-1]:
        h, y = model_step(x, h, p)
        out.append(y)
    return np.array(out)

In [11]:
logits = model(in_embeddings, p)
logits.shape

(4, 4)

In [12]:
softmax(logits, axis=1).round(1)

array([[0.2, 0.3, 0.2, 0.3],
       [0.2, 0.4, 0.2, 0.2],
       [0.2, 0.4, 0.2, 0.3],
       [0.2, 0.4, 0.2, 0.3]])

## Task 4: Implement loss function

1. Create a list called `targets` that contains the target token for each output. I.e. the tokenized version of `"ello"`
2. Write a function called `cross_entropy_loss(logits, targets)`. This is basically the same function you wrote in lecture 8. The steps are roughly:
    - Take the softmax over the last axis
    - Use the indexing trick to get likelihoods
    - Return the negative mean of the log likelihoods

We are not using the loss function for training, I just want to make sure you understand what is the loss function for language modelling. 

In [13]:
targets = tokens[1:]
targets

[0, 2, 2, 3]

In [14]:
def cross_entropy_loss(logits, targets):
    probs = softmax(logits, axis=1)
    likelihoods = probs[np.arange(len(targets)), targets]
    return -np.log(likelihoods + 1e-50).mean()

In [15]:
cross_entropy_loss(logits, targets)

1.5206348038623518

## Task 5: Implement a text-to-text model and use optimal weights

In this task I give you trained weights for the model. Those weights should enable the model to correctly return `"ello"` when prompted with `"hello"`

The only think you need to do is:

1. Write a function called `s2s_model(text, p, vocabulary)` that takes text and returns text. Inside, you have to do the following steps:
    - tokenize the text
    - embed the text
    - use the model to get logits
    - Get predicted tokens from the logits
    - Translate the tokens into text

In [16]:
w_xh_opt = np.array([
    [-13.8, 0.6, 2.7, 0.1],
    [4.7, -20.9, 1.6, 0.1],
    [1.6, 6.9, 10.9, 0. ],
])

w_hh_opt = np.array([
    [-2.1, -5.9,  7.2],
    [-5.9, -4.2,  0.8],
    [ 6. ,  7.5,  2.8],
])

w_hy_opt = np.array([
    [ -0.6, -24.2,  -0.7],
    [  3.4,   8.8, -12. ],
    [-12.5,  12.2,   9. ],
    [ 10. ,   3.2,   3.7]
])

p_opt = Params(
    w_xh=w_xh_opt,
    w_hh=w_hh_opt,
    w_hy=w_hy_opt,
)

In [17]:
def s2s_model(text, p, vocabulary):
    """Model that takes text and returns text."""
    vocab_size = len(vocabulary)
    tokens = tokenize(text, vocabulary)
    input_embeddings = embed(tokens, vocab_size)
    logits = model(input_embeddings, p)
    predictions = np.argmax(logits, axis=1)
    return "".join(vocabulary[pred] for pred in predictions)
    

s2s_model(text, p_opt, vocabulary)

'ello'

## Switching to word level embedding for machine translation

To learn about encoder-decoder RNNs we switch from character-level tokenization to word-level tokenization. Moreover, we add a start and end token. 

Since you already know how to write tokenizers, here is the code:

In [18]:
in_text = "Hello World"
out_text = "Hallo Welt"

def get_vocabulary(text):
    """Get a minimal vocabulary to tokenize the text."""
    text = text.lower().split()
    words = sorted(set(text)) + ["<SOS>", "<EOS>"]
    return words
    
def tokenize(text, vocabulary):
    """Tokenize the text, given the vocabulary."""
    text = ["<SOS>"] + text.lower().split() + ["<EOS>"]
    token_dict = {character: pos for pos, character in enumerate(vocabulary)}
    out = [token_dict[character] for character in text]
    return out


def embed(tokens, vocab_size):
    """Create input embeddings for each token."""
    out = np.zeros((len(tokens), vocab_size))
    out[np.arange(len(out)), tokens] = 1
    return out

in_vocabulary = get_vocabulary(in_text)
print("Input vocabulary:", in_vocabulary)
in_vocab_size = len(in_vocabulary)
in_tokens = tokenize(in_text, in_vocabulary)
print("Input tokens:", in_tokens)
in_embeddings = embed(in_tokens, in_vocab_size)

out_vocabulary = get_vocabulary(out_text)
print("Output vocabulary:", out_vocabulary)
out_vocab_size = len(out_vocabulary)
out_tokens = tokenize(out_text, out_vocabulary)
print("Output tokens:", out_tokens)
target_size = len(out_tokens)
print("Target size:", target_size)


n_in = in_vocab_size
n_out = out_vocab_size
n_hidden = 4


Input vocabulary: ['hello', 'world', '<SOS>', '<EOS>']
Input tokens: [2, 0, 1, 3]
Output vocabulary: ['hallo', 'welt', '<SOS>', '<EOS>']
Output tokens: [2, 0, 1, 3]
Target size: 4


Moreover, you get code for two classes of Parameters you can use in your model

In [19]:
@dataclass
class EncoderParams:
    w_xh: np.ndarray
    w_hh: np.ndarray


@dataclass
class DecoderParams:
    w_ss: np.ndarray
    w_ys: np.ndarray
    w_sy: np.ndarray

np.random.seed(1234)

p_enc = EncoderParams(
    w_xh=np.random.uniform(size=(n_hidden, n_in)),
    w_hh=np.random.uniform(size=(n_hidden, n_hidden)),
)

p_dec = DecoderParams(
    w_ss=np.random.uniform(size=(n_hidden, n_hidden)),
    w_ys=np.random.uniform(size=(n_hidden, n_out)),
    w_sy=np.random.uniform(size=(n_out, n_hidden)),
)

p_enc

EncoderParams(w_xh=array([[0.19151945, 0.62210877, 0.43772774, 0.78535858],
       [0.77997581, 0.27259261, 0.27646426, 0.80187218],
       [0.95813935, 0.87593263, 0.35781727, 0.50099513],
       [0.68346294, 0.71270203, 0.37025075, 0.56119619]]), w_hh=array([[0.50308317, 0.01376845, 0.77282662, 0.88264119],
       [0.36488598, 0.61539618, 0.07538124, 0.36882401],
       [0.9331401 , 0.65137814, 0.39720258, 0.78873014],
       [0.31683612, 0.56809865, 0.86912739, 0.43617342]]))

## Task 6: Implement encode and decode steps (for Machine Translation)

1. Write a function called `encode_step(x, h, p_enc)
2. Write a function called `decode_step(s, y_prev, p_dec)

The two functions together will play the same role as the `model_step` in the simple RNN

In [20]:
def encode_step(x, h, p_enc):
    h = np.tanh(p_enc.w_xh @ x + p_enc.w_hh @ h)
    return h

In [21]:
def decode_step(s, y_prev, p_dec):
    s = np.tanh(p_dec.w_ss @ s + p_dec.w_ys @ y_prev)
    y = p_dec.w_sy @ s
    return s, y

## Task 7: Implement the encoder-decoder model

1. Write a function called `model(in_embeddings, target_size, p_enc, p_dec)`. The function has the following steps:
    - Initialize h as a vector of zeros
    - call the encode step in a loop to produce a final encoder state (h)
    - Rename h to s
    - Initialize `y_prev` to the embedding of the `<SOS>` token in the output vocabulary
    - Collect the ys in a list

In [22]:
def model(in_embeddings, target_size, p_enc, p_dec):
    h = np.zeros(len(p_enc.w_hh))
    for x in in_embeddings:
        h = encode_step(x, h, p_enc)

    s = h 
    y_prev = np.zeros(p_dec.w_ys.shape[1])
    y_prev[-2] = 1
    out = []
    for _ in range(target_size):
        s, y = decode_step(s, y_prev, p_dec)
        out.append(y)
        y_prev = y
    
    return np.array(out)

## Task 8: Implement the encoder-decoder text-to-text model

1. Implement a function called `s2s_model(in_text, p_enc, p_dec, in_vocabulary, out_vocabulary, target_size). This is similar to the function you wrote above, but this time the input vocabulary and output vocabulary differ. 

In [23]:
def s2s_model(
    in_text, 
    p_enc,
    p_dec,
    in_vocabulary=in_vocabulary,
    out_vocabulary=out_vocabulary,
    target_size=target_size,
):
    """Model that takes text and returns text."""
    in_vocab_size = len(in_vocabulary)
    in_tokens = tokenize(in_text, in_vocabulary)
    in_embeddings = embed(in_tokens, in_vocab_size)

    logits = model(in_embeddings, target_size, p_enc, p_dec)

    predictions = np.argmax(logits, axis=1)
    return " ".join(out_vocabulary[pred] for pred in predictions)

In [24]:
w_xh_opt = np.array([
    [ 1.1,  0.2, -0.4,  0.5],
    [ 0.7, -0.5, -0.6, -7.9],
    [ 0.4,  3.9,  0.3, -0.2],
    [ 1.2,  0.6,  0.1,  2.7],
])

w_hh_opt = np.array([
    [-0.6, -0.8,  0.4, -0.1],
    [-1.6,  2.7, -3.7, -3.8],
    [-1.2,  1.3,  1. ,  0.2],
    [-0. , -0.6,  0.6,  0.9],
])

w_ss_opt = np.array([
    [10.6, -0.5, -5.2,  0.7],
    [-6.7,  1.9, -5. ,  1.8],
    [10.9, -7.5, -6.2,  4.5],
    [ 1.5,  0.4,  4.1, -4.4],
])

w_sy_opt = np.array([
    [  3. , -10.6, -22.2,   4.2],
    [  4.4,  15.4,   9.6,  -7. ],
    [  4.8,  -4. ,  12.2,  -6. ],
    [-16.4, -10.1, -18.8,  14.1],
])


w_ys_opt = np.array([
    [ 0.5, -3. , 11.9,  4.2],
    [ 3. ,  0.8,  2. ,  1. ],
    [ 3.8, -1.1, 16.1,  8.3],
    [-3.2, -0.1, -7.4, -0.7],
])

p_enc_opt = EncoderParams(
    w_xh=w_xh_opt,
    w_hh=w_hh_opt,
)

p_dec_opt = DecoderParams(
    w_ss=w_ss_opt,
    w_ys=w_ys_opt,
    w_sy=w_sy_opt,
)

In [25]:
s2s_model(in_text, p_enc_opt, p_dec_opt)

'<SOS> hallo welt <EOS>'