[Open this notebook on Colab](https://colab.research.google.com/github/probabll/ntmi-tutorials/blob/main/Encoders.ipynb)

# Guide

Neural networks (NNs) are functions that take real-valued vectors as inputs and produce real-valued vectors as outputs. They are quite flexible, but not flexible enough to process _symbolic_ data (such as language) without special treatment. 

In this notebook, we discuss techniques that we can use to map from a symbolic space, such as a finite countable set (e.g., the vocabulary of a language) to a real coordinate space (a real-valued vector space of fixed dimensionality). 

This notebook stands as lecture notes for the _Feature Learning_ module. 

## ILOs

After completing this lab you should be able to

* specify neural text encoders in PyTorch

## General Notes

* In this notebook you are expected to use $\LaTeX$. 
* Use python3.
* Use Torch. 
* This tutorial runs on CPU with no problem.

We will use a set of standard libraries that are often used in machine learning projects. If you are running this notebook on Google Colab, all libraries should be pre-installed. If you are running this notebook locally you will need to install some additional packages, ask your TA for help if you have problems setting up.

If you need a short introduction to PyTorch [check this tutorial](https://github.com/probabll/ntmi-tutorials/blob/main/PyTorch.ipynb).


## Table of Contents



### Topics 

* [Text Encoders](#sec:Text_Encoders)
* [Vocabulary](#sec:Vocabulary)
* [From Tokens to Vectors](#sec:From_Tokens_to_Vectors)
	* [One-Hot Encoding](#sec:One-Hot_Encoding)
	* [Word embeddings](#sec:Word_embeddings)
* [Pooling from multiple vectors](#sec:Pooling_from_multiple_vectors)
	* [Sum pooling](#sec:Sum_pooling)
	* [Average pooling](#sec:Average_pooling)
* [Mapping from one real coordinate space to another](#sec:Mapping_from_one_real_coordinate_space_to_another)
	* [Linear transformation](#sec:Linear_transformation)
	* [Nonlinear activation functions](#sec:Nonlinear_activation_functions)
* [Composing multiple vectors](#sec:Composing_multiple_vectors)
	* [Concatenation](#sec:Concatenation)
	* [Feed forward network](#sec:Feed_forward_network)
	* [Recurrent neural network](#sec:Recurrent_neural_network)
	 	* [LSTM](#sec:LSTM)
	 	* [Bidirectional RNN encoder](#sec:Bidirectional_RNN_encoder)


### Table of ungraded exercises

1. [One-hot encoding](#ungraded-1)
1. [Token embedding](#ungraded-2)
1. [Sum pooling](#ungraded-3)
1. [Average pooling](#ungraded-4)
1. [Limitations of pooling](#ungraded-5)
1. [Linear transformation](#ungraded-6)
1. [Activation functions](#ungraded-7)
1. [Concatenation](#ungraded-8)
1. [FFNN](#ungraded-9)
1. [BiLSTM](#ungraded-10)

## Setting up

In [None]:
import random
import numpy as np
np.random.seed(42)
random.seed(42)

In [None]:
import torch

<a name='sec:Text_Encoders'></a>
# Text Encoders

In NLP applications, we often have to *encode* a piece of text, that is, map it to one (or more) vector(s) in some real coordinate space. For example, that is the case in text classification.

In this section we will discuss standard NN building blocks and how to use them to encode a document (i.e., turn a document into features) and then map that encoding to the parameters of our choice of probability mass function (pmf). 
Whenever a neural network has parameters of its own, these are initialised in some standard way (typically at random). At initialisation, these parameters are uniformative. That is, we can use the NN, but the outputs are not optimised for any specific purpose. You will implement a training procedure in T4, for now, it is sufficient to focus on how to specify the NN functions and making sure we understand their inputs, their outputs and what operations they perform. 

Throughout, we will focus on building blocks useful in the design of $C$-way text classifiers. We assume a *document* is a sequence $x=\langle w_1, \ldots, w_l \rangle$ of $l$ tokens, each token comes from a vocabulary $\mathcal V$ of $V$ tokens. The label space $\mathcal C$ of our text classifier is made of $C$ classes. Hence, our goal is to map from any given $x$ to a $C$-dimensional probability vector $\boldsymbol \pi^{(x)} \in \Delta_{C-1}$.


The rough idea is as follows:
* we convert the tokens in a document to fixed-dimensional vectors ;
* then, we map these vectors to a single vector representing the entire document (depending on how we design this operation, it may or may not discard information such as the order in which the tokens ocurred);
* finally, we map this document encoding to a vector of $C$ logits (and softmax gives us $C$ probabilities), which then is used to parameterise the Categorical pmf.

In order to maximise the benefit of working with computation graphs, we will design our NNs to process batches of documents, where the documents in a batch may differ in length.

In [None]:
import torch
from torch import nn
import torch.nn.functional as F

<a name='sec:Vocabulary'></a>
# Vocabulary



The first thing we do when working with text is to map our tokens to unique 0-based integer identifiers (ids). It does not matter which token gets which id, so long as the correspondence between tokens and ids is fixed and unique.

Here's a toy example: a vocabulary of 10 known symbols and their unique 0-based integer identifiers. The first 4 symbols are reserved for special use, the remaining ones are words in our toy language.
```
id | symbol
---| -------
0  | -PAD-
1  | -BOS-
2  | -EOS-
3  | -UNK-
4  | and
5  | are
6  | awesome
7  | cats
8  | cute
9  | dogs
```

A vocabulary often contains some special symbols, which help us design good text encoders. 
* For example,  `-UNK-` is used in place of unknown words: if `otters` was never seen in training, but occurs at one point in  `otters are cute`, we change that document to `-UNK- are cute`.
* When working with batches of documents of different length, we extend shorter documents to match the length of the longest one, so they can be stacked together into something like a table:
```
|-------|------|-------|-------|---------|
| cats  | and  | dogs  | are   | awesome |
| cute  | cats | -PAD- | -PAD- | -PAD-   |
| -UNK- | are  | cute  | -PAD- | -PAD-   |
|-------|------|-------|-------|---------|
```
where the special symbol `-PAD-` identifies cells that are not part of any document.

There are various implementations of vocabulary, here's a very basic one (for most projects we need to adapt it to our needs, in T4 we will adapt it to a concrete application).

In [None]:
class Vocabulary:
    """
    Constructs a 1-to-1 map between symbols (strings) and 0-based identifiers
    """
    
    def __init__(self, reserve: list, default=None):
        """
        reserve: reserve this symbols in order
        """
        if len(reserve) != len(set(reserve)):
            raise ValueError("Every reserved symbol must be unique")
        self._reserve = tuple(reserve)
        self._sym2id = dict()
        self._symbols = []        
        for sym in reserve:
            self.add(sym)
        self._default = default    
        if default is not None:
            self.add(default)        
        
    def add(self, sym: str):
        """Add a symbol (if it is unique) and return its index"""
        idx = self._sym2id.get(sym, None)
        if idx is None:
            idx = len(self._symbols)
            self._symbols.append(sym)
            self._sym2id[sym] = idx
        return idx
    
    def idx(self, sym):
        """Return the index of an existing symbol"""
        idx = self._sym2id.get(sym, None)
        if idx is None:
            if self._default is None:
                raise KeyError(f"Unknown symbol {sym}")
            else:
                idx = self._sym2id.get(self._default)
        return idx
    
    def symbol(self, idx):
        """Return the symbol associated with an index"""
        return self._symbols[idx]

    def __len__(self):
        """Vocabulary size"""
        return len(self._symbols)
    
    def items(self):
        """Items in the dictionary of symbols"""
        return self._sym2id.items()

In [None]:
vocabulary = Vocabulary(['-PAD-', '-BOS-', '-EOS-', '-UNK-'], default='-UNK-')
len(vocabulary)

In [None]:
for k, s in vocabulary.items():
    print(k, s)

In [None]:
print(vocabulary.add("and"))
print(vocabulary.add("are"))
print(vocabulary.add("awesome"))
print(vocabulary.add("cats")) 
print(vocabulary.add("cats")) # duplicates aren't added twice
print(vocabulary.add("cute"))
print(vocabulary.add("dogs"))

In [None]:
print(vocabulary.idx("dogs"))

In [None]:
print(vocabulary.symbol(0))

In [None]:
print(vocabulary.symbol(vocabulary.idx("otters")))

In [None]:
for k, s in vocabulary.items():
    print(s, k)

From now on, if we talk about a "token" we mean a _token id_ (i.e., the 0-based integer that identifies that symbol uniquely).

<a name='sec:From_Tokens_to_Vectors'></a>
# From Tokens to Vectors

Neural networks are functions that take real-valued vectors as inputs and produce real-valued vectors as outputs. They are quite flexible, but not flexible enough to process _symbolic_ data (such as language) without special treatment. 

In this section we discuss techniques that we can use to map from a symbolic space, such as the set of known words (i.e., the vocabulary) to a real coordinate space (a real-valued vector space of fixed dimensionality). 

<a name='sec:One-Hot_Encoding'></a>
## One-Hot Encoding


If we know a *finite* set $\mathcal V$ of tokens (e.g., words), and the total number of unique symbols in it is some number $V = |\mathcal V|$, then the simplest technique to map tokens to vectors is to map each token $t \in \mathcal V$ to a vector $\mathbf v = \mathrm{onehot}_V(t)$ such that $\mathbf v \in \mathbb R^V$, $v_t=1$ and $v_{d\neq t}=0$. 

This technique is called _one-hot encoding_ because it returns a vector whose coordinates are $0$ for all but one dimension (that which indicates the token we are encoding), which gets 1.

Example: using the vocabulary from the example, $\mathrm{onehot}_{10}(\texttt{awesome})= (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^\top$.} 

Here we illustrate how to obtain one-hot encodings using `F.one_hot` from torch.

In [None]:
# This is the 10-dimensional encoding of 'awesome' 
F.one_hot(
    torch.tensor(vocabulary.idx("awesome")).long(), 
    len(vocabulary) # output dimensionality
)

See that the output dimensionality is fixed, this means we can only encode up to $V$ elements:

In [None]:
try: 
    F.one_hot(
        torch.tensor(-1).long(), 
        len(vocabulary) # output dimensionality
    )
except RuntimeError:
    print(f"Torch is 0-based, hence for a {len(vocabulary)}-dimensional space, we can encode from 0 to {len(vocabulary)-1}")
    
try: 
    F.one_hot(
        torch.tensor(len(vocabulary)).long(), 
        len(vocabulary) # output dimensionality
    )
except RuntimeError:
    print(f"Torch is 0-based, hence for a {len(vocabulary)}-dimensional space, we can encode from 0 to {len(vocabulary)-1}")

We can always work with batches of symbols, for example, a batch containing 3 documents of different length.

In [None]:
F.one_hot(
    torch.tensor(
        [
            [vocabulary.idx("cats"), vocabulary.idx("and"), vocabulary.idx("dogs"), vocabulary.idx("are"), vocabulary.idx("awesome")], # first document 
            [vocabulary.idx("cute"), vocabulary.idx("cats")] + [vocabulary.idx("-PAD-")] * 3, # second document
            [vocabulary.idx("otters"), vocabulary.idx("are"), vocabulary.idx("cute")] + [vocabulary.idx("-PAD-")] * 2 # third document
        ]
    ).long(), # three example documents (already expressed as sequences of token ids)
    len(vocabulary) # vocabulary size (pretend we only know 10 words)
)

<a name='ungraded-1'></a> **Ungraded Exercise 1 - One-hot encoding**

1. The output of `F.one_hot` above should have dimensionality [3, 5, 10]. Can you tell why?
2. In general, if the input batch has shape `[B, L]` and our vocabulary has `V` tokens, what's the expected output size of `F.one_hot(input_batch, V)`?
3. How many trainable parameters are needed to specify a one-hot encoding function for a vocabulary of size `V`?


<details>
    <summary> <b>Click to see a solution</b> </summary>
 
1. Every token id in the input should have been converted to a 10-dimensional one-hot vector.
The input has shape [3, 5], that is, we have 3 sequences of length 5 (counting PADs).
Hence, the output should be [3, 5, 10].
2. In general, for [B, L] inputs we expect [B, L, V] outputs.
3. None. The one-hot encoding function does not require trainable parameters, the output is "hard-coded" to be exactly the one-hot representation of a symbol, for each and every symbol in the vocabulary. 
    
---
    
</details>      


<a name='sec:Word_embeddings'></a>
## Word embeddings


Our next operation is a bit more interesting. It allows us to associate with each token a vector of trainable parameters. So, instead of encoding a symbol $t \in \mathcal V$ into a $V$-dimensional one-hot vector, we encode it into a $D$-dimensional vector of real-numbers. We normally refer to this operation as *embedding* the token into a $D$-dimensional space. Suppose we have a table of parameters $\mathbf E \in \mathbb R^{V \times D}$, with one $D$-dimensional row for each of the known symbols in the vocabulary $\mathcal V$. Then, for some symbol $t \in \mathcal V$, the embedding operation returns the vector $\mathbf e \in \mathbb R^D$ that corresponds to it.

Sometimes, we need to describe this operation in written form (for example, to sketch our model architecture). Here is a compact notation for it:
\begin{align}
    \mathbf e &= \mathrm{embed}_D(t; \mathbf E)
\end{align}

The subscript $_D$ indicates the output dimensionality. After `;` we have the trainable parameters of the operation.


If we had a sequence of symbols, for example, a document $w_{1:l}$, we could apply the embedding operation to each symbol in the sequence, and denote it this way:
\begin{align}
    \mathbf e_i &= \mathrm{embed}_D(w_i; \mathbf E) \quad \text{for }i=1,\ldots, l
\end{align}

Let's see how to use torch to specify an embedding layer. We will design a toy embedding layer, for a toy vocabulary:

In [None]:
# this creates the layer with untrained parameters
toy_emb_dim = 2
toy_vocab_size = len(vocabulary)
toy_emb = nn.Embedding(
    num_embeddings=toy_vocab_size, 
    embedding_dim=toy_emb_dim, 
)
toy_emb

See that pytorch will intialise all 10 vectors for us, each 2-dimensional:

In [None]:
print(toy_emb.weight)

Torch initialised those embeddings for us, at random. It is possible to intialise embeddings with meaningful features (you will see one example in T4).

For a 2-dimensional embedding layer, we can plot the embeddings:

In [None]:
import matplotlib.pyplot as plt

if toy_emb.embedding_dim == 2:
    _ = plt.scatter(toy_emb.weight[:,0].detach().numpy(), toy_emb.weight[:,1].detach().numpy(), marker='x')
    for t, i in vocabulary.items(): # pretending our vocabulary is this toy example
        _ = plt.annotate(t, toy_emb.weight[i].detach().numpy(), fontsize=12)
    _ = plt.xlabel("Embedding dimension 1")
    _ = plt.ylabel("Embedding dimension 2")    
    _ = plt.title("Embeddings at initialisation")

The forward method of the embedding module will embed every token in a batch of token ids.
Let's test test it on a toy batch:

In [None]:
toy_batch = torch.tensor(
    [
        [vocabulary.idx(s) for s in "cats and dogs are awesome".split()],
        [vocabulary.idx(s) for s in "cute cats -PAD- -PAD- -PAD-".split()],
        [vocabulary.idx(s) for s in "otters are cute -PAD- -PAD-".split()],
        [vocabulary.idx(s) for s in "cute rabbits and cute otters".split()]
    ]
)
toy_batch

In [None]:
# this embeds the tokens in the sequences in the batch
e = toy_emb(toy_batch)
print(e.shape)
print(e)

It can be useful to count the number of parameters in a layer, here is some helper code:

In [None]:
def num_parameters(torch_module):
    """A helper to count the number of parameters in a torch module"""
    return sum(np.prod(theta.shape) for theta in torch_module.parameters())

<a name='ungraded-2'></a> **Ungraded Exercise 2 - Token embedding**

Assume the input batch has shape `[B, L]`, our vocabulary has size `V` and our embedding vectors are `D`-dimensional. 

1. Once we pass the input batch through the embedding layer, what's the output shape?
2. How many trainable parameters do we need in order to specify an embedding layer?


<details>
    <summary> <b>Click to see a solution</b> </summary>
 
1. The output shape is `[B, L, D]` because each one of the token ids is mapped to a `D`-dimensional vector.
2. We need $V \times D$ trainable parameters, that is, $D$ parameters for each of the $V$ symbols in the vocabulary.
    
---
    
</details>      



<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

# We can use this to verify that the output shape is correct

assert toy_emb(toy_batch).shape == toy_batch.shape + (toy_emb_dim,)

# We can use this to verify the size of the embedding layer

assert num_parameters(toy_emb) == toy_vocab_size * toy_emb_dim, "Embedding layers are built upon [V, D] matrices"

```

---
    
</details>      


<a name='sec:Pooling_from_multiple_vectors'></a>
# Pooling from multiple vectors

Sometimes we need to combine a variable number of $D$-dimensional vectors into a single $D$-dimensional vector, this is usually referred to as *pooling*. There are different pooling operations, some with and some without trainable parameters, for now we only cover those without trainable parameters.

General idea:
* Input: a collection $\mathbf e_1, \ldots, \mathbf e_l$ of $l > 0$ vectors, all $D$-dimensional.
* Output: a single $D$-dimensional vector $\mathbf u \in \mathbf R^D$.

*Batched* implementations of pooling operations must give special treatment to positions that should not affect the result (i.e., those that correspond to `-PAD-`), as we will discuss in each case below.


<a name='sec:Sum_pooling'></a>
## Sum pooling 

The output is the elementwise sum of the inputs.

* Input: a collection of $l > 0$ vectors $\mathbf e_1, \ldots, \mathbf e_l$, all of the same dimensionality $D$.
* Output $\mathbf u \in \mathbb R^D$ defined as

\begin{align}
    \mathbf u &= \sum_{i=1}^l \mathbf e_i
\end{align}

That is, for each dimension $d \in [D]$, we have
\begin{align}
    u_d &=  \sum_{i=1}^l e_{i,d}
\end{align}

For a *batched* implementation of this operation, input vectors in padded positions should be treated as if they were $\mathbf 0$.

In [None]:
def sum_pooling(input_sequences, sequence_mask):
    """
    Returns the sum of the vectors along the sequence dimension.
    
    input_sequences: [batch_size, max_length, D] a batch of sequences of D-dimensional vectors
    sequence_mask: [batch_size, max_length] indicates which positions are valid (i.e., not PAD)
        we use 1 for valid (not PAD) and 0 for PAD
    
    Output shape is [batch_size, D]
    """
    
    # here we replace padding positions by D-dimensional vectors of 0s, 
    #  this way those options won't contribute to the sum
    # [batch_size, max_length, D]    
    masked = torch.where(
        # we create an extra axis at the end of the tensor
        sequence_mask.unsqueeze(-1),  # this has shape [batch_size, max_length, 1]
        input_sequences,  # this has shape [batch_size, max_length, D]
        torch.zeros_like(input_sequences)  # this has shape [batch_size, max_length, D]
    )
    
    # we sum, along the sequence dimension (second last),
    #  the valid vectors (those that are not PAD)
    # [batch_size, D]
    return torch.sum(masked, dim=-2) 

Let's test this using our toy batch. To obtain a sequence mask, we compare the tokens in the batch to the PAD id (we are using `0` for that):

In [None]:
toy_batch, toy_batch != vocabulary.idx("-PAD-")

In [None]:
sum_pooling(F.one_hot(toy_batch, toy_vocab_size), toy_batch != vocabulary.idx("-PAD-"))

<a name='ungraded-3'></a> **Ungraded Exercise 3 - Sum pooling**

Suppose we have a batch of documents, with shape `[B, L]`, we turn the tokens into one-hot vectors of dimensionality `V`. Then, then we apply sum pooling to this input batch. 

1. What's the shape of the output?
2. Can we interpret the result as a bag-of-words encoding of the documents in the batch?
3. How many trainable parameters are needed to specify the sum pooling operation?


<details>
    <summary> <b>Click to see a solution</b> </summary>
 
1. The input batch has shape `[B, L]`, after one-hot encoding, we get `[B, L, V]`, since each token becomes a V-dimensional one-hot vector. Finally, sum pooling works along the sequence dimension, hence we have output shape `[B, V]`.
2. Yes, that's precisely how we can interpret the output. you can see it in the toy example above. 
3. None, the sum pooling operation is an elementwise transformation of existing vectors and it does not require any trainable parameters. 
    
---
    
</details>      



<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

h = sum_pooling(F.one_hot(toy_batch, toy_vocab_size), toy_batch != vocabulary.idx("-PAD-"))
# Note how the 'step' or 'time' dimension is gone
assert h.shape == (toy_batch.shape[0], toy_vocab_size)
# In the last document h[-1], "cute" occurs twice:
assert h[-1, vocabulary.idx("cute")] == 2
# In the last document h[-1], we have 2 unknown words (rabbits and otters):
assert h[-1, vocabulary.idx("-UNK-")] == 2

```

---
    
</details>      


<a name='sec:Average_pooling'></a>
## Average pooling 

When the output is the elementwise average of the inputs, this is known as *average pooling*. 

* Input: a collection of $l > 0$ vectors $\mathbf e_1, \ldots, \mathbf e_l$, all of the same dimensionality $D$.
* Output $\mathbf u \in \mathbb R^D$ defined as

\begin{align}
    \mathbf u &= \frac{1}{l} \sum_{i=1}^l \mathbf e_i
\end{align}

That is, for each dimension $d \in [D]$, we have
\begin{align}
    u_d &= \frac{1}{l} \sum_{i=1}^l e_{i,d}
\end{align}

A *batched* implementation of this operation must ignore padded inputs (treat them as $\mathbf 0$ and also count sequence length without them).

In [None]:
def average_pooling(input_sequences, sequence_mask):
    """
    Returns the average encoding of each sequence.
    
    input_sequences: [batch_size, max_length, D] a batch of sequences of D-dimensional vectors
    sequence_mask: [batch_size, max_length] indicates which positions are valid (i.e., not PAD)
        we use 1 for valid (not PAD) and 0 for PAD
    
    Output shape is [batch_size, D]
    """
    
    # here we replace padding positions by D-dimensional vectors of 0s, 
    #  this way those options won't contribute to the sum
    # [batch_size, max_length, D]    
    masked = torch.where(
        # we create an extra axis at the end of the tensor
        sequence_mask.unsqueeze(-1),  # this has shape [batch_size, max_length, 1]
        input_sequences,  # this has shape [batch_size, max_length, D]
        torch.zeros_like(input_sequences)  # this has shape [batch_size, max_length, D]
    )
    
    # we sum, along the sequence dimension (second last),
    #  the valid vectors (those that are not PAD)
    # we also divide by sequence length
    # [batch_size, D]
    avg = torch.sum(masked, dim=-2) / torch.sum(sequence_mask.float(), dim=-1, keepdims=True)

    return avg

In [None]:
average_pooling(toy_emb(toy_batch), toy_batch != vocabulary.idx("-PAD-"))

<a name='ungraded-4'></a> **Ungraded Exercise 4 - Average pooling**

For an input batch with shape `[B, L, D]`, where `D` is the embedding dimension and `L` is max sequence length.

1. What's the output shape if we do average pooling over the sequence dimension?
2. Does average pooling require trainable parameters?


<details>
    <summary> <b>Click to see a solution</b> </summary>
 
1. `[B, D]` since average pooling will eliminate the sequence dimension, by elementwise average of the vectors in the sequence
2. No, it simply operates over the input vectors.
    
---
    
</details>      



<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

# This checks the shape of the output
h = average_pooling(toy_emb(toy_batch), toy_batch != vocabulary.idx("-PAD-"))
assert h.shape == (toy_batch.shape[0], toy_emb_dim)

```

---
    
</details>      


<a name='ungraded-5'></a> **Ungraded Exercise 5 - Limitations of pooling**

Can you already recognise a big limitation of `sum_pooling` or `average_pooling` (in combination with one-hot or embeddings) as a means to represent a document?


<details>
    <summary> <b>Click to see a solution</b> </summary>
 
Sum or average pooling are not sensitive to the order in which the input vectors are presented. So, if the inputs to pooling do not preserve information about word order, then neither will the output. You can see that in our `toy_batch` the first and the third document are permutations of one another, hence they have the same BoW encoding and the same average embedding encoding.

Later, in this tutorial, we will address this limitation.
    
---
    
</details>      


In [None]:
another_toy_batch = torch.tensor(
    [
        [vocabulary.idx(s) for s in "dogs are awesome pets".split()],
        [vocabulary.idx(s) for s in "awesome pets are dogs".split()],
        [vocabulary.idx(s) for s in "dogs awesome pets are".split()],
        [vocabulary.idx(s) for s in "are dogs awesome pets".split()]
    ]
)
another_toy_batch


<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

bow = sum_pooling(F.one_hot(another_toy_batch, toy_vocab_size), another_toy_batch != vocabulary.idx("-PAD-"))
avgemb = average_pooling(toy_emb(another_toy_batch), another_toy_batch != vocabulary.idx("-PAD-"))
assert torch.all(bow[0] == bow[1])
assert torch.all(bow[0] == bow[2])
assert torch.all(bow[0] == bow[3])
assert torch.allclose(avgemb[0], avgemb[1])  # when comparing floating numbers, we use `allclose` rather than `all`
assert torch.allclose(avgemb[0], avgemb[2])  # when comparing floating numbers, we use `allclose` rather than `all`
assert torch.allclose(avgemb[0], avgemb[3])  # when comparing floating numbers, we use `allclose` rather than `all`

```

---
    
</details>      


<a name='sec:Mapping_from_one_real_coordinate_space_to_another'></a>
# Mapping from one real coordinate space to another

Sometimes we are working with vectors of a certain dimensionality $I$ and we need to convert them to vectors of another dimensionality $O$. This is very common, for example, when mapping a document encoding (e.g., average embedding, or BoW encoding) to $C$ logits.

<a name='sec:Linear_transformation'></a>
## Linear transformation 

A *linear transformation* is the standard way to get this done:  with trainable parameters $\mathbf W \in \mathbb R^{O\times I}$ and $\mathbf b \in \mathbb R^O$ we can map any $I$-dimensional input $\mathbf u$ to an output $\mathbf v \in \mathbf R^O$ via $\mathbf v = \mathbf W \mathbf u + \mathbf b$.

In some situations, we prefer to use this transformation without a bias vector (i.e., setting $\mathbf b$ to a vector of $0$s); sometimes, this is also called a *projection*. 

Sometimes, we need to describe this operation in written form (for example, to sketch our model architecture). Here is a compact notation for it:
\begin{align}
    \mathbf v &= \mathrm{linear}_O(\mathbf u; \mathbf W, \mathbf b)
\end{align}


This is how we can construct a linear transformation with its trainable parameters

In [None]:
toy_linear = nn.Linear(
    toy_emb_dim, # number of inputs
    3 # number of outputs
)

For such a toy example, you can inspect the initialised parameters:

In [None]:
toy_linear.weight, toy_linear.bias

As expected, linear layers work on batched inputs too, so let's use the toy batch, convert its tokens to D-dimensional embeddings, combine all vectors in each sequence using average pooling, and then, finally, project each D-dimensional document encoding to 3 logits:

In [None]:
toy_linear(average_pooling(toy_emb(toy_batch), toy_batch != vocabulary.idx("-PAD-")))

<a name='ungraded-6'></a> **Ungraded Exercise 6 - Linear transformation**

If we have a batch of inputs with shape `[B, L]`, use a `D`-dimensional embedding layer to encode tokens,  average pooling to encode documents, and then a linear transformation to `K`-dimensional outputs:

1. What's the shape of the output?
2. How many trainable parameters are necessary to specify the linear layer?


<details>
    <summary> <b>Click to see a solution</b> </summary>
 
1. After the embedding layer we have `[B, L, D]`, after the pooling operation we have `[B, D]` and after the linear transformation we have `[B, K]`.
2. We need to store a matrix of shape `[K, D]` and a vector of `K` biases to linearly transform `D`-dimensional vectors into `K`-dimensional vectors.
    
---
    
</details>      


<a name='sec:Nonlinear_activation_functions'></a>
## Nonlinear activation functions



Sometimes we need to work on a _subspace_ of the real coordinate space, for example, where numbers are constrained to being positive, or strictly positive, or strictly positive and sum up to 1, etc.
We can achieve this by working with _activation functions_. An activation function will not change the dimensionality of its input, and it does not require any trainable parameter, typically, an activation function is a formula that transforms a vector elementwise. 

For example, if $\mathbf u \in \mathbb R^D$

* $\exp(\mathbf u)$ applies $\exp(u_d)$ to each coordinate $u_d$ of $\mathbf u$, hence returning a vector of $D$ strictly positive numbers;
* $\mathrm{relu}(\mathbf u)$ applies $\max(0, u_d)$ to each coordinate $u_d$ of $\mathbf u$, hence returning a vector of $D$ positive (and possibly 0) numbers;
* $\tanh(\mathbf u)$ applies $\tanh(u_d)$ to each coordinate $u_d$ of $\mathbf u$, hence returning a vector $D$ numbers in the space $(-1, 1)$;
* $\mathrm{sigmoid}(\mathbf u)$ applies $\frac{1}{1+\exp(u_d)}$ to each coordinate $u_d$ of $\mathbf u$, hence returning a vector of $D$ independently normalised probability values (i.e., each in the space $(0, 1)$);
* $\mathrm{softplus}(\mathbf u)$ applies $\log(1+\exp(u_d))$ to each coordinate $u_d$ of $\mathbf u$, hence returning a vector of $D$ strictly positive numbers;
* $\mathrm{softmax}(\mathbf u)$ applies $\frac{\exp(u_d)}{\sum_{k=1}^D \exp(u_k)}$ to each coordinate $u_d$ of $\mathbf u$, hence returning a vector of $D$ strictly positive numbers that add up to 1 (i.e., a point in the simplex $\Delta_{D-1} \subset \mathbb R^D$).

<details>
<summary> <b> Remarks about sigmoid and softplus </b> (click to expand) </summary> 

The _sigmoid_ function is also known as the _logistic_ function (in statistics). In various reference texts, the sigmoid function is denoted by $\sigma(\cdot)$, that is, $\sigma(a) = \frac{1}{1+\exp(a)}$. Here are some useful results about the sigmoid function: 
* $\sigma(a)= \frac{1}{1+\exp(a)}\\=\frac{\exp(u_d)}{1+\exp(u_d)}\\=1-\sigma(-a)$ 
* $\frac{\mathrm{d}}{\mathrm{d}a}\sigma(a)=\sigma(a) \times (1-\sigma(a))$


The _softplus_ function can be thought of as a smooth approximation to the $\mathrm{relu}$ function. It is often used when we need strictly positive numbers while also needing more numerical stability than the $\exp$ function can offer. In particular, the derivative of the $\exp$ function is the $\exp$ function (\ie, $\frac{\mathrm{d}}{\mathrm{d}a}\exp(a)=\exp(a)$), while the derivative of the $\mathrm{softplus}$ function is the $\mathrm{sigmoid}$  function: 
* $\frac{\mathrm{d}}{\mathrm{d}a}\mathrm{softplus}(a) = \mathrm{sigmoid}(a)$

the main advantage being that $\exp$ can become very large, while $\mathrm{sigmoid}$ is always between $0$ and $1$.

</details>


---



<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

torch.exp(torch.tensor([-2.5, -1., 0., 1, 2.5]))

```

---
    
</details>      


In [None]:
torch.relu(torch.tensor([-2.5, -1., 0., 1, 2.5]))

In [None]:
torch.tanh(torch.tensor([-2.5, -1., 0., 1, 2.5]))

Of course, these operations also work with batches:

In [None]:
torch.exp(torch.tensor([[-2.5, -1., 0., 1, 2.5], [-3., -2., -1, 0.5, 1.5]]))

<a name='ungraded-7'></a> **Ungraded Exercise 7 - Activation functions**

Play a bit with the following activations and explain what they do:
1. `torch.sigmoid`
2. `F.softplus`
3. `F.softmax`


<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

# sigmoid maps from R to (0, 1), it's appropriate to predict probability values
torch.sigmoid(torch.tensor([[-2.5, -1., 0., 1, 2.5], [-3., -2., -1, 0.5, 1.5]]))

```

---
    
</details>      



<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

# softplus maps from R to R+, it's appropriate to predict statistical parameters that must be strictly positive
# like rate, scale, concentration, etc.
F.softplus(torch.tensor([[-2.5, -1., 0., 1, 2.5], [-3., -2., -1, 0.5, 1.5]]))

```

---
    
</details>      



<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

# softmax maps from R^K to Simplex K-1, it's appropriate to predict probability vectors (that are normalised as to sum to 1)
print(F.softmax(torch.tensor([[-2.5, -1., 0., 1, 2.5], [-3., -2., -1, 0.5, 1.5]]), dim=-1))
assert torch.allclose(torch.sum(F.softmax(torch.tensor([[-2.5, -1., 0., 1, 2.5], [-3., -2., -1, 0.5, 1.5]]), dim=-1), -1), torch.ones(2))

```

---
    
</details>      


<a name='sec:Composing_multiple_vectors'></a>
# Composing multiple vectors


We now look into _composition functions_ that combine multiple vectors of fixed dimensionality into one (or more) vectors, while possibly changing the dimensionality of the output with respect to the dimensionality of the input(s). While pooling functions are necessarily discarding some information available in the input (e.g., sum discards order in the input collection, average discards order and size of the input collection, maximum discards inputs that are not the largest, etc.), we compose vectors  to a) _not_ discard anything important, and b) let the inputs _interact_ to create new, more complex features.

<a name='sec:Concatenation'></a>
## Concatenation


The simplest thing we can do, in order not to discard _any_ information, is to concatenate input vectors in their given order.
For example, we can concatenate an $I_1$-dimensional vector $\mathbf u$ with an $I_2$-dimensional vector $\mathbf v$ obtaining an $I_1+I_2$-dimensional vector:
\begin{align}
    \mathbf h &= \mathrm{concat}(\mathbf u, \mathbf v) = (u_1, \ldots, u_{I_1}, v_1, \ldots, v_{I_2})^\top~.
\end{align}

Another way to denote the concatenation of $l$ vectors $\mathbf u_1, \ldots, \mathbf u_l$, each of dimensionality $\operatorname{dim}(\mathbf u_i)$, is to write $[\mathbf u_1, \ldots, \mathbf u_l]$, the output vector will then have dimensionality $\sum_{i=1}^l \operatorname{dim}(\mathbf u_i)$.


A few things to consider about concatenation. It does not require any trainable parameters, and can be applied to any number of input vectors. However, the more inputs we have, the more outputs we have. If in a certain context we need to deal with variable-length inputs, we would then have to deal with variable-length outputs, which is sometimes not possible. 
Besides, and perhaps most importantly, while not discarding any information, concatenation is unable to create new features (e.g., by letting input features interact), this, however, we address next.


<a name='ungraded-8'></a> **Ungraded Exercise 8 - Concatenation**

Concatenate the BoW encoding of the documents in the toy batch with their average embedding.


<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

bow = sum_pooling(F.one_hot(toy_batch, toy_vocab_size), toy_batch != vocabulary.idx("-PAD-"))
avgemb = average_pooling(toy_emb(toy_batch), toy_batch != vocabulary.idx("-PAD-"))
bow_avg = torch.cat([bow, avgemb], -1)
assert bow_avg.shape[-1] == toy_vocab_size + toy_emb_dim
print(bow_avg)

```

---
    
</details>      


<a name='sec:Feed_forward_network'></a>
## Feed forward network


The simplest feed-forward network (FFNN) combines two linear transformations with a non-linear activation function in between. 
The input features _interact_ forming the intermediate ("hidden") features.
It is also possible to make an FFNN _deeper_, by stacking additional linear transformations (again, with nonlinearities in between).

This is the simplest example:
\begin{align}
    \mathbf h &= a(\mathrm{linear}_H(\mathbf u; \theta_{\text{hid}})) \\
    \mathbf v &= \mathrm{linear}_O(\mathbf h; \theta_{\text{out}}) 
\end{align}
where $a(\cdot)$ is an elementwise nonlinearity (e.g., an activation function), typically $\tanh$ or $\mathrm{relu}$, $\theta_{\text{hid}}$ are the parameters of the first input-to-hidden layer (a weight matrix and a bias vector) and  $\theta_{\text{out}}$ are the parameters of the second hidden-to-output layer (another weight matrix and another bias vector). See that instead of writing down the parameters explicitly, we named them (using $\theta$ and a suggestive subscript); this is a convenient notation shortcut.

Here's a visual depiction of an FFNN


<img src="https://raw.githubusercontent.com/probabll/ntmi-tutorials/main/img/example-ffnn.png" width="250" />

A nice way to prescribe FFNNs in torch is to use the so-called `Sequential` API, which stacks transformations:

In the following example, we concatenate two views of each document: the bag-of-words encoding and the average token embedding.

We then map each such vector to 7 ReLU hidden units and then to 3 output units.

In [None]:
hidden_size = 7
output_size = 3
toy_ffnn = nn.Sequential(
    nn.Linear(toy_vocab_size + toy_emb_dim, hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size, output_size)
)
assert num_parameters(toy_ffnn) == ((toy_vocab_size + toy_emb_dim)*hidden_size + hidden_size + hidden_size*output_size + output_size)

As expected, a FFNN accepts batched inputs:

In [None]:
bow = sum_pooling(F.one_hot(toy_batch, toy_vocab_size), toy_batch != vocabulary.idx("-PAD-"))
avgemb = average_pooling(toy_emb(toy_batch), toy_batch != vocabulary.idx("-PAD-"))
bow_avg = torch.cat([bow, avgemb], -1)
assert toy_ffnn(bow_avg).shape == (bow_avg.shape[0], output_size)
toy_ffnn(bow_avg)

<a name='ungraded-9'></a> **Ungraded Exercise 9 - FFNN**

Use the Sequential API to specify a FFNN similar to the example above, but with 2 hidden layers: the first with 7 units and the second with 13 units. Use ReLU for the first hidden layer and Tanh for the second hidden layer. The output of your FFNN should have 3 units, like in the example above. Inspect the number of parameters of the layer. Then, test it on the concatenation of bow and avg encodings, as we did above. Explain why the output shape is the same as in the example above.


<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

hidden_sizes = [7, 13]
output_size = 3
toy_ffnn2 = nn.Sequential(
    nn.Linear(toy_vocab_size + toy_emb_dim, hidden_sizes[0]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[0], hidden_sizes[1]),
    nn.Tanh(),
    nn.Linear(hidden_sizes[1], output_size)
)
assert num_parameters(toy_ffnn2) == ((toy_vocab_size + toy_emb_dim)*hidden_sizes[0] + hidden_sizes[0] + hidden_sizes[0]*hidden_sizes[1] + hidden_sizes[1] + hidden_sizes[1]*output_size + output_size)

assert toy_ffnn2(bow_avg).shape == (bow_avg.shape[0], output_size)

assert toy_ffnn2(bow_avg).shape == toy_ffnn(bow_avg).shape

# The output shape is the same as before 
# because the shape is determined by the ouptut_size 
# and not by the number of hidden layers
toy_ffnn2(bow_avg)

```

---
    
</details>      


<a name='sec:Recurrent_neural_network'></a>
## Recurrent neural network 

A recurrent neural network (RNN) uses one or more feed-forward networks inside of a for-loop to iterate over the steps of a sequence, at each step the RNN combines a _recurrent_ state and the input at that step using a FFNN. Rather than one different FFNN per step, the RNN reuses the same FFNN. Suppose we have a sequence of vectors $\mathbf e_{1:l}$, all of which are $I$-dimensional, for example, those are the the token embeddings for a document $w_{1:l}$.

The RNN has a initial state $\mathbf u_0$, this is an $H$-dimensional vector of trainable parameters. Then, at each step $i \in [l]$ of the sequence, the RNN composes the preceding recurrent state $\mathbf u_{i-1}$ with the current input $\mathbf e_i$, this combination produces a new value $\mathbf u_i$ for the recurrent state, which will be used in the next step. See the figure below for an illustration.

<img src="https://raw.githubusercontent.com/probabll/ntmi-tutorials/main/img/example-rnn.png" width="600" />

We can describe the computation as follows:
\begin{align}
\mathbf u_i &= \mathrm{rnnstep}_H(\mathbf u_{i-1}, \mathbf e_{i}; \theta) \quad \text{for }i \in [l].
\end{align}

The simplest implementation of the block _rnnstep_ looks like this:
\begin{align}
\mathrm{rnnstep}_H(\mathbf u_{i-1}, \mathbf e_{i}; \theta) &= \tanh(\mathbf R \mathbf u_{i-1} + \mathbf W \mathbf e_{i}) 
\end{align}
where the trainable parameters $\theta = \{\mathbf u_0 \in \mathbb R^H, \mathbf R \in \mathbb R^{H\times H}, \mathbf W \in \mathbb R^{H\times I}, \}$ are two matrices that project the recurrent state and the current input to size $H$.
This is even simpler than employing a full FFNN, since we only have a hidden layer.


If you pay close attention to the illustration, or to the formulae, you will see that the recurrent state at any one position $i$ can potentially store information from any of the inputs $\mathbf e_1, \ldots, \mathbf e_{i-1}$. Besides, due to the nonlinearity of the _rnnstep_ function, the state $\mathbf u_i$ is sensitive to the order in which the inputs are presented (that is, if we presented inputs in a different order, the numerical values of the coordinates of $\mathbf u_i$ might differ). This is an important aspect of a feature function for natural language processing, given that natural languages encode a lot of information in word order.

This form of RNN is also called an RNN _encoder_, in allusion to the fact that $\mathbf u_i$ encodes the vector sequence $\mathbf e_{1:i}$. In written form, a call to an RNN encoder can be denoted even more compactly as 
\begin{align}
\mathbf u_{1:l} &= \mathrm{rnnenc}_H(\mathbf e_{1:l}; \theta) ~.
\end{align}
This function takes a sequence of vectors as input and returns a sequence of $H$-dimensional vectors, each encoding a longer subsequence of the input sequence. The last vector $\mathbf u_l$ of the output sequence has the potential to store information about the entire input sequence (in its given order).




<a name='sec:LSTM'></a>
### LSTM


The simplest RNN suffers from certain instability issues (they struggle to retain information from long sequences and they lead to numerical problems in optimisation). 
A modern RNN-type architecture that does not exhibit these problems is the Long Short-Term Memory. %(https://arxiv.org/pdf/1503.04069.pdf) (LSTM for short). [It's already implemented for us in torch](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html).
You do not need to study the LSTM paper, what you will need to know will be explained in this notebook. 
It is not necessary, for the applications in this course, to study the LSTM paper. For completeness, we briefly explain the internal design of the LSTM here, but this level of detail is much more than what we need in this course. 
The choice of letters we use in this part are internal to the LSTM and are not to be confused for letters used in other contexts.

For a step $t$, let $\mathbf e_t$ be an $I$-dimensional input to an LSTM (e.g., this may be an embedding for the token $w_t$ in a document).

At this point, the memory of an LSTM is made of two $K$-dimensional vectors called the _cell vector_ $\mathbf c_{t-1}$ and the _hidden state_ $\mathbf h_{t-1}$, each of which is $K$-dimensional. When we process the input $\mathbf x_t$ with an LSTM, these two vectors are updated step by step as shown below:
\begin{align}
    \mathbf i_t &=\mathrm{sigmoid}(\mathrm{linear}_K(\mathbf e_t; \theta_1) + \mathrm{linear}_K(\mathbf h_{t-1}; \theta_2))\\
    \mathbf f_t &=\mathrm{sigmoid}(\mathrm{linear}_K(\mathbf e_t; \theta_3) + \mathrm{linear}_K(\mathbf h_{t-1}; \theta_4))\\
    \mathbf g_t &=\tanh(\mathrm{linear}_K(\mathbf e_t; \theta_5) + \mathrm{linear}_K(\mathbf h_{t-1}; \theta_6))\\
    \mathbf o_t&=\mathrm{sigmoid}(\mathrm{linear}_K(\mathbf e_t; \theta_7) + \mathrm{linear}_K(\mathbf h_{t-1}; \theta_8))\\
    \mathbf c_t &= \mathbf f_t \odot \mathbf c_{t-1} + \mathbf i_t \odot \mathbf g_t \\
    \mathbf h_t &= \mathbf o_t \odot \tanh(\mathbf c_t)
\end{align}
The first four steps compute the following using the input and the hidden state: the _input gate_ $\mathbf i_t$, then the _forget gate_ $\mathbf f_t$, the _draft cell_ $\mathbf g_t$, and the _output gate_ $\mathbf o_t$. 
These are all $K$-dimensional, and the linear transformations all have their own parameters (there 8 such affine transformations in total, they map either from $I$ dimensions to $K$ dimensions, or from $K$ dimensions to $K$ dimensions, and they have biases vectors in them). The last couple of steps finally update the cell and the hidden state by combining the intermediate gates and draft cell. The symbol $\odot$ denotes elementwise multiplication. After the update the LSTM memory is made of two states $\mathbf c_t $ and $\mathbf h_t$, each $K$-dimensional.  The torch implementation gives us access to both of them, and we will see later how to use it.


Typically, we regard the cell states $\mathbf c_{1:l}$ as something internal to the LSTM and we rarely need to use them for anything outside of it. It is the hidden state $\mathbf h_i$ at each step that we normally want to use in applications (e.g., as a representation of the sequence $\mathbf e_{1:i}$).
Hence, in this book, it is sufficient to think of the LSTM encoder as a function 
\begin{align}
\mathbf h_{1:l} &= \mathrm{lstm}_K(\mathbf e_{1:l}; \theta) 
\end{align}
that returns a sequence $\mathbf h_{1:l}$ of hidden states encoding the input sequence $\mathbf e_{1:l}$. This also shows that an LSTM is indeed just a special type of RNN.

In [None]:
toy_hidden_size = 6
toy_lstm = nn.LSTM(
    input_size=toy_emb_dim, # size of the vectors in the input sequence
    hidden_size=toy_hidden_size, # size of the recurrent cell
    num_layers=1,
    batch_first=True,  # this is important, it's telling the nn.LSTM class
    bidirectional=False,  # we will explain this argument in the next example
)
toy_lstm

In [None]:
num_parameters(toy_lstm)

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

In [None]:
# [batch_size, max_len, emb_dim]
e = toy_emb(toy_batch)
# [batch_size, max_len, hidden_dim]
# internally, the LSTM maintains two vectors in the memory
# the forward method of the LSTM class will return 
# a tensor which has the sequence of so called hidden states (this is usually what you want to use in a text encoder)
# and tuple of tensors that can be used in case you need access to the internal 
# mechanism of the LSTM cell

# For convenience, torch provides two auxiliary functions that help us deal with 
# batches of sequences that may differ in length
# First, we pack the padded sequences in a special way using `pack_padded_sequence`
#  for this to work correctly, torch needs to know the length of the sequences (discounting padding)
# [batch_size]
lengths = (toy_batch != vocabulary.idx("-PAD-")).long().sum(-1)
packed_seqs = pack_padded_sequence(e, lengths.cpu(), batch_first=True, enforce_sorted=False)        
# it's important to tell torch that the first axis of our tensors is for the batch (with `batch_first=True`)
# it's also important to tell torch that our sequences are _not_ sorted by length (with `enfore_sorted=False`)

# Next, we run the LSTM on packed sequences,
#  this returns, for every sequence, the states for all steps and a tuple (final state, final cell)
h, (last_h, last_c) = toy_lstm(packed_seqs)

# Finally, before going ahead and using `h`, we call `pad_packed_sequence`
#  this returns the tensor of states with padding positions zeroed out
h, _ = pad_packed_sequence(h, batch_first=True)

assert h.shape == toy_batch.shape + (toy_hidden_size,)

In [None]:
print(h)  # se how the state information is zeroed out for the -PAD- positions
# this is because we used torch's helper functions (pack_padded_sequence and pad_packed_sequence)

In [None]:
print(last_h) # this tensor, returned by the LSTM class gives us convenient access to the last state of each sequence

If we wanted to combine the LSTM outputs into a single vector, we could use average pooling, for example. This does not destroy the infromation about word order, since that information is already coded in the LSTM outputs.
Alternatively, we could use the last state of the sequence. This is a practical choice, and there's no theory to support one option or the other.

<a name='sec:Bidirectional_RNN_encoder'></a>
### Bidirectional RNN encoder

When we encode a document, for example, in text classification, we have the entire document available to us and we are by no means constrained to processing the words from left-to-right. Why not also encode it from right-to-left, for example? 

Even better, why not do both? This way, whenever we look at a given position $i$, we can obtain information from its left (i.e., from $\mathbf e_1, \ldots, \mathbf e_i$) and from its right (i.e., from $\mathbf e_i, \ldots, \mathbf e_l$). This gives us a fully contextualised view of the token that sits at the $i$th position of a document.

An RNN cell, by design, makes computations in a single direction (e.g., left-to-right), but we can use 2 different RNN cells, one that reads the sequence in one order and another that reads the sequence in reversed order.

We don't need to invent a new RNN cell for this, we can simply reverse the inputs to a standard RNN cell:
\begin{align}
\mathbf r_{1:l} &= \mathrm{reverse}(\mathbf e_{1:l}) \\
\mathbf v_i &= \mathrm{rnnstep}_K(\mathbf v_{i-1}, \mathbf r_{i}; \theta_{\text{renc}}) ~.
\end{align}
Because the inputs have been reversed in order. For example, in a sentence of length $l=10$, $\mathbf v_2$ knows about $w_{9}$ through $\mathbf r_2$ and about $w_{>9}$ through $\mathbf v_{1}$. The last state $\mathbf v_l$ has information about the entire document $w_{1:l}$, but processed it in reversed order.

Therefore a reversed RNN encoder can be denoted compactly as follows:
\begin{align}
\mathbf v_{1:l} &= \mathrm{rnnenc}_K(\mathrm{reverse}(\mathbf e_{1:l}); \theta_{\text{renc}})
\end{align}


Note that we named the parameter set differently: $\theta_{\text{enc}}$ for the first RNN cell, and $\theta_{\text{renc}}$ for the second one, that's because we indeed want to have two different sets of parameters. If we used the same set of parameters for both directions, that probably would not work very well, as reading in one direction and reading in another are conceptually two different operations.

The **bidirectional RNN encoder** is our prefered text encoder, it can be denoted as follows:
\begin{align}
\mathbf o_{1:l} &= \mathrm{birnn}_{2K}(\mathbf e_{1:l}; \theta_{\text{enc}} \cup \theta_{\text{renc}})
\end{align}
and here are the operations that it performs:
\begin{align}
\mathbf u_{1:l} &= \mathrm{rnnenc}_K(\mathbf e_{1:l}; \theta_{\text{enc}})\\
\mathbf v_{1:l} &= \mathrm{rnnenc}_K(\mathrm{reverse}(\mathbf e_{1:l}); \theta_{\text{renc}})\\
\mathbf o_{i} &= \mathrm{concat}(\mathbf u_i, \mathbf v_{l-i+1}) & \text{for }i \in \{1, \ldots, l\}
\end{align}

Its outputs are $2K$-dimensional because after processing the sequence from left-to-right with the first RNN encoder and from right-to-left with the second RNN encoder, it then concatenates the two views of the process in such a way that $\mathbf o_i$ has information about $w_i$, $w_{<i}$ and $w_{>i}$.

See the figure as an illustration of how the two RNN cells can be used to obtain the bidirectional RNN encoder: 

<img src="https://raw.githubusercontent.com/probabll/ntmi-tutorials/main/img/example-birnn.png" alt="drawing" width="600"/>


Luckily, `nn.LSTM` implements all that for us, and we don't really need to worry about reversing anything ourserlves. See the example below.

In [None]:
toy_bilstm = nn.LSTM(
    input_size=toy_emb_dim,
    hidden_size=toy_hidden_size,
    num_layers=1,
    batch_first=True,
    bidirectional=True,  # now we employ one LSTM for each direction
)
toy_bilstm

In [None]:
num_parameters(toy_bilstm)

In [None]:
assert num_parameters(toy_bilstm) == 2*num_parameters(toy_lstm), "A BiLSTM is made of two LSTMs"

<a name='ungraded-10'></a> **Ungraded Exercise 10 - BiLSTM**

Use `toy_bilstm` to encode our `toy_batch` (remember that from an API point of view, a BiLSTM is just like an LSTM, hence you should use torch auxiliary functions `pack_padded_sequence` and `pad_packed_sequence`.


<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python

# [batch_size, max_len, 2*hidden_dim]
# as for the standard LSTM, we only use the first of its outputs, namely, 
# a tensor of states, this time the states are concatenated for two directions, 
# thus they will be twice as large

# As before, we use torch's helper functions to correctly deal with the variable length
# of our padded sequences
packed_seqs = pack_padded_sequence(e, lengths.cpu(), batch_first=True, enforce_sorted=False)        
h2, (last_h2, last_c2) = toy_bilstm(packed_seqs)
h2, _ = pad_packed_sequence(h2, batch_first=True)
assert h2.shape == toy_batch.shape + (2*toy_hidden_size,)  # BiLSTM outputs are the concatenation of 2 LSTM outputs

print("h shape:", h2.shape) # the shape here is as expected [batch_size, max_len, 2*hidden_dim]
print("last shape:", last_h2.shape)  # the shape here is a bit different: [num_layers, batch_size, hidden_dim]
# where each direction counts as 1 layer, hence num_layers is 2

# If we wanted to use the final states we would have to concatenate
#  the states from different directions
# We first move the first axis (num_layers) to the end of the tensor using `permute`
#  and then flatten the last two axis (hidden_dim, num_layers)
#  obtaining output shape [batch_size, num_layers * hidden_dim]
print("last states reshaped and concatenated:", torch.flatten(torch.permute(last_h2, (1, 2, 0)), 1, 2).shape)


```

---
    
</details>      


# What Next?

In T4 you will experiment using these blocks to develop text encoders for a text classifier. 
There, you will have training data (labelled documents) which you can use to estimate the parameters of the NN blocks (which in this notebook were never trained, they were only initialised at random). 

In T4, you will also learn some important tricks needed for effective optimisation of NN blocks, such as regularisation techniques. 


<details>
    <summary> <b>Click to see a solution</b> </summary>

If you double-click the cell, you will be able to copy the code:

```python



```

---
    
</details>      
