# Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are another special form of neural networks. 
RNNs are mainly used for sequences that are arranged in a fixed order. In these cases, the order of the individual elements of the sequence is often crucial for the interpretation of the whole sequence.

Languages as a classical example lend themselves immediately. This is because the ordering of a sentence influences the interpretation of the individual words. 

Example:

> I am not a fan of this movie.


The word "*fan*" has a positive connotation. But the "*not*" before the word, turns the interpretation around. That is, the word "*fan*" should be interpreted in the context of the whole sentence. 
However, RNNs can also be used in chemistry/pharmacy. For example, SMILES `strings` or protein sequences are suitable for RNNs. 

 `()` have a strong influence on how individual parts of the smile can be interpreted.

`CCCC`|`CC(C)C`
------|--------
<img src="Img/rnn/mol1.png" width="200"/> |<img src="Img/rnn/mol2.png" width="200"/> 


The general concept of an RNN is relatively simple:
Word by word (or even character by character) a sentence (or Smiles) is passed through the network. 
The output layer is completely ignored at first, but after a word has passed through the network, the activations of the hidden layer ($h_1$) are stored.

Using the example sentence "*Hallo Welt*" (Hello World) in the figure. $h_1$ here are the activations for the word "Hallo".

In the context of RNNs, we also refer to the activations of the hidden layer as **Hidden State**. $h_1$ is the hidden state for the word "*Hallo*".

Next, the second word is passed through the network. We want to calculate $h_2$, but to the activations of the word "Welt" we also add the activations $h_1$. So $h_2$ is a combination of the activations of "Welt", but also of "Hallo". The word "Welt" was interpreted together with the previous word.


<img src="Img/rnn/rnn_1.svg.png" width="200"/> 


If we had a third word, $h_3$ would be calculated from the activations of the third word and $h_2$. And since $h_2$ contains the information of both the second and the first word, both words influence the interpretation of the third word.


<img src="https://miro.medium.com/max/724/1*1U8H9EZiDqfylJU7Im23Ag.gif">
*Source: Michael Phi - An illustrated Guide to Recurrent Neural Networks*

In the GIF, you can see that the influence of the hidden state of "*What*" (black), the first word, decreases as we get closer to the end of the sentence. However, it still has an influence on the interpretation of the last word.

The hidden state of the last part of the sentence ("*?* "), called $O5$ ($h_5$) in the example, is a combination of all previous hidden states and the activations of "*?* ".

<img src="https://ichi.pro/assets/images/max/724/1*yQzlE7JseW32VVU-xlOUvQ.png">


We can use this hidden state as input to another network that makes its prediction based on this last hidden state.

Similar to how a CNN is used to convert an image into a vector, RNNs are used to convert sequences into vectors.


# Data Preparation:

Before we train our RNN, we need to get the data into the right format. Letters and words cannot simply be read by a neural network.
As with the labels from the MNIST dataset (0-9), we can one-hot encode words or, in the case of Smiles, characters.

Suppose we have two smiles:

`smiles = ["CCN=C=O", "NC(=O)CC(=O)O"]`

There are six different symbols in total:
`C`, `N`, `=`, `O`, `(`, `)` 

We can represent a `C` as a vector of length 6, which has a `1` at the first position and otherwise only zeros. We can also represent an `N` as a vector, except that we shift the `1` by one position.

We can do this for all symbols in the smiles:

```python
"C" = [1,0,0,0,0,0]
"N" = [0,1,0,0,0,0]
"=" = [0,0,1,0,0,0]
"O" = [0,0,0,1,0,0]
"(" = [0,0,0,0,1,0]
")" = [0,0,0,0,0,1]
```
These symbols are also often called **tokens**.
We can encode a Smiles `string` using these rules. Encoding a Smiles `string` using these rules results in a matrix:

```python
"CCN=C=O" -> np.array([[1,0,0,0,0,0],
                      [1,0,0,0,0,0],
                      [0,1,0,0,0,0],
                      [0,0,1,0,0,0],
                      [1,0,0,0,0,0],
                      [0,0,1,0,0,0],
                      [0,0,0,1,0,0]])
```

The `string` `"CCN=C=O"` becomes a matrix where each row is a token and each column indicates which symbols are assigned to that row.

With the following code you can automate this conversion.
Many functions are already customly written by us. But if you are still interested in exactly how these functions look like, you can find the code in the file `../utils/utils.py`.

In [2]:
import torch
from torch import nn, optim
from rdkit.Chem import AllChem as Chem
import numpy as np
from sklearn.metrics import roc_auc_score
from torch.utils.data import DataLoader, TensorDataset
%run ../utils/utils.py

In [3]:
smiles = ["CCN=C=O","NC(=O)CC(=O)O"]

First, we need a kind of dictionary that stores all occurring symbols and assigns a number to them. This number also indicates at which position in the one-hot vector the `1` will appear. 

In [4]:
dictionary = create_dict(smiles)
dictionary

{'C': 0, 'O': 1, '=': 2, 'N': 3, ')': 4, '(': 5}

The `=` is assigned to a `0` and the `N` is assigned to a `1` and so on....

With the function `tokenize()` we can convert the smiles into a number string. We now represent the Smiles `string` through the numbers. 
The function just has to be told which smiles to encode and which `dictionary` to use for it.

In [5]:
tokenized_smiles = tokenize(smiles,dictionary)
tokenized_smiles

[[0, 0, 3, 2, 0, 2, 1], [3, 0, 5, 2, 1, 4, 0, 0, 5, 2, 1, 4, 1]]

The Smiles are now represented as a simple sequence of numbers.
However, they are still of different lengths.

In [6]:
[len(x) for x in tokenized_smiles]

[7, 13]

The first Smiles consists of 7 symbols/tokens, the other of 13. This is a problem, because an RNN expects each sequence to be of equal length. Of course, this is not always possible, because larger molecules have more symbols than smaller ones. 
To solve the problem, we *pad* all sequences to the length of the longest smile.
To allow for of strings *padding*  we need to add a new token to our dictionary: `"<pad>"`. This token will be added to each Smiles `string` until it has the same length as the longest Smiles. 
The `"<pad>"` is to tell the network that these symbols are no longer relevant to the actual Smiles.

In [7]:
max_smiles_length = max([len(x) for x in tokenized_smiles])
max_smiles_length

13

In [8]:
dictionary["<pad>"] = len(dictionary)
dictionary

{'C': 0, 'O': 1, '=': 2, 'N': 3, ')': 4, '(': 5, '<pad>': 6}

Now we have added the token `<pad>` to our dictionary. The last thing we have to do is to append this token to our first smile `tokenized_smiles[0]`.

In [10]:
num_missing_tokens = max_smiles_length-len(tokenized_smiles[0])
tokenized_smiles[0] += [dictionary["<pad>"]] * num_missing_tokens 
tokenized_smiles[0]

[0, 0, 3, 2, 0, 2, 1, 6, 6, 6, 6, 6, 6]

Now both Smiles have the same length.

In [13]:
[len(x) for x in tokenized_smiles]

[13, 13]

Now that the smiles have the same length, we can convert the numbers to one-hot coded vectors.

In [14]:
vocabulary_length = len(dictionary)
print(vocabulary_length)

7


In total there are 7 symbols in our dictionary.
With the function `token_to_onehot` the `tokenized_smiles` become matrices.

In [15]:
onehot_tokens = token_to_onehot(tokenized_smiles, vocabulary_length)
print(onehot_tokens[0])
      
print(onehot_tokens.shape)

[[1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]]
(2, 13, 7)


`onehot_tokens` is a `np.array` with the dimensions `(2,13,7)` . The first dimension is the number of smiles (`2`). The second dimension is the length of the sequences (`13`). The third dimension is the number of different tokens (`7`).

By itself, our data would now be ready for an RNN. But instead of taking these one-hot encoded vectors as input, we first apply an *Embedding Layer*. 

# Word Embeddings

These one-hot encoded vectors are rarly directly used as input. Before being used as input for an RNN the one-hot encoded vectors pass trough an embedding layer. This replaces the one-hot encoded vectors with initially random numbers. To better understand what this means, let's first look at an embedding layer.

In [17]:
np.random.seed(1234)
embedding_layer = np.random.rand(7,4)
embedding_layer

array([[0.19151945, 0.62210877, 0.43772774, 0.78535858],
       [0.77997581, 0.27259261, 0.27646426, 0.80187218],
       [0.95813935, 0.87593263, 0.35781727, 0.50099513],
       [0.68346294, 0.71270203, 0.37025075, 0.56119619],
       [0.50308317, 0.01376845, 0.77282662, 0.88264119],
       [0.36488598, 0.61539618, 0.07538124, 0.36882401],
       [0.9331401 , 0.65137814, 0.39720258, 0.78873014]])

An embedding layer consists of a single weight matrix. Initially it contains random numbers. The number of rows corresponds exactly to the number of different tokens in our dictionary. 
An embedding layer simply exchanges all vectors that have a `1` at the first position (`[1,0,0,0,0,0]`) with the first row from the `embedding_layer`. Which is `embedding_layer[0,:]= [0.19151945, 0.62210877, 0.43772774, 0.78535858]`.

To achieve this, we simply need to multiply the one-hot encoded smiles by the embedding layer:

In [18]:
token_embeddings = np.matmul(onehot_tokens,embedding_layer)
print(token_embeddings[0])

[[0.19151945 0.62210877 0.43772774 0.78535858]
 [0.19151945 0.62210877 0.43772774 0.78535858]
 [0.68346294 0.71270203 0.37025075 0.56119619]
 [0.95813935 0.87593263 0.35781727 0.50099513]
 [0.19151945 0.62210877 0.43772774 0.78535858]
 [0.95813935 0.87593263 0.35781727 0.50099513]
 [0.77997581 0.27259261 0.27646426 0.80187218]
 [0.9331401  0.65137814 0.39720258 0.78873014]
 [0.9331401  0.65137814 0.39720258 0.78873014]
 [0.9331401  0.65137814 0.39720258 0.78873014]
 [0.9331401  0.65137814 0.39720258 0.78873014]
 [0.9331401  0.65137814 0.39720258 0.78873014]
 [0.9331401  0.65137814 0.39720258 0.78873014]]


You can see the embeddings of the first smile here above.
Below you can see the first line of the one-hot coded smile.

In [28]:
onehot_tokens[0,0,:]

array([1., 0., 0., 0., 0., 0., 0.])

If you now look at the first row (index `0`) in the weight matrix of the `embedding_layer`, you will notice that this vector has exactly the same values as the first row in the `token_embeddings` layer.

In [27]:
embedding_layer[0,:]

array([0.19151945, 0.62210877, 0.43772774, 0.78535858])

In [24]:
token_embeddings[0,0,:]

array([0.19151945, 0.62210877, 0.43772774, 0.78535858])

More simply explained: 
An embedding layer converts one-hot encoded vectors into vectors with random weights. 

*But why is this done?*

One advantage is that texts or even Smiles in most cases consist of more than just 7 symbols or words. For example, if we were to encode all the words that appear in a document, these one-hot encded vectors would become very large. By "embedding" the vectors, we can first reduce the size of these input vectors.

More importantly, the weights in the embedding layer can be learned. This means that these weights are updated during backpropagation.
Thus, the embeddings adapt during training. This is convenient because you expect similar words to have similar embeddings after training. For example, the words truck and car are more similar in usage than car and beach. 
If car and truck have similar embeddings, i.e., are described by similar vectors, then they can be processed more easily in the context of a sentence.


> A car drives on the road

> A truck is driving on the road

The two sentences describe two very similar situation and if the numerical representations are also similar, it is easier for the network to learn.


In the case of smiles, it can be argued that the role of a nitrogen in a molecule is more like that of a carbon than a fluorine. This should also be reflected in the embeddings.


# RNNs

We have now converted the smiles to the correct format. We just need to convert the `np.array` into a tensor. Make sure that we also use the `.permute()` function. The function `.permute()` is used to swap dimensions of a tensor. This is necessary because for RNNs PyTorch expects the tensor to be arranged as follows:
`[length of smile, number of smiles, embedding size]`.

In [29]:
token_embeddings_tensor = torch.tensor(token_embeddings, dtype= torch.float).permute(1,0,2)
token_embeddings_tensor.shape

torch.Size([13, 2, 4])

The tensor `token_embeddings_tensor` has the above dimensions. Each smile consists of `13` tokens, our batch consists of `2` smiles and each token is described by `4` values (from the embedding layer). 

Now we can define an RNN. As usual there is also a RNN class in the `torch.nn` module.
As always we have to be careful when defining the dimensions of the RNN. The first dimension is the size of theinput vectors, that is the embedding size (`4`). The second dimension specifies the size of the hidden layer. This also defines how big the vectors of the hidden state should be.


In [32]:
torch.manual_seed(1234)
rnn = nn.RNN(4,10)

You can now simply pass the `token_embeddings_tensor` through the `rnn`.

In [33]:
output_rnn = rnn(token_embeddings_tensor)
len(output_rnn)

2

The output of the RNN (`output_rnn`) is a list with length two.
We first look at the first object of the output.

In [34]:
print(output_rnn[0])

tensor([[[-5.1854e-01,  3.2573e-01,  8.8806e-02,  3.2631e-01,  2.4860e-02,
          -2.3541e-01,  1.3345e-01,  2.1532e-01,  4.3254e-01, -1.3007e-02],
         [-6.0268e-01,  2.3149e-01,  1.1194e-01,  4.0218e-01, -5.5046e-02,
          -2.2365e-01,  2.9007e-01,  2.9465e-01,  4.7601e-01,  1.2947e-01]],

        [[-5.4682e-01,  3.2731e-01,  4.1438e-02,  4.7448e-01, -9.1607e-02,
          -1.4660e-01, -9.0391e-02,  1.3491e-01,  5.3045e-01, -1.9623e-01],
         [-5.0177e-01,  2.3414e-01,  2.3930e-02,  4.4177e-01, -6.7175e-02,
          -1.7767e-01, -1.7725e-01,  1.1185e-01,  4.8777e-01, -1.7003e-01]],

        [[-5.8859e-01,  2.7894e-01, -7.3078e-03,  6.2402e-01, -1.6086e-01,
          -1.6881e-01,  6.8062e-02,  2.3859e-01,  6.2905e-01, -1.3412e-01],
         [-4.5562e-01,  4.3098e-01, -1.1293e-01,  5.9627e-01, -1.3256e-01,
          -1.3781e-01, -4.3761e-02,  2.0347e-01,  5.7765e-01, -2.3168e-01]],

        [[-6.2021e-01,  1.5627e-01, -3.7084e-02,  6.8411e-01, -1.8842e-01,
          -1.

In [None]:
print(output_rnn[0].shape)

The output `output_rnn[0]` has the dimensions `[13, 2, 10]`. The only thing that has changed compared to the input is the last dimension. Instead of the dimension `4` it is now `10`. 

In fact, the first part of the RNN output contains the Hidden States of each token in the Smiles.

Think back to the GIF:
<img src="https://miro.medium.com/max/724/1*1U8H9EZiDqfylJU7Im23Ag.gif">
*Source: Michael Phi - An illustrated Guide to Recurrent Neural Networks.

`output_rnn[0]` contains $O1$ to $O5$. But since our sequences have length 13, `output_rnn[0]` contains 13 hidden states.

But what does `output_rnn[1]` contain?

In [35]:
output_rnn[1]

tensor([[[-0.5940,  0.0767, -0.0585,  0.7042, -0.1930, -0.1573,  0.0256,
           0.2622,  0.5856, -0.1866],
         [-0.5326,  0.0952, -0.0637,  0.6259, -0.2566, -0.0808, -0.1455,
           0.2833,  0.6471, -0.2102]]], grad_fn=<StackBackward>)

In [36]:
output_rnn[1].shape

torch.Size([1, 2, 10])

`output_rnn[1]` contains ONLY the last hidden state. In the GIF this is $O5$, for us it would be $O13$. This hidden state describes (theoretically) the complete sequence and is therefore very important.

The `output_rnn[0][-1] == output_rnn[1][0]` can also be checked:

In [38]:
print(output_rnn[0][-1])
output_rnn[1][0]

tensor([[-0.5940,  0.0767, -0.0585,  0.7042, -0.1930, -0.1573,  0.0256,  0.2622,
          0.5856, -0.1866],
        [-0.5326,  0.0952, -0.0637,  0.6259, -0.2566, -0.0808, -0.1455,  0.2833,
          0.6471, -0.2102]], grad_fn=<SelectBackward>)


tensor([[-0.5940,  0.0767, -0.0585,  0.7042, -0.1930, -0.1573,  0.0256,  0.2622,
          0.5856, -0.1866],
        [-0.5326,  0.0952, -0.0637,  0.6259, -0.2566, -0.0808, -0.1455,  0.2833,
          0.6471, -0.2102]], grad_fn=<SelectBackward>)


To understand in more detail what is happening, we will write an RNN ourself.


Suppose we have a sentence `sentence = ["Hello", "World"]`. We have this stored as two words in a list. 

We also define two simple linear layers.  One maps the input from embedding size `4` to `10` dimensions. The other layer maps from `10` to `10` dimensons.

Through the first network we send the first word `sentence[0]` and store the hidden state in `output_1`.


```python
sentence = ["Hello", "World"]

lin_1 = nn.Linear(4,10) 

lin_2 = nn.Linear(10,10)

output_1 =rnn(sentence[0])
```

Next, we also pass the second word `"World"` through the `lin_1`. But afterwards we also add the `lin_2(output_1)` to it. 

```python
sentence = ["Hello", "World"]

lin_1 = nn.Linear(4,10) 

lin_2 = nn.Linear(10,10)

output_1 = lin_1(sentence[0])

output_2 = lin_1(sentence[1]) + lin_2(output_1)
```

That is, the hidden state `output_2` is not determined by the word `"World"` alone, but the hidden state before it also has an influence. In fact, we also add a non-linear activation function. In RNNs, a Tanh function is used by default instead of a ReLU function.

```python
sentence = ["Hello", "World"]

lin_1 = nn.Linear(4,10) 

lin_2 = nn.Linear(10,10)

output_1 = lin_1(sentence[0])

output_2 = torch.tanh(lin_1(sentence[1]) + lin_2(output_1))
```

If we had a third word in the sentence (`sentence[2]`), then this step would repeat. This time we add not `output_1`  but `output_2`:

```python
sentence = ["Hello", "World", "Dude"]

lin_1 = nn.Linear(4,10) 

lin_2 = nn.Linear(10,10)

output_1 = lin_1(sentence[0])

output_2 = torch.tanh(lin_1(sentence[2]) + lin_2(output_1))

output_3 = torch.tanh(lin_1(sentence[3]) + lin_2(output_2))

```


To check whether this is actually how the PyTorch RNN works, we can write our own implementation using the same weights as the PyTorch RNN.
First we store the weights of the PyTorch `rnn`. We can now use these ourselves.
Remember that `nn.Linear()` does nothing else than: `torch.mm(X,W.t())+b`.

In [40]:
w_1=list(rnn.parameters())[0]
w_2=list(rnn.parameters())[1]
b_1=list(rnn.parameters())[2]
b_2=list(rnn.parameters())[3]

With these weights you can now calculate the hidden state for the first token in the Smiles sequence (`lin_1`). These are located in `token_embeddings_tensor[0]`.

In [None]:
activations_jetzt = torch.mm(token_embeddings_tensor[0],____)+____
activations_jetzt

<details>
    <summary><b>Lösung:</b></summary>

```python
activations_jetzt = torch.mm(token_embeddings_tensor[0],w_1.t())+b_1
activations_jetzt
```
</details>

Next we transform the hidden state of the previous token (`lin_2`). 
However, at the moment we are at the first word/token. So we don't have a hidden state of a previous token yet. This part was omitted in the previous text. In fact we start with a hidden state in which all values are zero. `h0 = torch.zeros(2,10)`

In [None]:
h0 = torch.zeros(2,10)

activations_vorher = torch.mm(___,____)+____

<details>
    <summary><b>Lösung:</b></summary>

```python
h0 = torch.zeros(2,10)

activations_vorher = torch.mm(h0,w_2.t())+b_2
```
</details>

In the last step, the two activations are added and a `torch.tanh` activation function is applied.

In [None]:
torch.tanh(___________+_____________)

<details>
    <summary><b>Lösung:</b></summary>

```python
torch.tanh(activations_jetzt+activations_vorher)
```
</details>

This is the hidden state for the first token of the smile.
We can also compare this with the hidden state of the `nn.RNN` and see that they are identical.

In [None]:
output_rnn[0][0]

We want to calculate the hidden states not only for the first token, but for all tokens in the smiles. Therefore we need a `for-loop`. 

First we initialize the first hidden state with zeros. And then we write a `for-loop`, which iterates through all 13 tokens.

In [41]:
h0 = torch.zeros(2,10)
for i in range(max_smiles_length):
    activations_jetzt = _____  # When calculating, always make sure to select the i element from the inputs
    activations_vorher = ________
    h0 = torch.tanh(activations_jetzt+activations_vorher) # <-- The output is stored as h0, 
h0                                                        #     to use it in the next iteration as the new h0
                                                          #                   

NameError: name '_____' is not defined

<details>
    <summary><b>Lösung:</b></summary>

```python
h0 = torch.zeros(2,10)
for i in range(max_smiles_length):
    activations_jetzt = torch.mm(token_embeddings_tensor[i],w_1.t())+b_1
    activations_vorher = torch.mm(h0,w_2.t())+b_2
    h0 = torch.tanh(activations_jetzt+activations_vorher) 
h0                                                     
```                                                          
</details>

<details>
    <summary><b></b></summary>

```python
h0 = torch.zeros(2,10)
for i in range(max_smiles_length):
    activations_jetzt = torch.mm(token_embeddings_tensor[i],w_1.t())+b_1
    activations_vorher = torch.mm(h0,w_2.t())+b_2
    h0 = torch.tanh(activations_jetzt+activations_vorher) 
h0                                                     
```                                                          
</details>

`h0` now contains the final hidden state. Again, we can check if our result is identical to PyTorchs `nn.RNN`.

In [None]:
output_rnn[1]

Of course, it is easier to use the PyTorch class rather than our own implementation. 
But programming it yourself should help you better understand what exactly happens in an RNN.

Also, the code illustrates the biggest weakness of RNNs: the `for-loop`.
We cannot pass a sentence/Smile through the network all at once. 
Each word/symbol must be passed through the network one at a time. This makes RNNs extremely slow.


# PyTorch RNN

PyTorch provides us not only with RNNs, but also `nn.Embedding` Layers. This is handy, for one thing, it makes backpropagation easier. In addition, we do not have create one-hot encoded vectors. PyTorch immediately takes as input the tokenized smiles `tokenized_smiles`.

In [42]:
tokenized_smiles

[[0, 0, 3, 2, 0, 2, 1, 6, 6, 6, 6, 6, 6],
 [3, 0, 5, 2, 1, 4, 0, 0, 5, 2, 1, 4, 1]]

In [43]:
emb = nn.Embedding(7,4, padding_idx = dictionary["<pad>"])

Here we defined a `torch` embedding layer was defined on top of it. It takes as input the number of different symbols/tokens in our dataset. In our case this would be `7`. The second parameter specifies the size of the embedding vectors. We will stick with the size `4`. The last thing we can tell PyTorch is which token, i.e. which number represents the padding. PyTorch will then set the embeddings for these tokens to zero.

In [44]:
emb(torch.tensor(tokenized_smiles)).shape

torch.Size([2, 13, 4])

The output of this embedding layer does not have the correct format yet. We still have to change the dimensions of the tensor with `Permute`. 
We can combine all these steps into an `nn.Sequential()` module. 

*There is no `permute` in the Pytorchs `nn` module, we wrote an adapated version to work in `nn.Sequential`. That's why we don't need an `nn.` in front of the `permute`.

In [None]:
model = nn.Sequential(nn.Embedding(7,4, padding_idx = dictionary["<pad>"]),
                     Permute(1,0,2),
                     nn.RNN(4,10))

model

The `tokenized_smiles` can now be passed through the `model`. With `[1][0,:,:]` we can extract the final hidden states in the correct format. We can insert them directly into a linear layer. Since we need to index the output with `[1][0,:,:]`, we cannot use the linear layers directly in the same `nn.Sequential()` model. We need a second model that takes the `output_rnn` as input.

In [None]:
output_rnn = model(torch.tensor(tokenized_smiles))[1][0,:,:]

In [None]:
pred_ll = nn.Sequential(nn.Linear(10,1))

In [None]:
pred_ll(output_rnn)

There is still a problem with the `nn.RNN`. In the GIF you can clearly see that the first words in the sentence have less and less influence the longer the sentence gets. This can become a problem when sentences or Smiles become particularly long. Especially if subordinate clauses or, in the case of Smiles, additional branches are inserted into the `string`. It can happen that the beginning of the sentence or Smile is "forgotten" or lost by the network.

For this reason, more complex RNN layers are usually used. This allows the networks to hold information over longer `strings`.

A popular alternative is the Gated Recurrent Unit (GRU). Combining hidden states is much more complex than with "vanilla RNNs", but in PyTorch `nn.RNN` can easily be replaced by `nn.GRU`. Nothing needs to be changed on the rest of the network.

RNN |GRU
------|--------
<img src="https://miro.medium.com/max/332/0*eRJCRsikdGGu8ffA.png" width="200"/> |<img src="https://miro.medium.com/max/700/1*RiOzdOVaaeKrUotY7-1a2A.png" width="300"/> 


# Practise Exercise:

In the exercise task, we will look at a new data set. The Blood-Brain Barrier Penetration (BBBP) dataset recorded for 2000 molecules whether they can diffuse through the blood-brain barrier.

Most drugs and neurotransmitters cannot pass the blood-brain barrier. This is an important property for drugs that are intended to act in the central nervous system. Therefore, accurate prediction of these properties is of great interest.
The original dataset was published in 2012. However, we use a slightly modified dataset. Here, all stereochemistry information has already been removed from the Smiles. In addition, the dataset contains only Smiles consisting of less than 75 tokens.
> Martins, Ines Filipa, et al. "A Bayesian approach to in silico blood-brain barrier penetration modeling." Journal of Chemical Information and Modeling 52.6 (2012): 1686-1697.



In [45]:
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch import nn, optim
from rdkit.Chem import AllChem as Chem
import numpy as np
from sklearn.metrics import roc_auc_score
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics.pairwise import cosine_similarity
from matplotlib import pyplot as plt
%run ../utils/utils.py

You can first read in the data set.

In [47]:
data_bbbp = pd.read_csv("../data/bbbp/bbbp_clean.csv")
data_bbbp.head()

Unnamed: 0,smiles,target
0,CC(C)NCC(O)COc1cccc2ccccc12,1.0
1,CC(C)(C)OC(=O)CCCc1ccc(N(CCCl)CCCl)cc1,1.0
2,CC1COc2c(N3CCN(C)CC3)c(F)cc3c(=O)c(C(=O)O)cn1c23,1.0
3,CC(=O)NCCCOc1cccc(CN2CCCCC2)c1,1.0
4,Cc1onc(-c2ccccc2Cl)c1C(=O)NC1C(=O)N2C1SC(C)(C)...,1.0


The `smiles` are given together with the `target`. A `1` indicates that these molecules can diffuse through the BBB. In the following cell we calculate the percentage of molecules that have this property in the data set.

In [46]:
np.sum(data_bbbp.target)/data_bbbp.shape[0]*100

NameError: name 'data_bbbp' is not defined

Because of the large imbalance, the ROC-AUC is the most suitable metric.
But before we can turn our attention to the training, we must prepare the data.
First create a `dictionary` that assigns numbers to all symbols in the `smiles`.

In [48]:
dictionary = create_dict(data_bbbp.smiles)

In [49]:
dictionary

{'(': 0,
 'O': 1,
 ')': 2,
 '1': 3,
 'N': 4,
 'C': 5,
 'c': 6,
 '2': 7,
 'Cl': 8,
 '=': 9,
 '3': 10,
 'n': 11,
 'F': 12,
 '-': 13,
 'o': 14,
 'S': 15,
 'H': 16,
 '[': 17,
 ']': 18,
 '4': 19,
 's': 20,
 '#': 21,
 'Br': 22,
 'I': 23,
 '+': 24,
 '5': 25,
 'P': 26,
 '6': 27,
 'B': 28}

With this dictionary, you now convert the actual symbols of the Smiles to numbers.

In [56]:
tokenized_smiles = tokenize(data_bbbp.smiles,dictionary)

The problem is, as in the example, that the molecules and thus the `smiles` are of different lengths:

In [57]:
length_ll = np.array([len(x) for x in tokenized_smiles])
length_ll

array([27, 36, 48, ..., 43, 56, 49])

Therefore you must first bring all `tokenized_smiles` to the same length. This is the length of the longest smile. 

In [58]:
max_length = max(length_ll)
max_length

74

To all smiles that consist of less than 74 tokens, we add additional `<pad>` tokens until they are 74 tokens long. We assign the `<pad>` token the value `len(dictionary)` since this is the next unused number.

In [59]:
print(len(dictionary))
dictionary["<pad>"]= len(dictionary)

30


The following code adds this padding token to all smiles.

In [60]:
for i, tok_smi in enumerate(tokenized_smiles):
    tokenized_smiles[i] = tok_smi+ [dictionary["<pad>"]]*(max_length - length_ll[i])

In [61]:
length_ll = [len(x) for x in tokenized_smiles]
length_ll

[74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,
 74,


Now all `tokenized_smiles` are of the same length and in the correct format. But you have to divide the data into training and test dataset again. 
We merge the `tokenized_smiles` and targets from the `data_bbbp`. 

In [None]:
data_bbbp_tokenized = np.hstack([np.array(tokenized_smiles), data_bbbp.iloc[:,1:2]])
data_bbbp_tokenized

In [None]:
train, test = train_test_split(data_bbbp_tokenized,test_size=0.2,train_size=0.8, random_state=1234)

We separate the inputs and outputs from each other again. Important: The `targets` are in the last column.

In [None]:
train_x = torch.tensor(train[:,:-1], dtype=torch.long )
train_y = torch.tensor(train[:,-1], dtype=torch.float)
test_x = torch.tensor(test[:,:-1], dtype=torch.long)
test_y = torch.tensor(test[:,-1], dtype=torch.float)

Create the training loader so we can train with minibatches. 

In [None]:
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(train_dataset, batch_size=32)

test_dataset = TensorDataset(test_x, test_y)
test_loader = DataLoader(test_dataset, batch_size=32)

Then define the model which requires an embedding layer, a permute layer and an RNN. Here we use the GRU.

In [None]:
torch.manual_seed(1111)
model =nn.Sequential(nn.Embedding(len(dictionary),32, padding_idx = dictionary["<pad>"]),
                     Permute(1,0,2),
                     nn.GRU(32,64))


You also need a linear layer that makes predictions based on the output of the GRU. 
For this we create a second model called `pred_ll`.

Why do we need a second model?

This is because all RNNs in PyTorch have more than one output. One for all hidden states and one for only the final hidden states. In this case the `nn.Sequential` network does not know which output should be passed from the RNN to the linear layer.

Therefore we need a second model `pred_ll`. Here we use batchnorm and dropout. Make sure that the dimensions of `BatchNorm1d` and `Linear` correspond to the output dimension of the `GRU`.

In [62]:
torch.manual_seed(1111)
pred_ll = nn.Sequential(nn.BatchNorm1d(64),nn.Dropout(0.2),nn.Linear(64,1))

Addtionally, define a loss function and an optimizer. Remember that we have a binary classification.
Since we have two networks that we want to update together, we can combine the parameters of the two into one list and make them available to the optimizer.

In [None]:
loss_funktion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(list(model.parameters()) + list(pred_ll.parameters()), lr =0.001) 

In [None]:
for i in range(40):
    pred_ll.train()
    for input_, targets in train_loader:
        optimizer.zero_grad()
        rnn_output = model(input_)[1][0]
        output = pred_ll(rnn_output).flatten()
        
        loss = loss_funktion(output, targets)
        loss.backward()
        optimizer.step()
    
    pred_ll.eval()
    
    rnn_output = model(train_x)[1][0]    
    output = pred_ll(rnn_output).flatten()
    loss_train = loss_funktion(output, train_y)
    auc_train = roc_auc_score(train_y.numpy(),torch.sigmoid(output).detach().clone().numpy())
    
    rnn_output = model(test_x)[1][0]    
    output = pred_ll(rnn_output).flatten()
    loss_test = loss_funktion(output, test_y)
    auc_test = roc_auc_score(test_y.numpy(),torch.sigmoid(output).detach().clone().numpy())
    
    print("Training Loss: %.3f Training AUC: %.3f | Test Loss: %.3f Test AUC: %.3f"
        % (loss_train.item(), auc_train,loss_test.item(), auc_test ))


You can see that you can make accurate predictions with an RNN. However, in reality, ECFP and classical neural networks often work better. Especially on small datasets, since they are not as complex. 

Last, we look at the learned embeddings. For this, we store the weight matrix of the embeddings layer.

In [63]:
embedding_weights = list(model[0].parameters())[0].detach().clone().numpy()
embedding_weights.round(2)

NameError: name 'model' is not defined

One way to analyze the embeddings is to compare the similarity of different tokens via the `cosine_similarity`. Tokens with similar function should have similar embeddings.

As an example we calculate the similarity of the embeddings of a nitrogen in an aromatic ring (`n`).
We have to find which number belongs to `n` in the dictionary, and hence also the index of the row in the corresponding embedding matrix.


In [None]:
idx_n = dictionary["n"]
dictionary["n"]

We calculate the similarity of this embedding to all other embeddings. Afterwards a bar chart is created.

In [None]:
similarity_N = cosine_similarity(embedding_weights[idx_n:idx_n+1,:],embedding_weights)[0]
labels = [x for x in dictionary]

In [None]:
sorted_values=pd.DataFrame({"symbol": labels, "similarity":similarity_N}).sort_values("similarity", ascending =False)
sorted_values.plot.bar("symbol", "similarity")

The problem with such a small data set is that the embeddings are extremely dependent on the data set. Nevertheless, general trends can be identified. `n` is more similar to the aromatic atoms `o` or `c` than to the atoms outside an aromatic ring `C`,`N` and `O`. However, the exact embeddings can vary extremely from training to training.

You can also compare other symbols by looking here which value is assigned to a certain token:

`idx_n = dictionary["o"]`.

Choose another symbol.