![](https://i.imgur.com/oix2cc4.jpg)


# Advanced Deep Learning with Transformers

### DATE
21-22 October 2021

## General information

This page: [https://hackmd.io/@enccs/SJ4OWVdVK](https://hackmd.io/@enccs/SJ4OWVdVK)
Lesson material: https://enccs.github.io/gnn_transformers/
Workshop page: [(https://enccs.se/events/2021/10/advanced-deep-learning/)](https://enccs.se/events/2021/10/advanced-deep-learning/)
Zoom: [ZOOM-LINK](https://rise.zoom.us/j/63771005182?pwd=V3duZnQyR2ZEcXVNZUxBM3ltbWJ6Zz09)
Post-workshop survey: SURVEY-LINK


### Schedule

**21 October**

| Time | Section | 
| ---- | ------- |
|09:00-10:00|	[Workshop and Graphs Introduction](https://docs.google.com/presentation/d/1I8s75w9kAGqnkZ1HejwZpMEuLDxOorezoQvWF9G1a1s/edit?usp=sharing) |
|10:00-12:00| Work on "Graph Basics" notebooks
|12:00-13:00	|Lunch break
|13:00-13:30	|[Intro to session 2](https://docs.google.com/presentation/d/1wxIXzxJRCAwLoMSb1Cvu75K1TzOrbkd_9GBw3CGk3Tk/edit?usp=sharing)
|13:30-15:50	|Work on "From molecules to PyTorch Tensors" notebooks
|15:50-16:00	|Recap of the day

**22 October**

| Time | Section | 
| ---- | ------- |
|09:00-09:30|Recap of yesterday and overview of a GNN
|09:30-12:00|Work on "Graph Neural Networks from the ground up" notebooks
|12:00-13:00|Lunch break
|13:00-13:30|Recap of implementing a GNN and overview of Transformers
|13:30-15:40|Work on "GNNs to Transformers" notebook
|15:40-16:00|Closing remarks

---
### Cheat sheets

[Graphs cheat sheet](https://enccs.github.io/gnn_transformers/_downloads/caaa68c4683b66a395a78b6871b369e3/cs_graphs.pdf)

[Graph Neural Networks cheat sheet](https://enccs.github.io/gnn_transformers/_downloads/a3ac08b326fa81cefb9e3b1b04211bd7/cs_gnns.pdf)

---

### Instructors

- Erik Ylipää
- Leon Sütfeld

### Helpers

- Johan Broberg
- Maria Bånkestad
- Saptarshi Hazra
- Carlos Penichet
- Daniel Perez
- Shuai Zhu


### Code of conduct

We strive to follow the [Contributor Covenant Code of Conduct](https://www.contributor-covenant.org/)
to foster an inclusive and welcoming environment for everyone. 
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/ENCCS/event-organisation/blob/main/CODE_OF_CONDUCT.md)


Further information and contact details to report CoC violations can be [found here](https://github.com/ENCCS/event-organisation/blob/main/CODE_OF_CONDUCT.md).

---

:::danger
You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such.
:::

---


## Questions, answers and information

**Question:** Where will the zoom recordings be made available?
**Answer:** We will make them available on the enccs.se website in the "Training resources". Expect this to take a week or two 

### October 21, Morning Session

---

**Question:** How do I start a GPU instance on Colab?
**Answer:** Go to Runtime->Change runtime type

---------

**Question:** Why should one hot enconding be avoided in NN?
**Answer:** There actually is a notebook on this linked in the overview for this session. It's optional, so don't worry if you don't get to looking into it. The upshot is that the representation is inefficient and will result in a lot of zeros in the gradients when training the neural network.

--------

**Question:** is the input order of vectors important? Shouldn't be, correct? 
**Answer:** Exactly right, in this problem we want to be _invariant_ of order, permutation-invariant so have to make sure our neural network is as well. This will covered in session 3.

--------
**Question:** I think we did too early a break out - now what? Break out session 3 at least.
**Answer:** The idea now is that you go through the notebooks together, best if is one of you share the screen and you go through the notebook in sync.

--------
**Question:** What do you mean that summing the vectors is preferable to concatenation, aren't these very different things?

**Question:** Could you repeat/clarify the part in which you said summing up random vectors in high dimensional space and how is that the reason why summing is preferred to concatenation?



**Answer:** Our claim that summing is better relies on three assumptions: that our embeddings free variables, are random and high dimensional. 

If they are, sums of these vectors actually retain a lot of the information about the constituents. We could pretty easily test whether a particular embedding vector is part of the sum by just performing a dot prodcut (or compute the cosine similarity) with the sum. If the vector was not part of that sum, we're likely to get a value close to 0. This is also what the downstreams matrix multiplication of our neural network can do, it can learn to have singular vectors aligned with the categorical values of interest.

Another way of thinking about it is the example we use in the notebook: We can implement concatenation by first padding two vectors and then summing them. This shows that concatenation can be thought of as a special case of summing vectors

There's a better argument for summing instead of concatenating if we also assume that this sum will be multiplied with a matrix (as will be the case in this workshop).
In that case, lets say we concatenate the vectors of _free variables_ $\mathbf{x}_1$ and $\mathbf{x}_2$:

$$\mathbf{x} = \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{bmatrix}$$

Then the matrix multiplication $W \mathbf{x}$ is the same as 
$$
W \mathbf{x} = W_1 \mathbf{x}_1 + W_2 \mathbf{x}_2
$$

Where $W = \begin{bmatrix} W_1 & W_2 \end{bmatrix}$

If we use concatenation, the first layer that the concatenation encounters will essentially perform a sum of these concatenated vectors after they have been linearily transformed. 

Since all the matrices and vectors are freely parameterized, we might just set $W_1 = W_2$ and not loose anything in terms of expressivity, especially if we increase the dimensionality of $\mathbf{x}_1$ and $\mathbf{x}_2$ to match the one the concatenation would have had.

**One argument for concatenation** is if we would like our combination of vectors to _not_ be permutation invariant. Let's say the vectors we want to combine are actually a sequence, where each position in the sequence will be represented by a vector, but this might be the _same_ vector.

S_1 = "a white cat visited the house"
S_2 = "a cat visited the white house"

In this case, concatenation makes more sense

$$S_1 = \begin{bmatrix} \mathbf{x}_\text{a} \\ \mathbf{x}_\text{white} \\ \mathbf{x}_\text{cat} \\ \mathbf{x}_\text{visited} \\ \mathbf{x}_\text{the} \\  
\mathbf{x}_\text{house} \\  
\end{bmatrix}$$

$$S_2 = \begin{bmatrix} \mathbf{x}_\text{a} \\ \mathbf{x}_\text{cat} \\ \mathbf{x}_\text{visited} \\ \mathbf{x}_\text{the} \\ \mathbf{x}_\text{white} \\ 
\mathbf{x}_\text{house} 
\end{bmatrix}$$


$$
W S_1 = W_1 \mathbf{x}_\text{a} + W_2\mathbf{x}_\text{white} + W_3\mathbf{x}_\text{cat} + W_4 \mathbf{x}_\text{visited} + W_5\mathbf{x}_\text{the} +  W_6\mathbf{x}_\text{house} 
$$

$$
W S_2 = W_1 \mathbf{x}_\text{a} + W_2\mathbf{x}_\text{cat} + W_3\mathbf{x}_\text{visited} + W_4 \mathbf{x}_\text{the} + W_5\mathbf{x}_\text{white} +  W_6\mathbf{x}_\text{house} 
$$

So in this case, assuming that the $\mathbf{x}$'s are fixed, the model can learn to _project_ them differently depending on position in the input concatenation. The word vector $W_5\mathbf{x}_\text{white}$ will be different from $W_2\mathbf{x}_\text{white}$, even though it's the same word.

If we instead summed them, we could not separate these two sentence representations.

Conversely, if we do want permutation invariance, concatenation is problematic since

$$
 W_1 \mathbf{x}_1 + W_2 \mathbf{x}_2 \neq W_1 \mathbf{x}_2 + W_2 \mathbf{x}_2
$$

So if the combination _should_ be permutation invariant, this will cause issues (the network can still _learn_ to treat them permutation invariant learning similar  $W_1$ and $W_2$, but this will require much more training).

**Question:** So, it works for high dimensional space. What do you consider as high dimensional space as a rule of thumb? In the examples we had for example dimension 4

**Answer:**
This is really a matter of how much capacity do we need to separate our vectors. If we only have 4 values in the categorical variable, a 4-dimensional embedding space would be able to create orthogonal representations for these. But if we randomly generate them we might need more to have this property of _almost_ orthognal vectors (let's say 10, pulling a number out of the air.)

In the example we chose low values for clarity, in practice you would have more.

The properties of random vectors being almost orthogonal to each other depends on how many random vectors we're considering in relation to the dimensionality. 

In natural language processing it was for a long time customary to use 300 as the embedding dimension of word vectors.

I (Erik) would say that anything above 100 dimensions is high dimensional, but depending on context, 10 can be as well.

In practice, we typically set the embedding dimension to the same as the vectors we pass through our neural network. It's often high enough that we don't have to worry about "collisions" in embedding space.

-------

**comment** the standard valence for C, O,  N would be 4, 2, 3 :D (sorry, couldn't help it)

Yeah, the valence we calculate are just the "explicit valence", essentially the number of explicit bonds (including doubles etc.) in the molecules. Since the standard valences should be able to be learned from the atom symbol, we _hopefully_ don't have to explicitly include them.


----------

**Question** will we get the notebooks with solutions since not all of us managed to finish all ?

**Answer:** We will make these available "Teaching Resources" at enccs.se

------

:::info
*Always ask questions at the very bottom of this document, right above this.*
::: 

### October 21, Afternoon Session

---

:::info
*Always ask questions at the very bottom of this document, right above this.*
::: 


### October 22, Morning Session

---

:::info
*Always ask questions at the very bottom of this document, right above this.*
::: 

### October 22, Afternoon Session

---
**Question** any general comments on the optimizer being AdamW?
**Answer:** No, it's just a common default in the Transformer field

**Question** in ch "The downside" you have a "TODO, maximum?" 


**Fix for the wrong config reference**
The cell which doesn't work, insert this code instead.

```python=
torch.manual_seed(1729)
d_model = 16
basic_encoder_config = BasicTransformerConfig(d_model=d_model, 
                                      n_layers=2, 
                                      ffn_dim=16,
                                      head_dim=8,
                                      layer_normalization=True,
                                      dropout_rate=0.1,
                                      residual_connections=True)
basic_transformer = BasicTransformerEncoder(config=basic_encoder_config, 
                                            continuous_node_variables=dataset.continuous_node_variables,
                                            categorical_node_variables=dataset.categorical_node_variables,
                                            continuous_edge_variables=dataset.continuous_edge_variables,
                                            categorical_edge_variables=dataset.categorical_edge_variables)

head_config = GraphPredictionHeadConfig(d_model=d_model, ffn_dim=32, pooling_type='sum')
prediction_head = GraphPredictionHead(input_dim=d_model, output_dim=1, head_config)
```


**q:** d_model = 16 should be 32? 
**a:** set it d_model in the config to d_model instead

```python 
#New improved version: added _shortest_paths_(KP)
def get_pairwise_features(rd_mol, rd_atom_a, rd_atom_b):
  pairwise_features = {}
  # First we create the features for the bond (or missing such) between
  # the two atoms
  bond = rd_mol.GetBondBetweenAtoms(rd_atom_a.GetIdx(), rd_atom_b.GetIdx())
  bond_features = get_shortest_paths_bond_features(bond, **BOND_FEATURES)
  
  pairwise_features.update(bond_features)
  # Now we create bond features for the path between rd_atom_a and rd_atom_b
  # We iterate over atoms of the shortest path up till max_path_length
  # If the shortest path is shorter than max_path_length, we add None-valued
  # features for the remaining ones
  shortest_path = GetShortestPath(rd_mol, rd_atom_a.GetIdx(), rd_atom_b.GetIdx())
  for i in range(max_path_length):
    path_bond_variables = PATH_FEATURES[i]
    if i < (len(shortest_path) - 1):
      a, b = shortest_path[i], shortest_path[i+1]
      path_bond = rd_mol.GetBondBetweenAtoms(a, b)
    else:
      path_bond = None
    path_bond_features = get_shortest_paths_bond_features(path_bond, **path_bond_variables)
    pairwise_features.update(path_bond_features)

  
  return pairwise_features
```

:::info
*Always ask questions at the very bottom of this document, right above this.*
::: 
