# tl;dr

The transformer architecture was introduced in the paper ["Attention is all you need"](https://arxiv.org/abs/1706.03762), however attention mechanisms existed before this. Attention is a method that allows the model to focus on relevant parts of the input (context) in arbitrary (non-local) ways; high capacity models using attention mechanisms, trained with sufficient data, have surpassed recurrent neural network architectures as state-of-the-art (LSTM, etc. and even convolutions, too) on many tasks. This is largely due to the fact that transformers can be trained in parallel allowing them to use much larger training datasets than other recurrent models.

Attention makes a direct connection between points in the input (e.g, sequence); as a result you can [view attention essentially like a graph](https://graphdeeplearning.github.io/post/transformers-are-gnns/) where each part of the input is connected to all others and the weight of each edge determines the "strength" (how much attention to pay) of the interaction. Thus, transformers are a special case of graph neural networks!

Good Tutorials:
* [fast.ai lesson](https://course.fast.ai/Lessons/lesson24.html)
* [UvA tutorial](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html)

In [None]:
import sys
import torch
import deepchem
import sklearn
import rdkit
import simpletransformers
import transformers

import pandas as pd
import numpy as np

from IPython.display import YouTubeVideo
from transformers import RobertaModel, RobertaTokenizer

%load_ext autoreload
%autoreload 2

import watermark
%load_ext watermark
%watermark -t -m -v --iversions

# Attention Mechanisms

## Notes from CH. 16 in [Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow](https://github.com/ageron/handson-ml2)

Attention started with [this paper](https://arxiv.org/abs/1409.0473) in **2014** where the decoder was able to focus on different words in the input dynamically, and as needed.  Thus, the "length of the path" from an input word to its translation was much shorter than in other recurrent models. This is called "concatenative" or "Bahdanau" attention and uses a neural net to do "alignment."

["Luong", or "multiplicative", attention](https://arxiv.org/abs/1508.04025) was proposed the following year in **2015** - the goal of the attention mechanism is to measure the similarity between one of the encoder's outputs and the decoder's previous state, the authors simply proposed a dot product of these vectors, which is the most commonly used type attention today. It is usually faster to compute and performs better.

In [None]:
%%HTML
<img src="https://uvadlc-notebooks.readthedocs.io/en/latest/_images/scaled_dot_product_attn.svg"
width="200"/>

They also proposed a variant where the inputs are first sent through a linear transformation before the dot product is taken called the "general dot product approach" - I seem to see this a lot since it adds fittable parameters. Below is a visualization from the "multi-head" attention mechanism using this approach. (Note **there is no activation function** on any of the linear layers applied below).

In [None]:
%%HTML
<img src="https://uvadlc-notebooks.readthedocs.io/en/latest/_images/multihead_attention.svg"
width="200"/>

## YouTube Introductions

In [None]:
YouTubeVideo("yGTUuEx3GkA", width=400)

In [None]:
YouTubeVideo("tIvKXrEDMhk", width=400)

In [None]:
YouTubeVideo("23XUv0T9L5c", width=400)

# Transformers

In **2017**, the ["Attention is all you need"](https://arxiv.org/abs/1706.03762) paper was published, which showed that attention mechanisms alone (without any recurrent layers) is sufficient to handle neural machine translation tasks (NMT) using the "transformer" architecture.  Since then, transformers achieved SOTA performance in many areas.

In [None]:
%%HTML
<img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png"
width="400"/>

Encoder
---
The lefthand side is the encoder - inputs of size (batch size, max input sentence length)
The encoder outputs each token or word as an embedding with K dimensions - output has size (batch size, max input sentence length, K)
In the original architecture N = 6, so there are 6 stacks of the encoder part.

Feedforward parts are added in each block to help "post process" the new information.  Essentially, ecah block keeps updating the information to provide better context.

Decoder
---
The righthand side is the decoder - it accepts the original sentence (shifted to the right by 1 place) plus the encoder output and gives a probability of the next word. Output of size (batch size, max input sentence length, vocabulary length).
As a result, the model is called repeatedly, each time predicting the next word in the sentence.
Note that the encoder's final output is actually fed to each of the N decoder stacks.

The decoder's Masked MH Attention is used so that each word is only allowed to attend to the words that came before it.
BERT uses a similar architecture to GPT-2 but dropped these Masked MH Attention layers from that model to make it "bidirectional" (the "B" in BERT).

Attention
---
The attention mechanism is essentially Luong style, but contains a scale normalization constant.

The K, Q, V and the attenion operations are explained nicely (with examples) in:
1. [UvA tutorial](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html)
2. [dmol.pub](https://dmol.pub/dl/attention.html)

Multiheaded attention allows attention mechanism to project the embeddings in different contexts to interpret different parts in different ways.

Other Notes
---
LayerNorms are used throughput to keep the layers values on the same order as their inputs (since skip connections are used).

Positional encodings (PE) are dense vectors that are added to the embeddings of each token at the same position to impart information about where the token occurs in the sentence.  This is necessary because attention is basically a graph and has no concept of order. See the [UvA tutorial](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html) for code and examples.

Pre-training allows the encodings (output of encoder blocks) to be very helpful with transfer learning.

## Architecture and Discussions

In [None]:
YouTubeVideo("iDulhoQ2pro", width=400)

In [None]:
YouTubeVideo("EXNBy8G43MM", width=400)

In [None]:
YouTubeVideo("-QH8fRhqFHM", width=400)

Also see [blog](https://jalammar.github.io/illustrated-transformer/)

## Explainability

Attention probabilities can be used to visualize where the model is attending to given an input token.  However, as warned [here](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html):

> "This helps us in understanding, and in a sense, explaining the model. However, the attention probabilities should be interpreted with a grain of salt as it does not necessarily reflect the true interpretation of the model (there is a series of papers about this, including [Attention is not Explanation](https://arxiv.org/abs/1902.10186) and [Attention is not not Explanation](https://arxiv.org/abs/1908.04626)"

Basically, this boils down to a [similar issue with using Grad-CAM to explain CNNs](https://arxiv.org/abs/2011.08891) - if you have a GAP layer + 1 Dense layer at the end, you can take a gradient wrt of the scores to reliably highlight parts of the image that contribute positively to a classification.  However, with other architectures there is no guarantee because the mathematical transformations in the deep layers at the end (head) are not being explained.

This is from: https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Transfer_Learning_With_ChemBERTa_Transformers.ipynb


In [None]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
      jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
  }
});

In [None]:
import sys

# !git clone https://github.com/jessevig/bertviz bertviz_repo
# sys.path.append('bertviz_repo/')

In [None]:
def call_html():
    import IPython
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

from bertviz import head_view

model_version = 'seyonec/PubChem10M_SMILES_BPE_450k'
model = RobertaModel.from_pretrained(model_version, output_attentions=True)
tokenizer = RobertaTokenizer.from_pretrained(model_version)

smiles = "CCCCC[C@@H](Br)CC"
inputs = tokenizer.encode_plus(smiles, return_tensors='pt')
input_ids = inputs['input_ids']
attention = model(input_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

call_html()

head_view(attention, tokens)

## BERTology

https://huggingface.co/transformers/v3.0.2/bertology.html

# Examples of transfer learning

## "Foundational" models

Pre-training transformers enables their encoder blocks to be very useful for fine-tuning on other tasks.

[BERT](https://arxiv.org/abs/1810.04805) is a great example.  However, it was shown this model was undertrained and needed some refinement - this is what [RoBERTa](https://arxiv.org/abs/1907.11692) is (just a better way to train BERT models).

[ChemBERTa](https://arxiv.org/abs/2010.09885) (also [v2](https://arxiv.org/abs/2209.01712)) is basically the application of RoBERTa principles to train a BERT model on SMILES strings so that its encodings are "foundational" - that is, they basically learn the fundamental aspects of chemistry and thus, are very useful for transfer learning applications.

## Code examples

The [transformers](https://huggingface.co/docs/transformers/installation) library by Hugging Face hosts the models and weights for many popular tranformer architectures enabling their use in transfer learning.

The [simpletransformers](https://simpletransformers.ai/) tool makes it even easier.

Other examples include:

* https://dmol.pub/dl/pretraining.html
    
* https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html

* https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Transfer_Learning_With_ChemBERTa_Transformers.ipynb

In [None]:
from simpletransformers.classification import ClassificationModel
import pandas as pd, sklearn, matplotlib.pyplot as plt, numpy as np

In [None]:
soldata = pd.read_csv(
    "https://github.com/whitead/dmol-book/raw/main/data/curated-solubility-dataset.csv"
)

N = int(len(soldata) * 0.1)
sample = soldata.sample(N, replace=False)
train = sample[: int(0.8 * N)]
test = sample[int(0.8 * N) :]

train_dataset = train[["SMILES", "Solubility"]]
train_dataset = train_dataset.rename(columns={"Solubility": "labels", "SMILES": "text"})
test_dataset = test[["SMILES", "Solubility"]]
test_dataset = test_dataset.rename(columns={"Solubility": "labels", "SMILES": "text"})

In [None]:
model = ClassificationModel(
    "roberta",
    "seyonec/ChemBERTa_zinc250k_v2_40k",
    num_labels=1,
    args={
        "num_train_epochs": 5,
        "regression": True,
        "use_multiprocessing": False,
        "use_multiprocessing_for_evaluation": False,
    },
    use_cuda=False,
)

In [None]:
model.train_model(
    train_df=train_dataset,
    args={"num_train_epochs": 5},
)

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(
    test_dataset, acc=sklearn.metrics.mean_squared_error
)
print(result)

In [None]:
# make predictions and see how we do
predictions = model.predict(test_dataset["text"].tolist())[0]

# plot the predictions
plt.scatter(test_dataset["labels"].tolist(), predictions, color="C0")
plt.plot(test_dataset["labels"], test_dataset["labels"], color="C1")
plt.text(
    -10,
    0.0,
    f"Correlation coefficient: {np.corrcoef(test_dataset['labels'], predictions)[0,1]:.3f}",
)
plt.xlabel("Actual Solubility")
plt.ylabel("Predicted Solubility")
plt.show()

There is some confusion when doing transfer learning with BERT models concerning their output.  Using the simpletransfomers method above greatly simplifies the workflow, but hides details.

From tensorflow documentation [here](https://www.tensorflow.org/text/tutorials/classify_text_with_bert):
> "The BERT models return a map with 3 important keys: pooled_output, sequence_output, encoder_outputs:
>
>pooled_output represents each input sequence as a whole. The shape is [batch_size, H]. You can think of this as an embedding for the entire movie review.
>
>sequence_output represents each input token in the context. The shape is [batch_size, seq_length, H]. You can think of this as a contextual embedding for every token in the movie review.
>
>encoder_outputs are the intermediate activations of the L Transformer blocks. outputs["encoder_outputs"][i] is a Tensor of shape [batch_size, seq_length, 1024] with the outputs of the i-th Transformer block, for 0 <= i < L. The last value of the list is equal to sequence_output.
>
> For the fine-tuning you are going to use the pooled_output array."

* [stackoverflow discussion](https://stackoverflow.com/questions/69836422/bert-outputs-explained)
* [kaggle discussion](https://www.kaggle.com/discussions/questions-and-answers/86510)