# CS 195: Natural Language Processing
## Attention and Transformers

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F7_2_AttentionTransformers.ipynb)

## Reference

SLP: Attention, Section 9.8 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/9.pdf

SLP: Transformers and Pre-Trained Language Models, Chapter 10 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/10.pdf

How do Transformers Work?, Section 1.4 of Hugging Face NLP Course https://huggingface.co/learn/nlp-course/chapter1/4

## Reminder: Applied Exploration

The applied exploration for this fortnight will be a little different. I want everyone to get some experience fine-tuning an existing model, so this will be the task for the entire fortnight.

See the [workshop from last time](https://github.com/ericmanley/F23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb)

Fine-tune an existing model with the following requirements
* Choose a different starting model - you can use any Hugging Face model, but consider starting with a general one like BART or Llama2.
* Choose a different data set - think about something that would be good to include in an application that interests you
* Evaluate how well it performed. For sequence-to-sequence model, try going back and using Rouge from Fortnight 1.

The Hugging Face NLP course has [examples of fine-tuning for many different tasks](https://huggingface.co/learn/nlp-course/chapter7/1).

## Preliminary Discussion

We often represent things (word embeddings, context, hidden state, etc.) in neural networks as vectors.
* interpreted mathematically, vectors have *direction* and *magnitude*

Sometimes we want to be able to tell how *similar* two vectors are - for example, two similar words might have similar embedding vectors
* similar vectors point in the same direction
* we usually *normalize* them so that they're of length 1 but keep the same direction

Which of the following two sets of vectors are more similar?

(0.383, 0.077, 0.920) and (0.477, 0.191, 0.858)

or

(0.383, 0.077, 0.920) and  (0.759, 0.569, 0.316)

## Dot products

A **dot product** of two vectors is one way to measure similarity

Compute by
* multiplying corresponding entries
* adding them all together

In [1]:
(0.383*0.477) + (0.077*0.191) + (0.920*0.858)

0.986758

In [2]:
(0.383*0.759) + (0.077*0.569) + (0.920*0.316)

0.62523

## Review: The Encoder-Decoder Recurrent Neural Network Architecture

<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/encoder-decoder_detail.png?raw=1">
    </center>
</div>

### Discuss:
1. What do encoders do? How are they trained?
2. What do decoders do? How are they trained.
3. What is **c** in this diagram and what is its purpose?


image source: SLP Fig. 9.18, https://web.stanford.edu/~jurafsky/slp3/9.pdf

Notes:
*   Encoders trained using masked language tasks
*   Ignore the output of the encoder and take its training weights, that's the valuable part of the encoder
* Decoders output the desired result based on passed vector
* Decoders typically trained using "predict the next word" tasks
* C is the "essence" vector



## Problem: Bottleneck

<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/bottleneck.png?raw=1" width=700>
    </center>
</div>

The final hidden state in the encoder has to contain everything meaningful about the input text
* may not represent things from earlier in the input sequence
* even if you use LSTM nodes

image source: SLP Fig. 9.21, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Attention



<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/attention.png?raw=1" width=800>
    </center>
</div>

**Idea:** instead of just passing the final hidden encoder state to the decoder, pass a combination of all encoder states
* weighted sum: makes sure that the context vector is a fixed size (can be some other more complicated function)
* computed again for each decoder state $i$
* takes into account decoder state $i-1$ and all encoder states
* you can learn *which input words are most important* when generating the next word
* even better at retaining long-term information

image source: SLP Fig. 9.23, https://web.stanford.edu/~jurafsky/slp3/9.pdf

## Computing the Attention context vector

1. compute the *score* for how relevant each encoder state is to each decoder state
    * can be simple - the dot product
    * this is a single number that represents how similar the two vectors are
    * tells us the relevance of each encoder state to the current step in the decoder
2. Normalize these with a softmax to create a vector of weights $\alpha_{ij}$ - this essentially turns the relevance for each encoder state into probabilities
3. Use these normalized relevence scores as weights in a weighted sum - this is our new context vector $$c_i = \sum_{j} \alpha_{ij}h^e_j$$

## Attention is all you need

2017: Big breakthrough by researchers at Google - the **Transformer**
* You can use attention *without recurrent structures*
* recurrent structures: slow training - you have to generate them sequentially
* transformers: fast training - you can do it in parallel

<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/attention_paper.png?raw=1">
    </center>
</div>

Paper available at https://arxiv.org/abs/1706.03762

## Transformers

Transformers include
* linear layers
* feedforward networks
* **self-attention** layers
    - directly extract and use information from arbitrarily large contexts
    


## A simple self attention layer

<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/simple_self_attention.png?raw=1">
    </center>
</div>

* Each output is based on all prior inputs

* Similar to recurrent calculation
    - compute similarity score of each pair $x_i$ and $x_j$ (can be just dot product)
    - normalize with softmax, call it $\alpha_{ij}$
    - generate output as weighted sum of inputs $$y_i = \sum_{j \leq i} \alpha_{ij}x_j$$
    
* **Note that these can be computed in parallel!**

image source: SLP Fig. 10.1, https://web.stanford.edu/~jurafsky/slp3/10.pdf

## Fancier Embeddings

Transformers generate three new vectors from each word embedding representing different roles a word can play


<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/query_key_value.png?raw=1" width=600>
    </center>
</div>

* **Query vector:** used to measure relevance a word should give to other words in a sentence
    - current focus of attention
* **Key vector:** the vector that a query vector is compared to - how much focus the query word should give to this word
    - preceding words being compared to the current focus
* **Value vector:** information from the word that should be passed to the other words
    - once query and key have been matched, output is mostly the value vector but guided by interaction of query and key
    
*All of these weights are learned as part of the training process!*

image source: SLP Fig. 10.2, https://web.stanford.edu/~jurafsky/slp3/10.pdf

## Transformer Block

<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/transformer_block.png?raw=1">
    </center>
</div>

**Residual connections:** Allow information to skip layers - improves learning, helps with the vanishing gradient

**Normalization layers:** Rescale the vectors so that they're all meastured on the same scale (like z-score normalization in statistics) - also includes some learnable parameters called gain and offset for multiplying or adding on to the scaled values

image source: SLP Fig. 10.4, https://web.stanford.edu/~jurafsky/slp3/10.pdf

## Multi-Headed Attention

<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/multiheaded_attention.png?raw=1">
    </center>
</div>

As we have seen, the same word can have many different senses (e.g., a river *bank* vs. a *bank* that is a financial institution)

This can happen at each level of abstraction in a neural network language model

Each **head** is a self-attention layer capable of handling different relationships between words

image source: SLP Fig. 10.5, https://web.stanford.edu/~jurafsky/slp3/10.pdf

## Diagram of Encoder-Decoder Transformer

Encoder is on left, decoder on right

<div>
    <center>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/transformer_encoder_decoder.svg?raw=1">
    </center>
</div>


image source: https://huggingface.co/learn/nlp-course/chapter1/4