<a href="https://colab.research.google.com/github/pnijsters/pytorch/blob/main/Notes_Generative_AI_with_LLMs_Course_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Language Models

Foundation models (in order of parameter size)
1. GPT - decoder only model
2. Bloom (176B) - decoder only model
3. FLAN-T5 - encoder and decoder model
4. PaLM
5. LLaMa - decoder only model
6. BERT (110M) - encoder only model

Parameters = models memory

Prompt = input to the model

Context window = the size of the maximum input to the model, typically a couple of thousand words

Completion = output of a model


# Generative AI

GenAI is not new, it started with RNNs: Recurrent Neural Networks. The Transformers Architecture took it to a whole new level in 2017. RNNs are very compute resource intensive: doesnt scale well enough.

Transformers Architecture: Attention Is All You Need. Research paper from Google and University of Torronto. This unlocked the progress recently made because of the following capabilities:

- scales efficiently
- parallel process
- attention to input meaning

Three things that enabled the recent progress in the ever larger LLMs are:

- Transformer architecture
- Availability of massive training data sets
- More powerful compute resources (GPUs)
- Massive amounts of CAPEX inflow in the hype cycle

Paper is posted here: https://arxiv.org/abs/1706.03762

## Tokenizer: translating text into numbers

Embedding is 'translating' text into numbers. This is done by assigning each word or parts of words in the input text a number which is basically an index into a long list of all the words the model supports. This function is done during embedding and uses a "tokenizer". The number assigned is called a "token ID".

*Tokens* are words, character sets, or combinations of words and punctuation that are used by large language models (LLMs) to decompose text into. Tokens are the representations of text in the form of a vector.

## Embedding

Vague babble: TBD

A 'token ID' is mapped to a vector.

## Encoder
The encoder inputs ("prompts") with contextual understanding and produces one vector per input token

## Decoder
Accepts input tokens and generates new tokens

# Prompt engineering

- In-context learning (ICL) is providing examples in the prompt.
- Zero-shot inference is when you provide the input data in the prompt
- Single-shot inference is when you provide an example in the input data in the prompt
- Few shot inference is when you add multiple examples in the input data in the prompt

Larger models tend to work well with zero-shot, the smaller models typically need single or few shot inference for the output to make sense.

# Generative configuration

- Max new tokens

Greedy decoding or greedy sampling means that the model will always chose the word in the dictionary with the highest calculated probability. Random(-weighted) sampling selects a token/word based on a random-weighted strategy across all tokens which results in a more creative, less repetitive output. Top K and Top P are methods to limit the randomness and avoid completely bizarre outputs

- **Sample top K**: limit the model to only chose from the top K options instead of the entire dictionary
- **Sample top P**: limit the model to only chose from options where the cumulative probability of all these options is smaller than P
- **Temperature**: influences the shape of the probability distribution the model uses to select a token. The higher the temperature > the higher the randomness

# Training LLMs

**Encoder only models or autoencoding models** are trained using Masked Language Modeling (MLM). MLM takes a sentence and removes or 'masks' a random word and lets the model predict that word. The objective is to reconstruct the original text which is called 'denoising'. The models look at the sentence bidirectionally: from left to right and right to left to take in the entire context of the sentence before predicting the word. Good usecases for these models:

- Sentiment analysis
- Named entity recognition
- Word classification
- Example models: BERT and ROBERTA

**Decoder only models or autoregressive models** are trained using Causal Language Modeling (CLM). The objective is to predict the next token based on the previous sequence of tokens. The model only works from left to right and up to the tokens that are not masked. These models do not have knowledge or context of tokens beyond the masked token. Good usecases for these models:

- Text generation
- Other emergent behavior (?) depends on size of model
- Example models: GPT and BLOOM

** Decoder and encoder models or sequence-to-sequence models** are trained using Span Corruption where multiple tokens are masked and replaced with a 'sentinel token' that represents these multiple masked tokens. Exact details vary model to model. Good usecases for these models:

- Translation
- Text summarization
- Question answering
- Example models: T5 and BART

## Quantization

Quantization is reducing the required memory to store and train models. Quantization-aware training (QAT) learns the quantization scaling factors during training.

Training a model of 1B parameters requires the following memory:

- 4 bytes for each parameter at 32-bit precission
- 8 bytes for 2 states in the Adam optimizer
- 4 bytes for the gradient
- 8 bytes for activations and temporary memory

And thus, each parameter requires 24 bytes of memory. A 1B model therefor requires a GPU with 24GB of memory.

Quantization is remapping 32-bit (FP32) to 16-bit (FP16) floating point or 8-bit integer (INT8). BFLOAT16 or BFP16 or Brain Floating Point 16 was developed by Google Brain team and is popular with LLMs and is now supported by NVIDIA GPUs. BFP16 is also known as 'Truncated FP32' since the precision bits (fraction bits) are just truncated while keeping the total range.

- **FP32**: 1 sign bit, 8 exponent bits, 23 fraction bits. Range: -3e38 to 3e38
- **FP16**: 1 sign bit, 5 exponent bits, 10 fraction bits. Range: -65k to 65k
- **BFP16**: 1 sign bit, 8 exponent bits, 7 fraction bits. Range: -3e38 to 3e38
- **INT8**: 1 sign bit, 0 exponent bits, 7 fraction bits. Range -128 to 127

FLAN-T5 uses BFP16.

A 500B parameter model requires 12,000GB @ 32-bit full precision (!). Cannot be done on a single GPU: you need multiple GPUs in parallel (sometime 100s!)

PyTorch: Distributed Data Parallel (DDP)
PyTorch: Fully Sharded Data Parallel (FSDP) > ZeRO

## Compute performance

1 "petaflops/s-day" = floating point operations performed at a rate of 1 petaFLOP per second for one day. 1 petaFLOP = 1,000,000,000,000,000 (one quadrillion) floating point operations per second.

1 PetaFLOP/s-day = 8 x NVIDIA V100 running at full efficiency = 2 x NVIDIA A100

GPT-3-175B requires 5000 Petaflops/s-day for training:
- 5,000 days of running 1 Petaflops/s-day: 2 x A100
- 50 days of running 100 Petaflops/s-day: 200 x A100
- 5 days of running 1,000 Petaflops/s-day: 2,000 x A100
- 0.5 days or running 10,000 Petaflops/s-day: 20,000 x A100


Iron triangle:
- data set size
- model size
- compute budget

Chincilla scaling laws dictate that the ideal data set size is 20 times the model size. Meaning: 50B parameter model needs 50B x 20 input data set size. THe reality is that a lot of the very large parameter models are *undertrained* because they have not been fed enough training data.

# Fine tuning

Fine tuning is a supervised learning process whereby de model is fed with labeled examples to update the weights of the model (prompt-completion pairs).

Instruction fine tuning is like prompt engineering with one or more examples on how the model is to behave and perform. Full fine-tuning is a fine tuning process whereby all of the models parameters are updated; it results in a new version of the model with updated weights. This approach requires the full resource footprint of the original training exercise.

There are templates and tools available that can be used to feed the LLMs with these fine tuning prompts. Only 500-1000 examples can start to yield decent results.

One of the downsides of full fine-tuning is that you get a completely new version of the original LLM for every specific fine-tuned task. Which can be prohibitive since every LLM is typically many GBs in size. Using parameter efficient fine-tuning if possible addresses this downside.

## FLAN models

FLAN = Fine-tuned Language Net. A specific set of instructions was used to fine-tune these models. FLAN-T5 is the FLAN instructed version of T5 base model. The FLAN paper is posted here: https://arxiv.org/abs/2210.11416


## Catastrophic forgetting
Fine-tuning can significantly increase the performance of a model on a specific task but at the same time have sever negative impact on other tasks because the weights of the model are being updated. Depending on the use case of the focused task, catastrophic forgetting is not an issue.

Otherwise, do fine-tuning on multiple tasks at the same time with larger data sets. Or consider using **Parameter Efficient Fine Tuning (PEFT)**.

# Model evaluation metrics

Accuracy = correct predictions / total predicitions.

This works well for traditional machine learning with supervised data sets where the output is known and can be compared to the predicted value. For LLMs this is not the case since output isnt known...

"Mike really loves drinking tea" = "Mike adores sipping tea"
"Mike does not drink coffee" != "Mike does drink coffee"

Both examples are very similar in the differences between both responses are small. However the coffee example is clearly wrong whereas the tea example is not.

## Evaluation metrics for LLMs

- Rouge: Used for text summarization. Copares a summary to one or more reference summaries
- Blue (Bilingual Evaluation Understudy) score: Used for text translation. Compares to human generated translations

These are both fairly simple evaluation methods to use for initial evaluation and evaluation during fine-tuning. For better insights it is better to use common Benchmarks:

- Glue: General Purpose Language Understanding (2018)
- SuperGlue
- Helm: Holistic Evaluation of Language Models
- BIG-Bench (2022)
- MMLU: Massive Multitask Language Understanding (2021)


# Parameter Efficient Fine-Tuning (PEFT)

Fine-tuning the entire parameter space on LLMs might be cost prohibitive. More efficient tuning can be done be either:

- Frozen Weights: fix a large part of the parameter space and only allow changes to a subset of the parameters (15-20% of total LLM weights)
- Additional Layers: add additional layers to the model and keep the parameters of the original layers fixed, only train the weights in the new layers

PEFT is less prone to catastrophic forgetting because only a small number of weights get updated.

PEFT is also more efficient for model storage. Each LLM typically is many GBs in size. With PEFT only the delta of the new trained parameters needs to be stored (typically order of MBs). These PEFT weight are combined with the original LLM weights at inference.

PEFT methods:

- Select subset of initial parameters to fine-tune
- Reparameterization model weights using a low-rank representation (LoRA)
- Add trainable layers or parameters to the model
    - Adapters
    - Soft Prompts: Prompt Tuning

## LoRA - Low Rank Adaptation

- Freeze most of the original LLM weights
- Inject 2 rank decomposition matrices
- Train the weigths of the smaller matrices

## Prompt tuning

Prompt tuning != prompt engineering.

Prompt tuning adds trainable 'soft prompts' to the input; prepends these to the prompt: typically 20-100 tokens. These virtual tokens do not need to map to specific words in the embedding space.


# RLHF - Reinforcement Learning from Human Feedback

Models behaving badly:
- toxix language
- aggresssive responses
- providing dangerous information

HHH principles, completions should be:
- Helpful
- Honest
- Harmless

## RAG - Retrieval Augmented Generation

Lets LLM access external data sources at inference time

## PAL - Program Aided Language Model

Pairs an LLM with an external code interpreter to perform calculations (typically Python)

## ReACT - Reasoning and Action
Paper here: https://arxiv.org/abs/2210.03629

## LangChain

The orchestration framework for RAG, PAL and ReACT


