# Large Language Models

### Outline

- LLMs and Transformers

- Characteristics of LLMs: Emergent Properties

- Pipeline of building an LLM

- Engineering consideratioons 

### Large Language Models and Transformers

- Most LLMs are **decoder-only transformers**, i.e., they autoregressively predict the next word or the set of words
    * Use of causal masking to prevent the tokens from looking into the future

- **Lots of parameters**: GPT-3 has 175B parameters
    * 96 attention layers
    * 96x128 dimension heads
    * 12,288 dimensional token embeddings

- **Lots of data**: GPT-3 is trained on 300B tokens

- **Lots of compute**: 1024 GPUs, 34 days, costing USD4.6M

- **Lots of research** and hyperparameter choices to make 
    - Normalization type
    - Normalization position
    - Activation function
    - Learning rate 
    - ...

- **Lots of engineering** to maximize training throughput
    * Distributed training 
    * Memory optimization


### Characteristics of LLMs: Scaling Laws for pre-trained models

- **Scaling laws**: Empirical investigation of how loss decreases with model parameters ($N$), Dataset size ($D$), and compute ($C$)

    * Helps in assessing the model size given dataset size and compute available
    * Many laws have been proposed, e.g., Kaplan et al. (2020)
    
    $$L(N) = \big(\frac{N_c}{N}\big)^{\alpha_N}, \alpha_N \sim 0.076, N_c \sim 8.8 \times 10^{13}$$
    
    $$L(D) = \big(\frac{D_c}{D}\big)^{\alpha_D}, \alpha_D \sim 0.095, D_c \sim 5.4 \times 10^{13}$$
    
    <img src="imgs/scaling-laws.png">
    
    * General agreement that loss decreases with model size
    * Other scaling laws exist as well, e.g., Hoffmann et al. (2022)
    
- **Caveat**: These laws are observed for decoder-only architecture
    
[[1]](https://arxiv.org/pdf/2001.08361) Kaplan et al. (2020) Scaling Laws for Neural Language Models

[[2]](https://arxiv.org/abs/2203.15556) Hoffmann et al. (2022) Training Compute-Optimal Large Language Models

### Characteristics of LLMs: Emergent properties

- Larger models exhibit properties that they weren't trained on
    * No definitive theory as to why it happens

- **Instruction tuning** [1, 2]
    * Pre-trained language models (PLMs) are fine-tuned on smaller curated dataset with instructions
    * All benchmark datasets of NLP are used as an input appended with task instructions
    * Results in a smaller loss and a good generalization performance
    
    <img width=750 src="imgs/flan_schematic.png">

[[1]](https://arxiv.org/abs/2109.01652) Wei et al. (2022) Fine-tuned language models are few-shot learners

[[2]](https://arxiv.org/abs/2210.11416) Won et al. (2022) Scaling Instruction-Finetuned Language Models

### Characteristics of LLMs: Emergent properties

- **In-context learning** or **Prompt engineering** (from users' perspective) [1]: 
    * Zero-shot learning: LLMs can perform tasks that they haven't seen in the dataset
    * Few-shot learning: LLMs can learn from the demonstrations of the task
    
    * This ability unlocks a new paradigm of creating machine learning models, which could have taken months to collect the appropriate data, e.g., spam classification

    <img width=750 src="imgs/in-context-learning.png">

[[1]](https://arxiv.org/abs/2005.14165) Brown et al. (2020) Language models are few-shot learners


### Characteristics of LLMs: Emergent properties

- **Chain-of-thought reasoning** [1]: 
    * Complex reasoning tasks are difficult for PLMs
    * LLMs can be prompted to reason through its answers, e.g., "Let's think step by step"
    
    <img width=750 src="imgs/cot.png">

[[1]](https://arxiv.org/abs/2201.11903) Wei et al. (2022) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models


### LLMs from start to finish

<img src="imgs/llm-outline.png">

### Pre-training LLM: Data Collection


<img src="imgs/data-pre-processing.png">

[[1]](https://arxiv.org/abs/2303.18223) Zhao et al. (2023) A survey of large language models

### Pre-training LLM: Data Collection


- **Corpus selection**

<img src="imgs/corpus.png">


[[1]](https://arxiv.org/abs/2303.18223) Zhao et al. (2023) A survey of large language models

### Pre-training LLM: Data Collection


<img src="imgs/data-pre-processing.png">

- **Filter** for good quality documents

- **De-duplication**: Remove duplicates at sentence level and document level. 
    * Important to ensure that there is no leak between training and validation set

- **Privacy redaction**: Remove any personally idenitfiable information (PII)

- **Tokenization**: Converting raw text into individual tokens that are fed as a sequence into the model
    * Several strategies have been proposed, e.g., Byte-Pair Encoding [2], WordPiece[3], Unigram[4]
    * It plays an important role in what LLMs can learn
    * Tokenization can be learned as well. Modern LLMs learn the tokenization, e.g., WordPiece tokenization
    * Library: [SentencePiece](https://github.com/google/sentencepiece) [4]

[[1]](https://arxiv.org/abs/2303.18223) Zhao et al. (2023) A survey of large language models

[[2]](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt) Byte-pair Encoding tokenization Tutorial

[[3]](https://arxiv.org/abs/2012.15524) Song et al. (2020) Fast WordPiece Tokenization

[[4]](https://arxiv.org/abs/1808.06226) Kudo et al. (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

### Pre-training LLM: Model Setup

- **Transformer architecture**
    * Encoder-only BERT [1] was one of the earliest model.
    * Very few architectures are Encoder-decoder models, e.g., T5[2]
    * Most of the modern LLMs are Decoder-only
    
- **Attention mechanism**: Memory and fast computations are the major concern here. Several works have been proposed in this line of research, e.g., FlashAttention, Performer, Sparse Attention
    * Prefer sub-quadratic mechanism 


[[1]](https://arxiv.org/abs/1810.04805) Devlin et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    
    
[[2]](https://arxiv.org/abs/1910.10683) Raffel et al. (2019) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer


### Pre-training LLM: Model Setup

- **Positional Encodings**: Words without positional encodings will be treated as sets and not sequences
    * Several options to embed positions, e.g., Absolute position embedding, Relative position embedding, Rotary position embedding (RoPE), AliBi
    * **Extrapolation**: The ability of LLMs to capture long-term dependencies. 
    * AliBi is the preferred method, although RoPE and T5 bias have also been shown to exhibit extrapolation

- **Loss function**
    * Language modeling is the main objective and the most commonly used 
        $$ \mathcal{L}_{LM} = \sum_{i=1}^{N}\log P_{\theta}(x_i \mid x_{<i})$$
    * Denoising Autoencoding:  is also used by some LLMs, e.g., T5[1]. Not easy to implement in decoder-only models
        $$ \mathcal{L}_{DAE} = \log P_{\theta}(x^{'} \mid x - x^{'})$$
    
    * Mixture of Denoising: Uses $\mathcal{L}_{LM}$, and $\mathcal{L}_{DAE}$ with different levels of corruption, e.g., PaLM [2]


[[1]](https://arxiv.org/abs/1910.10683) Raffel et al. (2019) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

[[2]](https://arxiv.org/abs/2204.02311) Chowdhery et al. (2022) PaLM: Scaling Language Modeling with Pathways


### Pre-training LLM: Model Setup

- Other choices
    * **Normalization**:
        * LayerNorm was the proposed method
        * RMSNorm [1] and DeepNorm [2] stabilizes the training 
    * **Position of normalization**: Should the normalization be before or after the feedforward network
        * $\text{Add & Norm}$ was proposed 
        * $\text{FeedForward(Norm(x)) + x}$ seems to work better
    * **Activation function**: GeLU[3] has been most commonly used activation function

[[1]](https://arxiv.org/abs/1910.07467) Zhang et al. (2019) Root Mean Square Layer Normalization

[[2]](https://arxiv.org/abs/2203.00555) Wang et al. (2022) DeepNet: Scaling Transformers to 1,000 Layers

[[3]](https://arxiv.org/abs/1606.08415) Hendrycks et al. (2016) Gaussian Error Linear Units

### Pre-training LLM: Training

- **Batch size**: 
    * Standard size: 2048 examples or 4M tokens
    * Empirically, a schedule has been shown to stabilize the training 

- **Learning rate**: Warm-up followed by cosine decay

- **Optimizer**: Adam, AdamW (GPT-3), Adafactor (PaLM, T5)

- Other tricks to stabilize the training
    * **Weight decay**: Most LLMs have been trained with a value of 0.1
    * **Gradient clipping**: Don't let the absolute value of gradients to go above 1.0
    * **Loss spikes**: Prevent these by restarting from the checkpoint where the spike occured


### Pre-training LLM: Training
   
- Scaling up using **3D Parallelism**
    * **Data Parallelism**: Replicate the model parameters and optimizer states across multiple GPUs
    * **Pipeline parallelism**: Distribute different layers of LLM over multiple GPUs
    * **Tensor parallelism**: Decompose the tensors for multiplication over multiple GPUs
    * All three can be used, e.g., 8x4x12 parallelism was used for training BLOOM over 384 GPUs [2]


Libraries to support all three parallelism: [DeepSpeed](https://github.com/microsoft/DeepSpeed), [Colossol-AI](https://github.com/hpcaitech/ColossalAI), [Alpa](https://ai.googleblog.com/2022/05/alpa-automated-model-parallel-deep.html)

[[1]](https://arxiv.org/abs/2303.18223) Zhao et al. (2023) A survey of large language models

[[2]](https://arxiv.org/abs/2211.05100) BigScience Workshop (2022) BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

### Pre-training LLM: Training


- **Mixed-Precision Training**
    * 32-bit floating-point (FP32) numbers is the default option of training pre-training LMs
    * 16-bit floating-point (FP16) reduces memory usage and communication overhead
    * FP16 has been shown to result in loss of accuracy
    * Brain Floating Point (BF16), with more exponent bits and fewer significant bits
    * BF16 is widely used for pre-training LM
    
- GPT-4 uses **a smaller model to predict if the LLM will be trained successfully**. If not, they kill the process


Libraries to support all three parallelism: [DeepSpeed](https://github.com/microsoft/DeepSpeed), [Colossol-AI](https://github.com/hpcaitech/ColossalAI), [Alpa](https://ai.googleblog.com/2022/05/alpa-automated-model-parallel-deep.html)

[[1]](https://arxiv.org/abs/2303.18223) Zhao et al. (2023) A survey of large language models

[[2]](https://arxiv.org/abs/2211.05100) BigScience Workshop (2022) BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

### Adaptation: *Instruction* Tuning (Data Collection)

- Data collection
    * Pre-defined NLP benchmarks formatted with natural language task descriptions
    * Collect from human interactions through chat
    * Synthetically generate using LLMs
    * They need to be balanced so that one type of task do not dominate the dataset
    
- Considerations
    * Too many examples or too few examples for a task could create problems in training LLMs
    * Formatting of natural language task description should be considered, e.g., demonstrations for task helps, incorporating suggestions or things to avoid may hurt the performance
    * Diversity and quality of instructions is very important


[[1]](https://arxiv.org/abs/2303.18223) Zhao et al. (2023) A survey of large language models

### Adaptation: *Alignment* Tuning (Data Collection)

- LLMs have been known to show unintended or harmful behaviors such as hallucinating, misleading, or being biased 

- How can we ensure that LLMs are aligned with human values? Helpful, honest, and harmless

- We elicit responses from LLMs and penalize them for producing such responses

- **Data Collection**
    * Collect LLM's responses for various prompts
    * Manually score them
        * Ranking approach ranks various responses
        * Rating approach let the human annotator score the responses against some criterion
    * **Reward Model**: Learn a LM (e.g., encoder-only) that learns the same rating as humans.

- Algorithm for fine-tuning: LLMs can be fine-tuned using reinforcement learning algorithms such as PPO [2] to produce outputs that maximize RM's output, i.e., reward

[[1]](https://arxiv.org/abs/2303.18223) Zhao et al. (2023) A survey of large language models

[[2]](https://arxiv.org/abs/1707.06347) Schulman et al. (2017) Proximal Policy Optimization Algorithms

### Adaptation: Fine-tuning methods

- Full parameter tuning is computationally expensive

- Parameter Efficient Fine-Tuning (PEFT) methods are used to fine-tune LLMs by selectively learn the parameters that modify the behavior of original LLM
    * Adds a few fine-tunable parameters, thereby drastically reducing the number of trainable parameters

- Several techniques have been proposed
    * **Prefix tuning [1]**: Task-specific virtual tokens are learned that are added as a prefix to every layer
    * **Prompt tuning [2]**: Task-specific prompting tokens are learned
    * **Low-Rank Adaptation (LoRA) [3]**: Add low-rank "update matrices" to attention blocks. Once they are trained, merge them during the inference

    <img src="imgs/peft.png">


Library: [PEFT](https://github.com/huggingface/peft) implements all of the fine-tuning methods


[[1]](https://arxiv.org/abs/2101.00190) Li et al. (2021) Prefix-Tuning: Optimizing Continuous Prompts for Generation

[[2]](https://arxiv.org/abs/2104.08691) Lester et al. (2021) The Power of Scale for Parameter-Efficient Prompt Tuning

[[3]](https://arxiv.org/abs/2106.09685) Hu et al. (2021) LoRA: Low-Rank Adaptation of Large Language Models

### Usage: Memory-effecient Model Adaptation

- Bigger sized models demand more memory during the inference time

- Quantization methods have been proposed to reduce the size of these models

- LLMs have been shown to have outliers in their activations. This makes quantization difficult.

- Several approaches have been proposed
    * Mixed-precision decomposition [1]: *LLM.int8()* is the most-commonly used method to quantize LLMs
    * Layerwise quantization [2]
    

Libraries: [GPTQ-for-LLaMA](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)


[[1]](https://arxiv.org/abs/2208.07339) Dettmers et al. (2022) LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

[[2]](https://openreview.net/forum?id=ksVGCOlOEba) Frantar et al. (2022) Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning