# Pre-trained Large Models (PLMs)

## Motivation

Training neural machine translation models from scratch is expensive

Large bilingual corpora are necessary? 

For many task-specific translation models there are low resources

Pre-trained large models have proved to be useful in many scenarios, but in MT?

  * Combination of pre-trained encoders and pre-trained decoders?
  * Multilingual pretrained language models

## Background

PLMs usually refer to Transformer-based architectural models with billions of parameters and trained on massive data

PLMs exhibit strong capacities to understand natural language and solve complex tasks such as machine translation

### Scaling laws

Model performance/capacity as a function of model and data size, and computational budget

Given a limited compute budget, KM scaling laws favors a larger budget allocation in model size than the data
size, while the Chinchilla scaling law argues that the two sizes should be increased in equal scales

* Predictable scaling: loss decrease on smaller models convey to larger models
* Task-level scaling: loss decrease does not always implies better performance in downstream tasks

### Emergent abilities

Abilities that are not present in small models but arise in large models

* In-context learning. The LM provided with a natural language instruction and/or several task demonstrations, it can generate the expected output for the test instances by completing the word sequence of input text, without requiring additional training or gradient update

* Instruction following. By fine-tuning with a mixture of multi-task datasets formatted via natural language descrip-
tions without using explicit examples, LLMs are shown to perform well on unseen tasks that are also described in the form of instructions. 
<!--  According to the experiments in [67], instruction-tuned LaMDA-PT [68] started to significantly
outperform the untuned one on unseen tasks when the model size reached 68B, but not for 8B or smaller model sizes. A recent study [69] found that a model size of 62B is at least required for PaLM to perform well on various tasks in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA and MGSM), though a much smaller size might suffice for some specific tasks (e.g., MMLU). >

* Chain-of-thought (CoT): Step-by-step reasoning to solve complex tasks such mathematical word problem.
<!-- CoT prompting can bring performance gains (on arithmetic reasoning benchmarks) when applied to PaLM and LaMDA variants with a model size larger than 60B, while its advantage over the standard prompting becomes more evident when the model size exceeds 100B >



## Resources of PLMs

### Publicly Available Model Checkpoints or APIs

LlaMa family (Meta): Latest version is LlaMa 3.1

* Largest version has 405B parameters, 15T training tokens, and an extended context
window of 128K tokens 
* It achieves competitive performance against prominent closed-source LLMs, such as GPT-4, in various benchmark
* It has greatly advanced the research progress of LLMs

Mistral family: Mistral Large 2 (123B parameters)

* Strong performance on various mainstream benchmarks (e.g., MMLU and GSM8k)

Gemma family (Google): Gemma

* Achieved excellent performance in multiple benchmarks (e.g., ARC-c, MMLU, and GSM8k)

Other families: Qwen, GLM, Baichuan. etc.



### Commonly Used Corpora for Pre-training

* Web pages: CommonCrawl, C4, RedPajama-Data, RefinedWeb, WebText

* Books & Academic Data: Book Data, Academic data

* Wikipedia

* Code: GutHub (Microsoft), StackOverflow, BigQuery (Google), The Stack (HuggingFace)

* Mixed data: The Pile, Dolma

### Commonly Used Datasets for Fine-tuning

* Instruction Tuning Datasets

* Alignment Datasets

### Library resources

* Deep learning frameworks: PyTorch, TensorFlow, JAX, Keras, etc.

* Transformers (HuggingFace)

* DeepSpeed (Microsoft)

* Megatron-LM (NVIDIA)

* Others: Colossal-AI, BMTrain, FastMoE, vLLM, DeepSpeed-MII, DeepSpeed-Chat




## Pre-training of PLMs

#### Data collection and preparation

* Data sources: General (web pages, conversation text, books), specialized (multilingual, scientific, code)

#### Data preprocessing

* Filtering and Selection: classifier-based (trained on high-quality data) and heuristic-based (lang id, perplexity, basic statistics, keyword filtering)

* De-duplication: Remove repetitive patterns at word, sentence, paragraph and document levels

* Privacy reduction

* Tokenization: Byte-Pair Encoding, WordPiece and Unigram tokenizations.

#### Data scheduling

* Data mixture: proportion of each data source.

* Different sources would be selected according to the mixture proportions

* Data source heterogeneity is critical for improving the downstream performance

* Data curriculum: order in which each data source is scheduled for training

* Adaptive adjustment of data proportions for different sources during pre-training (easy samples first, then progressively introducing more challenging/specialized ones)



### Architecture

Transformer is the common architecture behind PLMs

#### Encoder-decoder

Vanilla Transformer model with encoder and decoder stacks

Mainly for text generation

Examples: T5, BART, NLLB

#### Encoder

Training based on masked tokens

Mainly for classification task as document classification or sentiment analysis.

Examples: BERT, ViT, etc.

#### Causal decoder

Unidirectional attention mask to guarantee that each input token can only attend to the past tokens and itself.

Mainly for text generation

Examples: GPT, LlaMa and Gemma families

#### Prefix (Non-casual) decoder

Masking mechanism of causal decoders, to enable performing bidirectional attention over the prefix tokens and unidi-
rectional attention only on generated tokens

Examples: GLM-130B and U-PaLM

#### Mixture-of-Experts

A subset of neural network weights for each input are sparsely activated

Examples: Switch Transformer and GLaM

#### Emergent architectures

Most based on parameterized state space models (SSM) 

Examples: Mamba, RetNet, RWKV and Hyena

### Architecture

Detailed configuration: normalization, position embeddings, activation functions, and attention and bias

#### Configuration 

<!- https://www.reddit.com/r/MachineLearning/comments/t5dznr/r_deepnet_scaling_transformers_to_1000_layers/ >
* Normalization: LayerNorm (LN), RMSNorm, DeepNorm and Post-LN, Pre-LN, Sandwich-LN

<!- GLU paper: https://arxiv.org/pdf/1612.08083v3>
* Activation functions: GeLU, variants of GLU (SwiGLEU, GeGLU)

* Position Embeddings: Absolute, relative, rotatory (RoPE), ALiBi

* Attention: Full (Vanilla Transformer), Sparse (local based on position), multi-query (key and value matrices shared across heads), grouped query (key and value matrices shared across group of heads), FlashAttention (optimization of GPU memory usage)

### Objective functions

#### Language modeling (LM)

Given a sequence of tokens $x_1^J$, a general training objective is to maximize the following log-likelihood:
$$
\begin{align}
{\cal L}(x) &= \sum_{j=1}^J \log p(x_i\mid x_1^{j-1})
\end{align}
$$

#### Denoising Autoencoding (DAE)

Given a sequence of tokens $x$ in which a subset of them $\tilde{X}$ have been corrupted in order to generate $\tilde{x}$, a corrupted version of $x$, the objective function maximises the log-likelihood of the corrupted tokens:

$$
\begin{align}
{\cal L}(x) &= \sum_{u \in \tilde{X}} \log p(u \mid \tilde{x})
\end{align}
$$

#### Mixture-of-Denoisers

The loss function is computed according to the type of denoising that is applied to the sample. 
Samples are prefixed with the denoising type. 
Denoising varies in length (number of tokens involved) and ratio of corrupted text to generate:

* S-denoiser (LM)
* R-denoiser (DAE, short span and low corruption)
* X-denoiser (DAE, long span or high corruption)  

### Decoding strategies

Greedy and beam search with length penalty

$$
\hat{y}_1^{\hat{I }} &= \argmax_{y_1^I} P(y_1^I\mid x_1^J)
$$

Random sampling: randomly select the next token based on the probability distribution to enhance the randomness and diversity during generation

* Top-$k$ sampling: Randomly sample from $$ most probable tokens

* Top-$p$ sampling: Randomly sample from those tokens accumulating a probability mass of $p$

* Other strategies: $\eta$-sampling, contrastive search and typical sampling

## Overview of pre-trained large models

<img src="EvolutionaryTreeLLMs.jpg" width="640"/>

## Additional bibliography

<ol>
<li><a href="https://arxiv.org/pdf/2303.18223" target="_blank">W. X. Zhao et al. A Survey of Large Language Models, arXiv preprint arXiv:2303.18223 (September 2024).</a></li>
<li><a href="https://doi.org/10.1145/3649506" target="_blank">J. Yang et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond, ACM Trans. Knowl. Discov. Data 18, 6, Article 160 (July 2024).</a></li>
</ol>