## Individual Presentation - Research Paper
Jennifer Zhuang

### Equation for Cross-Entropy Loss
For a sequence of $N$ tokens:

$$L = - \frac{1}{N} \sum_{i=1}^N \log p(y_i | y_{<i})$$
$Where:$
 
$y_i: \text{The } i^{th} \text { token in the sequence}$
 
$p(y_i | y_{<i}): \text{The model's predicted probability of token } y_i$
$\text{given all previous tokens in the sequence (} y_{<i} \text{)}$
 
$\text{The loss is averaged over all tokens in the sequence}$
 

### Power Laws
$$
L(X) \propto \frac{1}{X^{\alpha_X}}$$
---
$$\alpha_X: \text{Scaling exponent that determines}$$
$$\text{how quickly loss decreases as } X \text{ grows}$$
---

### Scaling Factors
| **Factor**        | **Range/Details**                           |
|--------------------|---------------------------------------------|
| **Model Size**     | 768 to 1.5 billion parameters              |
| **Dataset Size**   | 22 million to 23 billion tokens            |
| **Shape**          | Depth, width, attention heads, feed-forward |
| **Context Length** | 1024 tokens (default), shorter contexts    |
| **Batch Size**     | $2^19$, varied for experiments        |

### Scaling Factors

| **Factor**          | **Range/Details**                                                                                        | **Notation**                                                                                          |
|----------------------|----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|
| **Model Size**       | 768 to 1.5 billion parameters                                                                            | $N$                   |
| **Dataset Size**     | 22 million to 23 billion tokens                                                                         | $D$                                                                   |
| **Shape**            | Depth, width, attention heads, feed-forward dimensions                                                 |  $d_{ff}$, $n_{layer}$, $n_{heads}$                                                                           |
| **Context Length**   | 1024 tokens (default), shorter contexts                                                                | $n_{ctx}$                                                                                                    |
| **Batch Size**       | $2^{19}$, varied for experiments                                                                   | $B$                                                                                 |
| **Training Compute** | Estimated as $C \approx 6NBS$, quoted in PF-days (1 PF-day = $ 8.64 \times 10^{19}$ FLOPs)       | $C$                                                       |

### Paramaterizing Transformers
Given:
- $n_{layer}: \text{ Number of layers}$
- $d_{model}:\text{ Dimension of the residual stream (main data flow)}$
- $d_{ff}:\text{ Dimension of the intermediate feed-forward layer}$
- $d_{attn}:\text{ Dimension of the attention output}$
- $n_{heads}:\text{ Number of attention heads per layer}$

#### Defining Model Size

We use **N** to denote the model size, defined as the number of non-embedding parameters:
$$N \approx 2 \cdot d_{\text{model}} \cdot n_{\text{layer}} \cdot (2 \cdot d_{\text{attn}} + d_{\text{ff}})$$

This simplifies to:
$$N = 12 \cdot n_{\text{layer}} \cdot d_{\text{model}}^2$$
$$\text{where } d_{\text{model}} = \frac{d_{\text{ff}}}{4} = d_{\text{attn}}$$



## Compute for a Forward Pass

Evaluating a **forward pass** of the Transformer requires approximately:
$C_{\text{forward}} \approx 2 \cdot N + 2 \cdot n_{\text{layer}} \cdot n_{\text{ctx}} \cdot d_{\text{model}}$

## Model Shape Definition 
- $n_{layer}: \text{ Number of layers}$
- $d_{ff}:\text{ Dimension of the intermediate feed-forward layer}$
- $n_{heads}:\text{ Number of attention heads per layer}$

$$L = \left( \frac{C_{\text{min}}}{2.3 \times 10^8} \right)^{-0.050}$$
- $C_{\text{min}}$: The minimum compute required to achieve a specific loss value.
- The exponent ($-0.050$) indicates how quickly the loss decreases as compute increases.
- The dashed line demonstrates that loss scales predictably with compute, following a power-law relationship.

$$L = \left( \frac{D}{5.4 \times 10^{13}} \right)^{-0.095}$$

$$L = \left( \frac{N}{8.8 \times 10^{13}} \right)^{-0.076}$$