In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Bigger = Better ?  Scaling laws

There are many LLM's, with varying choices of
- number of parameters $N$
- size of training dataset $D$
- amount of compute for training $C$

Here is a table from the [GPT-3 paper](https://arxiv.org/pdf/2005.14165.pdf#page=46)

<img src="images/compute_by_model.png" width=90%>

We have already seen that some LLM properties
- like in-context learning (zero or few shot)
- "emerge" only when model size passes a threshold

This argues for bigger models.

<img src="images/LM_Few_Shot_Accuracy.png" width=80%>
                                           

There is also evidence that the emergence of ability to perform some in-context tasks
- is sudden
- rather than gradual
as the number of parameters increase.

<img src="images/arithmetic_LLM_by_size.png">

Attribution: [GPT-3 paper](https://arxiv.org/pdf/2005.14165.pdf#page=46)


Is bigger $N$ always better ?

Consider the costs.  Larger $N$
- entails more computation: larger $C$
- probably requires more training data: larger $D$

If we fix a "budget" for one choice (e.g., $C$) we can explore choices for $N, D$ that meet this budget.

Here are two models with the same $C$ budget
- but vastly different $N$ and $D$


model | Compute (PF-days) | params (M) | training tokens (B) |
:---|:---|:---|:---
RoBERTa-Large | 49.3 | 355 | 2000
GPT-3 2.7B    | 55.2 | 2650 | 300

Attribution: [GPT-3 paper](https://arxiv.org/pdf/2005.14165.pdf#page=46)

Given these choices: how do we choose ?

One way to quantify the decision is by setting a goal
- to maximize "performance"
- where this is usually proxied by "minimizing test loss" $L$
    - Cross Entropy for the "predict the next" token task of the LLM
 

We can state some basic theories
- Increasing $N$ creates the *potential* for better performance $L$
- To *actualize* the potential
    - we need increased $C$
        - more parameters via increasing the number of stacked Transformer Blocks
    - we need increased $D$

But this still leaves many unanswered questions
- Can $L$ always be reduced ?
    - Does performance hit a "ceiling" 
    - For a fixed $N$: perhaps increasing $D$ or $C$ won't help
- What is the relationship between $N$ and $D$ ?
    - how much must $D$ by increased when $N$ increases
- For a fixed $D$: what is the best choice for $N$ ?
    - holding performance constant

# Scaling Laws

Fortunately, this
[paper](https://arxiv.org/pdf/2001.08361.pdf)
has 
- conducted an empirical study of models with varying $N, D, C$ and resulting $L$
- fit an empirical function (*Scaling Laws*) describing the dependency of $L$ on $N, D, C$.

We briefly summarize the results.

"Performance" (test loss $L$ ) depends on scale.

Scale consists of 3 components
- Compute $C$
- Dataset size $D$
- Parameters $N$

We can set a "budget" for any of variables $L, N, D, C$
- and examine trade-offs for the non-fixed variable


The paper shows that 
- Increasing your budget for one of the scale factors
- increases performance (decrease loss)
- **provided** the other two factors don't become bottlenecks

<img src="images/scaling_loss_v_scale.png">

But bottlenecks are a worry:
- The potential performance of a model of fixed size $N$ hits a "ceiling"
- That can't be overcome by increasing compute $C$

<img src="images/scaling_loss_vs_compute.png">

**Observation**

For a fixed Compute $C$
- a smaller model (that has reached its asymptotic minimum) has better performance
- provided that there is enough training data

This is interesting in that more data $D$ may compensate for fewer parameters
- we may be able to create "small" models (fewer parameters)
- with performance equal to larger models
- given sufficient $D$


We can also set a performance budget $L$
- and examine the amount of training data $D$ to reach this budget
- as $N$ varies

<img src="images/scaling_loss_vs_datasize.png">

**Observation**

For a fixed $D$ 
- bigger models have better performance
- but at a higher $C$

Here is one graph that combines $N$ and $D$

<img src="images/scaling_loss_vs_D_and_N.png">

The [Scaling Laws](https://arxiv.org/pdf/2001.08361.pdf#page=4)
show that Loss follows a Power Law as a function of $N, C, D$.

[Here](https://arxiv.org/pdf/2001.08361.pdf#page=20) is a summary of the Scaling Laws.

<img src="images/scaling_power_laws_summary.png">

# More Recent results

Answering the same question as the [original paper](https://arxiv.org/pdf/2001.08361.pdf)
- a [more general approach](https://arxiv.org/pdf/2203.15556.pdf) to the same question
- leads to somewhat different conclusions

Stated more directly, the paper proposes an empirical function to estimate the optimal $N$ and $D$
- for a fixed compute budget $C$

$$
N_\text{opt}, D_\text{opt} = \argmin{ N, D \text{ s.t. } C=\text{FLOPS}(N,D)}{L(N,D)}
$$

where $L(N,D)$ is the early-stopped loss
- not trained to optimal converged $L$
- which would require more than the compute budget $C$

One point of departure between the two papers:
- the second paper uses a *learning rate schedule* that varies with $D$
    - decay to a fixed fraction of the initial rate, based on length of $D$
- versus a *fixed* learning rate schedule used by the first paper

The second paper contends that 
- failing to use a variable learning rate
- causes an *over-estimate* of $L$ when $D < 130B$

Using the overestimate in fitting an empirical function causes a difference in conclusions.
- The second paper concludes that $D$ should grow linearly with $N$
- rather than sub-linearly ($D = N^{0.74}$)


In comparing the optimal values for a variable (e.g., $C$) between paper 1 and paper 2
- we use subscript $j$ to refer to the value in paper $j \in \{,2\}$

Here are some [conclusions](https://arxiv.org/pdf/2203.15556.pdf#page=7) offered in the second paper
- Most LLM's use an $N$ that is *too large* given their *ocmpute budgets*
- For $N = 175B$ (GPT-3), an optimal version of GPT-3 
    - needs to be trained longer than the actual
        - $C_2 = 4.4 * 10^{24}$ Flops versus actual $C_1 = 3.1 * 10^{23}$
    - on more tokens $D$
        - $D_2 = 4.2T$ versus $D_1 = 0.3T$
- For current models much larger than GPT-3 ($N > 175B$)
    - the optimal $C_2$ and $D_2$ are not realistic in practical terms

Using the projected optimal values
- the second paper started with a model called Gopher with $N = 280B$
- set a compute budget equal to that used for Gopher
- to derive an optimal $N_2 = 70B$ and $D_2 = 1.4T$
- and trained a smaller model called Chinchilla 

Chinchilla, although only $25\%$ as large as Gopher
- outperforms on many benchmarks

So perhaps the future will see
- a trend to smaller models
- with more data

This may be particularly relevant
- with the use of non-parametric knowledge (external knowledge sources, like the Web)
- naturally reduced $N$

# Test-time cost versus Train-time cost

We have been focused on the cost of *training*
- cost of a forward pass
- cost of a backward pass
- summed over many training examples


Post-training, at test time, the cost of prediction is
- cost of a forward pass

The way that $N$ (number of parameters) usually increases in a Transformer Architecture
- is by stacking an increasing number of Transformer blocks.

This increases the path length of a forward path.

So making *predictions* using a bigger model will be more costly than doing so in a smaller model.

If you are running a prediction service at large scale (e.g., ChatGPT)
- you need increased compute
- to support the same number of predictions
- on a bigger model than a smaller one.

So smaller models have test-time as well as train-time advantages.

In [2]:
print("Done")

Done
