In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Introduction

The goal of Transfer Learning is to adapt a Pre-Trained model for a Source task
(the "base" model) to solve a new Target task.

Adapting a base model is typically performed by Fine-Tuning
- allowing the weights of the base model (and any additional "head") layers to adapt
- by training with a relatively small number of examples from the Target task.

Although Fine-Tuning is effective, there is a problem, especially with LLM base models
- LLM models can have a very large number $N$ of parameters
- They are increasingly deep: number of stacked Transformer blocks $n_\text{layers}$ is growing
    - latency in training
    
Even training on a small number of Target task examples is expensive in time and memory.

The question we address in this module
-  Can we adapt a base model *without* modifying *all* of the
parameters of the base model ?

We will refer to this problem as *Parameter Efficient Transfer Learning* 
- or *Parameter Efficient Fine-Tuning* when Fine-Tuning is used as the method for adaptation

We want the number of *adapted* parameters to be small relative to the total number of
base model parameters.

We will use this fraction as a metric in comparing adaptation methods.

We note that the number of parameters in a Transformer is $N = \OrderOf{ n_\text{layers} * d^2}$
- where $d$ is the internal dimension of the Transformer
- calculations may be found in [our notebook](Transformer.ipynb#Number-of-parameters) and [here](https://arxiv.org/pdf/2001.08361.pdf#page=6)

# Motivation for Parameter Efficient Transfer Learning

A base model may have a large number of parameters (e.g., an LLM)
- Adapting *all* the parameters may require large quantities of time and space
- Reducing the number of adapted parameters may have efficiency advantages

Beyond the obvious efficiency advantage
- there is a space advantage
- the specialization of the Base Model to a Target Task can be represented by
the small number of adapted parameters

This means that the parameters of the same base model can be *shared*
- across models for different Target tasks
- with one set of separate (but small) adapted parameters for each Target

This is also potentially a way to enable per-user instances of a Target task
- with user-specific training examples kept private to each user's instance

# Adapters

**References**

- [Parameter Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751.pdf)
- [LLM Adapters](https://arxiv.org/pdf/2304.01933.pdf)


Adapters are modules (implemented as Neural Networks)
- that are inserted into the existing modules (layers) of the base model.

In the general case: 
- we can insert one or more adapters *anywhere* within the NN comprising the base model.

Within a *single* Transformer block, typical arrangements are

- Series
    - Adapter inserted between modules
- Parallel
    - Adapter inserted parallel to a module
        - provided an alternate path *by-passing* the module

<table>
    <center><strong>Various Adapter designs</strong></center>
    <img src="images/LLM_adapters.png" width=70%>
    <br>
    Attribution: https://arxiv.org/pdf/2304.01933.pdf#page=2
</table>

Here is a diagram of a common adapter
<br><br>
<table>
    <center><strong>Adapter</strong></center>
    <img src="images/Adapter_diag.png" width=50%>
</table>

The dimensions of the input and output of the adapter
- are the same $d$ (common vector dimension) used for all layers in a Transformer
- facilitates inserting adapters anywhere in the Transformer

The usual architecture
- usually two modules, with a bottleneck of dimension $a \lt d$
    - Project down to reduced dimension; Project up to original dimension
- skip connection around the two projection modules

We are already familiar with adaptation via Adapter-like modules
- adding a new "head" layer to a head-less base model
    - often a Classifier to adapt the base model to the particular Target classes
- [Feature based transfer learning](NLP_Language_Models.ipynb#Other-uses-of-a-Language-Model:-Feature-based-Transfer-Learning) 
    - feeding the representation created by the base model to another module.
- these are not technically adapters
    - input and output dimensions don't match
    - architecture may differ

Regardless of where Adapters are placed
- they derive a new function $g$ from the function $f$ computed by the base model

Formally:
- $f_\Theta$ denotes the function computed by the base model which is parameterized by $\Theta$ 
- $g_{\Theta, \Phi}(\x)$ denotes the function computed by the adapted model
    - $\Phi$ are the Adapter parameters
    - $\Theta$ are the base model parameters

*Adapter Tuning* occurs when we train only the parameters $\Phi$ of the Adapter modules
- on a small number of examples from the Target task
- freezing the parameters of the base model

During epoch $\tt$ of Adapter Tuning, we learn $\Phi_\tp$
- initialing $\Phi_{(0)}$ such that
$$g_{\Theta, \Phi_{(0)}}(\x) \approx f_\Theta(\x)$$
- can be achieved by setting $\Phi = 0$
    - because of the skip connection, the adapter output becomes $f_\Theta(\x)$

## Bottleneck size

Since Adapter Tuning does not change base model parameters $\Theta$,
- the space used depends on the size of $\Phi$
- this is the key to adapting the base model using a small number of parameters


The number of parameters of the projection components of the Adapter are $\OrderOf{ d*a }$, multiplied by the number $k$ of Adapters.

Recall that a  number of parameters in a Transformer are $\OrderOf{n_\text{layers} * d^2}$.

Expressing the size of $\Phi$ as a fraction of the size of $\Theta$:

$$
\begin{array} \\
r & = & \frac{|\Phi|}{|\Theta|} \\
  & \approx & \frac{d * a * n_\text{layers}}{ n_\text{layers} * d^2 } & \text{since} \\
  & &                    | \Phi | = \OrderOf{ d*a * n_\text{layers}} \text{ assuming } k = n_\text{layers}\\
  & &                   | \Theta | = \OrderOf{n_\text{layers} * d^2} \text{ for a Transformer} \\
& \approx & \frac{ a  }{ d } \\
\end{array}
$$

For reference, $d = 12,288$ for GPT-3; $a$ is chosen to satisfy a target for $r$
- e.g., $r = 0.1 \%$, results in bottleneck size $a = 12$

In [experiments](https://arxiv.org/pdf/1902.00751.pdf#page=4), the the botttleck was varied
$$
a \in \{ 2, 4, 8, 16, 32, 64 \} 
$$
so typical $a$ is a fraction of $1 \%$.

[The effect of varying $a$](https://arxiv.org/pdf/1902.00751.pdf#page=7) are shown in the orange line
in the diagram below
- the horizontal axis is the total number of trainable parameters, which is linear in $a$
- it seems to show that increasing the size of the bottleneck does not impact performance greatly

The table also compares adaptation via Adapters to adaptation by Fine-Tuning only the top layers of the base model
- the total number of trainable parameters increases with the number of top layers fine-tuned
- the results show that adaptation via Adapters is better than Fine Tuning top layers
    - *unless* we Fine-Tune *many* top layers
    
<table>
    <center><strong>Adapter vs Fine Tuning</strong></center>
    <img src="images/Adapter_size_vs_FineTuning.png" width=70%>
    <br>
    Attribution: https://arxiv.org/pdf/1902.00751.pdf#page=7
</table>

## Adapter placement

Recall that Transformer blocks are usually stacked into $n_\text{layers}$ in a Transformer for an LLM.

Initially, Adapters were placed at *each* level of the stack.

However, [experiments](https://arxiv.org/pdf/1902.00751.pdf#page=8)  show that the most impactful adapters are located at the *top* of the stack.


In the study, adapters are *removed* within a span of levels of the stacked blocks.
- the models are **not re-trained** after removing the adapters

The horizontal/vertical axes indexes the *end/start* of the span.

Columns $7$ and beyond indicates the removing adapters does not decrease performance
- until the adapter at level 7 is removed

The last column indicates that the largest performance decrease occurs
- when removing the single adapater at the top level

<table>
    <center><strong>Adapter placement</strong></center>
    <img src="images/Adapter_layer_ablation.png" width=110%>
    <br>
    Attribution: https://arxiv.org/pdf/1902.00751.pdf#page=8
</table>

This is interesting
- Recall, our hypothesis of Deep learning is that increasing levels of abstraction of the inputs
are created as layers become deeper
- The early layers create representations that transfer across most tasks
- The deepest layer representations are most task-specific

The decrease in performance corresponding to deeper layers 
- may indicate that the Target task specific adaptation
- occurs in the region which we associate most with the Source task

# LoRA

**References**
- [LoRA:Low Rank Adaptation of Large Language Models](https://arxiv.org/pdf/2106.09685.pdf)

**Additional reading**
- [Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning](https://arxiv.org/abs/2012.13255)
- [LoRA Learns Less and Forgets Less](https://arxiv.org/pdf/2405.09673)



The Adapter method of Fine Tuning uses a module involving
- Down projecting to a lower dimension
- Up projecting back to the original dimension
- with an intervening non-linearity
- where the projections are achieved via Dense layers

We now show the Low Rank Adaptation (LoRA) method that is similar
- Down and Up Projections
- without an intervening non-linearity
- where the projections are achieved via matrix multiplication


Let $\W$ denote the parameters of the Pre-Trained Model.

Fine-Tuning updates the parameters to
$$
\W' = \W + \Delta \W
$$

The usual method is to use Gradient Descent to create a sequence of parameter updates
- one per mini-batch
- equal to negative one times the learning-rate scaled  gradient of the Loss

$$
\begin{array} \\
\W_{(0)} & = & \W \\
\text{update}_\tt & = & - \alpha_\tt  
* \frac{\partial \loss_{\W_{(\tt-1)} }}{\partial \W_{(\tt-1)}} \\
\W_\tp & = & \W_{(\tt-1)} +  \text{update}_\tt \\
\\
\Delta \W = \sum_{\tt} \text{update}_\tt
\end{array}
$$

LoRA uses a different method 
- using Gradient Descent to approximate the *cumulative* change $\Delta \W$.

## Computing $\Delta \W$


LoRA does not **learn** $\Delta \W$  directly.
- It factors $\Delta \W$ as the product of two *smaller* lower rank matrices $A, B$:
$$
\Delta \W = A * B
$$

out  &nbsp;  &nbsp;  &nbsp;  &nbsp; | &nbsp; | down project &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | up project &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:-:|:-:|:-:
$\Delta \W$ | = | $A$| * |$B$ |
$(d \times d)$ | | $(d \times r)$ | | $(r \times d)$

where $r \le \text{rank}(\Delta \W)$



Here is the architecture
<br>
<br>
<table>
    <center><strong>LoRA adapting Pre-Trained matrix W</strong></center>
    <img src="images/LoRA_arch.png" width=30%>
    <br>
    Attribution: https://arxiv.org/pdf/2106.09685.pdf#page=1
</table>

Given input $\x$, this arrangement results in output $h$
$$
\begin{array} \\
h & = & \W_0 * \x  & \text{the left branch} \\
  &   & \,\, + \, \x* A * B& \text{the sum operator on top} \\
& = &  \W_0 * \x + \Delta \W *  \x & \Delta \W = A * B \\
& = & (\W_0 + \Delta \W) * \x & \text{distributive property} \\
& = & \W' * \x & \W' = \W_0 + \Delta \W \\
\end{array}
$$

Thus, the output is $\W' * \x$, satisfying the goal of adapting $\W$ to $\W'$.

Note the computational advantage of computing
$$
(\x * A) * B
$$
over
$$
\x * (A * B)
$$

- We avoid constructing the $(d \times d)$ matrix $(A * B)$
- in favor of constructing *short* vectors


 down project &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | up project &nbsp;  &nbsp;  &nbsp;| |
:-:|:-:|:-:|:-|
 $(\x * A)$| * |$B$ |
 $d * (d \times r)$ | * | $(r \times d)$ |  dimensions
 $r$ | * | $(r \times d)$ | left product: dimensions $r$
 $d$ | | | final dimensions: $d$

The two dimensions of $A$ and $B$ are $d$ and $r$.

Thus, the resulting number of parameters
- is $2 * d * r$ parameters
- rather than $d^2$

So, not only is the representation of $\Delta \W$ smaller, there are fewer parameters to Fine-Tune.

Matrix $B$ is initialized to $0$ so that
- when Fine-Tuning begins

$$
\begin{array} \\
\W' & = & \W_0 &  + & (A*B) \\
& = & \W_) & + & (A * 0) \\
& = & \W_0
\end{array}
$$
- the initial output is the *same* as the unmodified weights

$A, B$ get updated during Fine-Tuning
- by gradient descent on the elements of the matrices

The original weights $\W_0$ are **frozen** and not updated by Gradient Descent.

Note the similarity to the Adapter used in a Parallel arrangement.

The advantage of the Parallel arrangement compared to a Series arrangement
- the Series introduces an added layer
- each time it appears
- which slows *inference*

The Parallel arrangement used in LoRA does not introduce latency at inference time.

## How big does $r$ have to be ?

Not much ! Values of $r \le 2$ seem to do very well in an experiment

The accuracy reported when $r=2$ is almost the same as when $r = 64$
<br>
<table>
    <center><strong>LoRA: accuracy versus rank $r$</strong></center>
    <img src="images/LoRA_by_rank.png">
    <br>
        Attribution: https://arxiv.org/pdf/2106.09685.pdf#page=10
</table>



## Results

How do the various adaptation methods compare according to the authors ?

LoRa with 37.7MM parameters ($.02 \%$ of GPT-3) *outperforms* full Fine-Tuning.

<br>
<table>
    <center><strong>LoRA: Performance, by method of adaptation</strong></center>
    <img src="images/LoRA_results.png">
    <br>
        Attribution: https://arxiv.org/pdf/2106.09685.pdf#page=8
</table>

## is LoRA as good as full Fine-Tuning ?

Compare the $\Delta \W$ of  LoRA to that of full Fine-Tuning
- $\Delta \W_\text{LoRA}$ is low rank
- $\Delta \W_\text{Fine tuning}$ is of unconstrained rank

It has been [shown](https://arxiv.org/pdf/2405.09673) that for some Target tasks
- $\Delta \W$ is of *high* rank
- so LoRA will under-perform Fine Tuning for these Target tasks

On the other hand:
- $\W'_\text{LoRA} = \W_0 + \Delta \W_\text{LoRA}$
- is more similar to $\W_0$
- than $\W'_\text{Fine tuning} = \W_0 + \Delta \W_\text{Fine tuning}$

so it has been found that LoRA is *less likely forget* the Source Task than full Fine-Tuning.

**Summary**

LoRA
- learns less (Target task)
- forgets less (Source task)

### Technical aside: what does "similar" mean above ?

**Note**

"Similar" is used in a very loose manner above.

The relation between modified $\W'$, base $\W_0$, and "perturubation" $\Delta \W$ was evaluated
- Using SVD (recall: used in PCA)
- To determine the number of singular vectors required to capture $90 \%$ of the variance of each

For Fine-tuning
- The number of singular vectors required 
- was similar for $\W'_\text{Fine tuning}, \W_0$ and $\Delta \W_\text{Fine tuning}$

For LoRA
- the number of singular vectors for $\Delta \W_\text{LoRA}$ is much smaller, for low rank $r$
- the rank $r$ required 
    - in order for the number of singular vectors of $\Delta \W_\text{LoRA}$ 
    - to approach the number of singular vectors of $\W_0$
    - is 10 to 100 times greater than the typical (small) $r$ used in LoRA
    

So the low rank $r$ typically used for $\Delta \W_\text{LoRA}$
- is unable to capture the true (i.e., Fine-tuned) rank of $\W'_\text{Fine tuning}$

This is dependent on the Source Task (i.e., $\W_0$) and the Target Task (i.e., $\W'$)

# BitFit

**References**
- [BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models](https://arxiv.org/pdf/2106.10199.pdf)



Our goal remains
- to fine-tune a base model
- without having to adapt many parameters

LoRA achieves this goal
- by leaving base model parameters unchanged
- adding Adapters
    - training only Adapter weights
    
This paper takes a different approach
- adapt a *small number* of base model parameters
   

Surprisingly: just fine-tuning the *bias* terms ("intercept") works pretty well !

To be specific: the bias parameters of Attention lookup layers are modified.

**Recall 1**

From the [Attention Lookup module](Attention_Lookup.ipynb#Projecting-queries,-keys-and-values)
- Attention creates queries, keys, and values
    - based on the sequences (states) produced by earlier layers of the Transformer
- Rather than using the raw states of the Transformer
as queries (resp., keys/values)
- we can map them through projection/embedding *matrices* $\W_Q, \W_K, \W_V$
    - each mapping matrix shape is $(d \times d)$
    - thus, the mapping preserves the shapes of $Q, K, V$


- Mapping through these matrices:

out  &nbsp;  &nbsp;  &nbsp;  &nbsp; | &nbsp; | left &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:-:|:-:|:-:
$Q$ | = | $Q$| * |$\W_Q$ |
$(T \times d)$ | | $(T \times d)$ | | $(d \times d)$
&nbsp;
$K$ | = | $K$| * |$\W_K$ |
$V$ | = | $V$| * |$\W_V$ |
$(\bar T \times d)$ | | $(\bar T \times d)$ | | $(d \times d)$

**Recall 2**

Our notational practice in dealing with the "bias" term
- when computing a dot product $\w \cdot \x$ we add
    - a constant "1" as first element of $\x$ (let's call the augmented vector $\x'$)
    - the bias parameter $b$ as the first element of $\w$ (let's  call this $\w'$)

So
$$
\w \cdot \x + b = \w' \cdot \x'
$$

This paper
- keeps $\w$ frozen
- modifies $b$

where these terms are parts of  $\W_Q, \W_K, \W_V$.

On small to medium fine-tuning datasets
- performance comparable to fine-tuning *all* parameters

on large fine-tuning datasets
- performance comparable to other sparse methods

# Conclusion: Fine-Tuning is easy for everyone !

Fine-Tuning a huge model like GPT-3 seemed out of the realm of possibility for individuals or small organizations.
- huge memory requirements
- time intensive
    - even with the *much smaller* number of examples in the Fine-Tuning dataset compared to the Pre-Training datasets
    
Parameter Efficient Transfer learning shows
- Fine-Tuning is now accessible on consumer grade hardware
- Without negligible loss of performance (maybe even better) than full Fine-Tuning    

Our module on [Transformer Scaling](Transformers_Scaling.ipynb)
- highlighted a trend
- to *smaller* Large Language Models
- with performance matching very large models (like GPT-3).

Combined with Parameter Efficient Fine-Tuning
- it is [now possible to Fine-Tune a model (LLaMA 7B)](https://arxiv.org/pdf/2303.16199.pdf)
- with performance equivalent to GPT-3 (175B parameters)
- using 8 A100 GPU's
- in one hour !




In [2]:
print("Done")

Done
