# What Model and Model Configuration (LoRA, etc.) should I Fine-Tune with?

## Picking a Model

### Model size
It depends on the use case, but if the breadth is relatively medium to narrow, there's no need to deal with the paralellism of much larger models. It's faster and easier to get the GPU that those run on
-At time of writing, 7B parameter models are the most popular (not instruction tuned models). 

### Model family

Thanks to the way it's abstracted from axolotl, it's very easy to try different models (especially if they fit into the same GPU). Even if you have to use a new instance, it should be easy. 

- Do whatever is fashionable. Take the recently released llama 3 and use it! If it's recently released and widely used it's likely to be very good. 

- To do this, you can go to HuggingFace and sort by hotness. You can also go to community subreddits (r/LocalLLaMA). 

- People overindex on this. If you've run a couple of models, that should work fine. You won't improve it immensely by trying other models.


### LoRA vs Full Fine Tune

Almost always use LoRA or some version of it. As a practicioner, you will almost never need a full fine tune. There are only theoretical reasons to use it most often. 

## LoRA and QLoRA: Efficient Model Fine-Tuning Techniques

The vast majority of fine-tuning (that I know of) is done on LoRA and QLoRA. Let's take a look at both:

### LoRA in a NutShell (Low-Rank Adaptation)

**Key**:
- **W** = Weight Matrix
- **R** = Set of Real Numbers
- **∈** = "an element of", "belongs to"

**Key Concepts:**
- **Weight Matrix (W):** A matrix with dimensions representing model parameters
- **Feed Forward Network:** Consists of layers transforming input embeddings into output vector representations, typically high-dimensional

**Goal** Reduce number of parameters for efficient fine-tuning

**Method:** Decompses the weight matrix into smaller matrices, reducing overall parameter count

**Benefit:** Less memory and computational resources used

Let's start to explain the LoRA approach with an example of a matrix with dimensions: 

- **Matrix W ∈ R^(4000×4000).** 

Fine-tuning this matrix on a feed forward network would require updating all 16 million parameters, which wouldn't be ideal for resources and memory use.

LoRA approximates W with the product of two smaller matrices, which reducing the number of parameters needing to be learned and updated. Let's say LoRA reduces matrix W into Matrix A and Matrix B below:

- **Matrix A ∈ R^(4000×16)**
- **Matrix B ∈ R^(16×4000)**
  
Number of parameters for A and B (64k + 64k) = 128k parameters. The sum of A and B's parameters is far less than the product of our original matrix W's parameters (16 m). 

**Why LoRA works:**

Any matrix W ∈ R^(m×n) can be decomposed into A ∈ R^(m×k) and B ∈ R^(k×n), where k is the rank of the approximation. If k is much smaller than m and n, the parameter count is greatly reduced. The rank of our example is 16.

The sum of the parameters in A and B (128k) is the actual number of elements needing storage and updates during fine-tuning, not the product A × B.

Matrix multiplication rules dictate that if  A ∈ R m×k and B ∈ R k×n, then the resulting matrix C= A×B will have the shape m×n. The inner dimensions (16 in this case) must match for the multiplication to be defined. Resulting matrix C will have shape mxn = 4k x 4k, which is the same shape as W. 

In other words, Total parameters involved in the LoRA method are indeed sum of params A and B (64k + 64k), which is different from resulting product A x B. This product gives shape 4k x 4k, but total number of elements (params) we need to store and learn during the fine tuning process is 64k + 64k. Difference of shape vs. number of parameters. 

**Benefits**

With LoRA, we can allow the original model's parameters to remain mostly unchanged, capturing task-specific adaptations without full re-training. It's just a configuration flag. Since the model's parameters are mostly unchanged, it **reduces the risk of overfitting**.

By using matrix factorization, LoRA allows for efficient adaptation of pre-trained models to new tasks with less GPU and RAM. It leverages matrix factorization to approximate large weight matrices with smaller, low-rank matrices.



### QLoRA: Quantized Low-Rank Adaptation

**Adding Quantization to LoRA:**

QLoRA reduces computational and memory requirements by furhter quantizing the low-rank matrices obtained from LoRA. It typically reduces from 32-bit floating point to 8 bit integers. 

For example, 64k elements at 32 bits = 2,048,000 bits, while 64k elements at 8 bits total 512,000 bits, achieving 4x memory savings. 

**Bit Reduction** divides the values for numbers into a smaller set of values (ex: 16 values for 4-bit storage). 

This is most suitable for resource-constrained environments. It might have some negative impact on your results.
