## How to fine-tune a LLM

### Fine-tuning LLM
* Purpose: Fine-tuning adjusts a pre-existing LLM to perform better on a specific task or domain, utilizing a smaller, domain-specific dataset.
* Advantages: Improves model performance for context-specific results and reduces training costs.

### Steps in Fine-tuning
1. Select a  Pre-trained Model: Choose a base model aligned with the desired functionalities. (e.g. Microsoft's Phi-2, Meta's LLaMA)
2. Gather Relevant Dataset: Collect a relevant dataset, with necessary labels or structure. This part is many times considered the real challenge.
3. Preprocess Dataset: Clean the data and split it into training, validation, and test sets.
4. Fine-tuning: Adjust the model on the prepared dataset to specialize for the specific task.
5. Further tailor the model in such a way that it becomes more attuned to the specific requirements, language, and knowledge of the task or domain it will be used for.

### Fine-tuning methods

* Full Fine-tuning: Updates all model weights to enhance performance across various tasks.
* Parameter Efficient Fine-Tuning (PEFT): Updates only a subset of parameters, reducing memory requirements and preserving learned information.

### Popular Fine-tuning methods: LoRA and QLoRA 

#### LoRA (Low-Rank Adaptation)

Here's how it works:
* Weight Matrix Approximation: Instead of updating the entire weight matrix of the pre-trained model (which can be very very large and resource-intensive), LoRA focuses on modifying smaller matrices. These smaller matrices are designed to approximate changes to the larger weight matrix effectively. 
* LoRA Adapters: The smaller matrices, known as "LoRA adapters," are inserted into the model. These adapters capture the necessary adjustments to the model's weights to adapt it to the specific task or domain.
* Efficiency and Performance: By fine-tuning only these smaller matrices, you can achieve similar performance improvements as full model fine-tuning but with significantly less computational cost. 

Here's a simple example that can be helpful to understand LoRA better:

Imagine you have a 6 x 6 large weight matrix W, which is the entire weight matrix of the pre-trained model. (In real LLMs, W has billions of rows and columns).

In [1]:
import numpy as np

np.random.randint(1, 10, size=(6, 6))

array([[8, 4, 7, 6, 8, 2],
       [7, 4, 7, 6, 6, 2],
       [4, 2, 7, 6, 3, 6],
       [4, 3, 9, 3, 4, 8],
       [4, 6, 4, 1, 4, 1],
       [6, 5, 9, 2, 4, 9]])

In LoRA, instead of fine-tuning all the elements of W, we approximate changes to W using two smaller matrices, A and B. Let's say A is 6x2 matrix and B is a 2x6 matrix:

In [5]:
A = np.random.random((6,2))
B = np.random.random((2,6))
print(f"A: \n{A}")
print(f"B: \n{B}")

A: 
[[0.67149334 0.43698063]
 [0.76943609 0.13184095]
 [0.74918086 0.69805827]
 [0.74757621 0.01147748]
 [0.14243444 0.28975049]
 [0.17845079 0.60621866]]
B: 
[[0.52789592 0.62018729 0.07316205 0.6132555  0.48515602 0.10829241]
 [0.86703269 0.96251119 0.59437175 0.50047613 0.52704569 0.64197135]]


Finding the matrices A and B in LoRA is part of the fine-tuning process. Initially, A and B are randomly initialized and then updated through training to better approximate the changes needed in the large weight matrix W for the specific task.

The product of A and B, which we'll call C, is a 6x6 matrix.

In [7]:
C = np.dot(A, B)
print(f"C: \n{C}")

C: 
[[0.73335508 0.83705038 0.30885677 0.63049536 0.55608779 0.35324668]
 [0.52049258 0.60409287 0.13465606 0.53784417 0.44278276 0.1679622 ]
 [1.00072885 1.13652134 0.46971773 0.80880078 0.7313782  0.52926401]
 [0.40459378 0.47468447 0.0615161  0.46419943 0.36874026 0.08832505]
 [0.3264137  0.36722411 0.1826403  0.2323619  0.22181467 0.20143608]
 [0.61981483 0.69416515 0.37337507 0.4128339  0.40608141 0.40849987]]


In LoRA, the original large matrix W is not directly modified during fine-tuning. Instead, the smaller matrices A and B are find-tuned, and their product C is used to adjust W during the model's forward pass. 

So, the model uses an effectice weight matrix W' which is the sume of the original W and the low-rank approximation C:

$$
W' = W + C
$$