### Fine-tuning a LLM model with mlx-lm

We use Low-Rank Adaptation (LoRA) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. 
This drastically reduces the number of parameters that need to be fine-tuned. We can combine LoRA and quantization (QLoRA) to further reduce the number of parameters and memory footprint of the model.

Some most important parameters to consider when fine-tuning a LLM model are:

```yaml
learning_rate: 1e-6

lora_parameters:
  keys: ['mlp.gate_proj', 'mlp.down_proj', 'self_attn.q_proj', 'mlp.up_proj', 'self_attn.o_proj','self_attn.v_proj', 'self_attn.k_proj']
  alpha: 256
  rank: 128
  scale: 10.0
  dropout: 0.1

lr_schedule:
  name: cosine_decay
  warmup: 100 # 0 for no warmup
  warmup_init: 1e-7 # 0 if not specified
  arguments: [1e-5, 10000, 1e-7] # passed to scheduler
```

In [None]:
%%bash
mlx_lm.lora --config mlx-lora-llama3.1-8b.yaml  

```text
Trainable parameters: 4.178% (335.544M/8030.261M)
```

## Create a new LLM model in GGUF format with Q4_0 quantization
### Modelfile
```text
FROM /tmp/Meta-Llama-3.1-8B-Instruct
ADAPTER /tmp/finetune/adapters/
```
Here we create a model in GGUF format with Q4_0 quantization.
```bash
ollama create llama3.1:8b-cyber -q q4_0
```