# Fine-tuning LLM using QLora with axolotl

QLora (Quantized Low Rank Adaptor) is a recent technique that made it possible to finetune a LLM with decreased hardware requirement. In this notebook, we're going to do an example run using the `axolotl` tool developed by OpenAccess AI Collective.

## Basic Setup

In [1]:
!git clone https://github.com/OpenAccess-AI-Collective/axolotl

Cloning into 'axolotl'...
remote: Enumerating objects: 4056, done.[K
remote: Counting objects: 100% (1997/1997), done.[K
remote: Compressing objects: 100% (433/433), done.[K
remote: Total 4056 (delta 1624), reused 1759 (delta 1472), pack-reused 2059[K
Receiving objects: 100% (4056/4056), 1.58 MiB | 11.48 MiB/s, done.
Resolving deltas: 100% (2555/2555), done.


In [2]:
%cd axolotl/

!pip3 install -e .
!pip3 install -U git+https://github.com/huggingface/peft.git

/content/axolotl
Obtaining file:///content/axolotl
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers@ git+https://github.com/huggingface/transformers.git (from axolotl==0.1)
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-install-xzfgiavj/transformers_3c38cf3aea0e44d7a48b6db0d94efbf3
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-install-xzfgiavj/transformers_3c38cf3aea0e44d7a48b6db0d94efbf3
  Resolved https://github.com/huggingface/transformers.git to commit 05cda5df3405e6a2ee4ecf8f7e1b2300ebda472e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes>=0.39.0 (from axolotl==0.1)
  Downloading bitsandbytes-0.41.0-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.7 MB/s[0

## Edit the config file

We'll be running the example at `examples/openllama-3b/qlora.yml`. However, we'll make some modifications:

- Some fix to make it work on a T4 GPU (disable `bf16` and `tf32` and enable `fp16`)
- Add connections to the wandb tool. It is a SaaS for logging/metric/monitoring of AI training runs, supports uploading checkpoints to save your intermediate work, and more
  - Remember to enable periodically saving checkpoints in the first place!

Let's examine the content of the config file:

In [3]:
!cat examples/openllama-3b/qlora.yml

base_model: openlm-research/open_llama_3b
base_model_config: openlm-research/open_llama_3b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.01
adapter: qlora
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len: 2048
lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_watch:
wandb_run_id:
wandb_log_model:
output_dir: ./qlora-out
batch_size: 4
micro_batch_size: 4
num_epochs: 2
optimizer: paged_adamw_32bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize

Next, we should login to wandb:

In [4]:
!pip3 install wandb



In [5]:
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

Then make the modification to the config file:

In [6]:
!pip3 install pyyaml



In [7]:
# @title Actual edit of config

# @markdown Enter your `wandb` project name
wandb_project = "qlora-retest" # @param {type:"string"}
# @markdown How often to save checkpoint, default of 50 step takes about 20 minutes per save
save_steps = 50 # @param {type:"integer"}
# @markdown Not tested in this version of notebook yet, enable to use local checkpoint save file and start from there instead of from scratch
resume_from_checkpoint = False # @param {type:"boolean"}
# @markdown Which floating point type to use, requires GPU generation at least Ampere for either `bf16` or `tf32`. For T4, choose `fp16` as the only one that works
float_type = "fp16" # @param ["bf16", "fp16", "tf32"]


import yaml

filename = "examples/openllama-3b/qlora.yml"

with open(filename, "r") as file:
    qlora_config = yaml.safe_load(file)

qlora_config["bf16"] = False
qlora_config["fp16"] = False
qlora_config["tf32"] = False
if float_type == "bf16":
    qlora_config["bf16"] = True
elif float_type == "fp16":
    qlora_config["fp16"] = True
elif float_type == "tf32":
    qlora_config["tf32"] = True

qlora_config["wandb_project"] = wandb_project
qlora_config["wandb_log_model"] = "checkpoint"
qlora_config["save_steps"] = save_steps
qlora_config["resume_from_checkpoint"] = resume_from_checkpoint

with open(filename, "w") as file:
    yaml.dump(qlora_config, file)

!cat {filename}

adapter: qlora
base_model: openlm-research/open_llama_3b
base_model_config: openlm-research/open_llama_3b
batch_size: 4
bf16: false
dataset_prepared_path: last_run_prepared
datasets:
- path: teknium/GPT4-LLM-Cleaned
  type: alpaca
debug: null
deepspeed: null
early_stopping_patience: null
eval_steps: 20
flash_attention: null
fp16: true
fsdp: null
fsdp_config: null
gptq_groupsize: null
gptq_model_v1: null
gradient_checkpointing: true
group_by_length: true
learning_rate: 0.0002
load_in_4bit: true
load_in_8bit: false
local_rank: null
logging_steps: 1
lora_alpha: 32
lora_dropout: 0.05
lora_fan_in_fan_out: null
lora_model_dir: null
lora_r: 8
lora_target_linear: true
lora_target_modules: null
lr_scheduler: cosine
max_packed_sequence_len: 2048
micro_batch_size: 4
model_type: LlamaForCausalLM
num_epochs: 2
optimizer: paged_adamw_32bit
output_dir: ./qlora-out
push_dataset_to_hub: null
resume_from_checkpoint: false
save_steps: 50
sequence_len: 2048
special_tokens:
  bos_token: <s>
  eos_token: </

## Let's go!

(It's going to take ~16 (up to 24?) hours for a complete training run. Since this is just an exercise to prove that it runs, feel free to stop the training in the middle, but please be patient as there will be some delay while it call wandb to sync the data (but there are some extra outputs from wandb that will not be shown in this notebook))


In [8]:
!accelerate launch scripts/finetune.py examples/openllama-3b/qlora.yml

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.
[2023-07-30 21:48:29,409] [INFO] [axolotl.scripts.train:219] [PID:2744] loading tokenizer... openlm-research/open_llama_3b
Downloading tokenizer.model: 100% 534k/534k [00:00<00:00, 12.0MB/s]
Downloading (…)cial_tokens_map.json: 100% 330/330 [00:00<00:00, 2.46MB/s]
Downloading (…)okenizer_config.json: 100% 593/593 [00:00<00:00, 4.56MB/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at ht