<a href="https://colab.research.google.com/github/mgfrantz/CTME-llm-lecture-resources/blob/main/finetunint/fine_tuning_and_inference_with_axolotl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tuning with Axolotl

[Axolotl](https://github.com/axolotl-ai-cloud/axolotl) is a convenient library that helps fine tune text generation models.
In this notebook, we will use `axolotl` to fine tune a small LLM on a dataset we've created.
Then, we'll run our evaluation suite on that model to see how it compares to other solutions we've explored in the past.

## Environment setup and imports

In [1]:
import os
if os.path.exists("axolotl"):
  !rm -rf axolotl
!git clone https://github.com/axolotl-ai-cloud/axolotl
# This handles a mismatch between xformers torch requirements and that of other dependencies
with open('/content/axolotl/requirements.txt', 'r') as file:
    requirements = file.read()
    # replace xformers==0.0.27 with xformers
    requirements = requirements.replace('xformers==0.0.27', 'xformers')
with open('/content/axolotl/requirements.txt', 'w') as file:
    file.write(requirements)
!pip install -qqqq ninja packaging mlflow=="2.13.0"
!cd axolotl && pip install -qqqq -e ".[flash-attn,deepspeed]"

Cloning into 'axolotl'...
remote: Enumerating objects: 16083, done.[K
remote: Counting objects: 100% (5171/5171), done.[K
remote: Compressing objects: 100% (905/905), done.[K
remote: Total 16083 (delta 4521), reused 4586 (delta 4107), pack-reused 10912 (from 1)[K
Receiving objects: 100% (16083/16083), 6.02 MiB | 19.62 MiB/s, done.
Resolving deltas: 100% (10481/10481), done.
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.0/25.0 MB[0m [31m73.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.2/233.2 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━

In [2]:
from google.colab import userdata
import os
token = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = token

## Axolotl configuration

In [5]:
%%writefile test_axolotl.yaml

# Model config
adapter: qlora
base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
bf16: auto

# HF hub config (push to huggingface)
# requires HF_TOKEN api key to be set (👈🔑secrets)
hf_use_auth_token: true
hub_model_id: mgfrantz/axolotl-test
mlflow_experiment_name: axolotl-test

# Data config
dataset_prepared_path: null
datasets:
- path: mhenrichsen/alpaca_2k_test
  type: alpaca

# Training config
debug: null
deepspeed: null
early_stopping_patience: null
eval_sample_packing: false
evals_per_epoch: 4
flash_attention: true
fp16: null
fsdp: null
fsdp_config: null
gradient_accumulation_steps: 4
gradient_checkpointing: true
group_by_length: false


learning_rate: 0.0002
load_in_4bit: true
load_in_8bit: false
local_rank: null
logging_steps: 1
lora_alpha: 16
lora_dropout: 0.05
lora_fan_in_fan_out: null
lora_model_dir: null
lora_r: 32
lora_target_linear: true
lora_target_modules: null
lr_scheduler: cosine
micro_batch_size: 8
model_type: LlamaForCausalLM
num_epochs: 4
optimizer: paged_adamw_32bit
output_dir: ./outputs/qlora-out
pad_to_sequence_len: true
resume_from_checkpoint: null
sample_packing: true
saves_per_epoch: 1
sequence_len: 4096
special_tokens: null
strict: false
tf32: false
tokenizer_type: LlamaTokenizer
train_on_inputs: false
val_set_size: 0.05
wandb_entity: null
wandb_log_model: null
wandb_name: null
wandb_project: null
wandb_watch: null
warmup_steps: 10
weight_decay: 0.0
xformers_attention: null


Overwriting test_axolotl.yaml


## Fine tuning

In [6]:
# By using the ! the comand will be executed as a bash command
!accelerate launch -m axolotl.cli.train /content/test_axolotl.yaml

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2024-09-27 20:53:32.180645: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-27 20:53:32.197597: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-27 20:53:32.218657: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-0

In [9]:
!python3 -m axolotl.cli.merge_lora test_axolotl.yaml

2024-09-27 21:15:40.083798: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-27 21:15:40.101025: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-27 21:15:40.121847: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-27 21:15:40.128114: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-27 21:15:40.142996: I tensorflow/core/platform/cpu_feature_guar

In [15]:
!huggingface-cli login --token $HF_TOKEN --add-to-git-credential

Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [19]:
import os
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
!huggingface-cli upload mgfrantz/axolotl-test outputs/qlora-out/merged/ .

  0% 0/1 [00:00<?, ?it/s]
pytorch_model.bin:   0% 0.00/2.20G [00:00<?, ?B/s][A
pytorch_model.bin:   1% 16.0M/2.20G [00:05<11:59, 3.03MB/s][A
pytorch_model.bin:   1% 32.0M/2.20G [00:07<07:44, 4.67MB/s][A
pytorch_model.bin:   2% 48.0M/2.20G [00:07<04:19, 8.28MB/s][A
pytorch_model.bin:   3% 64.0M/2.20G [00:08<03:01, 11.8MB/s][A
pytorch_model.bin:   4% 80.0M/2.20G [00:10<04:01, 8.76MB/s][A
pytorch_model.bin:   4% 96.0M/2.20G [00:10<02:50, 12.3MB/s][A
pytorch_model.bin:   5% 112M/2.20G [00:11<02:09, 16.1MB/s] [A
pytorch_model.bin:   6% 128M/2.20G [00:11<01:53, 18.2MB/s][A
pytorch_model.bin:   8% 176M/2.20G [00:12<00:52, 38.2MB/s][A
pytorch_model.bin:   9% 192M/2.20G [00:12<00:47, 42.3MB/s][A
pytorch_model.bin:   9% 208M/2.20G [00:12<00:50, 39.8MB/s][A
pytorch_model.bin:  10% 224M/2.20G [00:13<00:42, 46.4MB/s][A
pytorch_model.bin:  11% 240M/2.20G [00:13<00:35, 55.4MB/s][A
pytorch_model.bin:  12% 272M/2.20G [00:13<00:26, 73.8MB/s][A
pytorch_model.bin:  15% 336M/2.20G [00:13<00:

In [11]:
from google.colab import runtime
runtime.unassign()

# Inference

## Installs and imports

In [20]:
!pip install -qqqq peft transformers accelerate bitsandbytes

In [37]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from rich import print

## Load model

In [38]:
if model in vars(): del model;
BASE_CKPT = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
ADAPTER_CKPT = "mgfrantz/axolotl-test"
config = PeftConfig.from_pretrained(ADAPTER_CKPT)
base_model = AutoModelForCausalLM.from_pretrained(BASE_CKPT)
tokenizer = AutoTokenizer.from_pretrained(BASE_CKPT)
model = PeftModel.from_pretrained(base_model, ADAPTER_CKPT).eval().cuda()

In [24]:
if model in vars(): del model;
model = AutoModelForCausalLM.from_pretrained(ADAPTER_CKPT).eval().cuda()

config.json:   0%|          | 0.00/750 [00:00<?, ?B/s]

In [39]:
def format_instruction(instruction):
    text = f"""\
Below is an instruction that describes a task. \
Write a response that appropriately completes the request. \
### Instruction: {instruction} \
### Response: \
"""
    return text

In [40]:
def parse_response(text):
    return text.split("### Response:")[1].strip()

In [60]:
def do_inference(instruction):
    encoded = tokenizer(format_instruction(instruction), return_tensors="pt")
    encoded = {
        k: v.to("cuda") for k, v in encoded.items()
    }
    with torch.inference_mode():
        outputs = model.generate(**encoded, max_new_tokens=500, top_k=3)

    output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    return parse_response(output_text)


In [61]:
output = do_inference("Tell me how to bake a cake.")

In [62]:
print(output)