<a href="https://colab.research.google.com/github/prasanth-ntu/pookie-llm-finetuning-resources/blob/main/finetuning/unsloth/ascii_art_completion_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Completion finetuning using unsloth

For my notes, refer https://prasanth.io/Talks/How-to-Fine-tune-LLMs-with-Unsloth---Complete-Guide-by-Pookie

**Google Colab T4 resources**
- System RAM: 12.7 GB
- GPU RAM: 15.0 GB
- Disk: 112.6 GB

In [None]:
# !top

In [None]:
# Sneak peak at hardware specs
import os
import subprocess

try:
    # Print System RAM
    print("System RAM:")
    ram_output = subprocess.check_output(["free", "-h"])
    print(ram_output.decode("utf-8"))

    # Print GPU RAM
    print("\nGPU RAM:")
    gpu_output_raw = subprocess.check_output("nvidia-smi | grep 'MiB' | awk '{print $9, $11}'", shell=True).decode("utf-8").strip()
    gpu_parts = gpu_output_raw.split()
    if len(gpu_parts) == 2:
        used_gpu = gpu_parts[0]
        total_gpu = gpu_parts[1]
        print(f"{used_gpu} (used) / {total_gpu} (total)")
    else:
        print(gpu_output_raw)


    # Print Disk Space
    print("\nDisk:")
    disk_output = subprocess.check_output("df -h / | awk 'NR==2{print $2, $3, $4, $5}'", shell=True)
    print(disk_output.decode("utf-8"))

except subprocess.CalledProcessError as e:
    print(f"Error executing command: {e}")
except FileNotFoundError as e:
    print(f"Command not found: {e}")

In [None]:
# GPU usage
!nvidia-smi

This notebook makes use of unsloth to finetune a model for a completion task.
In this example we will finetune the llama 3.2 base model to generate ascii art. I would recommend using the unsloth library compared to just using the huggingface library as it requires less memory and is faster.

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

In [None]:
!ls

In [None]:
%%capture
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3  peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

In [None]:
!pip install -U datesets

### Load base model

- Model: [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
  - Parameters: 3B
  - Variant: pretrained
  - modality: text only

In [None]:
# Takes ~2 mins
from unsloth import FastLanguageModel # Makes finetuning LLM faster and efficienrt
import torch
from google.colab import userdata

# Load the pretrained model and the tokenizer from HF model repo
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.2-3B",
    # Sequence longer than this will be truncated. Sequence include both the input tokens (prompts)
    # and output tokens (LLM generated output)
    max_seq_length = 2048,
    # Unsloth intelligently checks your hardware to determine and optimize which lower-precision
    # data types are supported (float16 or bfloat16) that provide the best performance & memory efficiency
    dtype = None,
    # If enabled, model weights are loaded using 4-bit quantization to significanlty reduce memory
    # usage by trading off precision
    load_in_4bit = False,
    token=userdata.get('HF_ACCESS_TOKEN')
)

In [None]:
# Disables the automatic removal of potentially extra spaces around punctuation during the
# tokenizer's decoding process, ensuring that whitespace is preserved exactly as tokenized.
# In some specific use cases, particularly when dealing with text where the exact spacing is
# important (like in the ASCII art example you have, or in code where whitespace matters),
# you want to preserve all whitespace precisely as the model generates it.
tokenizer.clean_up_tokenization_spaces = False

In [None]:
!ls

### Add lora to base model and patch with Unsloth

In [None]:
# More info about parameters: https://huggingface.co/docs/peft/v0.11.0/en/package_reference/lora#peft.LoraConfig
# LoRA works by injecting small, trainable low-rank matrices into specific layers of the pre-trained model
target_modules = [
    # query, key, and value projection layers in the self-attention mechanism
    "q_proj", "k_proj", "v_proj",
    # output projection layer in the self-attention mechanism
    "o_proj",
    # layers within the feed-forward network (MLP) block
    "gate_proj", "up_proj", "down_proj"
    ]

# When adding special tokens
train_embeddings = False

if train_embeddings:
  target_modules = target_modules + ["lm_head"]

model = FastLanguageModel.get_peft_model(
    model, # base language model
    r = 16, # rank of lora matrices according to paper not much loss when set relatively low
    target_modules = target_modules,  # On which modules of the llm the lora weights are used (or adapters are inserted)
    lora_alpha = 16, # scales the weights of the adapters (more influence on base model), 16 was recommended on reddit
    lora_dropout = 0, # Default on 0.05 in tutorial but unsloth says 0 is better
    bias = "none",    # "none" is optimized
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram
    random_state = 3407,
    use_rslora = False,  # scales lora_alpha with 1/sqrt(r), huggingface says this works better
    loftq_config = None, # And LoftQ
)

In [None]:
empty_prompt = """
{ascii_art}
"""

# A special marker that signals the end of a sequence to the LLM.
# It's crucial for the model to learn where the generated output should end.
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func_no_prompt(examples):
  """
  Formats a batch of ASCII art examples into training prompts for a language model.

  Each ASCII art sample is wrapped in a simple template and appended with an
  end-of-sequence token.

  Args:
      examples (dict): A dictionary containing a batch of data, expected to have
                       an "ascii" key with a list of ASCII art strings.

  Returns:
      dict: A dictionary with a single key "text" containing a list of formatted
            training prompt strings.
  """
  print(f"len(examples): {len(examples)}")
  ascii_art_samples = examples["ascii"]
  print(f"len(ascii_art_samples): {len(ascii_art_samples)}")
  training_prompts = []
  for ascii_art in ascii_art_samples:
      training_prompt = empty_prompt.format(ascii_art=ascii_art) + EOS_TOKEN
      training_prompts.append(training_prompt)
  return { "text" : training_prompts, }


from datasets import load_dataset
dataset_org = load_dataset("pookie3000/ascii-cats", split = "train", download_mode="force_redownload")
dataset = dataset_org.map(formatting_prompts_func_no_prompt, batched = True)

 ### Visualize dataset

In [None]:
EOS_TOKEN

In [None]:
print(f"dataset_org: {dataset_org}")
print(f"dataset: {dataset}")

In [None]:
# Compare the "ascii" and "text" in raw format
dataset[0]["ascii"]

In [None]:
dataset[0]["text"]

In [None]:
# Compare the "dataset_org" and "dataset" for a random sample
import random
random_index = random.randint(0, len(dataset) - 1)
print(f"Random index: {random_index}")

print(f"\n=== dataset_org ===")
# print(f"Content at random index:\n{dataset_org[random_index]}")
for k,v in dataset_org[random_index].items():
  print (f"--- {k} ---\n{v}")

print(f"\n=== dataset ===")
# print(f"Content at random index:\n{dataset_org[random_index]}")
for k,v in dataset[random_index].items():
  print (f"--- {k} ---\n{v}")

In [None]:
for i, sample in enumerate(dataset):
    print(f"\n------ Sample {i + 1} ----")
    print(sample["text"])
    if i > 2:
      break

In [None]:
from trl import SFTTrainer # trl - library designed for finetuning LLMs on a given dataset
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Configure the training process by specifying the model, tokenizer, dataset, anda all necessary
# hyperparameters for the SFTTrainer to begin finetuning the LoRA-adapted model.
trainer = SFTTrainer( # Finetuning LLMs on a supervised dataset
    model = model, # LLM to be finetuned (the LoRA-adapted model prepared earlier)
    tokenizer = tokenizer, # Tokenizer associated with the model
    train_dataset = dataset, # Formatted dataset we prepared earlier for training
    dataset_text_field = "text", # Col in the dataset that contains the text data for training
    max_seq_length = 2048, # Should match with setting when loading the model
    dataset_num_proc = 2, # Number of processes to use for processing the dataset
    # Define all the hyperparameters and configuration settings for the training process
    args = TrainingArguments(
        per_device_train_batch_size = 2, # batch size per GPU (or CPU if not using a GPU) during training
        # The gradients are accumulated over this many batches before a single optimization step (parameter update) is performed
        # (2 batches/GPU * 4 batches = 8 in this case). This is useful when you don't have enough GPU memory to fit a large batch directly
        gradient_accumulation_steps = 4, # process 4 batches before updating parameters (parameter update == step)
        # full passes through the training dataset
        num_train_epochs = 5, # between 1 - 3 to prevent overfitting
        learning_rate = 2e-4, # Initial learning rate for the optimizer
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1, # Log training information (like loss) ever step
        optim = "adamw_8bit", # 8-bit AdamW optimizer, which is memory-efficient. Often used with techniques like LoRA
        weight_decay = 0.01, # A regularization parameter to prevent overfitting
        # A "linear" scheduler decreases the learning rate linearly from the initial value to 0 over the course of training
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs", # The directory where training outputs (like checkpoints and logs) will be saved.
        # Specifies where to report training progress (e.g., "tensorboard", "wandb", "none").
        report_to = "none".# Setting it to "none" disables reporting to external services.
    ),
)

In [None]:
# Takes ~4 mins if num_train_epochs = 5
trainer_stats = trainer.train()

In [None]:
# After training the model
!ls

In [None]:
!ls -a -l outputs/

In [None]:
!ls -a -l -h outputs/checkpoint-130/

### inference

In [None]:
from transformers import TextStreamer # Streams the generated text from the model token by token

def generate_ascii_art(model):
    """
    Generates ASCII art using the finetuned language model.

    Applies Unsloth inference optimizations, prepares an empty input,
    sets up a text streamer for token-by-token output, and generates
    ASCII art using the model's generate method.

    Args:
        model: The finetuned language model object.
    """
    FastLanguageModel.for_inference(model) # Applies Unsloth's optimizations specifically for inference

    # Prepares the inputs for the model
    inputs = tokenizer(
        # Tokenizes an empty string. Since this model is finetuned for completion (generating ASCII
        # art without a specific text prompt), an empty string serves as the starting point
        "",
        return_tensors = "pt" # Specifies that the output should be PyTorch tensors
      ).to("cuda") #  Moves the input tensors to the GPU for faster processing.

    text_streamer = TextStreamer(tokenizer)

    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationMixin
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationConfig
    for token in model.generate( # A versatile function for controlling text generation
        **inputs, # Unpacks the input tensors created earlier
        streamer = text_streamer,
        max_new_tokens = 100 # Model will generate up to 100 tokens (max number of newe tokens) of ASCII art
      ):
        print(token)
        pass

In [None]:
for _ in range(3):
  print("-"*40)
  generate_ascii_art(model)

## Saving

### Save lora adapter

This is both useful for inference and if you want to load the model again

The code below only saves the **small LoRA adapter weights** and does not save the full base model. When you load this model later for inference or further finetuning, you would typically load the original base model (`meta-llama/Llama-3.2-3B`) and then load these LoRA weights on top of it. This is beneficial because the LoRA adapter files are much smaller than the full model, making them easier to store and share. This format is useful if you want to continue finetuning or use the model within frameworks that support loading LoRA adapters separately.

In [None]:
# Takes only few secs
model.push_to_hub(
    # "pookie3000/Llama-3.2-3B-ascii-cats-lora",
    "prasanthntu/Llama-3.2-3B-ascii-cats-lora",
    tokenizer,
    token = userdata.get('HF_ACCESS_TOKEN')
)

### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

This code snippet does two main things before saving:

- It **merges the LoRA adapter weights with the base model weights**. This creates a single, consolidated model where the finetuning changes are incorporated directly into the base model's parameters.
- It then **saves this merged model in the GGUF format**. GGUF is a binary format designed for efficient loading and inference of large language models on various hardware, particularly CPUs and consumer GPUs, using tools like llama.cpp and Ollama. The `quantization_method="q4_k_m"` argument specifies a quantization method (4-bit in this case) to further reduce the model size and improve inference speed, often with minimal loss in quality.

In [None]:
# Took around ~X(=15) mins
model.push_to_hub_gguf(
    # "pookie3000/Llama-3.2-3B-ascii-cats-lora-q4_k_m-GGUF",
    "prasanthntu/Llama-3.2-3B-ascii-cats-lora-q4_k_m-GGUF",
    tokenizer,
    quantization_method="q4_k_m",
    token = userdata.get('HF_ACCESS_TOKEN')
)

### Load model and saved lora adapters
For if you want to continue finetuning or want to do inference using the model in safetensor format.

In [None]:
from unsloth import FastLanguageModel # Makes finetuning LLM faster and efficienrt
import torch
from google.colab import userdata

In [None]:
# Takes ~5 mins
# Note: If the code fails restart and do the pip installations, and run this code directly
from transformers import TextStreamer

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name="pookie3000/Llama-3.2-3B-ascii-cats-lora",
    model_name="prasanthntu/Llama-3.2-3B-ascii-cats-lora",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
    token=userdata.get('HF_ACCESS_TOKEN')
)


def generate_ascii_art(model):
    FastLanguageModel.for_inference(model)
    inputs = tokenizer("", return_tensors = "pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationMixin
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationConfig
    for token in model.generate(**inputs, streamer = text_streamer, max_new_tokens = 100):
        print(token)
        pass


In [None]:
!ls

In [None]:
for i in range(3):
  print("-"*30)
  generate_ascii_art(model)

# Appendix

## Why set `max_seql_length = 2048` while the actual context length of Llama 3.2 (text only) modek is 128K?
While the Llama 3.2 model can handle a context length of up to 128K tokens, there are several reasons why you might set `max_seq_length` to a smaller value like 2048 in a finetuning notebook, especially when using libraries like Unsloth:

- **Computational Resources**: Processing very long sequences requires significantly more computational resources (GPU memory and time). Finetuning with extremely long sequences can be prohibitively expensive or even impossible on consumer-grade hardware or platforms with limited resources like Google Colab (even with a T4 GPU). A `max_seq_length` of 2048 is a common and manageable size for many tasks and hardware setups.
- **Memory Constraints**: The memory required for training increases quadratically with the sequence length. Setting a lower `max_seq_length` is a crucial technique for reducing memory consumption and avoiding "out of memory" errors during training. Unsloth helps make training more memory efficient, but there are still limits based on the available GPU RAM.
- **Task Requirements**: The optimal `max_seq_length` also depends on the specific task you are finetuning the model for. For a task like generating ASCII art based on a short prompt, a context length of 2048 might be more than sufficient to capture the necessary information and generate the desired output. Using a much larger context might not provide any significant benefit for this particular task and would only increase training costs.
- **Dataset Characteristics**: The nature of your training data also plays a role. If your training examples typically consist of sequences much shorter than 128K tokens, setting `max_seq_length` to a value that accommodates most of your training data without excessive padding or truncation is a reasonable approach.
- **Training Efficiency**: Shorter sequence lengths generally lead to faster training iterations (steps), as less computation is required per token. This can allow for quicker experimentation and faster convergence during the finetuning process.

In summary, while the base model has a large potential context window, the `max_seq_length` during finetuning is often set based on a balance of available computational resources, memory constraints, the specific requirements of the finetuning task, and the characteristics of the training data. 2048 is a common and practical choice for many finetuning scenarios on typical hardware.

## `get_peft_model()` function explained

*   **`model`**: The base language model object you loaded previously (`meta-llama/Llama-3.2-3B` in this case).
*   **`r = 16`**: This is the "rank" of the low-rank matrices used in LoRA. A higher rank means more trainable parameters and potentially more expressiveness, but also higher memory usage and slower training. A value of 16 is a common starting point and often provides a good balance.
*   **`target_modules = target_modules`**: This specifies the list of module names where the LoRA adapters will be inserted.
*   **`lora_alpha = 16`**: This is a scaling factor for the LoRA weights. A higher `lora_alpha` gives more weight to the LoRA adaptations compared to the original pre-trained weights. A value equal to `r` (16 in this case) is a common practice.
*   **`lora_dropout = 0`**: This sets the dropout rate for the LoRA layers. Dropout is a regularization technique to prevent overfitting. Setting it to 0 means no dropout is applied to the LoRA layers. Unsloth documentation often suggests 0 for better performance.
*   **`bias = "none"`**: This specifies whether bias terms should be trained alongside the LoRA weights. Setting it to `"none"` means no bias terms are trained, which is often recommended for LoRA.
*   **`use_gradient_checkpointing = "unsloth"`**: This enables gradient checkpointing, a technique that reduces memory usage during training by recomputing gradients for certain layers instead of storing them. Setting it to `"unsloth"` uses Unsloth's optimized implementation of gradient checkpointing, which is particularly useful for long sequences.
*   **`random_state = 3407`**: This sets the random seed for reproducibility.
*   **`use_rslora = False`**: RSLora is a variation of LoRA that scales `lora_alpha` with `1/sqrt(r)`. Setting it to `False` uses the standard LoRA scaling.
*   **`loftq_config = None`**: LoftQ is a method that can be used with LoRA for quantization-aware finetuning. Setting it to `None` means LoftQ is not used.