# Introduction
In this tutorial we will walk through how to fine-tune a (as of 2024) medium-sized large language model, specifically, the 2 billion parameter version of Google's Gemma 2 model. OpenAI's GPT4 and Meta's largest Llama3 models are in the trillion and hundred billion parameter range respectively, but such models are likely overkill for many use cases and are too large to train on the GPUs available to us on Frontera for this course. From my qualitative assesment, the 8 billion parameter version of Llama3 appears to be the most popular choice for finetuning, likely because it's one of the newest models to be released and during training it can just barely fit in the GPU memory available on a single consumer GPU such as NVIDIA's RTX 4090. 

# 1. Torchtune library initialization
The torchtune python library we will be using to fine-tune our LLM in this tutorial has a collection of fine tuning "recipes" for many of the most popular LLM models available including the Llama family and Mistral models. It should be straight forward to adapt what we cover here to these other models should your specific use case require that. The torchtune documentation can be viewed at https://pytorch.org/torchtune/stable/index.html.

First let's make sure torchtune is installed and can initialize. Running the following command will give a brief overview of the functionality of the torchtune library: 

In [1]:
! tune

usage: tune [-h] {download,ls,cp,run,validate} ...

Welcome to the torchtune CLI!

options:
  -h, --help            show this help message and exit

subcommands:
  {download,ls,cp,run,validate}
    download            Download a model from the Hugging Face Hub.
    ls                  List all built-in recipes and configs
    cp                  Copy a built-in recipe or config to a local path.
    run                 Run a recipe. For distributed recipes, this supports
                        all torchrun arguments.
    validate            Validate a config and ensure that it is well-formed.


# 2. Downloading model files
You can download models directly using torchtune with a single command however it requires you to setup an account with the model repository huggingface.co, so, for convinience, we have already downloaded the model weights and tokenizer for the 2 billion parameter version of Google's Gemma2 model to the directory `/work2/10156/gj3385/frontera/ml_institute/models/gemma-2b/` for use in this course. The parent directory also contains a fine-tuned version of gemma-2b that we will use later on.

We will now copy the model files to the temp directory on our current node for fast file access

In [1]:
! mkdir -p /tmp/models
! cp -r /scratch1/10156/gj3385/ml_institute/models/. /tmp/models

Let's check that our model files copied over, running the following command should show two subdirectories:
+ gemma-2b (our base model)
+ gemma-2b-lora-finetuned-alpaca (base fine-tuned on alpaca dataset)

In [2]:
! ls /tmp/models

gemma-2b  gemma-2b-lora-finetuned-alpaca


----
**For when you want to download other models on your own time**
1. Make an account on https://huggingface.co/join and login
2. Find the model you want, e.g. https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
3. Some models are "gated" and require you to submit information about yourself and accept a licensing agreement before being granted access.  For the example above, there is a form you can fill out on the model card page.
4. You'll recieve a confirmation email when your access has been granted
5. On huggingface, click on your profile icon and go to “*Settings*” and then “*Access Tokens*” and click “*Create New Token*”. Be sure to click the checkbox for “*read access to contents of all public gated repos you can access*” and then scroll to the bottom and "*Create Token*". You will use the provided token string in your torchtune download command.
6. If you lose your token string, you’ll have to go back to your "*Access Tokens*" page which lists your current tokens, find the token you just made, and on the far right click the three little dots, then select “*invalidate and refresh*” to get the new token string (there’s no way to view the old token string)

**Example download command in torchtune:** 

`! tune download meta-llama/Meta-Llama-3-8B-Instruct --output-dir [directory you want model to go] --hf-token [your hugging face token]`
    
----

# 3. Model fine-tuning recipe selection
Running the command below will retrieve a list of the preconfigured training, generation, and evaluation recipes available to us in torchtune. In the RECIPE column, a "full_finetune" means training all the model weights (this is memory and computationally expensive), we will instead being doing a "lora_finetune" which will only train the weights of a small lora added to the model.  The "single_device" recipes run on a single GPU, "distributed" recipes run the training across multiple GPUs. There are differences between each model's architecture so the CONFIG column shows default configuration files for specific models for each recipe. In this course we will be running:

RECIPE: **lora_finetune_distributed**  &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; CONFIG: **gemma/2B_lora**

In [2]:
! tune ls

RECIPE                                   CONFIG                                  
full_finetune_single_device              llama2/7B_full_low_memory               
                                         code_llama2/7B_full_low_memory          
                                         llama3/8B_full_single_device            
                                         llama3_1/8B_full_single_device          
                                         mistral/7B_full_low_memory              
                                         phi3/mini_full_low_memory               
                                         qwen2/7B_full_single_device             
                                         qwen2/0.5B_full_single_device           
                                         qwen2/1.5B_full_single_device           
full_finetune_distributed                llama2/7B_full                          
                                         llama2/13B_full                         
    

# 4. Configuring the Training Dataset
In the following section we will look at a common example dataset called *alpaca* that comes preintegrated into torchtune as well as a method to build your own custom datasets.

<h2><center>4.1 Initialize the Tokenizer</center></h2>
First we need to initialize the tokenizer for the specific model we plan to finetune. The tokenizer breaks up the words in an input text string into small groups of characters and assigns each of these groupings a unique number which is refferred to as a <em>token</em> or <em>token id</em>. Note that it's not quite a 1 word = 1 token, some words will be broken up into several tokens and often special characters get their own token. The range of unique tokens ids the tokenizer produces is often refered to as the <em>vocabulary</em> of the model. For this example, we will be using torchtune's implementation of the gemma model family tokenizers.

In [19]:
from torchtune.models.gemma import gemma_tokenizer

# Initialize the gemma tokenizer from a saved file
tokenizer = gemma_tokenizer(
    path="/tmp/models/gemma-2b/tokenizer.model",
    max_seq_len=512)

# Let's feed some text to the tokenizer to see that it's working
text = "Llamas are awesome!"
tokenized_text = tokenizer.encode(text)
print("Token IDs:  ", tokenized_text)

# Let's decode the tokens back to normal text one by one to see what they each represent
chars_per_token = []
for token in tokenized_text:
    chars_per_token.append(tokenizer.decode(token))
    
print("Characters: ", chars_per_token)

Token IDs:   [2, 214614, 2616, 708, 10740, 235341, 1]
Characters:  ['', 'Lla', 'mas', ' are', ' awesome', '!', '']


<br>
<h2><center>4.2 Alpaca Packing - The Alpaca Dataset</center></h2>
Now lets load in the alpaca dataset so we can look at how sample packing works and what the samples in the dataset look like. For illustrative purposes, we'll load in the dataset twice; once without sample packing, and a second time with sample packing.

In [4]:
from torchtune.datasets import alpaca_dataset, PackedDataset

# Instantiate the alpaca dataset but don't pack the samples
dataset_no_packing = alpaca_dataset(
    tokenizer=tokenizer,
    packed=False,
)

# Instantiate the alpaca dataset again and pack the samples
dataset_packed = alpaca_dataset(
    tokenizer=tokenizer,
    packed=True,
)

Packing dataset: 100%|██████████| 52002/52002 [01:19<00:00, 655.69it/s] 


The dataset consists of multiple samples.  Each sample contains an instruction (prompt) and a response (what we hope the model will generate). During training, we will feed to the model one sample at a time with the response obscured and see how well the model can guess the expected response. Each sample in this dataset follows the same instruct format:

<div class="alert alert-block alert-info">
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
    
<br>\### Instruction:
{instruction}

\### Response:
{response}
</div>


Let's print the first two samples in the dataset. Note that the dataset has already been encoded using the tokenizer, so we have to decode each sample to get the human readable text back.

In [5]:
# Print the first 2 samples from the unpacked dataset
for i, sample in enumerate(dataset_no_packing):
    
    # break after i=2
    if i >= 2:
        break
        
    # print current sample #
    print(f"Sample {i + 1}:")
    
    # pull out tokens from current sample
    tokens = sample['tokens']
    
    # decode tokens and print resulting text
    print(f"{tokenizer.decode(tokens)}\n\n")
    

Sample 1:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


Sample 2:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the three primary colors?

### Response:
The three primary colors are red, blue, and yellow.




Was anyone else bothered by the lack of a space character after the "1." in the first sample? I was. Now let's print just the first sample from the packed dataset, notice how many prompts and responses have been packed into this single sample! There are also new tensors embedded in the packed dataset object for the positional encoding and attention masking to ensure that when this packed sample is sent to the model, the model's generated output properly addresses each individual prompt in the sample.

In [6]:
import torch

# Print the first sample from the packed dataset
for i, sample in enumerate(dataset_packed):
    
    # break after i=1
    if i >= 1:
        break
    
    # print current sample number
    print(f"Sample {i + 1}:")
    
    # pull out tokens for this sample, convert tensor to list before decoding
    tokens = sample['tokens'].tolist()

    # decode tokens and print resulting text
    print(f"{tokenizer.decode(tokens)}")


Sample 1:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the three primary colors?

### Response:
The three primary colors are red, blue, and yellow.Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe the structure of an atom.

### Response:
An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a n

Here is an example dataset file in json formatting.

In [7]:
! cat ./config_files/custom_json_dataset.json

[
	{
		"prompt": "What time is it in London?",
		"response": "It is 10:00 AM in London.",
	},
	{
		"prompt": "How many bananas?",
		"response": "All the bananas",
	}
]


The following code will show you how to load the data from our custom dataset file into a dataset object in torchtune which can then be used to train any model.

In [20]:
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset

# Define our prompt template which follows the format
# 
# {<role>: (<prepended text>,<appended text>),...}
# 
# The roles our dataset will recognize by default are system, user, and assistant
custom_template = {
    "system": ("System: ", "\\n\\n"),
    "user": ("User: ", "\\n\\n"),
    "assistant": ("Assistant: ", "\\n\\n"),
}

# Initialize the gemma tokenizer with our template
tokenizer = gemma_tokenizer(
    path="/tmp/models/gemma-2b/tokenizer.model",
    prompt_template=custom_template)

# Load our dataset as an instruct_dataset
# Note that the new_system_prompt variable will define our system message for all samples.
# The column_map will map key values from our json data to the input and output labels.
# The tokenizer will then map input to the user role, and output to the assistant role
# when using our prompt template (yes it's confusing)
custom_dataset = instruct_dataset(
    tokenizer=tokenizer,
    source="json",
    data_files="./config_files/custom_json_dataset.json",
    split="train",
    
    # By default, user prompt is ignored in loss. Set to True to include it
    train_on_input=True,
    
    # System message to use for every sample
    new_system_prompt="You are an AI assistant. ",
    
    # Map the key 'prompt' from our dataset to 'input', and the key 'response' to 'output'
    column_map={"input": "prompt", "output": "response"},
)

# Let's decode the samples in the dataset and print them to see that the formatting worked
for i, sample in enumerate(custom_dataset):
    
    # print current sample #
    print(f"Sample {i + 1}:")
    
    # pull out the token values for this sample, and the masking labels
    tokens, labels = sample["tokens"], sample["labels"]
    print(tokenizer.decode(tokens))


Sample 1:
System:You are an AI assistant.\n\nUser:What time is it in London?\n\nAssistant:It is 10:00 AM in London.\n\n
Sample 2:
System:You are an AI assistant.\n\nUser:How many bananas?\n\nAssistant:All the bananas\n\n


# 5. Model Training
We will be using the command line interface (CLI) of torchtune to run our model training. The basic syntax is

`! tune run <recipe args> <recipe name> <configuration file> <optional config overrides>`

As an example, 

`! tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config gemma/2B_lora epochs=1`

This would run a lora fine-tune on 1 node distributed across 4 GPUs using the default gemma/2B configuration with a configuration override to only train for 1 epoch. We'll provide pre-customized configuration files below, but if you want to make your own, it's easiest to copy the default configuration file and modify it to your needs. To do so, You can run the command

`! tune cp gemma/2B_lora /path/to/save/clever_config_name.yaml`

Note that you can change *gemma/2B_lora* to whatever recipe you want to copy.  We have already prepared configuration files for you, let's take a look at the contents of the lora training configuration file below by running the following command:

In [21]:
! cat ./config_files/custom_train_gemma2-2b-lora.yaml

# Config for multi-device LoRA finetuning in lora_finetune_distributed.py
# using a gemma 2B model
#
# To launch on 4 devices, run the following command from root:
#   tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config ./config_files/custom_train_gemma2-2b-lora.yaml
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
#   tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config ./config_files/custom_train_gemma2-2b-lora.yaml  checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works only when the model is being fine-tuned on 2+ GPUs.


# Tokenizer
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /tmp/models/gemma-2b/tokenizer.model
  max_seq_len: 512

# Dataset
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: true
seed: null
shuffle: True


# Model Arguments
mode

<h2><center>5.1 Training Visualization Setup - Weights and Biases</center></h2>
By default, torchtune configuration files are set to log training progress to a local file, but you can also have it upload training statistics to a popular online visualizer called <a href="https://wandb.ai/site">Weights&Biases</a>. You will need to make an account on their website to use the visualizer, they have a free tier which is sufficient for what we will be doing here. You can also apply for an free academic license. Once you have your account, you can get a copy of your api key here: <a href="https://wandb.ai/authorize">https://wandb.ai/authorize</a>.

The following code below will authorize the connection between this notebook session and your wandb account.  Paste your api into the box when prompted and hit ENTER.

In [22]:
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mgjaffe[0m ([33mgjaffe-university-of-texas-at-ausitn[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

<h2><center>5.2 Run LoRA fine-tune</center></h2>
The command below will run the LoRA fine-tune using our custom configuration file.  Note that it takes ~1.5 hours to complete 1 epoch on the full dataset, so for illustrative purposes we have added in the override

`dataset.split="train[:1%]"`

which will run the training on just 1% of the dataset. The recipe will first initialize the model into memory, pack the dataset, fine-tune the model, and then save the new model weights to a file.  For convinience, we've already run and saved the fine-tuned weights so you can skip ahead to the generation section if you don't want to wait for this to finish.

In [23]:
! tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
--config ./config_files/custom_train_gemma2-2b-lora.yaml \
metric_logger._component_=torchtune.training.metric_logging.WandBLogger \
metric_logger.project=torchtune \
dataset.split="train[:1%]" \
checkpointer.output_dir=$WORK/gemma2-lora/


Running with torchrun...
Running LoRAFinetuneRecipeDistributed with resolved config:

batch_size: 2
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/models/gemma-2b/
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: GEMMA
  output_dir: /work2/10156/gj3385/frontera/gemma2-lora/
  recipe_checkpoint: null
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: true
  split: train[:1%]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.WandBLogger
  log_dir: /tmp/alpaca-gemma-lora
  project: torchtune
mo

The trained model is saved to a folder on your work directory which we can view with the command below.

In [5]:
! ls $WORK/gemma2-lora/

adapter_0.pt	     adapter_model.bin	hf_model_0001_0.pt
adapter_config.json  config.json	hf_model_0002_0.pt


The adapter_0.pt file contains just the weights for the LoRA. The two files hf_model_0001_0.pt and hf_model_0002_0.pt contain the full model weights with the LoRA weights baked in. You'll notice there is no tokenizer model in this folder, so if we want to use this trained model we'll need to reference the tokenizer from the original base model folder.