## Llama Demo Notebook

This notebook shows how to finetune a Llama 2 7B model with one V100 GPU 32GB with LoRA and quantization, and a Llama 3 8B model on two V100 GPU 32GB using LoRA and FSDP. Examples here are modified version of examples in the llama-recipes.

### Step 0: Install pre-requirements and convert checkpoint

The example uses the Hugging Face trainer and model which means that the checkpoint has to be converted from its original format into the dedicated Hugging Face format.
The conversion can be achieved by running the `convert_llama_weights_to_hf.py` script provided with the transformer package.

In [None]:
%bash
pip install llama-recipes transformers datasets accelerate wandb sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire torch_tb_profiler wandb

### Step 1: Load the model

Point model_id to model weight folder. Make sure you replace the "/your_project_dir/llama-2-7b-hf" with the correct bridges-2 directory.

In [1]:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

model_id="/your_project_dir/llama-2-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_id)

model = LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Step 2: Load the preprocessed dataset

We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:

In [2]:
from llama_recipes.utils.dataset_utils import get_preprocessed_dataset
from llama_recipes.configs.datasets import samsum_dataset
from llama_recipes.data.concatenator import ConcatDataset

train_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'train')
train_dataset = ConcatDataset(train_dataset, chunk_size=4096)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Preprocessing dataset: 100%|██████████| 14732/14732 [00:07<00:00, 2103.55it/s]


### Step 3: Check base model

Run the base model on an example input:

In [3]:
eval_prompt = """
Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
---
Summary:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100,pad_token_id=tokenizer.eos_token_id)[0], skip_special_tokens=True))




Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
---
Summary:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?
B: What do you want to do?
A: I want to get a pu

We can see that the base model only repeats the conversation.

### Step 4: Prepare model for PEFT

Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):

In [4]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_kbit_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    # prepare int-8 model for training
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)



trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


### Step 5: Fine tune the model

Here, we fine tune the model for a single epoch.

In [None]:
from transformers import default_data_collator, Trainer, TrainingArguments

output_dir = "tmp/llama2-output"
config = {
    'lora_config': lora_config,
    'learning_rate': 1e-4,
    'num_train_epochs': 1,
    'gradient_accumulation_steps': 2,
    'per_device_train_batch_size': 2,
    'gradient_checkpointing': False,
}
# Define training args
training_args = TrainingArguments(
    report_to="wandb",
    run_name="llama-demo",
    output_dir=output_dir,
    overwrite_output_dir=True,
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="no",
    optim="adamw_torch_fused",
    max_steps= -1,
    **{k:v for k,v in config.items() if k != 'lora_config'}
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=default_data_collator,
)

# Start training
trainer.train()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[34m[1mwandb[0m: Currently logged in as: [33mmeiyu-physics[0m ([33mmeiyuw[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.17.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/ocean/projects/pscstaff/mwang7/bridges2-llm-examples/Llama/wandb/run-20240524_102702-jmwuo6v0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mllama-demo[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/meiyuw/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/meiyuw/huggingface/runs/jmwuo6v0[0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,1.432


### Step 6:
Save model checkpoint

In [8]:
model.save_pretrained(output_dir)

### Step 7:
Try the fine tuned model on the same example again to see the learning progress:

In [9]:
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))



Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
---
Summary:
A wants to get a puppy for his son. He took him to the animal shelter last Monday. He showed him one that he really liked. A will name it after his dead hamster - Lemmy.


### Additional : Fine tune Llama 2 7B the model with llama-recipe module

Here, we fine tune the Llama-2 7B model for a single epoch with one V100 32GB GPU.

In [None]:
!python3 -m llama_recipes.finetuning --use_peft --peft_method lora --num_epochs 1 --use_fp16 --use_wandb num_epochs 1 --quantization --model_name /your_project_dir/llama-2-7b-hf   --batch_size_training 1 --output_dir /your_project_dir/llama-2-7b-hf-finetuned

[34m[1mwandb[0m: Currently logged in as: [33mmeiyu-physics[0m ([33mmeiyuw[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.17.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/ocean/projects/pscstaff/mwang7/llama-demo/wandb/run-20240520_223735-6qb9u7lv[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mhappy-bee-9[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/meiyuw/llama_recipes[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/meiyuw/llama_recipes/runs/6qb9u7lv[0m
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:11<00:00,  5.69s/it]
--> Model /ocean/projects/pscstaff/mwang7/llama/llama-2-7b-hf

--> /ocean/pro

### Distributed training with FSDP for Llama 3 8B using llama-recipe module:
Fine tuning Llama 3 8B model with two V100 32GB GPUs:

In [None]:
!OMP_NUM_THREADS=4 torchrun --nnodes 1 --nproc_per_node 2  -m llama_recipes.finetuning --enable_fsdp --use_peft --peft_method lora --model_name /your_project_dir/llama-3-8b-hf/ --batch_size_training 1 --output_dir /your_project_dir/llama-3-8b-hf-finetuned 

Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|██████████████████| 4/4 [02:51<00:00, 42.94s/it]
Loading checkpoint shards: 100%|██████████████████| 4/4 [02:51<00:00, 42.96s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model /ocean/projects/pscstaff/mwang7/llama3/Meta-Llama-3-8B_hf

--> /ocean/projects/pscstaff/mwang7/llama3/Meta-Llama-3-8B_hf has 8030.261248 Million params

trainable params: 1,703,936 || all params: 8,031,965,184 || trainable%: 0.021214434586871833
trainable params: 1,703,936 || all params: 8,031,965,184 || trainable%: 0.021214434586871833
bFloat16 enabled for mixed precision - using bfSixteen policy
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
You can avoid this