<a href="https://colab.research.google.com/github/jayozer/ai_webinars/blob/main/Llama_3_Translation_Fine_tuning_Prompt_Engineering_Event.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Llama 3 for Translation

For our fine-tuning use-case, we'll leveraging [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on a translation task in a particular style!



# Fine-tuning Example

We'll start, as we always do, with some dependencies!

In [None]:
!pip install -qU transformers peft trl accelerate bitsandbytes datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.0/102.0 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━

Next, let's set-up some data!

## Data

We'll be using a [dataset](https://huggingface.co/datasets/ai-maker-space/gen-z-translation) that is adept at translating English to Gen-Z Slang laden English for this version of the model.

This dataset contains the following:

- English
- Gen-Z Language (still English)

We'll start by grabbing our dataset from Hugging Face!

In [None]:
! pip install huggingface-hub




In [None]:
from datasets import load_dataset

gen_z_dataset = load_dataset("ai-maker-space/gen-z-translation")

Let's see how many items we're working with in our dataset.

> NOTE: Keep in mind that this is a relatively simple example meant to showcase fine-tuning - in practice, we'd want to use somewhere in the neighbourhood of ~500-50,000 examples.

In [None]:
gen_z_dataset

DatasetDict({
    train: Dataset({
        features: ['English', 'Gen-Z'],
        num_rows: 105
    })
})

In [None]:
print(gen_z_dataset['train'][0:2])

{'English': ['That was really funny.', 'That looks really good.'], 'Gen-Z': ["I'm weak.", "That's bussin'."]}


Let's look at an example of our original text and summary!

In [None]:
print(f"English: {gen_z_dataset['train'][70]['English']}\n\nGen-Z: {gen_z_dataset['train'][70]['Gen-Z']}")

English: She's very good at manipulating people to get what she wants.

Gen-Z: She's got mad finesse, always getting her way.


Now, we mentioned earlier we were going to fine-tune meta-llama/Meta-Llama-3-8B-Instruct, which is important for our next step: Creating the instruction format.

Let's take a look at the instructions (so meta) to generate an instruction prompt from the [repository](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models)


> The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in [`ChatFormat`](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202) needs to be followed: The prompt begins with a `<|begin_of_text|>` special token, after which one or more messages follow. Each message starts with the `<|start_header_id|>` tag, the role `system`, `user` or `assistant`, and the `<|end_header_id|>` tag. After a double newline "\n\n" the contents of the message follow. The end of each message is marked by the `<|eot_id|>` token.

Let's look at an example of how we might format our instruction - and then reproduce that in code.

```python
"""
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Gen-Z-ify<|eot_id|><|start_header_id|>user<|end_header_id|>

She's very good at manipulating people to get what she wants.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

She's got mad finesse, always getting her way.<|eot_id|>
"""
```

In [None]:
INSTRUCTION_PROMPT_TEMPLATE = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Gen-Z-ify<|eot_id|><|start_header_id|>user<|end_header_id|>

{english}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

RESPONSE_TEMPLATE = """\
{gen_z_slang}<|eot_id|><|end_of_text|>"""

The natural language {english} is in {gen_z_slang} is out in response.

Now we can create a helper function that will convert our dataset row into the above prompt!

In [None]:
def create_instruction(sample, return_response=True):
  prompt = INSTRUCTION_PROMPT_TEMPLATE.format(
      english=sample["English"]
  )

  if return_response:
    prompt += RESPONSE_TEMPLATE.format(gen_z_slang=sample["Gen-Z"])

  return prompt

Let's try it out!

In [None]:
create_instruction(gen_z_dataset["train"][0])

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nGen-Z-ify<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThat was really funny.<|eot_id|><|start_header_id|>assistant<|end_header_id|>I'm weak.<|eot_id|><|end_of_text|>"

## Loading Our Model

Now we can move onto loading our model!

We're going to be dependent on two major technologies to allow us to train our model with <=16GB GPU RAM.

1. Quantization
2. LoRA

> NOTE: We've done some events on [LoRA](https://www.youtube.com/watch?v=kV8yXIUC5_4&list=PLrSHiQgy4VjGMzyXsSlvN-TjPaqFFsAGP&index=4) and [QLoRA](https://www.youtube.com/watch?v=XOb-djcw6hs&list=PLrSHiQgy4VjGMzyXsSlvN-TjPaqFFsAGP&index=5) for deeper dives into those respective technologies

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

We'll load our tokenizer and do a few pre-processing steps to prepare it for training.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
gen_z_dataset["train"][75]["English"]

'They are in a very complicated romantic relationship.'

In [None]:
gen_z_dataset["train"][75]["Gen-Z"] # expected output!!! directly from train

"They're in a whole situationship, it's messy."

In [None]:
from transformers import pipeline

base_model_pipe = pipeline("text-generation", model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=False)

In [None]:
outputs = base_model_pipe(create_instruction(gen_z_dataset["train"][75], return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

In [None]:
outputs[0]["generated_text"]

'\n\nLet\'s Gen-Z-ify their complicated romantic relationship!\n\nImagine a relationship where the lines are blurred, and the rules are made up as they go along. It\'s like trying to navigate a TikTok algorithm, where the algorithm is their emotions.\n\nThey\'re stuck in a cycle of "are we or aren\'t we?" - a never-ending debate that\'s as exhausting as trying to keep up with the latest trends.\n\nOne minute, they\'re low-key, and the next, they\'re high-key. It\'s like they\'re playing a game of "Would You Rather," where the stakes are their hearts.\n\nThey\'re constantly wondering if they\'re just "frenemies" or if they\'re actually "soulmates." It\'s like trying to solve a puzzle, where the pieces keep shifting.\n\nTheir relationship is a messy, beautiful, confusing, and thrilling ride. It\'s like trying to keep up with a Kardashian\'s Instagram stories - you\'re never quite sure what\'s real and what\'s just for show.\n\nBut despite the drama, they\'re addicted to each other. It\'s

In [None]:
gen_z_dataset["train"][2]["English"]

'She looks very attractive.'

In [None]:
outputs = base_model_pipe(create_instruction(gen_z_dataset["train"][2], return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

In [None]:
outputs[0]["generated_text"]

"\n\nYou're saying she's got that Gen-Z-ified glow going on!"

So in above the Gen-z set up is not that great. It adds a bunch from the larger corpus. What we need is for it to prvide set answers. We want to make it tighter.

Now we can set-up our LoRA configuration file - which will let the TRL library know how to create our LoRA adapters!

In [None]:
from peft import LoraConfig

peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

### Fine-tuning!

Now onto the star of today's show: Fine-tuning!

We're going to use the `SFTTrainer` or "Supervised Fine-tuning Trainer" from the [TRL](https://github.com/huggingface/trl/tree/main) library today.

In essence, this is a trainer that will handle most of the fine-tuning process for us - including but not limited to:

- Dataset packing
- LoRA initialization
- Tokenizing

Let's set up some training hyper-parameters through transformers `TrainingArguments` class to get started. Here's a breakdown of which parameters are doing what:

- `outpur_dir` - indicates where we store the results of training locally
- `num_train_epochs` - how many epochs we will train for (somewhere ~3-4 is a good place to start)
- `per_device_train_batch_size` - how many batches we want per device. In this case, we only have one device - but we set this to a low value to keep memory consumption below 16GB GPU RAM
- `gradient_accumulation_steps` - this hyper-parameter lets us indicate how many steps we wish to accumulate our gradients for, this is a way to "cheat out" a larger batch size without extra memory consumption
- `gradient_checkpointing` - this lets us [trade off memory consumption for increased training time](https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing)
- `optim` - our optimizer! In this case, we're using  a fused ADAMW optimiser. The fused method is a faster version of the ADAMW optimizer but is reliant on CUDA (GPU). More information can be read [here](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)

The rest of the hyper-parameters are taken directly from the QLoRA [paper](https://arxiv.org/abs/2305.14314) and are discussed in more detail there!

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="gen-z-translate-llama-3-instruct-v1",
    num_train_epochs=10,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
    save_strategy="epoch",
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    push_to_hub=True,
)

Because we're going to automatically push our model to the hub, thanks to `push_to_hub=True`, we'll want to provide a Hugging Face Write token.

> NOTE: You can skip this step by commenting out `push_to_hub=True`

Now, finally, we can set-up our `SFTTrainer` which is going to help us fine-tune this model on our dataset we create at the beginning of the notebook!

We'll discuss a few parameters to clarify what they're doing:

- `formatting_func` - since we created a helper function to convert our dataset row into a Mistral-style Instruction prompt, we need to let TRL know to use it to create our prompts!
- `peft_config` - TRL will automatically load our model in LoRA format with this config.
- `packing` - this will let our model "pack" the context window to ensure we're not wasting precious memory on padding tokens where posssible
- `dataset_kwargs` - because we already added the special tokens to our prompts, we want to ensure we don't "re-add" them!

With those parameters set - we're clear for training!

In [None]:
from trl import SFTTrainer

max_seq_length=2048

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=gen_z_dataset["train"],
    formatting_func=create_instruction,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens" : False,
        "append_concat_token" : False,
    }
)



Generating train split: 0 examples [00:00, ? examples/s]

All that's left to do is fine-tune our model!

In [None]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss




TrainOutput(global_step=10, training_loss=0.4649935245513916, metrics={'train_runtime': 64.331, 'train_samples_per_second': 0.155, 'train_steps_per_second': 0.155, 'total_flos': 932513065205760.0, 'train_loss': 0.4649935245513916, 'epoch': 10.0})

Now we can save it.

In [None]:
trainer.save_model()



Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/336M [00:00<?, ?B/s]

events.out.tfevents.1714662571.359bf6bb7d63.400.0:   0%|          | 0.00/5.62k [00:00<?, ?B/s]

Let's clear up memory so we can do inference while staying under our memory budget.

In [None]:
del model
del trainer
torch.cuda.empty_cache()

We'll need to load our mode back as a PEFT model, due to the use of LoRA, and then merge the LoRA layers back into the original model for use in Hugging Face pipelines.

In [None]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto"
)

merged_model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.17 GiB of which 106.88 MiB is free. Process 3647 has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 55.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
merged_model.push_to_hub("ai-maker-space/gen-z-translate-llama-3-instruct-v1")

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ai-maker-space/gen-z-translate-llama-3-instruct-v1/commit/c77d56682070d79868b80de51bdba97c08b5c873', commit_message='Upload LlamaForCausalLM', commit_description='', oid='c77d56682070d79868b80de51bdba97c08b5c873', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub("ai-maker-space/gen-z-translate-llama-3-instruct-v1")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ai-maker-space/gen-z-translate-llama-3-instruct-v1/commit/ad6e1f14b5c33990614bd09ba9efd238b1be9e7f', commit_message='Upload tokenizer', commit_description='', oid='ad6e1f14b5c33990614bd09ba9efd238b1be9e7f', pr_url=None, pr_revision=None, pr_num=None)

Now we can load our pipeline for `text-generation`.

In [None]:
from transformers import pipeline

gen_z_pipe = pipeline("text-generation", merged_model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=False)

## Testing Fine-tuned Model

Now that we've fine-tuned, lets see how we did!

In [None]:
plato_quote_one = "The greatest wealth is to live content with little."

In [None]:
outputs = gen_z_pipe(create_instruction({"English" : plato_quote_one}, return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

In [None]:
outputs[0]["generated_text"].split("\n")[-1]

'"The most valuable flex is being low-key rich in your own mind."'

Another example!

In [None]:
feynman_quote_one = "I was born not knowing and have had only a little time to change that here and there."

In [None]:
outputs = gen_z_pipe(create_instruction({"English" : feynman_quote_one}, return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

In [None]:
outputs[0]["generated_text"].split("\n")[-1]

"I was a noob, and I've only had a hot sec to upgrade that."

Overall, our fine-tuning did a great job and allowed our model to generate our desired output - all with <16GB GPU memory, and 4 epochs of fine-tuning!