# FINE TUNE LLAMA3.1 8B

**LINKS**
- 

All these settings are based on the following huggingface's notebook:

https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing#scrollTo=Ybeyl20n3dYH

In [5]:
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from transformers import AutoModelForCausalLM,AutoTokenizer,  BitsAndBytesConfig
from datetime import datetime
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
import torch

In [6]:
import gc
import torch
gc.collect()
gc.collect()
torch.cuda.empty_cache()

# MODEL,TOKENIZER AND PEFT SETTINGS

#### How much memory do you need for training 3B,7B models?

in full precision (float32), every parameter of the model is stored in 32 bits or 4 bytes. Hence 4 bytes / parameter * 7 billion parameters = 28 billion bytes = 28 GB of GPU memory required, for inference only. In half precision, each parameter would be stored in 16 bits, or 2 bytes. Hence you would need 14 GB for inference. There are now also 8 bit and 4 bit algorithms, so with 4 bits (or half a byte) per parameter you would need 3.5 GB of memory for inference. However usually there’s also some additional overhead as you generate tokens, see this nice blog post: [Calculating GPU memory for serving LLMs | Substratus.AI 388.](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm)

For training, it depends on the optimizer you use and whether you use full fine-tuning vs. PEFT 34 (e.g. QLoRa).

In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory. If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory. With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.

In case you use parameter-efficient methods like QLoRa, memory requirements are greatly reduced: [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA 71](https://huggingface.co/blog/4bit-transformers-bitsandbytes). Basically one quantizes the base model in 8 or 4 bits and then train adapters on top in float16.

I highly recommend this guide: [Methods and tools for efficient training on a single GPU 3.0k](https://huggingface.co/docs/transformers/perf_train_gpu_one#anatomy-of-models-memory) which goes over all of this in much more detail.

More info in this [HF Thread](https://discuss.huggingface.co/t/llama-7b-gpu-memory-requirement/34323/8)

In [7]:
base_model_id = "numind/NuExtract"
dataset_name ="nymiz/nymiz-dataset-rel-pjcr-es-x"
hub_model_id="apolo/nymiz-lora-phi_3_mini_4k-pjcr-es-by_person"

#### FLASH ATTENTION IMPLEMENTATION

For the moment, Tesla V100 are not supported by the attention implemention.
[Git issue](https://github.com/Dao-AILab/flash-attention/issues/148)

In [8]:
if torch.cuda.is_available():
  device_map = {"": 0}

if torch.cuda.is_bf16_supported():
    device = torch.device('cuda')
    compute_dtype = torch.bfloat16
    gpu_name = torch.cuda.get_device_name(device)
    major, minor = torch.cuda.get_device_capability(device)
    if major == 8:
        print(f'GPU ({gpu_name}) is Ampere.')
        compute_dtype = torch.bfloat16
        attn_implementation = 'flash_attention_2'
    else:
        print(f'GPU is not Ampere.')
        compute_dtype = torch.float16
        attn_implementation = 'eager'
else:
    print(f'GPU is not Ampere.')
    compute_dtype = torch.float16
    attn_implementation = 'eager'

# This line of code is used to print the value of 'attn_implementation', which indicates the chosen attention implementation.
print(f"The flash implementation is : {attn_implementation}")

GPU is not Ampere.
The flash implementation is : eager


#### SETTING TOKENIZER 

In [9]:
tokenizer = AutoTokenizer.from_pretrained(base_model_id,use_fast=True,trust_remote_code=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### *How does rank affect model performance?*

A higher rank means a greater number of trainable parameters in our model, making fine-tuning more memory intensive. However, higher ranks retain more information from the original weight matrix, as the decomposed matrices themselves are large and capture most of the essence of W (i.e., the model becomes more expressive). We can say that, as the rank increases, LORA essentially converges toward normal fine-tuning.

#### *How does alpha affect model performance?*

A higher “alpha” would place more emphasis on the low-rank structure or regularization, while a lower “alpha” would reduce its influence, making the model rely more on the original parameters. Adjusting “alpha” helps in striking a balance between fitting the data and preventing overfitting by regularizing the model

As a rule of thumb, it’s usually common to choose an alpha that is twice as large as the rank when fine-tuning LLMs

In [10]:
# Use PEFT
use_peft=True

# Lora settings
lora_rank = 1024 # Latest good value for small version : 1024 
lora_alpha = 2048 # Latest good value for small version: 2048
target_modules = ["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"]
lora_bias="none"
lora_task_type = "CAUSAL_LM"

# QLora settings
load_in_4bit=True
bnb_4bit_quant_type = "nf4"
bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
bnb_4bit_use_double_quant = True

In [11]:
if use_peft:

    # QLORA settings
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,
        bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
        bnb_4bit_quant_type=bnb_4bit_quant_type,
        bnb_4bit_compute_dtype=bnb_4bit_compute_dtype
    )
    
    model = AutoModelForCausalLM.from_pretrained(
          base_model_id, quantization_config=bnb_config, trust_remote_code=True, device_map=device_map,
          attn_implementation=attn_implementation,temperature=0,torch_dtype=compute_dtype)
    
    
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    # LORA settings
    def print_trainable_parameters(model):
        """
        Prints the number of trainable parameters in the model.
        """
        trainable_params = 0
        all_param = 0
        for _, param in model.named_parameters():
            all_param += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()
        print(
            f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
        )
    peft_config = LoraConfig(
        r=lora_rank,
        lora_alpha=lora_alpha,
        target_modules=target_modules,
        bias=lora_bias,
        task_type=lora_task_type,
    )
    model = get_peft_model(model, peft_config)
    print_trainable_parameters(model)

else:
    model = AutoModelForCausalLM.from_pretrained(base_model_id,torch_dtype=compute_dtype, trust_remote_code=True, device_map=device_map,attn_implementation=attn_implementation,temperature=0)

Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.27it/s]


trainable params: 570425344 || all params: 2579565568 || trainable%: 22.11323298295785


# PREPARE DATASET

In [12]:
nunextract_template="""<|input|>\n### Template:\n{}\n### Text:\n{}\n<|output|>\n{}"""
json_template="""{"people":{"name":"","title":[],"role":[],"profession":[]}}"""


In [13]:
import json

EOS_TOKEN = tokenizer.eos_token


def formatting_prompts_func(examples):
  schema_template = json.dumps(json.loads(json_template), indent=4)
  texts = examples["text"]
  outputs = examples["people"]
  example_prompts = []
  for text,output in zip(texts,outputs):
    json_output = """{{'people': {}}}""".format(output).replace("'",'"')
    schema_output = json.dumps(json.loads(json_output), indent=4)
    example_prompt = nunextract_template.format(schema_template,text,schema_output) + f"\n{EOS_TOKEN}\n"
    example_prompts.append(example_prompt)
  return {"prompt":example_prompts}

def formatting_prompts_func_by_people(examples):
  texts = examples["text"]
  outputs = examples["people"]
  example_prompts = []
  example_texts = []
  example_people = []
  for text,output in zip(texts,outputs):
    for individual_info in output:
      name = individual_info["name"]
      title = individual_info["title"]
      role = individual_info["role"]
      profession = individual_info["profession"]
      example_json_template = json.dumps({"person":{"name":name,"title":[],"role":[],"profession":[]}},indent=4)
      output_json_template = json.dumps({"person":{"name":name,"title":title,"role":role,"profession":profession}},indent=4)
      example_prompt = nunextract_template.format(example_json_template,text,output_json_template) + f"\n{EOS_TOKEN}\n"
      example_prompts.append(example_prompt)
      example_texts.append(text)
      example_people.append(individual_info)
  return {"text":example_texts,"people":example_people,"prompt":example_prompts}


from datasets import load_dataset
dataset = load_dataset("nymiz/nymiz-dataset-rel-pjcr-es-x")
dataset = dataset.map(formatting_prompts_func_by_people, batched = True)

In [14]:
print(dataset["train"][0]["prompt"])

<|input|>
### Template:
{
    "person": {
        "name": "Mayela",
        "title": [],
        "role": [],
        "profession": []
    }
}
### Text:
Mayela
<|output|>
{
    "person": {
        "name": "Mayela",
        "title": [],
        "role": [],
        "profession": []
    }
}
<|endoftext|>



In [15]:
print(dataset["test"][0]["prompt"])

<|input|>
### Template:
{
    "person": {
        "name": "MARIELOS MARIN RETANA",
        "title": [],
        "role": [],
        "profession": []
    }
}
### Text:
MEDIDAS DE PROTECCION CONTRA LA VIOLENCIA DOMESTICA, establecidas por MARIELOS MARIN RETANA, mayor, casada, cédula seis-ciento noventa y uno-setecientos sesenta, ama de casa, vecina de Pavas, conta OSCAR ALBERTO MARIN ESTRADA, mayor, casado, agente vendedor. Expediente tramitado ante la Alcaldía Mixta de Pavas, bajo el número 962-96. Conoce este Tribunal del presente proceso, en virtud del recurso de apelación interpuesto por la parte demandada, contra la resolución dictada a las dieciséis horas del veintiuno de marzo de mil novecientos noventa y siete.-
<|output|>
{
    "person": {
        "name": "MARIELOS MARIN RETANA",
        "title": [],
        "role": [],
        "profession": [
            "ama de casa"
        ]
    }
}
<|endoftext|>



# TRAINER

Tips:

- _Be careful using "fp16=True"_ : Like fp16, bf16 uses 16 bits (instead of the 32 bits used in full precision). However, fp16 and bf16 represent different ranges of numbers - fp16 is limited to the range [-65k, 65k], whereas bf16 has a vastly bigger range of possible values (roughly the same range as fp32, except that chunks of numbers get skipped). Ultimately, when you try to use fp16 to train a model that was pretrained with bf16, you frequently end up with a lot of overflow issues which cause inf/NaN values for the loss.

- _fp16 vs bf16_: If you own Ampere or newer hardware you can start using bf16 for your training and evaluation. While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were experiencing overflow issues while training the model, bf16 will prevent this from happening most of the time. Remember that in fp16 the biggest number you can have is 65535 and any number above that will overflow. A bf16 number can be as large as 3.39e+38 (!) which is about the same as fp32 - because both have 8-bits used for the numerical range.

In [16]:
args = SFTConfig(
        hub_model_id=hub_model_id,
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        save_total_limit=5,
        warmup_steps = 5,
        eval_strategy = "epoch",
        do_eval=True,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim="paged_adamw_8bit", # Other options: "adamw_8bit"
        weight_decay = 0.01,
        lr_scheduler_type = "cosine", # Other options: "linear"
        seed = 2024,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        num_train_epochs=10,
        output_dir = f"../models/{base_model_id}/{datetime.now().strftime('%Y%m%d%H%M%S')}",
        save_strategy="epoch",
        push_to_hub=False,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
    )

In [17]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    peft_config= peft_config  if use_peft else None,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    dataset_text_field = "prompt",
    max_seq_length = 2048,
    args = args,

)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
Map:   0%|          | 0/232 [00:00<?, ? examples/s]

Map: 100%|██████████| 232/232 [00:00<00:00, 3846.24 examples/s]
Map: 100%|██████████| 14/14 [00:00<00:00, 1869.71 examples/s]


# TRAIN

In [18]:
trainer.train()
trainer._load_best_model()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
You are not running the flash-attention implementation, expect numerical differences.


Epoch,Training Loss,Validation Loss
1,0.3296,0.666264
2,0.1512,0.720121
3,0.099,0.672777
4,0.0376,0.719886
5,0.0237,0.795169
6,0.0199,0.809359




# FiNAL EVALUATE

In [19]:
trainer.evaluate()

{'eval_loss': 0.6662644147872925,
 'eval_runtime': 4.4137,
 'eval_samples_per_second': 3.172,
 'eval_steps_per_second': 0.453,
 'epoch': 6.0}

# PUSH

In [20]:
trainer.push_to_hub()

adapter_model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]
[A

training_args.bin: 100%|██████████| 5.50k/5.50k [00:00<00:00, 45.2kB/s]39MB/s]
tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 617kB/s]33, 8.32MB/s]
adapter_model.safetensors: 100%|██████████| 2.28G/2.28G [00:52<00:00, 43.7MB/s]


Upload 3 LFS files: 100%|██████████| 3/3 [00:52<00:00, 17.49s/it]


CommitInfo(commit_url='https://huggingface.co/apolo/nymiz-lora-phi_3_mini_4k-pjcr-es-by_person/commit/52f4c44a1e19046557c27acf09f1d5d150fad01f', commit_message='End of training', commit_description='', oid='52f4c44a1e19046557c27acf09f1d5d150fad01f', pr_url=None, pr_revision=None, pr_num=None)

# CLEAN RESOURCES

In [21]:
del model
del trainer
import gc
gc.collect()
gc.collect()
torch.cuda.empty_cache()