# half precision

Instead of f32, we can use f16, which is half of the default precision, where comes the name half precision.
For half precision, the smallest positive value is ~5.9e-8 and the largest value is ~65504 (https://devblogs.microsoft.com/dotnet/introducing-the-half-type.)

Using half-precision, we can save the memory usage.
But, we may have problems such as overflow, rounding (if newer GPU, use bf16).

So, to use it: 
 * option1: model = model.half()
 * option2: XXXmodel.from_pretrained(ckp, torch_type=torch.half)

The option2 will require less resources, since option1 still needs full precision resources to load the model, and then convert it to half precision.

In [11]:
# to illustrate problem of overflow
# since the limit of half precision is ~5.9e-8, any value smaller then this value will become 0

import torch

print("6e-8 is: ", torch.tensor(6e-8).half().item())
print("1e-8 is: ", torch.tensor(1e-8).half().item())

6e-8 is:  5.960464477539063e-08
1e-8 is:  0.0


## I. LlaMA2

 - Meta opensource pre-trained model
 - auto-regressive model
 - sizes: 7B, 13B, 70B
 

### 1. Data

In [2]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this
# Should be run the first

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForSeq2Seq

ckp_data = "yahma/alpaca-cleaned"
ckp = "NousResearch/Llama-2-7b-chat-hf"

2024-06-26 10:37:58.220885: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-26 10:37:58.220946: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-26 10:37:58.223924: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-26 10:37:58.237042: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# load dataset

data = load_dataset(ckp_data, split="train[:1000]")
data

Dataset({
    features: ['output', 'instruction', 'input'],
    num_rows: 1000
})

### 2. Tokenizer

In [5]:
# load tokenizer

# LLama tokenizer, the default padding_side is "left".
# If we leave this option to default, when training, the loss will be 0 after 1 or 2 steps.
# This is because that the actual data contributing to the loss are on the right (after "assistant" key word).

# So we have to set this option to "right"

tokenizer = AutoTokenizer.from_pretrained(ckp)
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token
tokenizer

LlamaTokenizerFast(name_or_path='NousResearch/Llama-2-7b-chat-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	32000: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
}

In [6]:
# process data

def process(sample):

    MAX_LEN = 256

    human = tokenizer("Human: " + "\n".join([sample["instruction"], sample["input"]]).strip() + "\n\nAssistant: ", add_special_tokens=False)
    
    # if we set the add_special_tokens to true, it will add the sos also that we don't want
    # if we add the eos directly to the sample's output string, we should add a space between them
    # otherwise, the last word will become "{word}eos", which is different from "{word} eos".
    # So the simplest way is to add the eso directly to the input_ids as shown below.
    
    ml = tokenizer(sample["output"], add_special_tokens=False)

    input_ids = human["input_ids"] + ml["input_ids"] + [tokenizer.eos_token_id]
    attention_mask = human["attention_mask"] + ml["attention_mask"] + [1]
    labels = [-100] * len(human["input_ids"]) + ml["input_ids"] + [tokenizer.eos_token_id]

    if len(input_ids) > MAX_LEN:

        input_ids = input_ids[:MAX_LEN]
        attention_mask = attention_mask[:MAX_LEN]
        labels = labels[:MAX_LEN]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

In [7]:
# tokenize dataset

tokenized_data = data.map(process, remove_columns=data.column_names)
tokenized_data

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 1000
})

In [7]:
# we notice that the tokenizer added some special tokens: eos

tokenizer.decode(tokenized_data[1]["input_ids"])

'Human: What are the three primary colors?\n\nAssistant:  The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB).</s>'

In [8]:
print(tokenized_data[1]["input_ids"])

[12968, 29901, 1724, 526, 278, 2211, 7601, 11955, 29973, 13, 13, 7900, 22137, 29901, 29871, 450, 2211, 7601, 11955, 526, 2654, 29892, 7254, 29892, 322, 13328, 29889, 4525, 11955, 526, 2000, 7601, 1363, 896, 2609, 367, 2825, 491, 24907, 916, 11955, 322, 599, 916, 11955, 508, 367, 1754, 491, 29299, 963, 297, 5164, 12098, 1080, 29889, 512, 278, 788, 3321, 2927, 1788, 29892, 1304, 363, 3578, 29892, 278, 7601, 11955, 526, 2654, 29892, 7933, 29892, 322, 7254, 313, 28212, 467, 2]


In [10]:
type(tokenized_data[1]["input_ids"][0])

int

### 3. load model

In [8]:
# model
# after loading, 13B of GPU memory was taken

import torch

model = AutoModelForCausalLM.from_pretrained(ckp, low_cpu_mem_usage=True, 
                                             torch_dtype=torch.half, 
                                             device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
# show the type of the model

model.dtype

torch.float16

In [10]:
# show more detailed types

for name, param in model.named_parameters():
    print(name, param.dtype)

model.embed_tokens.weight torch.float16
model.layers.0.self_attn.q_proj.weight torch.float16
model.layers.0.self_attn.k_proj.weight torch.float16
model.layers.0.self_attn.v_proj.weight torch.float16
model.layers.0.self_attn.o_proj.weight torch.float16
model.layers.0.mlp.gate_proj.weight torch.float16
model.layers.0.mlp.up_proj.weight torch.float16
model.layers.0.mlp.down_proj.weight torch.float16
model.layers.0.input_layernorm.weight torch.float16
model.layers.0.post_attention_layernorm.weight torch.float16
model.layers.1.self_attn.q_proj.weight torch.float16
model.layers.1.self_attn.k_proj.weight torch.float16
model.layers.1.self_attn.v_proj.weight torch.float16
model.layers.1.self_attn.o_proj.weight torch.float16
model.layers.1.mlp.gate_proj.weight torch.float16
model.layers.1.mlp.up_proj.weight torch.float16
model.layers.1.mlp.down_proj.weight torch.float16
model.layers.1.input_layernorm.weight torch.float16
model.layers.1.post_attention_layernorm.weight torch.float16
model.layers.2

In [12]:
# get model parameters

params = sum(param.numel() for param in model.parameters())
print("model size: ", params/1e9, "GB")
print("total required memory: ", round(params/1e9 * (1 + 1 + 3) * 4, 2), "GB")
    

model size:  6.738415616 GB
total required memory:  134.77 GB


### 4. peft

In [13]:
from peft import LoraConfig, TaskType, get_peft_model

# config = LoraConfig(task_type=TaskType.CAUSAL_LM, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"])
config = LoraConfig(task_type=TaskType.CAUSAL_LM)
config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, r=8, target_modules=None, lora_alpha=8, lora_dropout=0.0, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)

In [14]:
# get model

peft_model = get_peft_model(model, config)

In [15]:
# this is need if we enable gradient_checkpointing

peft_model.enable_input_require_grads()

In [16]:
# show trainable parameter number

peft_model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622


In [13]:
# show type of all layers

# make sure all layers are half precision

for name, param in peft_model.named_parameters():
    print(name, param.dtype)

base_model.model.model.embed_tokens.weight torch.float16
base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight torch.float16
base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight torch.float16
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight torch.float16
base_model.model.model.layers.0.self_attn.k_proj.weight torch.float16
base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight torch.float16
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight torch.float16
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight torch.float16
base_model.model.model.layers.0.self_attn.o_proj.weight torch.float16
base_model.model.model.layers.0.mlp.gate_proj.weight torch.float16
base_model.model.model.layers.0.mlp.up_proj.weight torch.float16
base_model.model.model.layers.0.mlp.down_proj.weight torch.float16
base_model.model.model.layers.0.input_layernorm.weight torch.float16
base_model.model.model.layers.0.p

### 5. training

In [17]:
args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    logging_steps=50,
    gradient_checkpointing=True, # need enable_input_require_grads
    adam_epsilon=1e-4 # to avoid overflow
)

In [18]:
trainer = Trainer(
    model=peft_model, 
    args=args,
    train_dataset=tokenized_data,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

In [19]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
50,1.1329
100,0.9256
150,0.8627
200,0.9252
250,0.9253
300,0.9004
350,1.0435
400,1.0043
450,0.9407
500,0.9722




TrainOutput(global_step=1000, training_loss=0.9005191535949707, metrics={'train_runtime': 308.5116, 'train_samples_per_second': 3.241, 'train_steps_per_second': 3.241, 'total_flos': 5855336658862080.0, 'train_loss': 0.9005191535949707, 'epoch': 1.0})

### 6. inference

In [20]:
peft_model.eval()
text = "hello"
input = tokenizer("Human: " + text + "\n\nAssistant: ", return_tensors="pt").to(model.device)
output = peft_model.generate(**input, max_length=256, eos_token_id=tokenizer.eos_token_id)
tokenizer.decode(output[0], skip_special_tokens=True)

'Human: hello\n\nAssistant:  Hello! How can I assist you today?'

## II. possible problems with abnormal loss

 1) if the loss (batch!=1) explode then becomes 0 after 1 or 2 steps, make sure the tokenizer's padding side is "right" instead of "left" which is the default value.
 
 2) if the loss still becomes 0 after 1). Check if the all model layers are half precision. If so, change the optimizer's epsilon value (The default epsilon value is 1e-8 and the precision of half precision is ~6e-8. torch.tensor(1e-8).half()=0)
to make sure it is greater than 6e-8.

 3) use: tokenizer.pad_token = tokenizer.eos_token

 4) if the inference results in unstopped string after eos, check the data processing.


Some others points:

  - make sure the max_length of tokenizer is long enough for the whole text
  - to load the model, add torch.half, otherwise the model will be loaded with full precision
  - when using gradient_checkpointing, call peft_model.enable_input_require_grads() first
