# Transformers

In this notebook, it will:

    I. Summarize the transformers so far
    II. Memory Usage
    III. Optimization of memory usage

## I. Summarize

### The modules presented so far

 - **datasets**: download, construct, process data, can be online or offline dataset.

 - **tokenizer**: tokenize dataset. It produce a dict structure containing "input_ids", "attention_mask", "labels" and
    maybe, other field depending on the problem.

 - **model**: to construct, download, modify models

 - **evaluate**: define, load metrics

 - **trainer**: put things together for training with ways to optimize training
 
 - **pipeline**: construct process to take input and output results

### Train steps

So far, we can summarize the training and usage into 10 steps:

 1) import modules: hugging face provides transformers and datasets for their own resources
 2) load datasets: we can use datasets from hugging face or any other datasets. Depending
 on the data format and application, we should clean, structure, split or process it.
 3) split dataset
 4) tokenization: process + tokenize datasets
 5) load model
 6) construct metrics
 7) training arguments: epochs, devices, loggings...
 8) construct trainer: designate model, tokenizer, datasets, metrics, datacollator...
 9) train + evaluate
 10) inference: we can use pipeline from hugging face, but sometimes we should write our own pipeline.

 

## II. memory usage 

Before we run any inference or training, we should make sure that we have enough memory since the model can easily pass the gpu memory limit and render the run really slow or crash the machine.

There is an article to exaplain the memory usage by running the model: https://arxiv.org/pdf/1910.02054v3.

hugging face provide a tool to show those infos: https://huggingface.co/spaces/hf-accelerate/model-memory-usage.

Besides, there are also other overhead memories depending on the batch size, input/output size, and some hidden states or other factors.

But, PyTorch doesn’t report the memory usage (torch.cuda.memory_allocated() could return a 0 allocation).

On runtime, we could use nvidia-smi (or any other reporting tool such as "nvtop") to check the overall GPU memory usage.


Below, we present a simple example to show the calculation.

In [5]:
# we take a bert model as example

from transformers import BertModel
model = BertModel.from_pretrained("google-bert/bert-large-uncased")

In [7]:
# count the total number of the parameters in the model

params = sum(param.numel() for param in model.parameters())

print("model parameter numbers: ", round(params/1e6), "MB")

# for adam training normal we should consider memories for (by using f32):
#  - copy of the model parameter
#    model parameters x 4 (bytes)

print("model: ", round(params/1e6) * 4, "MB")

#  - copy of the gradients
#    model parameters x 4 (bytes)

print("gradients: ", round(params/1e6) * 4, "MB")

#  - optimizer states (copy of parameters, momentum and variance)
#    model parameters x 12 (bytes)

print("optimizer: ", round(params/1e9 * 12, 2), "GB")

# so in total we need for model: 
print("total: ", round(params/1e9 * (4 + 4 + 12), 2), "GB")

model parameter numbers:  335 MB
model:  1340 MB
gradients:  1340 MB
optimizer:  4.02 GB
total:  6.7 GB


## III. Memory Optimizations

Since the models becomes larger and larger, the memory demande for training increases as well.

We would like to illustrate how to optimize certain parameters to decrease the memory usage to run large models and their effects.

To produce the below results, run the below cell with corresponding parameters changes. Restart the kernel before each run.

 - we use model : google-bert/bert-large-uncased, 
   * 336M
 - NVIDIA GeForce RTX 3090 
   * CUDA Version: 12.3 
   * 24GB

| num |  Optimization 	                    |   Part	    | Memory(GB) | time(s) | comment |
|--- |---	                                |---	          |---	       |---	     |---	     |
| 0 |Bbaseline (BS=32, Length=128)	      |   	          |8.8   	     |90   	   |            |            
| 1 |Gradient Accumulation (BS=1, GA=32) 	|Forward   	    |5.7   	     |747      |gradient_accumulation_steps=32|
| 2 |Gradient Checkpoints (BS=1, GA=32)   |Forward   	    |5.4   	     |947      |gradient_checkpointing=True|
| 3 |Adafactor Optimizer (BS=1, GA=32)   	|Optimizer  	  |3.0   	     |907   	 |optim="adafactor"|
| 4 |Freeze Model (BS=1, GA=32)   	      |Foward+Gradient|1.6   	     |370   	 |set model params requires_grad to false|
| 5 |Data Length (BS=1, GA=32, length=32) |Forward   	    |1.6   	     |360   	 |decrease max_length in tokenizer|


Other methods can also be used to be able to run large model on limited resources:
 - Reduce Batch Size: The first methode to be considered
 - Use a Simpler Model: Smaller architectures require less memory. Depending on the problem we may not always need larger models.
 - Distributed Training: on multiple GPUs.
 - Use Mixed Precision Training: use f8, f16 for training (explored later.)
 - Release unused tensors.
 - other finetuning techniques (explored later)

In [2]:
#############################
# code to preduce the table #
#############################

# The above table was obtained by changing the code according to the num of the comment.
# The number of momory and time may vary a bit depending on the machine and the state of the machine.


# This cell should be run first before any other transformers calls

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs


# example used to get the above table results

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import evaluate
import numpy as np

ckp = "google-bert/bert-large-uncased"

# load data
data = load_dataset("sepidmnorozy/English_sentiment")

# tokenization
tokenizer = AutoTokenizer.from_pretrained(ckp)

# process dataset
def process(samples):

    toks = tokenizer(samples["text"], truncation=True, max_length=128, padding=True) # (5) change max_length=32
    toks["labels"] = samples["label"]

    return toks

tokenized_data = data.map(process, batched=True, remove_columns=data["train"].column_names)

# load model
model = AutoModelForSequenceClassification.from_pretrained(ckp)

# evaluate
acc_fct = evaluate.load("accuracy")
f1_fct = evaluate.load("f1")

def metric(pred):

    preds, refs = pred
    preds = preds.argmax(axis=-1)

    accuracy = acc_fct.compute(predictions=preds, references=refs)
    f1 = f1_fct.compute(predictions=preds, references=refs)

    accuracy.update(f1)

    return accuracy

# train args
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

args = TrainingArguments(
    output_dir="./checkpoint",
    per_device_train_batch_size=32, # (1-5) change the batch size to 1
    per_device_eval_batch_size=1,
    # gradient_accumulation_steps=32, # (1) uncomment to use gradient accumulation
    # gradient_checkpointing=True, # (2) uncomment to use gradient checkpoint
    # optim="adafactor", # (3) uncomment to use adafactor optimization
    num_train_epochs=1,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    metric_for_best_model="f1",
    load_best_model_at_end=True
)

# The model is composed of a BertModel (instanced as self.bert) and a linear output layer
# the idea is to freeze the BertModel parameters and only update the output layer parameters

# (4) uncomment below to freeze bert model parameters
# for _, param in model.bert.named_parameters():
#     param.requires_grad = False

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=metric
)

trainer.train()

2024-06-07 17:35:11.881082: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-07 17:35:11.881146: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-07 17:35:11.884144: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-07 17:35:11.897349: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Map:   0%|          | 0/6920 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
