# Week 5: Model Deployment & Training Efficiencies


In this lesson notebook we will apply this week's material to the family of GPT2 models with specific focus on memory consumption and model qualities.

**Note:** We should stress that using small (and older) models like GPT-2 is not necessarily representative of the effectiveness of the techniques for more recent models. Also, we are running only a few hundred steps for the training runs, obviously affecting the results. Hyperparameter tuning also wasn't done. **So the purpose of this notebook is to introduce and test the ideas, not to conduct a detailed comparison.**

We will use the Stanford Sentiment Treebank as a dataset to compare the models. This notebook also uses components and approaches of the '[Fine tuning Google Colab notebook](https://https://huggingface.co/blog/4bit-transformers-bitsandbytes)' discussing Bits and Bytes, and Hugging Face's notebook [text classification example](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb) notebook.


Here is the structure of the lesson notebook and the points of interest:

0. Setup
1. Dataset Creation and Configuration: Sentiment Classification
2. The Base Model: GPT2-medium
    - Memory Consumption, pre- and during training
    - Fine-tuning Result
3. Quantization
    - Memory Consumption of 8-bit and 4-bit quantized models

4. LoRA: Fine-tuning with Few Parameters I
    - Memory Size during training compared to base models
    - Fine-tuning Result
    - Size of LoRA parameters
    - Saving and loading  

5. Soft Prompt Tuning: Fine-tuning with Few Parameters II
    - Memory Size during training compared to base models
    - Finetuning Result
    - Size of Soft Prompt parameters

6. QLoRA: Fine-tuning of GPT2-Large with Few Parameters & Aggressive Quantization
    - Memory Size during training compared to base and PEFT models
    - Fine-tuning Result

This notebook runs on a T4 processor.

**Note:** if you want to look at memory consumptions using the Resources tab, you may need to restart the session multiple times. If you do so, comment out the pip installs and rerun the Setup and Data Preparation sections. Then continue from where you want to continue.


##0. Setup


Installs & Imports:

In [None]:
%%capture

!pip install datasets==2.21.0  transformers
!pip install accelerate -U            # Quantization, Distribution
!pip install -q peft                  # LoRA
!pip install -q evaluate
!pip install bitsandbytes             # QLoRA

In [None]:
import sys
import numpy as np
import torch

import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import BitsAndBytesConfig

from datasets import load_dataset, load_metric

from peft import LoraConfig, TaskType, PeftModel, get_peft_model
from peft import load_peft_weights, set_peft_model_state_dict
from peft import PromptEncoderConfig, prepare_model_for_kbit_training

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

import wandb
wandb.init(mode="disabled")

Some useful definitions (see Text Classification notebook):

In [None]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))


def preprocess_function(examples):
    return tokenizer(examples[sentence1_key], truncation=True)



def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

def show_currently_allocated_gpu_mem():
  torch.cuda.empty_cache()
  mem = torch.cuda.memory_allocated()
  print(f"Current GPU memory allocation (GB): {mem/1024**3}")

## 1. Data Setup

We use the GLUE dataset, loading the data for the Stanford Sentiment Treebank task. We will also right away define the tokenizer for our models (they all use the GPT2 tokenizer).

In [None]:
task = actual_task = "sst2"
tokenizer_model_name = "gpt2-medium"  # GPT2 tokenizers hopefully are the same for all sizes. We pick this one.
batch_size = 16

In [None]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
dataset = load_dataset("glue", actual_task, trust_remote_code=True)
metric = load_metric('glue', actual_task, trust_remote_code=True)

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

  metric = load_metric('glue', actual_task, trust_remote_code=True)


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][2]

{'sentence': 'that loves its characters and communicates something rather beautiful about human nature ',
 'label': 1,
 'idx': 2}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label,idx
0,"the performers are so spot on , it is hard to conceive anyone else in their roles .",positive,60501
1,sympathetic characters,positive,49927
2,sacrificing the integrity of the opera,negative,63283
3,up not one but two flagrantly fake thunderstorms,negative,58917
4,real visual charge to the filmmaking,positive,62882
5,for the film 's publicists or for people who take as many drugs as the film 's characters,negative,59083
6,funny and also heartwarming without stooping to gooeyness,positive,39669
7,"extends to his supple understanding of the role that brown played in american culture as an athlete , a movie star , and an image of black indomitability",positive,56163
8,"i 've never seen or heard anything quite like this film , and",positive,62126
9,is no doubt that krawczyk deserves a huge amount of the credit for the film 's thoroughly winning tone,positive,64673


You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [None]:
fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.5625}

In [None]:
sentence1_key, sentence2_key = ("sentence", None)

Following (https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb), we construct a properly formated (for the Trainer class) dataset using the pre-process function defined above:

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Lastly, we define for the future analysis the base model, the metric, and the key for the validation data in the encoded dataset:

In [None]:
metric_name = "accuracy"
base_model_name = "gpt2-medium"
validation_key = "validation"

## 2. Base Models

Now we can perform fine-tuning of our base models. We start with GPT2-medium, a 355m parameter model.

In [None]:
medium_model = AutoModelForSequenceClassification.from_pretrained("gpt2-medium", num_labels=2)
medium_model.config.pad_token_id = medium_model.config.eos_token_id

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-medium and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's make sure it works:

In [None]:
input_dict = tokenizer(['this is fun', 'this is nice'], return_tensors='pt')
input_dict['labels'] = torch.tensor([1, 0])
input_dict.to('cuda')

{'input_ids': tensor([[5661,  318, 1257],
        [5661,  318, 3621]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1],
        [1, 1, 1]], device='cuda:0'), 'labels': tensor([1, 0], device='cuda:0')}

In [None]:
medium_model.to('cuda')
preds = medium_model(**input_dict)

preds.loss

tensor(1.4613, device='cuda:0', grad_fn=<NllLossBackward0>)

In [None]:
medium_model.to('cuda')

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3072, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=1024)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=4096, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=4096)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=1024, out_features=2, bias=False)
)

In [None]:
medium_model(**tokenizer(['this is fun', 'this is nice'], return_tensors='pt').to('cuda')).keys()

odict_keys(['logits', 'past_key_values'])

In [None]:
preds[0].mean()

tensor(1.4613, device='cuda:0', grad_fn=<MeanBackward0>)

Now look at the GPU memory consumption in the resource list! Is it as expected?

You can also look at the current gpu memory consumption in this way:



In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 1.387643814086914


Now we first define the trainer arguments and then the actual trainer for the base model:

In [None]:
args = TrainingArguments(
    f"full_{base_model_name}-finetuned-{task}",
    eval_strategy = "steps",
    eval_steps = 100,
    logging_strategy = "steps",
    logging_steps = 100,
    save_strategy = "no",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    max_steps=300,
    weight_decay=0.01,
    load_best_model_at_end=False,
    metric_for_best_model=metric_name,
    report_to="none",
    run_name="test1"
)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
medium_trainer = Trainer(
    medium_model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  medium_trainer = Trainer(


As stated in the referenced Text Classification notebook: "You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it one last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?)..."

Now we train the model simply through the `train` method. Note that we did not have to write our own training and testing loops, these are abstracted and taken care off by the `Trainer` class.

In [None]:
medium_trainer.train()

Step,Training Loss,Validation Loss,Accuracy
100,0.5792,0.295501,0.887615
200,0.3198,0.269023,0.911697
300,0.2769,0.283095,0.919725


TrainOutput(global_step=300, training_loss=0.39195878982543947, metrics={'train_runtime': 154.0396, 'train_samples_per_second': 31.161, 'train_steps_per_second': 1.948, 'total_flos': 291439861039104.0, 'train_loss': 0.39195878982543947, 'epoch': 0.07125890736342043})

Again, observe the memory consumption during training! Is it ~3-5x of the original amount?

Let us save the model to disc:



In [None]:
medium_trainer.save_model("./medium_model_base")

In [None]:
!ls -al ./medium_model_base

total 1390804
drwxr-xr-x 2 root root       4096 May 30 19:18 .
drwxr-xr-x 1 root root       4096 May 30 19:18 ..
-rw-r--r-- 1 root root       1021 May 30 19:18 config.json
-rw-r--r-- 1 root root     456318 May 30 19:18 merges.txt
-rw-r--r-- 1 root root 1419331144 May 30 19:18 model.safetensors
-rw-r--r-- 1 root root        131 May 30 19:18 special_tokens_map.json
-rw-r--r-- 1 root root        507 May 30 19:18 tokenizer_config.json
-rw-r--r-- 1 root root    3557680 May 30 19:18 tokenizer.json
-rw-r--r-- 1 root root       5304 May 30 19:18 training_args.bin
-rw-r--r-- 1 root root     798156 May 30 19:18 vocab.json


In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 4.03923225402832


What if we delete the trainer?

In [None]:
del medium_trainer

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 4.03923225402832


Ah! (It may take a few seconds before the memory is freed.)

##3. Quantization of the Medium Model

We will now load the medium model from disc, but we will quantize to 4-bit. Please consider the memory consumption.

In [None]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

In [None]:
loaded_medium_model_4_bit = AutoModelForSequenceClassification.from_pretrained("./medium_model_base",
                                                                      num_labels=2,
                                                                      #load_in_4bit=True
                                                                      quantization_config=quantization_config)

loaded_medium_model_4_bit(**tokenizer('this is fun', return_tensors='pt').to('cuda'))['logits']

tensor([[-0.8657, -1.9727]], device='cuda:0', dtype=torch.float16,
       grad_fn=<IndexBackward0>)

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 1.6750645637512207


How much memory is used now? Is it as expected?

##4. LoRA - Fine-tuning using Less Parameters I

We'll create a second GPT2-medium model which we will use for (some of) our PEFT trainings. This base model will **not** change, as we will only train adapters.


In [None]:
peft_base_model = AutoModelForSequenceClassification.from_pretrained("gpt2-medium", num_labels=2)
peft_base_model.config.pad_token_id = peft_base_model.config.eos_token_id

peft_base_model.to('cuda')
peft_base_model(**tokenizer('this is fun', return_tensors='pt').to('cuda'))['logits']


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-medium and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[ 1.9292, -7.6093]], device='cuda:0', grad_fn=<IndexBackward0>)

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 3.028148651123047


Memory increase expected?

Next, we need to define the LoRA configuration, particularly *r*:

In [None]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=10,
    lora_alpha=200,
    lora_dropout=0.1
)

This configuration is used to create a PEFT model, which is defined through a base model and the configuration:

In [None]:
peft_lora_model = get_peft_model(peft_base_model, lora_config)



How many trainable parameters do we have?

In [None]:
peft_lora_model.print_trainable_parameters()

trainable params: 985,088 || all params: 355,810,304 || trainable%: 0.2769


In [None]:
args = TrainingArguments(
    f"lora_{base_model_name}-finetuned-{task}",
    eval_strategy = "steps",
    eval_steps = 100,
    save_strategy = "no",
    logging_strategy = "steps",
    logging_steps = 100,
    learning_rate=1.2e-4,   # set higher than for base model!
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    max_steps=300,
    weight_decay=0.01,
    load_best_model_at_end=False,
    metric_for_best_model=metric_name,
)


peft_lora_trainer = Trainer(
    peft_lora_model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)



  peft_lora_trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
peft_lora_trainer.train()



Step,Training Loss,Validation Loss,Accuracy
100,1.1433,0.907503,0.521789
200,0.6909,0.560754,0.691514
300,0.458,0.336548,0.872706


TrainOutput(global_step=300, training_loss=0.7640715535481771, metrics={'train_runtime': 98.5534, 'train_samples_per_second': 48.705, 'train_steps_per_second': 3.044, 'total_flos': 292389517393920.0, 'train_loss': 0.7640715535481771, 'epoch': 0.07125890736342043})

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 3.0391578674316406


What about the memory consumption? Expected?



Great. Now what about saving and loading the LoRA (only!) parameters?

In [None]:
peft_model_path = "./my_lora_model"

peft_lora_trainer.model.save_pretrained(peft_model_path)

Let's look at the size of the saved adapter:

In [None]:
!ls -al my_lora_model/

total 3876
drwxr-xr-x 2 root root    4096 May 30 19:24 .
drwxr-xr-x 1 root root    4096 May 30 19:24 ..
-rw-r--r-- 1 root root     781 May 30 19:24 adapter_config.json
-rw-r--r-- 1 root root 3946624 May 30 19:24 adapter_model.safetensors
-rw-r--r-- 1 root root    5085 May 30 19:24 README.md


Good. Now we will load and use the LoRA model... twice (to look at the incremental memory consumptions). Note that loading the model requires the original base model and the adapters.

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 3.0391578674316406


In [None]:
loaded_peft_model_1 = PeftModel.from_pretrained(peft_base_model,
                                        peft_model_path,
                                        is_trainable=False)

loaded_peft_model_1(**tokenizer('this is fun', return_tensors='pt').to('cuda'))['logits']



tensor([[-0.5945, -3.6751]], device='cuda:0', grad_fn=<IndexBackward0>)

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 3.042832851409912


Look at the memory consumption.


Good. Essentially no change. Why? What if instead we loaded another base model?

##4. Soft Prompt Tuning

We will now repeat the procedure, but will tune parameters in a Soft Prompt ('virtual token's):

In [None]:
peft_prompt_config = PromptEncoderConfig(task_type=TaskType.SEQ_CLS,
                                         num_virtual_tokens=10,
                                         encoder_hidden_size=384)

In [None]:
peft_prompt_model = get_peft_model(peft_base_model, peft_prompt_config)

peft_prompt_model.to('cuda')
peft_prompt_model(**tokenizer('this is fun', return_tensors='pt').to('cuda'))['logits']

GPT2ForSequenceClassification will not detect padding tokens in `inputs_embeds`. Results may be unexpected if using padding tokens in conjunction with `inputs_embeds.`


tensor([[ 0.3937, -0.7077]], device='cuda:0', grad_fn=<IndexBackward0>)

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 3.073418617248535


Memory increase as expected?

In [None]:
peft_prompt_model.print_trainable_parameters()

trainable params: 945,920 || all params: 356,756,224 || trainable%: 0.2651


In [None]:
args = TrainingArguments(
    f"prompt_{base_model_name}-finetuned-{task}",
    eval_strategy = "steps",
    eval_steps = 100,
    logging_strategy = "steps",
    logging_steps = 100,
    save_strategy = "no",   # no saving of checkpoints
    learning_rate=8e-5,   # set higher than for base model!
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    max_steps=1000,
    weight_decay=0.01,
    load_best_model_at_end=False,
    metric_for_best_model=metric_name,
)

In [None]:
peft_prompt_trainer = Trainer(
    peft_prompt_model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  peft_prompt_trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Let's train!

In [None]:
# annoying lines will show up nevertheless. What could they mean? Let's discuss...

peft_prompt_trainer.train()

Step,Training Loss,Validation Loss,Accuracy
100,0.8459,0.750252,0.563073
200,0.6385,0.609305,0.647936
300,0.5656,0.548404,0.733945
400,0.5209,0.436197,0.806193
500,0.4938,0.481236,0.797018
600,0.5008,0.409961,0.826835
700,0.4558,0.416095,0.834862
800,0.479,0.377737,0.861239
900,0.434,0.359722,0.865826
1000,0.4206,0.371403,0.858945


TrainOutput(global_step=1000, training_loss=0.5354920806884765, metrics={'train_runtime': 413.699, 'train_samples_per_second': 38.675, 'train_steps_per_second': 2.417, 'total_flos': 990490599751680.0, 'train_loss': 0.5354920806884765, 'epoch': 0.2375296912114014})

What about the size upon saving?

In [None]:
peft_prompt_model_path = "./my_prompt_model"

peft_prompt_trainer.model.save_pretrained(peft_prompt_model_path)

In [None]:
!ls -al my_prompt_model/

total 72
drwxr-xr-x 2 root root  4096 May 30 19:32 .
drwxr-xr-x 1 root root  4096 May 30 19:32 ..
-rw-r--r-- 1 root root   429 May 30 19:32 adapter_config.json
-rw-r--r-- 1 root root 49360 May 30 19:32 adapter_model.safetensors
-rw-r--r-- 1 root root  5085 May 30 19:32 README.md


In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 3.0804662704467773


##5. QLoRA

Now let's use QLoRA to fine-tune a model that is quantized down to 4 bits. We first need to specify the BitsAndBytes configuration, then the LoRA adapter, and then we'll train as always. But now we will use the XL model with 1.5bn parameters? That would **not** fit into our T4 chip for training purposes. Will it work with QLoRA? And how good will the results be?

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
qlora_model = AutoModelForSequenceClassification.from_pretrained("gpt2-XL", quantization_config=bnb_config, device_map={"":0})

qlora_model.config.pad_token_id = qlora_model.config.eos_token_id

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2-XL and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
qlora_model(**tokenizer('this is fun', return_tensors='pt').to('cuda'))['logits']

tensor([[0.3022, 0.3508]], device='cuda:0', dtype=torch.float16,
       grad_fn=<IndexBackward0>)

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 4.000992298126221


We need to do a few more adjustments:

In [None]:
qlora_model.gradient_checkpointing_enable()
qlora_model = prepare_model_for_kbit_training(qlora_model)


In [None]:
config = LoraConfig(
    r=10,
    lora_alpha=32,
    #target_modules=["query_key_value"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)

qlora_model = get_peft_model(qlora_model, config)
qlora_model.print_trainable_parameters()

trainable params: 3,075,200 || all params: 1,560,689,600 || trainable%: 0.1970


In [None]:
#qlora_model.to('cuda')
qlora_model(**tokenizer('this is fun', return_tensors='pt').to('cuda'))['logits']

tensor([[0.3049, 0.3522]], device='cuda:0', grad_fn=<IndexBackward0>)

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 4.187507629394531


In [None]:
args = TrainingArguments(
    f"qlora_gpt2-XL-finetuned-{task}",
    eval_strategy = "steps",
    eval_steps = 100,
    save_strategy = "no",
    logging_strategy = "steps",
    logging_steps = 100,
    learning_rate=1.2e-4,   # set higher than for base model!
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    max_steps=300,
    weight_decay=0.01,
    load_best_model_at_end=False,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

qlora_trainer = Trainer(
    qlora_model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)



  qlora_trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
qlora_trainer.train()

A ConfigError was raised whilst setting the number of model parameters in Weights & Biases config.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss,Accuracy
100,0.5805,0.28958,0.891055
200,0.294,0.228402,0.920872
300,0.2242,0.244972,0.923165


TrainOutput(global_step=300, training_loss=0.36624647776285807, metrics={'train_runtime': 925.4561, 'train_samples_per_second': 5.187, 'train_steps_per_second': 0.324, 'total_flos': 1425456276480000.0, 'train_loss': 0.36624647776285807, 'epoch': 0.07125890736342043})

In [None]:
peft_qlora_model_path = "./my_qlora_model"

qlora_trainer.model.save_pretrained(peft_qlora_model_path)

In [None]:
!ls -al my_qlora_model/

total 12048
drwxr-xr-x 2 root root     4096 May 30 19:51 .
drwxr-xr-x 1 root root     4096 May 30 19:51 ..
-rw-r--r-- 1 root root      777 May 30 19:51 adapter_config.json
-rw-r--r-- 1 root root 12313312 May 30 19:51 adapter_model.safetensors
-rw-r--r-- 1 root root     5081 May 30 19:51 README.md


In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 4.210419654846191


In [None]:
del qlora_trainer

In [None]:
show_currently_allocated_gpu_mem()

Current GPU memory allocation (GB): 4.210419654846191


That's it! We hope that this gives you a good overview of all of the various approaches that allow you to deploy and train a model in much more efficient ways compared to 'training all parameters at full precision'.