# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [None]:
!nvidia-smi

Mon Sep 18 01:01:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m110.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m5.0 MB

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from datasets import load_dataset

lovenpiece = '/content/drive/MyDrive/NIKL_SC_2023/nikluge-sc-2023-train.jsonl'
lovenpiece_test = '/content/drive/MyDrive/NIKL_SC_2023/nikluge-sc-2023-test.jsonl'
lovenpiece_val = '/content/drive/MyDrive/NIKL_SC_2023/nikluge-sc-2023-dev.jsonl'
train_dataset = load_dataset("json", data_files=lovenpiece)
test_dataset = load_dataset("json", data_files=lovenpiece_test)
val_dataset = load_dataset("json", data_files=lovenpiece_val)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
for i in train_dataset['train']:
  print(i)
  break

{'id': 'nikluge-2023-sc-train-000001', 'input': {'sentence1': '시은이는 다음 주의 여름 휴가 이전에 기분을 전환하고 싶었다.', 'sentence3': '예쁘게 꾸민 손톱을 보며 여행을 갈 생각에 한층 더 들떴다.'}, 'output': '그래서 네일샵에 가서 예쁘게 손톱을 칠했다.'}


In [None]:
for i in train_dataset['train']:
  print(i)
  break

{'id': 'nikluge-2023-sc-train-000001', 'input': {'sentence1': '시은이는 다음 주의 여름 휴가 이전에 기분을 전환하고 싶었다.', 'sentence3': '예쁘게 꾸민 손톱을 보며 여행을 갈 생각에 한층 더 들떴다.'}, 'output': '그래서 네일샵에 가서 예쁘게 손톱을 칠했다.'}


In [None]:
sent=[]

In [None]:
from datasets import Dataset, concatenate_datasets
combined_dataset = concatenate_datasets([train_dataset['train'], val_dataset['train']])
combined_dataset

Dataset({
    features: ['id', 'input', 'output'],
    num_rows: 135157
})

In [None]:
import datasets

In [None]:
 from datasets import Dataset, DatasetDict,ClassLabel
 all_dataset = DatasetDict({'train':combined_dataset})

In [None]:
all_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'input', 'output'],
        num_rows: 135157
    })
})

In [None]:
true_keys=[]

In [None]:
def findsent(datasets):
  for key, value in datasets['train']["output"].items():
      if value == True:
          true_keys.append(key)
          return true_keys

In [None]:
all_dataset['train']['output'][2]

'그래서 우진이가 다니는 독서실도 침수되었을까 봐 걱정됐다.'

In [None]:
all_dataset['train']['output'][0]

'그래서 네일샵에 가서 예쁘게 손톱을 칠했다.'

In [None]:
all_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'input', 'output'],
        num_rows: 135157
    })
})

In [None]:
# # data
# data = data.map(
#     lambda x:
#     {'text': f"### 명령어: {x['instruction']}\n\n###맥락: {x['input']}\n\n### 답변: {x['output']}<|endoftext|>" }
#     if x['input'] else

#     {'text':f"### 명령어: {x['instruction']}\n\n### 답변: {x['output']}<|endoftext|>"},
# )

# data
data = all_dataset.map(
    lambda x: {'text': f"### 당신은 한국의 국어 교사로서 첫번째 문장과 세번째 문장이 주어지면 사이에 들어갈 두번째 문장을 맞추는 역할을 합니다. 첫번째 문장과 세번째 문장을 모두 고려하여 글의 문맥을 판단해서 첫번째 문장과 세번째 문장이 자연스럽게 이어지도록 결론을 내리시오. 첫번째 문장: {x['input']['sentence1']}\n\n### 세번째 문장: {x['input']['sentence3']}\n\n### 두번째 문장: {x['output']}</끝>"}
)

Map:   0%|          | 0/135157 [00:00<?, ? examples/s]

In [None]:
data['train']['text'][0]

'### 당신은 한국의 국어 교사로서 첫번째 문장과 세번째 문장이 주어지면 사이에 들어갈 두번째 문장을 맞추는 역할을 합니다. 첫번째 문장과 세번째 문장을 모두 고려하여 글의 문맥을 판단해서 첫번째 문장과 세번째 문장이 자연스럽게 이어지도록 결론을 내리시오. 첫번째 문장: 시은이는 다음 주의 여름 휴가 이전에 기분을 전환하고 싶었다.\n\n### 세번째 문장: 예쁘게 꾸민 손톱을 보며 여행을 갈 생각에 한층 더 들떴다.\n\n### 두번째 문장: 그래서 네일샵에 가서 예쁘게 손톱을 칠했다.</끝>'

In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/polyglot-ko-12.8b")

Downloading (…)okenizer_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/204 [00:00<?, ?B/s]

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/polyglot-ko-12.8b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Downloading (…)lve/main/config.json:   0%|          | 0.00/678 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/52.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/28 [00:00<?, ?it/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/946M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/843M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/843M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/896M [00:00<?, ?B/s]

Downloading (…)of-00028.safetensors:   0%|          | 0.00/518M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/28 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
data = data.map(lambda samples: tokenizer(samples["text"]), batched=True)

Map:   0%|          | 0/135157 [00:00<?, ? examples/s]

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 52428800 || all params: 6654576640 || trainable%: 0.7878607886917356


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
tokenizer.decode(2)

'<|endoftext|>'

In [None]:
tokenizer.encode('</끝>')

[422, 5568, 33]

In [None]:
for i in range(0,1000000):
  if tokenizer.decode([i])=='<끝>':
    print(i)
    break

In [None]:
tokenizer.decode([6, 6, 6, 7990,29, 16922, 2])

'### 문장: 독성<|endoftext|>'

In [None]:
tokenizer.decode([6, 6, 6, 5716,29, 16922, 2])

'### 유형: 독성<|endoftext|>'

In [None]:
tokenNum = 7990

In [None]:
# import transformers
# from transformers import Trainer

# class maskTrainer(Trainer):
#   def __init__(self, *args, **kwargs):
#     super().__init__(*args, **kwargs)

#   def compute_loss(self, model, inputs, return_outputs=False):


#     maskIndex = torch.eq(inputs['input_ids'], tokenNum).nonzero()

#     for x in range(len(inputs['labels'])):
#       inputs['labels'][x][:maskIndex[:, 1][x]+2] = -100

#     outputs = model(**inputs)

#     loss = outputs['loss']

#     return (loss,outputs) if return_outputs else loss

In [None]:
# Modifying the compute_loss function to consider only the last occurrence of tokenNum
import transformers
from transformers import Trainer

class maskTrainerLastOccurrence(Trainer):
    def __init__(self, tokenNum, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenNum = tokenNum

    def compute_loss(self, model, inputs, return_outputs=False):
        maskIndices = torch.eq(inputs['input_ids'], self.tokenNum).nonzero()

        for x in range(len(inputs['labels'])):
            # Find the last occurrence of tokenNum for the current input
            last_occurrences = [idx[1] for idx in maskIndices if idx[0] == x]
            if last_occurrences:
                last_occurrence = max(last_occurrences)
                inputs['labels'][x][:last_occurrence+2] = -100

        outputs = model(**inputs)
        loss = outputs['loss']

        return (loss, outputs) if return_outputs else loss

# Note: This code is written for illustrative purposes and may need adjustments based on the actual use case and data structure.

In [None]:


# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = maskTrainerLastOccurrence(
    tokenNum = tokenNum,
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=16,
        gradient_accumulation_steps=1,
        # warmup_steps=200,
        #max_steps=3000, ## 초소형만 학습: 10 step = 20개 샘플만 학습.
        fp16=True,
        output_dir="outputs",
        logging_steps=30,
        num_train_epochs = 3,
        learning_rate=4e-5,


        lr_scheduler_type= "cosine",
        #optim="paged_adamw_8bit"

    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
# trainer = transformers.Trainer(
#     model=model,
#     train_dataset=data["train"],
#     args=transformers.TrainingArguments(
#         per_device_train_batch_size=64,
#         gradient_accumulation_steps=1,
#         # warmup_steps=200,
#         #max_steps=3000, ## 초소형만 학습: 10 step = 20개 샘플만 학습.
#         fp16=True,
#         output_dir="outputs",
#         logging_steps=30,
#         num_train_epochs = 3,
#         learning_rate=1.4e-3,


#         lr_scheduler_type= "cosine",
#         #optim="paged_adamw_8bit"

#     ),
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
# )
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Step,Training Loss
30,1.1147
60,1.0773


KeyboardInterrupt: ignored

In [None]:
model.save_pretrained('./output')

In [None]:
model.eval()
model.config.use_cache = True  # silence the warnings. Please re-enable for inference!

In [None]:
test_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'input', 'output'],
        num_rows: 15018
    })
})

In [None]:
import torch

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

class StoppingCriteriaSub(StoppingCriteria):

    def __init__(self, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop for stop in stops]

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        for stop in self.stops:
            if torch.all((stop == input_ids[0][-len(stop):])).item():
                return True

        return False

stop_words = ["</끝>"]
stop_words_ids = [tokenizer(stop_word, return_tensors='pt')['input_ids'].squeeze() for stop_word in stop_words]
stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])

In [None]:
def gen_modified(x):
    prompt = (f"### 당신은 한국의 국어 교사로서 첫번째 문장과 세번째 문장이 주어지면 사이에 들어갈 두번째 문장을 맞추는 역할을 합니다. "
              f"첫번째 문장과 세번째 문장을 모두 고려하여 글의 문맥을 판단해서 첫번째 문장과 세번째 문장이 자연스럽게 이어지도록 결론을 내리시오. "
              f"첫번째 문장: {x['input']['sentence1']}\n\n### 세번째 문장: {x['input']['sentence3']}\n\n### 두번째 문장:")

    gened = model.generate(
        **tokenizer(
            prompt,
            return_tensors='pt',
            return_token_type_ids=False
        ),
        max_new_tokens=30,
        temperature=0.001,
        no_repeat_ngram_size=10,
        early_stopping=True,#        #early_stopping=True,
        do_sample=True,
        eos_token_id=2,
        stopping_criteria=stopping_criteria
        )
    return tokenizer.decode(gened[0]).replace(prompt+ " ", "")

In [None]:
#gen("내가 진짜 올 해 안에 차 산다!")

In [None]:
gen_modified(test_dataset['train'][33]).replace('</끝>','')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'교사는 학생들의 학교 생활을 지도하는 데에 힘썼다.'

In [None]:
test_dataset['train']['input'][330]

{'sentence1': '교사는 학생들을 지도하여 올바른 학교 생활을 할 수 있도록 도왔다.',
 'sentence3': '이 같은 노력으로 교사는 표창장도 받게 되었다.'}

In [None]:
#gen(test_dataset['train'][677]['input'])

In [None]:
answerlist = []

for i in test_dataset['train']:

  answerlist.append(gen_modified(i).replace('</끝>',''))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


KeyboardInterrupt: ignored

In [None]:
# "극좌는 이 비겁자층을 제대로 요리할 줄 안다..."
# "내가 진짜 올 해 안에 차 산다!"
# "선거 때마다 불장난 하는 못된 버릇 대대손손 배워가지고 그러고 까불어대면, 너 나중에 뒷덜미에 혹난다???"

In [None]:
import json
dataset_dict = test_dataset['train'].to_dict()
new_dataset  = []
new_line = {}
for i, value in enumerate(answerlist):
    new_line['id'] = dataset_dict['id'][i]
    new_line['input'] = dataset_dict['input'][i]
    new_line['output'] = answerlist[i]
    a =new_line.copy()
    new_dataset.append(a)

NameError: ignored

In [None]:
with open("submission_lovenpiece.json", "w") as file:
    for item in new_dataset:
      json.dump(item, file, ensure_ascii=False)
      file.write('\n')

In [None]:

file_path = "/content/drive/MyDrive/submission_classification.json"
with open(file_path, "w", encoding="utf-8") as file:
    for item in new_dataset:
      json.dump(item, file, ensure_ascii=False)
      file.write('\n')    # 파일에 데이터 쓰기

In [None]:
answerlist

['서영이는 승무원에게 기차표를 보여주며 자리를 확인했다.',
 '그 모습을 본 다른 사람이 그에게 제발 그만하라고 소리쳤다.',
 '앵무새가 물을 마시다가 그 사람을 보고 물을 뿜었다.',
 '하지만 언니가 결혼을 하면서 서울로 이사를 갔다.',
 '그러다 그는 물에 빠져 허우적댔다.']