# Bllossom-AICA Instruction Tuning Tutorial (Only Text)

## Library Import

In [None]:
!pip install datasets bitsandbytes peft

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

In [None]:
import os
from transformers import MllamaForConditionalGeneration, AutoTokenizer
from transformers import Trainer, TrainingArguments
import torch
from torch.nn.utils.rnn import pad_sequence
import datasets
from peft import LoraConfig, get_peft_model

os.environ['CUDA_VISIBLE_DEVICES']="0"

In [None]:
!huggingface-cli login

## Model & Tokenizer Load

In [None]:
model_id = 'Bllossom/llama-3.2-Korean-Bllossom-AICA-5B'

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='cuda:0',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/5.22k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/84.7k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/835M [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.58G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

In [None]:
# 학습 전
sample_message = [
    {'role':'user', 'content': [{'type':'text', 'text':'자연어처리 교과목에 대해 간략히 소개해줘'}]},
    {'role':'assistant', 'content': [{'type':'text', 'text':'Natural Language Processing is a domain of the computer scicence and it is mostly working on corpora based on '}]},
    {'role':'user', 'content': [{'type':'text', 'text':'자연어처리 대표 논문은 뭐가있어?'}]},
]

inputs = tokenizer.apply_chat_template(sample_message,
                              tokenize=True,
                              add_generation_prompt=True,
                              return_tensors='pt').to(model.device)

In [None]:
tokenizer.decode(inputs[0])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 May 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n자연어처리 교과목에 대해 간략히 소개해줘<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nNatural Language Processing is a domain of the computer scicence and it is mostly working on corpora based on <|eot_id|><|start_header_id|>user<|end_header_id|>\n\n자연어처리 대표 논문은 뭐가있어?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [None]:
inputs

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,    975,   3297,    220,   2366,     20,    271, 128009, 128006,
            882, 128007,    271,  26799, 101347,  32179, 102657,  29102, 101999,
          54780,  88708,  19954, 112107, 105131, 112469, 101709, 124827,  34983,
          59269,    246, 128009, 128006,  78191, 128007,    271,  55381,  11688,
          29225,    374,    264,   8106,    315,    279,   6500,   1156,    292,
            768,    323,    433,    374,  10213,   3318,    389,   8533,     64,
           3196,    389,    220, 128009, 128006,    882, 128007,    271,  26799,
         101347,  32179, 102657,  29102, 116865, 110709,  52688,  34804, 113792,
          20565, 105625,  32179,     30, 128009, 128006,  78191, 128007,    271]],
       device='cuda:0')

In [None]:
output = model.generate(inputs,max_new_tokens=512,eos_token_id=tokenizer.convert_tokens_to_ids('<|eot_id|>'))

print(tokenizer.decode(output[0]))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 14 May 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

자연어처리 교과목에 대해 간략히 소개해줘<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Natural Language Processing is a domain of the computer scicence and it is mostly working on corpora based on <|eot_id|><|start_header_id|>user<|end_header_id|>

자연어처리 대표 논문은 뭐가있어?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

자연어처리(NLP) 대표 논문은 많은 연구자들이 발표한 논문들입니다. 여기 몇 가지 중요한 논문을 소개합니다:

1. **"Transformers Are Good Solvers for Machine Translation"** by Vaswani et al. (2017) - 구글 딥러인 연구팀에서 발표한 논문으로, Transformer 모델(특히 BERT, GPT 등)을 통해 기계번역 문제에서 뛰어난 성능을 보인다고 주장합니다.

2. **"Attention is All You Need"** by Vaswani et al. (2017) - 위의 논문과 같은 출처로, Attention 메커니즘이 뉴런 네트워크를 통해 정보를 다층적으로 처리하는 과정을 설명합니다. 이는 문장 집합을 집약적으로 다루는 새로운 방법을 제시합니다.

3. **"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"** by Devlin

## Dataset Load

In [None]:
#train_dataset = datasets.load_dataset('beomi/KoAlpaca-v1.1a',split='train[:100]')
train_dataset = datasets.load_dataset('beomi/KoAlpaca-v1.1a', split='train')
train_dataset = train_dataset.select(range(100)) # Selects the first 100 samples
train_dataset


NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

In [None]:
train_dataset[0]

NameError: name 'train_dataset' is not defined

## Data Preprocessing

In [None]:
def preprocessing_data(examples):
    input_ids = []
    attention_masks = []
    labels = []

    for instruction,response in zip(examples['instruction'],examples['output']):
        message = [
        {'role': 'user', 'content': [
            {'type':'text', 'text':instruction}
        ]},
        ]
        inputs = tokenizer.apply_chat_template(message, tokenize=True, add_generation_prompt=True)
        label = tokenizer(response+'<|eot_id|>',add_special_tokens=False)['input_ids']
        input_id = inputs+label+([tokenizer.pad_token_id]*4096)
        input_id = input_id[:4096]
        label_id = [-100]*len(inputs) + label + ([tokenizer.pad_token_id]*4096)
        label_id = label_id[:4096]
        attention_mask = [1 if token != tokenizer.pad_token_id else 0 for token in input_id]

        input_ids.append(input_id)
        attention_masks.append(attention_mask)
        labels.append(label_id)

    return {
        'input_ids': input_ids,
        'attention_mask': attention_masks,
        'labels': labels
    }

In [None]:
train_dataset = train_dataset.map(
    preprocessing_data,
    num_proc=2,
    batched=True,
    remove_columns=['instruction','output','url']
)

train_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 100
})

In [None]:
train_dataset[0]["input_ids"][:10]

[128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25]

In [None]:
tokenizer.decode(train_dataset[0]["input_ids"][:100])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 13 Dec 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n양파는 어떤 식물 부위인가요? 그리고 고구마는 뿌리인가요?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n양파는 잎이 아닌 식물의 줄기 부분입니다. 고구마는 식물의 뿌리 부분입니다. \n\n식물의 부위의 구분에 대해 궁금해'

## Data Collator

In [None]:
def DataCollator(examples):
    input_ids = torch.LongTensor([example['input_ids'] for example in examples])
    attention_mask = torch.LongTensor([example['attention_mask'] for example in examples])
    labels = torch.LongTensor([example['labels'] for example in examples])

    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': labels
    }


## Configuration TrainingArugments

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    remove_unused_columns=False,
    report_to="none",
    optim="adamw_bnb_8bit",
    gradient_checkpointing=True,
    num_train_epochs=1,
    logging_strategy='steps',
    logging_steps=10,
    label_names=['labels'],
    torch_compile=True,
)


## Configuration Trainable Parameters

In [None]:
# LLM Full Tuning
target_modules = []
for n,p in model.named_parameters():
    if 'language_model' in n and 'cross_attn' not in n and 'embed_tokens' not in n and 'lm_head' not in n and 'norm' not in n:
        target_modules.append(n.replace('.weight',''))
    # else:
    #     p.require_grad=False

# Lora Tuning
peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=4,  #<-- 요거 줄이면 GPU메모리 절약됩니다.
    lora_alpha=8, #<-- 요거 줄이면 GPU메모리 절약됩니다.
    target_modules=target_modules
)

lora_model = get_peft_model(model,peft_config)
lora_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MllamaForConditionalGeneration(
      (vision_model): MllamaVisionModel(
        (patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), padding=valid, bias=False)
        (gated_positional_embedding): MllamaPrecomputedPositionEmbedding(
          (tile_embedding): Embedding(9, 8197120)
        )
        (pre_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
          (embedding): Embedding(9, 5120)
        )
        (post_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
          (embedding): Embedding(9, 5120)
        )
        (layernorm_pre): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (layernorm_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (transformer): MllamaVisionEncoder(
          (layers): ModuleList(
            (0-31): 32 x MllamaVisionEncoderLayer(
              (self_attn): MllamaVisionSdpaAttention(
               

## Train

In [None]:
# LLM 풀튜닝 VRAM 30GB 사용
# LoRA 사용시 18.6GB 사용됨 (RANK 16, 토큰임베딩 및 lm_head 학습 X)
# 메모리 터질 시 RANK 사이즈 등 조정해서 사용하세용~
trainer = Trainer(
        model=lora_model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=DataCollator,
    )

trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,10.7148
20,7.7257
30,6.3684
40,5.944
50,5.738
60,5.6909
70,5.5949
80,5.615
90,5.6393
100,5.6155


TrainOutput(global_step=100, training_loss=6.464637794494629, metrics={'train_runtime': 370.3354, 'train_samples_per_second': 0.27, 'train_steps_per_second': 0.27, 'total_flos': 1.16953090646016e+16, 'train_loss': 6.464637794494629, 'epoch': 1.0})

## Inference Test

In [None]:
# 학습 후
sample_message = [
    {'role':'user', 'content': [
        {'type':'text', 'text':'서울의 유명 관광지에 대해 소개해줘'}
        ]},
]

inputs = tokenizer.apply_chat_template(sample_message,
                              tokenize=True,
                              add_generation_prompt=True,
                              return_tensors='pt').to(model.device)


output = lora_model.generate(inputs,max_new_tokens=512,eos_token_id=tokenizer.convert_tokens_to_ids('<|eot_id|>'))

print(tokenizer.decode(output[0]))