# NLP 과제

## 제출 요구사항
1. **파일 이름**: `nlp-homework.ipynb`
2. **구성 내용**:
   - Q1: `meta-llama/Llama-3.2-1B-Instruct` 모델 사용
   - Q2: 2,800개의 샘플로 학습, 200개의 샘플로 검증
   - Q3: LoRA 어댑터를 Hugging Face Hub에 업로드
   - Q4: Hugging Face Hub에서 LoRA 어댑터 로드
   - Q5: 검증 세트에 대해 BLEU 점수 계산

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 환경 구축

In [2]:
!pip install transformers accelerate datasets peft trl bitsandbytes wandb sacrebleu



# import 설정 / 경로 설정

## Q1: `meta-llama/Llama-3.2-1B-Instruct` 모델 사용

### 해결 방법
기존 `gemma-2b` 대신 Hugging Face에서 제공하는 `meta-llama/Llama-3.2-1B-Instruct` 모델을 사용하였습니다.

이 모델은 더 큰 파라미터 공간과 향상된 성능을 제공합니다.

#### 주요 작업
1. Hugging Face의 `transformers` 라이브러리를 사용하여 `meta-llama/Llama-3.2-1B-Instruct` 모델 로드.
2. `torch_dtype`를 `bfloat16`으로 설정하고, GPU 메모리 최적화를 위해 `device_map` 활성화.
3. `bitsandbytes`를 활용하여 8비트 양자화를 적용하여 메모리 사용량을 줄임.


In [3]:
import os
import re
import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, PeftModel
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    logging as hf_logging,
)
from trl import SFTTrainer, SFTConfig
from trl.trainer import ConstantLengthDataset
import logging

hf_logging.set_verbosity_info()

# 설정
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
device_map = "auto"
torch_dtype = torch.bfloat16
output_dir = "/content/drive/MyDrive/NLP_Class/output"
dataset_name = "/content/drive/MyDrive/NLP_Class/llm-modeling-lab.jsonl"
seq_length = 512

# hugging face / wandb 설정

### hf

In [None]:
from huggingface_hub import login

login("")  # Hugging Face 액세스 토큰


### wandb


In [None]:
import wandb

wandb.login(key="")  # WandB API 키


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mfloralee782[0m ([33mfloralee_782[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

# 데이터 셋

## Q2: 데이터셋을 학습용(2,800개)과 검증용(200개)으로 분리

### 해결 방법
JSONL 데이터셋을 로드한 뒤, 2,800개의 샘플을 학습용으로, 200개의 샘플을 검증용으로 분리하였습니다.

#### 주요 작업
1. Hugging Face의 `datasets` 라이브러리를 사용하여 데이터셋 로드.
2. `Dataset.select()` 메서드를 사용해 데이터셋을 나눔.
   - 학습 데이터: 처음 2,800개 샘플.
   - 검증 데이터: 이후 200개 샘플.

## tokenizer 설정

In [6]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    use_fast=True,
)
tokenizer.padding_side = "right"

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_quant_type="nf8",
    bnb_8bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch_dtype,
)

base_model.config.use_cache = False

if getattr(tokenizer, "pad_token", None) is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right"
if base_model.config.pad_token_id != tokenizer.pad_token_id:
    base_model.config.pad_token_id = tokenizer.pad_token_id


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/tokenizer.json
loading file tokenizer.model from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Ins

## 데이터 셋 train / valid 분리

##### Train size : 2800 // Valid sizd : 200

In [7]:
full_dataset = Dataset.from_json(path_or_paths=dataset_name)

train_dataset = full_dataset.select(range(2800))
valid_dataset = full_dataset.select(range(2800, 3000))

print("Train size:", len(train_dataset))
print("Validation size:", len(valid_dataset))


Train size: 2800
Validation size: 200


## 모델 파인튜닝 용 데이터 확립

In [8]:
def function_prepare_sample_text(tokenizer, for_train=True):
    def _prepare_sample_text(example):
        system_prompt = "너는 사용자가 입력한 주문 문장을 분석하는 에이전트이다. 주문으로부터 음식명, 옵션명, 수량을 추출한다."
        user_input = f"### 주문 문장:\n{example['input']}"
        messages = [
            {"role": "system", "content": f"{system_prompt}"},
            {"role": "user", "content": f"{user_input}"},
        ]
        if for_train:
            messages.append({"role": "assistant", "content": f"{example['output']}"})

        text = ""
        for message in messages:
            if message['role'] == 'system':
                text += f"<s>[SYSTEM]\n{message['content']}\n[/SYSTEM]"
            elif message['role'] == 'user':
                text += f"\n[USER]\n{message['content']}\n[/USER]"
            elif message['role'] == 'assistant':
                text += f"\n[ASSISTANT]\n{message['content']}\n[/ASSISTANT]</s>"
        return text

    return _prepare_sample_text

def chars_token_ratio(dataset, tokenizer, prepare_sample_text, nb_examples=200):
    total_characters, total_tokens = 0, 0
    for _, example in tqdm(zip(range(min(nb_examples, len(dataset))), iter(dataset)), total=min(nb_examples, len(dataset))):
        text = prepare_sample_text(example)
        total_characters += len(text)
        if tokenizer.is_fast:
            total_tokens += len(tokenizer(text).tokens())
        else:
            total_tokens += len(tokenizer.tokenize(text))
    return total_characters / total_tokens

def create_datasets(tokenizer, dataset, seq_length):
    prepare_sample_text = function_prepare_sample_text(tokenizer)
    chars_per_token = chars_token_ratio(dataset, tokenizer, prepare_sample_text)
    print(f"The character to token ratio of the dataset is: {chars_per_token:.3f}")
    cl_dataset = ConstantLengthDataset(
        tokenizer,
        dataset,
        formatting_func=prepare_sample_text,
        infinite=True,
        seq_length=seq_length,
        chars_per_token=chars_per_token,
    )
    return cl_dataset

train_ds = create_datasets(tokenizer, train_dataset, seq_length)


100%|██████████| 200/200 [00:00<00:00, 1824.54it/s]

The character to token ratio of the dataset is: 1.612





# 모델 학습

In [9]:
from peft import LoraConfig

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "down_proj",
        "up_proj",
        "gate_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

sft_config = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    gradient_checkpointing=False,
    learning_rate=1e-4,
    warmup_ratio=0.1,
    max_grad_norm=0.3,
    weight_decay=0.05,
    num_train_epochs=1,
    logging_steps=20,
    evaluation_strategy="no",
    save_strategy="steps",
    save_steps=50,
    save_total_limit=2,
    max_seq_length=seq_length,
    report_to="wandb",
    run_name="llama-3.2-1b-fine-tuning"
)


PyTorch: setting up devices


##### 아래 upgrade 는 학습 시 , 에러 해결을 위함

In [10]:
!pip install --upgrade peft bitsandbytes




## 실제 학습 진행

In [11]:
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_ds,
    eval_dataset=None,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=sft_config
)

trainer.train()


PyTorch: setting up devices
***** Running training *****
  Num examples = 2,800
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 1,400
  Number of trainable parameters = 5,636,096
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"




Step,Training Loss
20,2.0902
40,1.9464
60,1.5731
80,1.1767
100,0.9835
120,0.8871
140,0.8393
160,0.7686
180,0.7482
200,0.7281


Saving model checkpoint to /content/drive/MyDrive/NLP_Class/output/checkpoint-50
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "

TrainOutput(global_step=1400, training_loss=0.6581716074262347, metrics={'train_runtime': 1905.3589, 'train_samples_per_second': 1.47, 'train_steps_per_second': 0.735, 'total_flos': 8419093040332800.0, 'train_loss': 0.6581716074262347, 'epoch': 1.0})

#### 업로드된 모델 정보
- **모델 이름**: `shlee0/llama-3.2.1-finetuning-assign`
- **파일 형식**: `Safetensors` 형식의 LoRA 어댑터.
- **모델 업로드 주소**: 'https://huggingface.co/shlee0/llama-3.2.1-finetuning-assign/resolve/main/adapter_model.safetensors'

##### wandb에서 확인한 모델 학습 내용

- **학습 진행 내용**: 'https://wandb.ai/floralee_782/huggingface?nw=nwuserfloralee782'

### 학습 완료 모델 평가 (test 평가)

In [12]:
# 학습이 완료 모델로 검증 세트 평가
from sacrebleu import corpus_bleu

def generate_prediction(tokenizer, model, input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=128,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded

# 추론용 전처리 함수
inference_preprocessor = function_prepare_sample_text(tokenizer, for_train=False)

# 검증 데이터셋 평가
refs = []  # 레퍼런스 정답
sys = []   # 모델 예측 결과

for example in tqdm(valid_dataset):
    input_prompt = inference_preprocessor(example)
    prediction = generate_prediction(tokenizer, trainer.model, input_prompt)
    # 어시스턴트의 응답만 추출
    if "[ASSISTANT]" in prediction:
        assistant_response = prediction.split("[ASSISTANT]")[-1].split("[/ASSISTANT]")[0].strip()
    else:
        assistant_response = prediction.strip()
    sys.append(assistant_response)
    refs.append(example["output"])

# BLEU 점수 계산
bleu_score = corpus_bleu(sys, [refs])
print("BLEU score:", bleu_score.score)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 200/200 [39:22<00:00, 11.81s/it]

BLEU score: 87.31510507150453





# hf 에서 모델 로드하여 모델 평가

## Q4: Hugging Face Hub에서 LoRA 어댑터 로드

### 수행 방법
업로드된 LoRA 어댑터를 Hugging Face Hub에서 로드하여 원본 모델에 결합

In [17]:
from huggingface_hub import HfApi

model_id = "shlee0/llama-3.2.1-finetuning-assign"

**hf 모델 로드**

In [18]:
import gc
torch.cuda.empty_cache()
gc.collect()

# 모델 로드
new_base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch_dtype,
)

new_tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
if getattr(new_tokenizer, "pad_token", None) is None:
    new_tokenizer.pad_token = new_tokenizer.eos_token
    new_tokenizer.pad_token_id = new_tokenizer.eos_token_id
new_tokenizer.padding_side = "right"

# PEFT 모델 로드
from peft import PeftModel
adapter_model = PeftModel.from_pretrained(new_base_model, model_id)  # Q4: HF Hub에서 로드


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/config.json
Model config LlamaConfig {
  "_name_or_path": "meta-llama/Llama-3.2-1B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "

**hf 로드 모델로 bleu score**

## Q5: 검증 세트에 대해 BLEU 점수 계산

### 해결 방법
검증 데이터셋에 대해 모델의 예측을 생성하고, 참조 정답과 비교하여 BLEU 점수를 계산하였습니다.

In [16]:
# 학습이 완료된 모델 로드하여 평가
from sacrebleu import corpus_bleu

def generate_prediction(tokenizer, model, input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=128,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded

# 추론용 전처리
inference_preprocessor = function_prepare_sample_text(tokenizer, for_train=False)

# 평가
refs = []  # 레퍼런스 정답
sys = []   # 모델 예측 결과

for example in tqdm(valid_dataset):
    input_prompt = inference_preprocessor(example)
    prediction = generate_prediction(tokenizer, trainer.model, input_prompt)

    if "[ASSISTANT]" in prediction:
        assistant_response = prediction.split("[ASSISTANT]")[-1].split("[/ASSISTANT]")[0].strip()
    else:
        assistant_response = prediction.strip()
    sys.append(assistant_response)
    refs.append(example["output"])

# BLEU score
bleu_score = corpus_bleu(sys, [refs])
print("BLEU score:", bleu_score.score)


100%|██████████| 200/200 [38:54<00:00, 11.67s/it]

BLEU score: 87.2813115602023





BLEU score: 87.2813115602023

## 참고자료
- Hugging Face: [https://huggingface.co/docs](https://huggingface.co/docs)
- SacreBLEU: [https://github.com/mjpost/sacrebleu](https://github.com/mjpost/sacrebleu)
- PEFT 및 LoRA 설명: [https://github.com/huggingface/peft](https://github.com/huggingface/peft)

# 추가 작업 (업로드용)

In [13]:
# 업로드용 모델/토크나이저 저장
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16"

('/content/drive/MyDrive/NLP_Class/output/tokenizer_config.json',
 '/content/drive/MyDrive/NLP_Class/output/special_tokens_map.json',
 '/content/drive/MyDrive/NLP_Class/output/tokenizer.json')