# LLM 수리추론 능력 강화를 위한 SFT 학습

## 실험 개요
- **목적**: SFT를 활용한 LLM 수리추론 능력 강화
- **학습 데이터**: grade-school-math-instructions
- **평가 데이터**: math-QA (lm-evaluation-harness)
- **학습 모델**: Qwen2.5-0.5B, Qwen2.5-1.5B (base models)
- **환경**: Colab T4 GPU

## 1. 환경 설정

### 의존성 버전 (테스트 완료)
| 패키지 | 버전 | 용도 |
|--------|------|------|
| torch | 2.4.0+ | 딥러닝 프레임워크 |
| transformers | 4.44.2+ | 모델 로딩 |
| datasets | 2.21.0 | 데이터 로딩 |
| accelerate | 0.33.0+ | 분산 학습 |
| peft | 0.12.0 | LoRA/QLoRA |
| bitsandbytes | 0.43.3+ | 4-bit quantization |
| trl | 0.9.6 | SFT Trainer |
| wandb | 0.17.5+ | 실험 추적 |
| flash-attn | 2.5.0+ | **속도 최적화 (선택)** |
| scipy | 1.13.1 | 수치 연산 |

### 속도 최적화 패키지
Flash Attention 2를 사용하려면 별도 설치가 필요합니다:
```bash
pip install flash-attn --no-build-isolation
```

### ⚠️ triton/GenerationMixin 오류 시 (Colab)

`triton.backends` 또는 `GenerationMixin` import 오류가 나면:

1. **아래 셀 실행** → triton 업그레이드 + transformers 재설치
2. **런타임 재시작** (Runtime → Restart runtime) ← **필수!**
3. 노트북 처음부터 다시 실행

**`KernelMetadata.cluster_dims` 오류 시**: `torch_compile=True`가 PyTorch inductor/Triton과 충돌합니다. `get_training_args()`에서 `torch_compile=False`로 설정되어 있습니다.

In [2]:
# 핵심 라이브러리 설치 (버전 명시)

# 기본 라이브러리 설치
!pip install transformers>=4.45.0 bitsandbytes>=0.44.0 
!pip install --upgrade triton  # torch 2.9 호환 (2.2.0 고정 시 triton.backends 오류)
!pip install datasets==2.21.0
!pip install peft==0.12.0
!pip install trl==0.9.6
!pip install scipy==1.13.1
# !pip install numpy pandas
!pip install numpy --no-cache-dir
!pip install wandb
!pip install --upgrade "accelerate>=1.7.0"
!pip install --upgrade triton


Collecting triton
  Downloading triton-3.6.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Downloading triton-3.6.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (188.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.3/188.3 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: triton
  Attempting uninstall: triton
    Found existing installation: triton 3.5.0
    Uninstalling triton-3.5.0:
      Successfully uninstalled triton-3.5.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.9.0+cu126 requires triton==3.5.0; platform_system == "Linux", but you have triton 3.6.0 which is incompatible.[0m[31m
[0mSuccessfully installed triton-3.6.0


Collecting datasets==2.21.0
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting fsspec<=2024.6.1,>=2023.1.0 (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets==2.21.0)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.0
    Uninstalling fsspec-2025.3.0:
      Successfully uninstalled fsspec-2025.3.0
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
[31mERROR: pip

Collecting peft==0.12.0
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting triton==3.5.0 (from torch>=1.13.0->peft==0.12.0)
  Downloading triton-3.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Downloading peft-0.12.0-py3-none-any.whl (296 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading triton-3.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (170.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.5/170.5 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: triton, peft
  Attempting uninstall: triton
    Found existing installation: triton 3.6.0
    Uninstalling triton-3.6.0:
      Successfully uninstalled triton-3.6.0
  Attempting uninstall: peft
    Found existing installation: peft 0.18.1
    Uninstalling peft-0.18.1:
      Successfully uninstalled pe

Collecting trl==0.9.6
  Downloading trl-0.9.6-py3-none-any.whl.metadata (12 kB)
Collecting numpy<2.0.0,>=1.18.2 (from trl==0.9.6)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting tyro>=0.5.11 (from trl==0.9.6)
  Downloading tyro-1.0.5-py3-none-any.whl.metadata (12 kB)
Downloading trl-0.9.6-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.8/245.8 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m115.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading tyro-1.0.5-py3-none-any.whl (181 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.2/181.2 kB[0m [3

Collecting scipy==1.13.1
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.2/38.2 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.16.3
    Uninstalling scipy-1.16.3:
      Successfully uninstalled scipy-1.16.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytensor 2.37.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4

Collecting triton
  Using cached triton-3.6.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Using cached triton-3.6.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (188.3 MB)
Installing collected packages: triton
  Attempting uninstall: triton
    Found existing installation: triton 3.5.0
    Uninstalling triton-3.5.0:
      Successfully uninstalled triton-3.5.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.9.0+cu126 requires triton==3.5.0; platform_system == "Linux", but you have triton 3.6.0 which is incompatible.[0m[31m
[0mSuccessfully installed triton-3.6.0


In [None]:
# # GPU 사용 확인
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Memory allocated: {torch.cuda.memory_allocated(0)/1e9:.2f} GB")

In [4]:
# 설치된 버전 확인
!pip show torch transformers datasets accelerate peft bitsandbytes trl wandb flash-attn | grep -E "^(Name|Version)"

[0mName: torch
Version: 2.9.0+cu126
Name: transformers
Version: 4.57.6
Name: datasets
Version: 2.21.0
Name: accelerate
Version: 1.12.0
Name: peft
Version: 0.12.0
Name: bitsandbytes
Version: 0.49.1
Name: trl
Version: 0.9.6
Name: wandb
Version: 0.24.0


### 라이브러리 임포트 및 Attention 구현 방식 설정

필요한 라이브러리를 임포트하고, GPU 환경에 맞는 Attention 구현 방식(Flash Attention 2, SDPA, eager)을 자동으로 선택합니다.

### WandB 실험 추적 설정

Weights & Biases를 사용하여 학습 과정을 추적합니다. API 키로 로그인하고 프로젝트 이름을 설정합니다.

In [3]:
import os
import torch
import wandb
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from trl import SFTTrainer

# Attention 구현 방식 결정
# - Flash Attention 2: Ampere (8.0) 이상 필요
# - SDPA: PyTorch 2.0+에서 사용 가능, T4에서도 동작
# - eager: 기본 구현
def get_attn_implementation():
    if not torch.cuda.is_available():
        return "eager"
    
    major, minor = torch.cuda.get_device_capability()
    compute_capability = float(f"{major}.{minor}")
    
    # Ampere (8.0) 이상이면 Flash Attention 2 사용
    if compute_capability >= 8.0:
        try:
            import flash_attn
            print(f"✓ Flash Attention 2 사용 가능 (v{flash_attn.__version__})")
            return "flash_attention_2"
        except ImportError:
            pass
    
    # PyTorch 2.0+ 이면 SDPA 사용 (T4에서도 동작)
    if torch.__version__ >= "2.0":
        print(f"✓ SDPA (Scaled Dot Product Attention) 사용")
        return "sdpa"
    
    print("✓ Eager Attention 사용")
    return "eager"

ATTN_IMPLEMENTATION = get_attn_implementation()
print(f"Attention Implementation: {ATTN_IMPLEMENTATION}")

✓ SDPA (Scaled Dot Product Attention) 사용
Attention Implementation: sdpa


In [None]:
# Load HF_TOKEN, WANDB_API_KEY, WANDB_ENTITY from .env (copy .env.example to .env and fill in)
from dotenv import load_dotenv
load_dotenv()

### HuggingFace 로그인 및 데이터셋 로드

HuggingFace Hub에 로그인하고 grade-school-math-instructions 데이터셋을 로드합니다. 이 데이터셋은 초등학교 수준의 수학 문제와 단계별 풀이를 포함합니다.

### 데이터 전처리

INSTRUCTION과 RESPONSE 필드를 "### Question:" / "### Answer:" 형식으로 변환하여 모델 학습에 적합한 텍스트 포맷을 생성합니다.

### 학습/검증 데이터 분할

데이터셋을 90% 학습, 10% 검증으로 분할합니다.

In [2]:
import os
import wandb

# WANDB_API_KEY, WANDB_ENTITY from .env (see .env.example)
wandb_api_key = os.environ.get("WANDB_API_KEY")
if wandb_api_key:
    wandb.login(key=wandb_api_key)
else:
    wandb.login()

PROJECT_NAME = "llm-math-reasoning-sft"
ENTITY = os.environ.get("WANDB_ENTITY", "jungwoonshin")  # 본인의 wandb username 또는 team name

[34m[1mwandb[0m: [wandb.login()] Using explicit session credentials for https://api.wandb.ai.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjungwoonshin[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## 2. 데이터 준비

In [3]:

# HF_TOKEN from .env (see .env.example)
import os
HF_TOKEN = os.environ.get("HF_TOKEN")
from huggingface_hub import login

if HF_TOKEN:
    login(token=HF_TOKEN)

# Grade School Math Instructions 데이터셋 로드
dataset = load_dataset("qwedsacf/grade-school-math-instructions")
print(f"Dataset structure: {dataset}")
print(f"\nSample data:")
print(dataset['train'][0])

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Downloading readme:   0%|          | 0.00/852 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.55M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8792 [00:00<?, ? examples/s]

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['INSTRUCTION', 'RESPONSE', 'SOURCE'],
        num_rows: 8792
    })
})

Sample data:
{'INSTRUCTION': 'This math problem has got me stumped: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?\nCan you show me the way?', 'RESPONSE': 'Natalia sold 48/2 = 24 clips in May.\nNatalia sold 48+24 = 72 clips altogether in April and May.', 'SOURCE': 'grade-school-math'}


In [4]:
# 데이터 전처리 함수
def format_instruction(sample):                                                                                                                                                                            
    instruction = sample.get('INSTRUCTION', '')                                                                                                                                                            
    response = sample.get('RESPONSE', '')                                                                                                                                                                  
                                                                                                                                                                                                            
    if not instruction or not response:                                                                                                                                                                    
        return {"text": ""}  # Skip empty samples                                                                                                                                                          
                                                                                                                                                                                                            
    text = f"### Question:\n{instruction}\n\n### Answer:\n{response}"                                                                                                                                      
    return {"text": text}   

# 데이터셋 변환
formatted_dataset = dataset['train'].map(format_instruction, remove_columns=dataset['train'].column_names)
print(f"Formatted dataset size: {len(formatted_dataset)}")
print(f"\nSample formatted data:")
print(formatted_dataset[0]['text'][:500])

Map:   0%|          | 0/8792 [00:00<?, ? examples/s]

Formatted dataset size: 8792

Sample formatted data:
### Question:
This math problem has got me stumped: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Can you show me the way?

### Answer:
Natalia sold 48/2 = 24 clips in May.
Natalia sold 48+24 = 72 clips altogether in April and May.


### Qwen2.5-0.5B 모델 로드

4-bit quantization(QLoRA)을 적용하여 Qwen2.5-0.5B 모델을 로드하고 LoRA 어댑터를 설정합니다.

### WandB 초기화 및 SFTTrainer 설정

WandB 실험을 초기화하고 SFTTrainer를 설정합니다. Sequence Packing을 활성화하여 학습 속도를 향상시킵니다.

### 학습 실행 및 모델 저장

Qwen2.5-0.5B 모델 학습을 실행하고 최종 모델을 저장합니다.

### LoRA 병합 및 Google Drive 저장 (0.5B)

LoRA 어댑터를 base 모델에 병합하고 Google Drive에 저장합니다.

### 메모리 정리

다음 모델 학습을 위해 GPU 메모리를 해제합니다.

In [5]:
# 학습/검증 데이터 분할 (90/10)
split_dataset = formatted_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")

Training samples: 7912
Validation samples: 880


### Qwen2.5-1.5B 모델 로드

더 큰 1.5B 모델을 동일한 QLoRA 설정으로 로드합니다.

### WandB 초기화 및 1.5B SFTTrainer 설정

1.5B 모델용 WandB 실험을 초기화하고 더 큰 배치 크기로 SFTTrainer를 설정합니다.

### 1.5B 모델 학습 실행

Qwen2.5-1.5B 모델 학습을 실행하고 저장합니다.

## 3. 모델 학습 설정

### T4 GPU 환경에서의 최적화 전략:

#### 메모리 최적화
- **QLoRA** (4-bit quantization + LoRA): 메모리 효율적인 학습
- **Gradient Checkpointing**: 메모리 사용량 감소

#### 속도 최적화
- **SDPA** (Scaled Dot Product Attention): T4에서 ~1.5x 빠른 attention
  - Flash Attention 2는 Ampere (A100, A10) 이상만 지원
  - T4 (Turing)에서는 자동으로 SDPA 사용
- **torch.compile**: ~10-30% 전체 속도 향상
- **Mixed Precision (fp16)**: 학습 속도 향상
- **Sequence Packing**: 여러 샘플을 하나의 시퀀스에 패킹 → 2~3배 속도 향상
- **Parallel Data Loading**: 4 workers로 병렬 데이터 로딩

In [6]:
def setup_model_and_tokenizer(model_id, use_qlora=True):
    """
    모델과 토크나이저 설정
    T4 GPU 환경에 최적화된 QLoRA 설정 적용

    Attention 구현:
    - Flash Attention 2: Ampere GPU 이상 (~2x 빠름)
    - SDPA: T4 등 Turing GPU에서 사용 (~1.5x 빠름)
    - eager: 기본 구현
    """
    # Tokenizer 설정
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

    # padding token 설정 (없는 경우)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id

    if use_qlora:
        # 4-bit quantization 설정
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,  # bfloat16 for better stability
            bnb_4bit_use_double_quant=True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            quantization_config=bnb_config,
            device_map="auto",
            trust_remote_code=True,
            attn_implementation=ATTN_IMPLEMENTATION,  # 자동 감지된 attention 사용
        )

        # QLoRA를 위한 모델 준비
        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
            attn_implementation=ATTN_IMPLEMENTATION,
        )

    # LoRA 설정
    lora_config = LoraConfig(
        r=16,  # LoRA rank
        lora_alpha=32,  # LoRA alpha
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    return model, tokenizer

In [7]:
def get_training_args(model_name, output_dir, fast_mode=False):
    """
    T4 GPU에 최적화된 학습 인자 설정

    Speed optimizations:
    - dataloader_num_workers: 병렬 데이터 로딩
    - dataloader_pin_memory: 빠른 CPU→GPU 전송
    - torch_compile: ~10-30% 속도 향상
    - 줄인 logging/eval 빈도: 오버헤드 감소
    """
    args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,  # effective batch size = 16
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},  # 더 빠른 checkpointing
        optim="paged_adamw_8bit",
        learning_rate=2e-4,
        weight_decay=0.01,
        fp16=True,
        bf16=False,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine",
        logging_steps=25 if fast_mode else 10,  # 로깅 빈도 줄임
        save_strategy="steps",
        save_steps=200 if fast_mode else 100,  # 저장 빈도 줄임
        eval_strategy="steps",
        eval_steps=200 if fast_mode else 100,  # 평가 빈도 줄임
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        report_to="wandb",
        run_name=f"{model_name}-math-sft",
        # Speed optimizations
        dataloader_num_workers=0,  # 0=progress bar 표시 (Jupyter/Colab), 4=병렬 로딩(스크립트용)
        dataloader_pin_memory=True,  # 빠른 메모리 전송
        torch_compile=False,  # Colab/Triton 호환: True 시 KernelMetadata.cluster_dims 오류 발생
        # dataloader_prefetch_factor=2,  # 데이터 프리페치
    )
    return args

## 4. Qwen2.5-0.5B 모델 학습

In [8]:
# Qwen2.5-0.5B 모델 설정 (QLoRA)
MODEL_ID_05B = "Qwen/Qwen2.5-0.5B"
OUTPUT_DIR_05B = "./outputs/qwen2.5-0.5b-math-sft"

print(f"Loading model: {MODEL_ID_05B}")
print(f"Attention: {ATTN_IMPLEMENTATION}")

model_05b, tokenizer_05b = setup_model_and_tokenizer(MODEL_ID_05B, use_qlora=True)

Loading model: Qwen/Qwen2.5-0.5B
Attention: sdpa


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

trainable params: 8,798,208 || all params: 502,830,976 || trainable%: 1.7497


In [9]:
# WandB 실행 초기화
wandb.init(
    project=PROJECT_NAME,
    entity=ENTITY,
    name="qwen2.5-0.5b-math-sft",
    config={
        "model": MODEL_ID_05B,
        "dataset": "qwedsacf/grade-school-math-instructions",
        "method": "QLoRA",
        "lora_r": 16,
        "lora_alpha": 32,
        "epochs": 1,
        "batch_size": 16,
        "learning_rate": 2e-4,
    }
)

# 학습 인자 설정
training_args_05b = get_training_args("qwen2.5-0.5b", OUTPUT_DIR_05B)

# Trainer 설정
trainer_05b = SFTTrainer(
    model=model_05b,
    args=training_args_05b,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer_05b,
    dataset_text_field="text",
    max_seq_length=512,
    packing=True,  # 여러 샘플을 한 시퀀스에 패킹 → 2~3배 속도 향상
)

print("Starting training for Qwen2.5-0.5B...")


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Starting training for Qwen2.5-0.5B...


  super().__init__(


In [None]:
# Qwen2.5-0.5B 학습 실행
trainer_05b.train()

# 최종 모델 저장
trainer_05b.save_model(f"{OUTPUT_DIR_05B}/final")
tokenizer_05b.save_pretrained(f"{OUTPUT_DIR_05B}/final")

print(f"Model saved to {OUTPUT_DIR_05B}/final")
wandb.finish()


Mounted at /content/drive
cp: cannot stat './outputs/qwen2.5-0.5b-math-sft-merged': No such file or directory
Models saved to Google Drive!


In [13]:

from peft import PeftModel
import gc

def merge_and_save_model(base_model_id, lora_path, output_path):
    """
    LoRA 가중치를 base 모델에 병합하여 저장
    """
    print(f"Loading base model: {base_model_id}")
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
    )
    
    print(f"Loading LoRA weights from: {lora_path}")
    model = PeftModel.from_pretrained(base_model, lora_path)
    
    print("Merging weights...")
    merged_model = model.merge_and_unload()
    
    print(f"Saving merged model to: {output_path}")
    merged_model.save_pretrained(output_path)
    
    # Tokenizer도 함께 저장
    tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
    tokenizer.save_pretrained(output_path)
    
    print("Done!")
    
    # 메모리 정리
    del base_model, model, merged_model
    torch.cuda.empty_cache()
    gc.collect()
    
    return output_path

    # Qwen2.5-0.5B 모델 병합
merged_05b_path = merge_and_save_model(
    base_model_id="Qwen/Qwen2.5-0.5B",
    lora_path=f"{OUTPUT_DIR_05B}/final",
    output_path="./outputs/qwen2.5-0.5b-math-sft-merged"
)


# Google Drive 마운트
from google.colab import drive
drive.mount('/content/drive')

# 저장 디렉토리 생성
!mkdir -p /content/drive/MyDrive/llm-math-models/

# 모델 복사
!cp -r ./outputs/qwen2.5-0.5b-math-sft-merged /content/drive/MyDrive/llm-math-models/
# !cp -r ./outputs/qwen2.5-1.5b-math-sft-merged /content/drive/MyDrive/llm-math-models/

print("Models saved to Google Drive!")

Loading base model: Qwen/Qwen2.5-0.5B
Loading LoRA weights from: ./outputs/qwen2.5-0.5b-math-sft/final
Merging weights...
Saving merged model to: ./outputs/qwen2.5-0.5b-math-sft-merged
Done!
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Models saved to Google Drive!


In [14]:
# 메모리 정리
del model_05b, trainer_05b
torch.cuda.empty_cache()
import gc
gc.collect()

33274

## 5. Qwen2.5-1.5B 모델 학습

In [15]:
# Qwen2.5-1.5B 모델 설정 (QLoRA)
MODEL_ID_15B = "Qwen/Qwen2.5-1.5B"
OUTPUT_DIR_15B = "./outputs/qwen2.5-1.5b-math-sft"

print(f"Loading model: {MODEL_ID_15B}")
print(f"Attention: {ATTN_IMPLEMENTATION}")

model_15b, tokenizer_15b = setup_model_and_tokenizer(MODEL_ID_15B, use_qlora=True)

Loading model: Qwen/Qwen2.5-1.5B
Attention: sdpa


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


In [16]:
# WandB 실행 초기화
wandb.init(
    project=PROJECT_NAME,
    entity=ENTITY,
    name="qwen2.5-1.5b-math-sft-fast",
    config={
        "model": MODEL_ID_15B,
        "dataset": "qwedsacf/grade-school-math-instructions",
        "method": f"QLoRA + {ATTN_IMPLEMENTATION}",
        "lora_r": 16,
        "lora_alpha": 32,
        "epochs": 3,
        "batch_size": 32,  # effective batch size
        "learning_rate": 2e-4,
    }
)

# 학습 인자 설정 (1.5B 속도 최적화)
training_args_15b = get_training_args("qwen2.5-1.5b", OUTPUT_DIR_15B, fast_mode=True)
training_args_15b.per_device_train_batch_size = 8  # 배치 크기 증가
training_args_15b.gradient_accumulation_steps = 4  # effective batch size = 32
training_args_15b.num_train_epochs = 3

# Trainer 설정
trainer_15b = SFTTrainer(
    model=model_15b,
    args=training_args_15b,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer_15b,
    dataset_text_field="text",
    max_seq_length=512,
    packing=True,  # 여러 샘플을 한 시퀀스에 패킹 → 2~3배 속도 향상
)

print("Starting training for Qwen2.5-1.5B (optimized for speed)...")
print(f"Speed optimizations enabled:")
print(f"  - {ATTN_IMPLEMENTATION.upper()}")
print(f"  - torch.compile")
print(f"  - Parallel data loading (4 workers)")
print(f"  - Sequence packing")


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Starting training for Qwen2.5-1.5B (optimized for speed)...
Speed optimizations enabled:
  - SDPA
  - torch.compile
  - Parallel data loading (4 workers)
  - Sequence packing


  super().__init__(


In [18]:
# Qwen2.5-1.5B 학습 실행
trainer_15b.train()

# 최종 모델 저장
trainer_15b.save_model(f"{OUTPUT_DIR_15B}/final")
tokenizer_15b.save_pretrained(f"{OUTPUT_DIR_15B}/final")

print(f"Model saved to {OUTPUT_DIR_15B}/final")
wandb.finish()

Step,Training Loss,Validation Loss
200,0.4499,0.800456


Model saved to ./outputs/qwen2.5-1.5b-math-sft/final


0,1
eval/loss,▁█
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▂▃▃▄▅▆▇▇▁▂▃▃▄▅▆▇▇██
train/global_step,▁▂▃▃▄▅▆▇▇▁▂▃▃▄▅▆▇▇██
train/grad_norm,▆▆▆▄▆▇▁▃█▅▅▄▇█▅▄▃
train/learning_rate,██▇▆▄▃▂▁██▇▆▄▃▂▁▁
train/loss,█▆▅▄▃▄▃▃▃▃▃▂▂▂▁▁▁

0,1
eval/loss,0.80046
eval/runtime,11.6877
eval/samples_per_second,23.443
eval/steps_per_second,5.904
total_flos,3.087692963040461e+16
train/epoch,3
train/global_step,237
train/grad_norm,0.30893
train/learning_rate,0.0
train/loss,0.4669


In [19]:
# 메모리 정리
del model_15b, trainer_15b
torch.cuda.empty_cache()
gc.collect()

3042

## 6. LoRA 가중치를 Base 모델에 병합

평가를 위해 LoRA 가중치를 base 모델에 병합하여 완전한 모델로 저장합니다.

In [20]:
from peft import PeftModel

def merge_and_save_model(base_model_id, lora_path, output_path):
    """
    LoRA 가중치를 base 모델에 병합하여 저장
    """
    print(f"Loading base model: {base_model_id}")
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
    )
    
    print(f"Loading LoRA weights from: {lora_path}")
    model = PeftModel.from_pretrained(base_model, lora_path)
    
    print("Merging weights...")
    merged_model = model.merge_and_unload()
    
    print(f"Saving merged model to: {output_path}")
    merged_model.save_pretrained(output_path)
    
    # Tokenizer도 함께 저장
    tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
    tokenizer.save_pretrained(output_path)
    
    print("Done!")
    
    # 메모리 정리
    del base_model, model, merged_model
    torch.cuda.empty_cache()
    gc.collect()
    
    return output_path

In [21]:
# Qwen2.5-0.5B 모델 병합
merged_05b_path = merge_and_save_model(
    base_model_id="Qwen/Qwen2.5-0.5B",
    lora_path=f"{OUTPUT_DIR_05B}/final",
    output_path="./outputs/qwen2.5-0.5b-math-sft-merged"
)

# Qwen2.5-1.5B 모델 병합
merged_15b_path = merge_and_save_model(
    base_model_id="Qwen/Qwen2.5-1.5B",
    lora_path=f"{OUTPUT_DIR_15B}/final",
    output_path="./outputs/qwen2.5-1.5b-math-sft-merged"
)

Loading base model: Qwen/Qwen2.5-0.5B
Loading LoRA weights from: ./outputs/qwen2.5-0.5b-math-sft/final
Merging weights...
Saving merged model to: ./outputs/qwen2.5-0.5b-math-sft-merged
Done!
Loading base model: Qwen/Qwen2.5-1.5B
Loading LoRA weights from: ./outputs/qwen2.5-1.5b-math-sft/final
Merging weights...
Saving merged model to: ./outputs/qwen2.5-1.5b-math-sft-merged
Done!


In [22]:
# Qwen2.5-1.5B 모델 병합
merged_15b_path = merge_and_save_model(
    base_model_id="Qwen/Qwen2.5-1.5B",
    lora_path=f"{OUTPUT_DIR_15B}/final",
    output_path="./outputs/qwen2.5-1.5b-math-sft-merged"
)

Loading base model: Qwen/Qwen2.5-1.5B
Loading LoRA weights from: ./outputs/qwen2.5-1.5b-math-sft/final
Merging weights...
Saving merged model to: ./outputs/qwen2.5-1.5b-math-sft-merged
Done!


## 7. 학습된 모델 Google Drive 저장 (선택사항)

Colab 런타임이 종료되면 모델이 삭제되므로, Google Drive에 저장합니다.

In [None]:
# Google Drive 마운트
from google.colab import drive
drive.mount('/content/drive')

# 저장 디렉토리 생성
!mkdir -p /content/drive/MyDrive/llm-math-models/

# 모델 복사
# !cp -r ./outputs/qwen2.5-0.5b-math-sft-merged /content/drive/MyDrive/llm-math-models/
!cp -r ./outputs/qwen2.5-1.5b-math-sft-merged /content/drive/MyDrive/llm-math-models/

print("Models saved to Google Drive!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Models saved to Google Drive!


## 학습 완료

다음 단계:
1. `02_evaluation.ipynb`를 실행하여 모델 평가 수행
2. WandB 대시보드에서 학습 과정 확인

In [25]:
dataset = load_dataset("qwedsacf/grade-school-math-instructions")
print(dataset['train'].column_names)  # 실제 필드명 확인
print(dataset['train'][0])  # 샘플 확인

['INSTRUCTION', 'RESPONSE', 'SOURCE']
{'INSTRUCTION': 'This math problem has got me stumped: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?\nCan you show me the way?', 'RESPONSE': 'Natalia sold 48/2 = 24 clips in May.\nNatalia sold 48+24 = 72 clips altogether in April and May.', 'SOURCE': 'grade-school-math'}
