## üìã Setup & Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2
!pip install wandb 

In [2]:
import os
import json
import torch
import wandb
from datasets import Dataset, load_dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import pandas as pd
from sklearn.model_selection import train_test_split
import gc

# Check GPU
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")



ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-11-28 08:30:50.823339: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764318651.280937      47 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764318651.398819      47 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


ü¶• Unsloth Zoo will now patch everything to make training faster!
GPU Available: True
GPU Name: Tesla T4
GPU Memory: 15.83 GB


## üîê WandB Login (for monitoring)

In [4]:
# Login to WandB for experiment tracking
# Get WandB API key from Kaggle Secrets
# In Kaggle: Add-ons ‚Üí Secrets ‚Üí Add new secret with key "WANDB_API_KEY"
# Get your API key from: https://wandb.ai/authorize

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("WANDB_API_KEY")

# Login with API key from Kaggle Secrets
wandb.login(key=wandb_api_key)

# Initialize WandB project with detailed config
wandb.init(
    project="vietnamese-legal-ai",
    name="llama3.2-3b-traffic-law-v1",
    config={
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "dataset": "traffic_law_data.jsonl",
        "task": "legal_qa",
        "language": "vietnamese",
        "max_seq_length": 1536,
        "lora_r": 32,
        "lora_alpha": 32,
        "learning_rate": 2e-4,
        "num_epochs": 2,
        "batch_size": 4,
        "gradient_accumulation": 4,
        "effective_batch_size": 32,
    },
    settings=wandb.Settings(
        _disable_meta=False,
        _disable_stats=False,
    )
)

print("‚úÖ WandB initialized with detailed logging")

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmikeethanh04[0m ([33mmikeethanh04-student[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


‚úÖ WandB initialized with detailed logging


## ‚öôÔ∏è Model Configuration

### T·∫°i sao ch·ªçn Llama-3.2-3B-Instruct?
- ‚úÖ **3B parameters**: V·ª´a ƒë·ªß m·∫°nh, v·ª´a ti·∫øt ki·ªám GPU
- ‚úÖ **Multilingual support**: H·ªó tr·ª£ nhi·ªÅu ng√¥n ng·ªØ bao g·ªìm ti·∫øng Vi·ªát
- ‚úÖ **Instruct version**: ƒê√£ ƒë∆∞·ª£c train theo instruction format
- ‚úÖ **Fit Kaggle T4**: ~15GB VRAM v·ªõi 4-bit quantization
- ‚úÖ **Unsloth optimized**: H·ªó tr·ª£ t·ªët, train nhanh 2x
- ‚úÖ **Meta's latest**: Phi√™n b·∫£n m·ªõi nh·∫•t t·ª´ Meta (2024)

In [5]:
# Model configuration for Kaggle T4 (16GB VRAM)
max_seq_length = 1536  # Based on data analysis (covers 95% of samples)
dtype = None  # Auto-detect. Use Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage

# Alternative models (uncomment to try):
# model_name = "unsloth/Qwen2.5-3B-Instruct-bnb-4bit"  # Qwen 2.5 3B
# model_name = "unsloth/gemma-2-2b-it-bnb-4bit"  # Gemma 2B (smaller, faster)
# model_name = "unsloth/Phi-3-mini-4k-instruct"  # Microsoft Phi-3

model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"  # Meta Llama 3.2 - Pre-quantized by Unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"‚úÖ Model loaded: {model_name}")
print(f"üìè Max sequence length: {max_seq_length}")
print(f"üî¢ 4-bit quantization: {load_in_4bit}")

==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

‚úÖ Model loaded: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
üìè Max sequence length: 1536
üî¢ 4-bit quantization: True


## üéØ LoRA Configuration

### LoRA Parameters Explained:
- **r (rank)**: 16-32 cho balance quality/speed. Higher = better but slower
- **lora_alpha**: Scaling factor, th∆∞·ªùng = r ho·∫∑c 2*r
- **target_modules**: Train all attention & MLP layers cho best result
- **lora_dropout**: 0 cho faster training (Unsloth optimized)
- **bias**: "none" cho faster & less overfitting

In [6]:
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # LoRA rank - higher = more expressive but slower (16, 32, 64)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],  # All attention & MLP layers
    lora_alpha=32,  # LoRA scaling (usually = r or 2*r)
    lora_dropout=0,  # 0 is optimized by Unsloth
    bias="none",  # "none" is optimized
    use_gradient_checkpointing="unsloth",  # Unsloth's long context support
    random_state=3407,  # For reproducibility
    use_rslora=False,  # Rank stabilized LoRA
    loftq_config=None,  # LoftQ quantization
)

print("‚úÖ LoRA adapters applied")
print(f"üìä Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"üìä Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"üí° Trainable ratio: {100 * sum(p.numel() for p in model.parameters() if p.requires_grad) / sum(p.numel() for p in model.parameters()):.2f}%")

Unsloth 2025.11.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


‚úÖ LoRA adapters applied
üìä Trainable parameters: 48,627,712
üìä Total parameters: 1,852,091,392
üí° Trainable ratio: 2.63%


## üìä Data Preparation

In [7]:
# Load data from Kaggle input (adjust path if uploading to Kaggle)
# For local testing, adjust the path
data_path = "/kaggle/input/traffic-law/traffic_law_data.jsonl"  # Kaggle path
# data_path = "../data/finetune_llm/traffic_law_data.jsonl"  # Local path

# Check if file exists
if not os.path.exists(data_path):
    print(f"‚ö†Ô∏è Data file not found at {data_path}")
    print("For Kaggle: Upload dataset or adjust path")
    print("For local: Make sure you're in the correct directory")
else:
    print(f"‚úÖ Found data at: {data_path}")

# Load JSONL data
data = []
with open(data_path, 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

print(f"üìä Total samples: {len(data):,}")

# Show sample
print("\nüìù Sample data:")
sample = data[0]
for key, value in sample.items():
    if key == 'output':
        print(f"{key}: {value[:200]}...")  # Truncate long output
    else:
        print(f"{key}: {value}")

‚úÖ Found data at: /kaggle/input/traffic-law/traffic_law_data.jsonl
üìä Total samples: 8,652

üìù Sample data:
instruction: Tr·∫£ l·ªùi c√¢u h·ªèi ph√°p lu·∫≠t sau:
input: Th·ªùi h·∫°n k√©o d√†i vi·ªác t·∫°m gi·ªØ tang v·∫≠t, ph∆∞∆°ng ti·ªán c√≥ t√≠nh ng√†y ngh·ªâ kh√¥ng?
output: Theo ƒêi·ªÅu 8 Lu·∫≠t X·ª≠ l√Ω vi ph·∫°m h√†nh ch√≠nh 2012 quy ƒë·ªãnh v·ªÅ c√°ch t√≠nh th·ªùi gian, th·ªùi h·∫°n, th·ªùi hi·ªáu trong x·ª≠ l√Ω vi ph·∫°m h√†nh ch√≠nh nh∆∞ sau: C√°ch t√≠nh th·ªùi gian, th·ªùi h·∫°n, th·ªùi hi·ªáu trong x·ª≠ l√Ω vi ph·∫°m...
domains: H√¨nh s·ª±, D√¢n s·ª±, Giao th√¥ng
traffic_type: Vi ph·∫°m giao th√¥ng
output_words: 194
complexity: Unknown


In [8]:
# Split data: 90% train, 5% validation, 5% test
train_data, temp_data = train_test_split(data, test_size=0.1, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

print(f"üìä Train: {len(train_data):,} samples")
print(f"üìä Validation: {len(val_data):,} samples")
print(f"üìä Test: {len(test_data):,} samples")

üìä Train: 7,786 samples
üìä Validation: 433 samples
üìä Test: 433 samples


## üìù Chat Template

S·ª≠ d·ª•ng chat template chu·∫©n c·ªßa Llama 3.2 Instruct ƒë·ªÉ ƒë·∫£m b·∫£o t∆∞∆°ng th√≠ch v·ªõi pretrained model:

In [9]:
from unsloth import apply_chat_template

# Chat template chu·∫©n cho Llama 3.2 Instruct
chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>

{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{OUTPUT}<|eot_id|>"""

def format_data_for_chat_template(examples):
    """Format data for Llama 3.2 chat template"""
    conversations = []
    
    for instruction, input_text, output in zip(examples["instruction"], examples["input"], examples["output"]):
        # Combine instruction and input as the user message
        user_message = f"{instruction}\n\n{input_text}" if input_text else instruction
        
        conversation = [
            {"role": "system", "content": "B·∫°n l√† m·ªôt tr·ª£ l√Ω AI chuy√™n v·ªÅ lu·∫≠t giao th√¥ng Vi·ªát Nam. H√£y tr·∫£ l·ªùi ch√≠nh x√°c v√† chi ti·∫øt c√°c c√¢u h·ªèi v·ªÅ lu·∫≠t giao th√¥ng."},
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": output}
        ]
        conversations.append(conversation)
    
    return {"conversations": conversations}

# Convert to HuggingFace Dataset
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
test_dataset = Dataset.from_list(test_data)

# Format data for chat template
train_dataset = train_dataset.map(format_data_for_chat_template, batched=True)
val_dataset = val_dataset.map(format_data_for_chat_template, batched=True)

# Apply chat template using Unsloth
train_dataset = apply_chat_template(
    train_dataset,
    tokenizer=tokenizer,
    chat_template=chat_template,
    default_system_message="B·∫°n l√† m·ªôt tr·ª£ l√Ω AI chuy√™n v·ªÅ lu·∫≠t giao th√¥ng Vi·ªát Nam. H√£y tr·∫£ l·ªùi ch√≠nh x√°c v√† chi ti·∫øt c√°c c√¢u h·ªèi v·ªÅ lu·∫≠t giao th√¥ng."
)

val_dataset = apply_chat_template(
    val_dataset,
    tokenizer=tokenizer,
    chat_template=chat_template,
    default_system_message="B·∫°n l√† m·ªôt tr·ª£ l√Ω AI chuy√™n v·ªÅ lu·∫≠t giao th√¥ng Vi·ªát Nam. H√£y tr·∫£ l·ªùi ch√≠nh x√°c v√† chi ti·∫øt c√°c c√¢u h·ªèi v·ªÅ lu·∫≠t giao th√¥ng."
)

print("‚úÖ Data formatted with Llama 3.2 chat template")
print("\nüìù Example formatted conversation:")
print(train_dataset[0]['text'][:500] + "...")

Map:   0%|          | 0/7786 [00:00<?, ? examples/s]

Map:   0%|          | 0/433 [00:00<?, ? examples/s]

Map:   0%|          | 0/7786 [00:00<?, ? examples/s]

Map:   0%|          | 0/433 [00:00<?, ? examples/s]

‚úÖ Data formatted with Llama 3.2 chat template

üìù Example formatted conversation:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

B·∫°n l√† m·ªôt tr·ª£ l√Ω AI chuy√™n v·ªÅ lu·∫≠t giao th√¥ng Vi·ªát Nam. H√£y tr·∫£ l·ªùi ch√≠nh x√°c v√† chi ti·∫øt c√°c c√¢u h·ªèi v·ªÅ lu·∫≠t giao th√¥ng.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tr·∫£ l·ªùi c√¢u h·ªèi ph√°p lu·∫≠t sau:

T√†i s·∫£n n√†o c·ªßa c√° nh√¢n kh√¥ng ƒë∆∞·ª£c ti·∫øn h√†nh thi h√†nh √°n?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

CƒÉn c·ª© quy ƒë·ªãnh kho·∫£n 2 ƒêi·ªÅu 87 Lu·∫≠t Thi h√†nh √°n d√¢n s·ª± 2008 quy ƒë·ªãnh v·ªÅ t√†i s·∫£n c·ªßa c√° nh√¢n kh√¥ng ƒë∆∞·ª£c ti·∫øn h√†nh thi h...


## üéì Training Configuration

### Optimized for Kaggle T4 (30h/week limit):
- **Epochs**: 3 (sufficient for legal domain)
- **Batch size**: 4 per device (max for T4 16GB)
- **Gradient accumulation**: 4 steps (effective batch = 16)
- **Learning rate**: 2e-4 (standard for LoRA)
- **Warmup**: 10% of steps
- **FP16**: Enabled for speed
- **Gradient checkpointing**: Unsloth optimized

In [10]:
# Training arguments optimized for Kaggle T4
training_args = TrainingArguments(
    # Output & Logging
    output_dir="./outputs",
    run_name="llama3.2-3b-traffic-law-v1",
    
    # Training dynamics
    num_train_epochs=2,  # 2-3 epochs is usually enough
    per_device_train_batch_size=4,  # Reduced from 4 to better utilize VRAM
    gradient_accumulation_steps=4,  # Increased to maintain effective batch size = 16
    
    # Optimization
    optim="adamw_8bit",  # 8-bit AdamW for memory efficiency
    learning_rate=2e-4,  # Standard for LoRA fine-tuning
    weight_decay=0.01,
    warmup_ratio=0.1,  # 10% warmup
    lr_scheduler_type="cosine",  # Cosine annealing
    
    # Performance
    fp16=not torch.cuda.is_bf16_supported(),  # Use FP16 for T4
    bf16=torch.cuda.is_bf16_supported(),  # Use BF16 if supported (A100, H100)
    
    # Logging & Saving (more frequent for better monitoring)
    logging_steps=5,  # Log every 5 steps for better visibility
    logging_strategy="steps",
    logging_first_step=True,  # Log first step
    save_strategy="steps",
    save_steps=50,  # Save more frequently
    save_total_limit=3,  # Keep only 3 best checkpoints
    
    # Evaluation
    eval_strategy="steps",
    eval_steps=50,  # Evaluate more frequently
    eval_accumulation_steps=1,  # Accumulate eval predictions
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    
    # WandB integration with detailed logging
    report_to="wandb",
    logging_nan_inf_filter=True,  # Filter out NaN/Inf values
    include_inputs_for_metrics=False,  # Don't log inputs (save space)
    
    # Progress bar and output control
    disable_tqdm=False,  # Enable progress bar
    log_level="info",  # Show info messages
    log_level_replica="warning",
    log_on_each_node=True,
    dataloader_num_workers=2,
)

print("‚úÖ Training arguments configured")
print(f"üíæ Per device batch size: {training_args.per_device_train_batch_size}")
print(f"üìä Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"üìà Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"üìä Logging every {training_args.logging_steps} steps")
print(f"üìä Evaluating every {training_args.eval_steps} steps")
print(f"üïê Total training steps: {len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs}")

‚úÖ Training arguments configured
üíæ Per device batch size: 4
üìä Effective batch size: 16
üìà Gradient accumulation: 4
üìä Logging every 5 steps
üìä Evaluating every 50 steps
üïê Total training steps: 972


In [12]:
# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences
    args=training_args,
)

print("‚úÖ Trainer initialized")

Unsloth: Tokenizing ["text"] (num_proc=8):   0%|          | 0/7786 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=8):   0%|          | 0/433 [00:00<?, ? examples/s]

Using auto half precision backend


‚úÖ Trainer initialized


## üöÄ Start Training!

**Estimated time on T4**: ~3-4 hours for 3 epochs  
**Memory usage**: ~14-15GB VRAM  
**Kaggle time budget**: ~4h / 30h week (leaves 26h for experiments)

In [13]:
# Show GPU stats before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"üñ•Ô∏è GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"üíæ {start_gpu_memory} GB of memory reserved.")

# Start training
print("\nüöÄ Starting training...\n")
trainer_stats = trainer.train()

# Show final stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)

print("\n" + "="*50)
print("‚úÖ TRAINING COMPLETED!")
print("="*50)
print(f"‚è±Ô∏è Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"üíæ Peak reserved memory: {used_memory} GB")
print(f"üìä Memory used for training: {used_memory_for_lora} GB")
print(f"üìà Percentage of max memory: {used_percentage}%")
print(f"üéØ Final train loss: {trainer_stats.metrics['train_loss']:.4f}")

üñ•Ô∏è GPU = Tesla T4. Max memory = 14.741 GB.
üíæ 3.07 GB of memory reserved.

üöÄ Starting training...



The following columns in the Training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: input, domains, output_words, instruction, traffic_type, output, text, attention_mask, conversations, complexity. If input, domains, output_words, instruction, traffic_type, output, text, attention_mask, conversations, complexity are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
skipped Embedding(128256, 3072, padding_idx=128004): 375.75M params
skipped: 375.75M params
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,786 | Num Epochs = 2 | Total steps = 488
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 48,627,712 of 3,261,377,536 (1.49% trained)
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLE

Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
50,1.4421,1.378838
100,1.2318,1.208347
150,1.1947,1.13906
200,1.1012,1.090938
250,1.0273,1.052491
300,0.9823,1.024429
350,0.953,1.002104
400,0.9856,0.989309
450,0.9709,0.984937


The following columns in the Evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: input, domains, output_words, instruction, traffic_type, output, text, attention_mask, conversations, complexity. If input, domains, output_words, instruction, traffic_type, output, text, attention_mask, conversations, complexity are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 433
  Batch size = 8
Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient
Saving model checkpoint to ./outputs/checkpoint-50
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3.2-3b-instruct-bnb-4bit/snapshots/bb1d317a108579fb40e646af8924a5e7ec5604b1/config.json
Mo

0,1
eval/loss,‚ñà‚ñÖ‚ñÑ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÅ
eval/runtime,‚ñà‚ñÉ‚ñÑ‚ñÑ‚ñÅ‚ñÖ‚ñÖ‚ñÜ‚ñÇ
eval/samples_per_second,‚ñÅ‚ñÜ‚ñÖ‚ñÖ‚ñà‚ñÖ‚ñÑ‚ñÑ‚ñá
eval/steps_per_second,‚ñÅ‚ñÖ‚ñÖ‚ñÖ‚ñà‚ñÖ‚ñÖ‚ñÖ‚ñà
train/epoch,‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñà‚ñà‚ñà‚ñà‚ñà
train/global_step,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà‚ñà
train/grad_norm,‚ñà‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÑ‚ñÅ‚ñÇ‚ñÅ‚ñÇ‚ñÅ‚ñÇ‚ñÅ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÇ‚ñÉ‚ñÇ‚ñÉ‚ñÇ‚ñÉ‚ñÇ‚ñÇ‚ñÉ‚ñÇ‚ñÇ
train/learning_rate,‚ñÅ‚ñÉ‚ñÑ‚ñÑ‚ñá‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñá‚ñá‚ñá‚ñÜ‚ñÜ‚ñÜ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ
train/loss,‚ñà‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÑ‚ñÑ‚ñÉ‚ñÉ‚ñÑ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÇ‚ñÅ

0,1
eval/loss,0.98494
eval/runtime,217.7863
eval/samples_per_second,1.988
eval/steps_per_second,0.253
total_flos,2.3628365978903347e+17
train/epoch,2.0
train/global_step,488.0
train/grad_norm,0.28392
train/learning_rate,0.0
train/loss,0.8984



‚úÖ TRAINING COMPLETED!
‚è±Ô∏è Training time: 17256.84 seconds
üíæ Peak reserved memory: 11.223 GB
üìä Memory used for training: 8.153 GB
üìà Percentage of max memory: 76.135%
üéØ Final train loss: 1.1240


## üìä Evaluation

In [14]:
# Evaluate on validation set
print("üìä Evaluating on validation set...\n")
eval_results = trainer.evaluate()

print("="*50)
print("VALIDATION RESULTS")
print("="*50)
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}" if isinstance(value, float) else f"{key}: {value}")

# Log to WandB
wandb.log({"final_eval_loss": eval_results['eval_loss']})

The following columns in the Evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: input, domains, output_words, instruction, traffic_type, output, text, attention_mask, conversations, complexity. If input, domains, output_words, instruction, traffic_type, output, text, attention_mask, conversations, complexity are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 433
  Batch size = 8


üìä Evaluating on validation set...



Error: You must call wandb.init() before wandb.log()

## üß™ Inference Testing

In [15]:
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

def test_model(instruction, input_text, max_new_tokens=512):
    """Test model with chat template format"""
    # Create conversation format
    user_message = f"{instruction}\n\n{input_text}" if input_text else instruction
    
    messages = [
        {"role": "system", "content": "B·∫°n l√† m·ªôt tr·ª£ l√Ω AI chuy√™n v·ªÅ lu·∫≠t giao th√¥ng Vi·ªát Nam. H√£y tr·∫£ l·ªùi ch√≠nh x√°c v√† chi ti·∫øt c√°c c√¢u h·ªèi v·ªÅ lu·∫≠t giao th√¥ng."},
        {"role": "user", "content": user_message}
    ]
    
    # Apply chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        use_cache=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.batch_decode(outputs)[0]
    # Extract only the response part (after the last assistant header)
    if "<|start_header_id|>assistant<|end_header_id|>" in response:
        response = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1]
        response = response.split("<|eot_id|>")[0].strip()
    
    return response

# Test with samples from test set
print("üß™ Testing model on random samples...\n")
print("="*80)

import random
test_samples = random.sample(test_data, 3)

for i, sample in enumerate(test_samples, 1):
    print(f"\n{'='*80}")
    print(f"TEST SAMPLE #{i}")
    print(f"{'='*80}")
    print(f"\nüìù Instruction: {sample['instruction']}")
    print(f"\n‚ùì Input: {sample['input']}")
    print(f"\nüéØ Expected Output:\n{sample['output'][:300]}...")
    
    # Generate response
    response = test_model(sample['instruction'], sample['input'])
    print(f"\nü§ñ Model Response:\n{response}")
    print(f"\n{'='*80}")

üß™ Testing model on random samples...


TEST SAMPLE #1

üìù Instruction: Tr·∫£ l·ªùi c√¢u h·ªèi ph√°p lu·∫≠t sau:

‚ùì Input: C√°c h√†nh vi n√†o b·ªã c·∫•m trong ho·∫°t ƒë·ªông qu·∫£ng c√°o?

üéØ Expected Output:
Theo quy ƒë·ªãnh t·∫°i ƒêi·ªÅu 8 Lu·∫≠t Qu·∫£ng c√°o 2012 th√¨ c√°c h√†nh vi c·∫•m trong ho·∫°t ƒë·ªông qu·∫£ng c√°o bao g·ªìm: - Qu·∫£ng c√°o nh·ªØng s·∫£n ph·∫©m, h√†ng h√≥a, d·ªãch v·ª• quy ƒë·ªãnh t·∫°i ƒêi·ªÅu 7 Lu·∫≠t Qu·∫£ng c√°o 2012. - Qu·∫£ng c√°o l√†m ti·∫øt l·ªô b√≠ m·∫≠t nh√† n∆∞·ªõc, ph∆∞∆°ng h·∫°i ƒë·∫øn ƒë·ªôc l·∫≠p, ch·ªß quy·ªÅn qu·ªëc gia, an ninh, qu·ªëc ph√≤ng. - Qu·∫£ng...

ü§ñ Model Response:
CƒÉn c·ª© ƒêi·ªÅu 8 Lu·∫≠t Qu·∫£ng c√°o 2012 quy ƒë·ªãnh c√°c h√†nh vi b·ªã c·∫•m trong ho·∫°t ƒë·ªông qu·∫£ng c√°o nh∆∞ sau: - Qu·∫£ng c√°o c√≥ n·ªôi dung x√¢m ph·∫°m l·ª£i √≠ch qu·ªëc gia, l·ª£i √≠ch c√¥ng c·ªông, an ninh, qu·ªëc ph√≤ng, tr·∫≠t t·ª±, an to√†n x√£ h·ªôi, nh√¢n ph·∫©m, danh d·ª±, nh√¢n ph·∫©m, uy t√≠n c·ªßa t·ªï ch·ª©c, danh d·ª±, nh√¢n ph·∫©m,

## üíæ Save Model

In [16]:
# Save LoRA adapters (only ~100-200MB!)
#model.save_pretrained("vietnamese_legal_lora")
#tokenizer.save_pretrained("vietnamese_legal_lora")

#rint("‚úÖ LoRA adapters saved to: vietnamese_legal_lora/")
#print("üì¶ Size: ~100-200MB (adapters only)")

# Optional: Save merged model (full size ~6GB)
model.save_pretrained_merged("vietnamese_legal_merged", tokenizer, save_method="merged_16bit")
print("‚úÖ Merged model saved to: vietnamese_legal_merged/")

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

Configuration saved in vietnamese_legal_merged/config.json


Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 1/2 [00:13<00:13, 13.25s/it]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:22<00:00, 11.41s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:41<00:00, 20.57s/it]


Unsloth: Merge process complete. Saved to `/kaggle/working/vietnamese_legal_merged`
‚úÖ Merged model saved to: vietnamese_legal_merged/


In [None]:
import os
import shutil
from google.colab import files

# Detect environment
is_colab = "COLAB_" in "".join(os.environ.keys())
is_kaggle = os.path.exists("/kaggle/working")

if is_colab:
    print("üîç Detected: Google Colab")
    print("üì¶ Compressing models for download...\n")
    
    from google.colab import files
    
    # Zip LoRA adapters
    if os.path.exists("vietnamese_legal_lora"):
        shutil.make_archive('vietnamese_legal_lora', 'zip', 'vietnamese_legal_lora')
        print("‚úÖ LoRA adapters zipped: vietnamese_legal_lora.zip (~100-200MB)")
        files.download('vietnamese_legal_lora.zip')
    
    # Zip merged model (if exists)
if os.path.exists("vietnamese_legal_merged"):
        print("\n‚ö†Ô∏è Merged model is ~6GB, this may take a while...")
        zip_path = shutil.make_archive('vietnamese_legal_merged', 'zip', 'vietnamese_legal_merged')
        print(f"‚úÖ Merged model zipped: {zip_path} (~6GB)")
        if files:
            files.download(zip_path)
        else:
            print(f"üìÇ File saved at: {os.path.abspath(zip_path)}")
    

elif is_kaggle:
    print("üîç Detected: Kaggle")
    print("üìÇ Models are saved in the output directory.")
    print("‚ÑπÔ∏è After notebook execution completes:")
    print("  1. Go to 'Output' tab on the right")
    print("  2. Download the following folders:")
    print("     - vietnamese_legal_lora/ (~100-200MB)")
    print("     - vietnamese_legal_merged/ (~6GB, if saved)")
    print("     - *.gguf files (if exported)")
    print("\nüí° Tip: Models in the output directory will be available for 7 days.")

else:
    print("üîç Detected: Local environment")
    print("üìÇ Models saved at:")
    
    if os.path.exists("vietnamese_legal_lora"):
        lora_path = os.path.abspath("vietnamese_legal_lora")
        print(f"  ‚úÖ LoRA adapters: {lora_path}")
    
    if os.path.exists("vietnamese_legal_merged"):
        merged_path = os.path.abspath("vietnamese_legal_merged")
        print(f"  ‚úÖ Merged model: {merged_path}")
    
    gguf_files = [f for f in os.listdir('.') if f.endswith('.gguf')]
    if gguf_files:
        print(f"  ‚úÖ GGUF models: {os.path.abspath('.')}")
        for gguf_file in gguf_files:
            print(f"     - {gguf_file}")
    
    print("\n‚úÖ Models are already on your local machine!")

In [19]:
# Upload merged model to HuggingFace Hub (gi·∫£i ph√°p cho file l·ªõn!)
# B∆∞·ªõc 1: T·∫°o HuggingFace account t·∫°i https://huggingface.co/join
# B∆∞·ªõc 2: T·∫°o token t·∫°i https://huggingface.co/settings/tokens (ch·ªçn "Write" permission)
# B∆∞·ªõc 3: Th√™m token v√†o Kaggle Secrets v·ªõi key "HF_TOKEN"

import os

if os.path.exists("/kaggle/working"):
    print("üöÄ Uploading model to HuggingFace Hub...")
    print("="*70)
    
    try:
        from kaggle_secrets import UserSecretsClient
        user_secrets = UserSecretsClient()
        hf_token = user_secrets.get_secret("HF_TOKEN")
        
        from huggingface_hub import HfApi, login
        
        # Login to HuggingFace
        login(token=hf_token)
        print("‚úÖ Logged in to HuggingFace")
        
        # Thay YOUR_USERNAME b·∫±ng username HuggingFace c·ªßa b·∫°n
        YOUR_HF_USERNAME = "mikeethanh"  # ‚ö†Ô∏è S·ª¨A D√íNG N√ÄY!
        repo_name = f"{YOUR_HF_USERNAME}/vietnamese-legal-llama3.2-3b-merged-sft-v1"
        
        print(f"\nüì§ Uploading to: {repo_name}")
        print("‚è≥ ƒêang upload ~6GB, c√≥ th·ªÉ m·∫•t 10-15 ph√∫t...\n")
        
        # Upload merged model
        if os.path.exists("vietnamese_legal_merged"):
            from huggingface_hub import create_repo, upload_folder
            
            # Create repo (public)
            try:
                create_repo(repo_name, repo_type="model", exist_ok=True)
                print(f"‚úÖ Repository created: https://huggingface.co/{repo_name}")
            except:
                print(f"‚ÑπÔ∏è Repository already exists: https://huggingface.co/{repo_name}")
            
            # Upload folder
            upload_folder(
                folder_path="vietnamese_legal_merged",
                repo_id=repo_name,
                commit_message="Vietnamese Legal AI - Llama 3.2 3B Merged Model",
            )
            
            print("\n" + "="*70)
            print("‚úÖ UPLOAD TH√ÄNH C√îNG!")
            print("="*70)
            print(f"\nüì• Download model v·ªÅ m√°y b·∫±ng c√°ch:")
            print(f"   git clone https://huggingface.co/{repo_name}")
            print(f"\nüåê Ho·∫∑c xem tr√™n web:")
            print(f"   https://huggingface.co/{repo_name}")
            print("\nüí° Model ƒë√£ public, ai c≈©ng c√≥ th·ªÉ download!")
        else:
            print("‚ö†Ô∏è Folder 'vietnamese_legal_merged' not found!")
            
    except Exception as e:
        print(f"‚ùå Error: {e}")
        print("\nüìù H∆∞·ªõng d·∫´n fix:")
        print("  1. T·∫°o account t·∫°i: https://huggingface.co/join")
        print("  2. T·∫°o token t·∫°i: https://huggingface.co/settings/tokens")
        print("  3. Kaggle: Add-ons ‚Üí Secrets ‚Üí Add 'HF_TOKEN'")
        print("  4. S·ª≠a YOUR_USERNAME trong code")
        
else:
    print("‚ÑπÔ∏è This cell only works on Kaggle")
    print("üí° For local, use: model.push_to_hub() directly")

üöÄ Uploading model to HuggingFace Hub...
‚úÖ Logged in to HuggingFace

üì§ Uploading to: mikeethanh/vietnamese-legal-llama3.2-3b-merged-sft-v1
‚è≥ ƒêang upload ~6GB, c√≥ th·ªÉ m·∫•t 10-15 ph√∫t...

‚úÖ Repository created: https://huggingface.co/mikeethanh/vietnamese-legal-llama3.2-3b-merged-sft-v1


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            


‚úÖ UPLOAD TH√ÄNH C√îNG!

üì• Download model v·ªÅ m√°y b·∫±ng c√°ch:
   git clone https://huggingface.co/mikeethanh/vietnamese-legal-llama3.2-3b-merged-sft-v1

üåê Ho·∫∑c xem tr√™n web:
   https://huggingface.co/mikeethanh/vietnamese-legal-llama3.2-3b-merged-sft-v1

üí° Model ƒë√£ public, ai c≈©ng c√≥ th·ªÉ download!


## üì§ Push to HuggingFace Hub (Optional)

In [None]:
# Uncomment to push to HuggingFace Hub
# You need to login first: huggingface-cli login

# model.push_to_hub(
#     "your-username/vietnamese-legal-llama3.2-3b-lora",
#     token="your_hf_token",
#     commit_message="Vietnamese Legal AI - Traffic Law QA"
# )
# tokenizer.push_to_hub(
#     "your-username/vietnamese-legal-llama3.2-3b-lora",
#     token="your_hf_token"
# )

# print("‚úÖ Model pushed to HuggingFace Hub!")

## üìä Quantization Export (for deployment)

In [None]:
# Export to GGUF for llama.cpp / Ollama deployment
# Uncomment the quantization method you want

quantization_methods = [
    "q8_0",    # Fast inference, good quality (recommended)
    # "q4_k_m",  # Smaller size, still good quality
    # "q5_k_m",  # Balance between size and quality
]

for method in quantization_methods:
    print(f"\nüì¶ Exporting to {method.upper()}...")
    model.save_pretrained_gguf(
        "vietnamese_legal_model",
        tokenizer,
        quantization_method=method,
    )
    print(f"‚úÖ Exported: vietnamese_legal_model-{method.upper()}.gguf")

print("\n‚úÖ All quantization exports completed!")
print("üìù You can now use these with Ollama or llama.cpp")

## üéâ Finish & Cleanup

In [None]:
# Finish WandB run
wandb.finish()

# Clear GPU memory
del model
del trainer
gc.collect()
torch.cuda.empty_cache()

print("‚úÖ Training completed successfully!")
print("\nüìä Summary:")
print(f"  - Model: Llama-3.2-3B-Instruct")
print(f"  - Training samples: {len(train_data):,}")
print(f"  - Validation samples: {len(val_data):,}")
print(f"  - Test samples: {len(test_data):,}")
print(f"  - Training time: ~{trainer_stats.metrics['train_runtime']/3600:.2f} hours")
print(f"  - Final eval loss: {eval_results['eval_loss']:.4f}")
print("\nüìÇ Saved outputs:")
print("  - LoRA adapters: vietnamese_legal_lora/")
print("  - GGUF models: vietnamese_legal_model-*.gguf")
print("\nüéØ Next steps:")
print("  1. Test model on more samples")
print("  2. Deploy with Ollama or llama.cpp")
print("  3. Collect feedback and iterate")

---

## üìö References & Resources

- **Unsloth**: https://github.com/unslothai/unsloth
- **Unsloth Docs**: https://docs.unsloth.ai
- **WandB**: https://wandb.ai
- **Llama 3.2**: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

## üí° Tips for Better Results

1. **More data**: Collect more Vietnamese legal Q&A pairs
2. **Data quality**: Clean and verify answers
3. **Hyperparameter tuning**: Try different learning rates (1e-4, 5e-5)
4. **Longer training**: Try 4-5 epochs if not overfitting
5. **Larger model**: Try Llama-3.2-11B if you have more GPU
6. **Domain adaptation**: Continue pretraining on legal documents first

## üêõ Troubleshooting

- **OOM (Out of Memory)**: Reduce batch size or max_seq_length
- **Slow training**: Enable packing=True for short sequences
- **Poor results**: Increase LoRA rank or training epochs
- **Overfitting**: Reduce epochs or add more data augmentation