# LLaDA Benchmark Playground

이 노트북은 **정량적 평가**를 위한 벤치마크 실험을 수행합니다.

## 목적
- Academic Benchmarks (GSM8K, MMLU)를 사용한 A/B 테스트
- Baseline vs Experimental 샘플링 전략 비교
- 메트릭: Accuracy, Perplexity, Stability, Survival Rate, Correction Efficacy

In [None]:
import os
import sys
import torch
import pandas as pd
from transformers import AutoTokenizer

# Add current directory to path
current_dir = os.getcwd()
if current_dir not in sys.path:
    sys.path.append(current_dir)

# Import local modules
from modeling_llada import LLaDAModelLM
from configuration_llada import LLaDAConfig
import experiment_utils
import decoding
from tta_uncertainty_sampling import generate_with_tta_uncertainty

print("Modules loaded successfully.")

## 1. Load Model

In [None]:
LOCAL_MODEL_PATH = "../Grok-1-LLaDA-8B"
HF_MODEL_ID = "GSAI-ML/LLaDA-8B-Base"

model_path = HF_MODEL_ID
if os.path.exists(LOCAL_MODEL_PATH):
    model_path = LOCAL_MODEL_PATH
    print(f"Using local model: {model_path}")
else:
    print(f"Using HuggingFace model: {model_path}")

config = LLaDAConfig.from_pretrained(model_path)
model = LLaDAModelLM.from_pretrained(model_path, config=config, torch_dtype="auto")

if torch.cuda.is_available():
    model.cuda()
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Model loaded successfully.")

## 2. Run Academic Benchmark

이 셀은 GSM8K와 MMLU 데이터셋을 사용하여 **TTA Uncertainty Sampling**을 벤치마크합니다.

### 설정 옵션:
- **Temporal Decay 테스트**: `experimental_fn=None` (기본값)
- **TTA 테스트**: `experimental_fn=generate_with_tta_uncertainty` (현재 설정)

In [None]:
# Benchmark Configuration
TTA_K_VALUES = [3, 5, 7]  # TTA forward pass 횟수
N_SAMPLES = 20  # Number of samples per task (빠른 테스트용, 전체는 50)
STEPS = 32  # 빠른 테스트용 (전체는 64)
GEN_LENGTH = 32
BLOCK_LENGTH = 32

print(f"Starting TTA Uncertainty Sampling Benchmark")
print(f"TTA K Values: {TTA_K_VALUES}")
print(f"Samples per task: {N_SAMPLES}")
print(f"Expected time: ~{N_SAMPLES * len(TTA_K_VALUES) * 2} minutes")

results_df = experiment_utils.run_academic_benchmark(
    model=model,
    tokenizer=tokenizer,
    baseline_fn=None,  # Uses inspect_sampling (no remasking)
    experimental_fn=generate_with_tta_uncertainty,  # TTA algorithm
    thresholds=TTA_K_VALUES,  # Different tta_k values to test
    samples=N_SAMPLES,
    steps=STEPS,
    gen_length=GEN_LENGTH,
    block_length=BLOCK_LENGTH
)

print("\nBenchmark completed!")
print(f"Total results: {len(results_df)} rows")

## 3. Analyze Results

In [None]:
# Display comprehensive analysis
experiment_utils.analyze_icml_results(results_df)

## 4. Save Results

In [None]:
# Save to CSV for further analysis
output_file = "tta_benchmark_results.csv"
results_df.to_csv(output_file, index=False)
print(f"Results saved to {output_file}")

## 5. Detailed Inspection (Optional)

특정 케이스를 자세히 살펴보고 싶다면 아래 셀을 사용하세요.

In [None]:
# Filter for specific category or AlphaDecay (TTA K)
print("\n=== Math Task Results ===")
math_results = results_df[results_df['Category'] == 'math']
print(math_results.groupby('AlphaDecay')[['Acc_Exp', 'PPL_Delta', 'Stability_Delta']].mean())

print("\n=== Logic Task Results ===")
logic_results = results_df[results_df['Category'] == 'logic']
print(logic_results.groupby('AlphaDecay')[['Acc_Exp', 'PPL_Delta', 'Stability_Delta']].mean())

## 6. TTA-Specific Analysis

TTA 알고리즘의 불확실성 메트릭을 분석합니다.

In [None]:
# Summary by TTA K
print("\n=== Performance by TTA K ===")
summary = results_df.groupby('AlphaDecay')[['Acc_Exp', 'PPL_Delta', 'Stability_Delta', 'Survival', 'Correction_Eff']].mean()
print(summary)

# Best configuration
best_k = summary['Acc_Exp'].idxmax()
best_acc = summary['Acc_Exp'].max()
baseline_acc = results_df['Acc_Base'].mean()

print(f"\n=== Best Configuration ===")
print(f"Best TTA K: {best_k}")
print(f"Accuracy: {best_acc:.2%} (Baseline: {baseline_acc:.2%})")
print(f"Improvement: {best_acc - baseline_acc:+.2%}")