<a href="https://colab.research.google.com/github/joo9906/AI_study/blob/main/coding_challenge(internship)_ipynb%EC%9D%98_%EC%82%AC%EB%B3%B8%EC%9D%98_%EC%82%AC%EB%B3%B8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Coding Challenge: LLM Performance Optimization

### Overview
Your task is to enhance the performance of a small language model (e.g. Qwen2.5-0.5B) on mathematical reasoning tasks. You'll have the freedom to explore various optimization techniques while maintaining reproducibility and providing clear documentation of your methodology.

### Challenge Requirements
* Improve the model's performance on the GSM8K benchmark using Qwen2.5-0.5B as your foundation model
* Document your experimental process and methodology thoroughly
* Ensure your solution is fully reproducible
* Submit all necessary code, model checkpoints, and documentation

### Available Optimization Approaches
You have flexibility in your approach and can explore techniques such as:
* Fine-tuning strategy optimization
* Custom architecture modifications
* Dataset curation and synthesis
* Hyperparameter optimization
* Template and tokenizer configuration adjustments

### Technical Guidelines
* While our baseline implementation uses liger-kernel, you're welcome to explore alternative optimization methods (e.g., PEFT, spectrum)
* You can implement custom components such as:
  * Custom dataset classes
  * Specialized data collators
  * Modified training loops
* You may leverage larger models (>7B) for data synthesis or knowledge distillation

### Evaluation Criteria
* Primary metric: GSM8K benchmark performance
  * Baseline score (Qwen2.5-0.5B-Instruct): 41.6
  * Evaluation using EleutherAI's lm-evaluation-harness

Note: Even if significant score improvements aren't achieved, strong technical analysis and well-reasoned experimentation will be valued highly.

### Deliverables
Required:
* Complete notebook (ipynb or Google Colab format)
* Final model weights and tokenizer (shared via HuggingFace Hub)

Optional:
* Supplementary analysis report (PDF)
* Additional experimental results and ablation studies

In [25]:
import os
folder = r"G:\SSAFY\About_Code\AI_study\Assignment\Allganize_model\results\merged"
print("\n".join(os.listdir(folder)))


added_tokens.json
chat_template.jinja
config.json
generation_config.json
merges.txt
model.safetensors
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json


In [None]:
!python -m pip install --upgrade pip -q -U
!pip install -q -U datasets
!pip install -q -U transformers
!pip install -q -U trl
!pip install -q -U bitsandbytes
!pip install -q -U accelerate
!pip install -q huggingface_hub==0.35.3
!pip install -q -U vllm
!pip install -q -U mlflow

In [None]:
!pip show peft

In [None]:
from huggingface_hub import login
from dotenv import load_dotenv
load_dotenv()
import os

token = os.getenv("TOKEN")

login(token=token)

In [None]:
# # (Optional) Mount Google Drive, if you are not using Colab, please comment out the code below.
# from google.colab import drive
# drive.mount('/gdrive', force_remount=True)
# drive.mount('/content/drive')

In [None]:
# #(Optional) 구글 드라이브를 사용할 경우 아래의 코드를 통해 모델을 캐싱하여 시간을 절약하고 학습 데이터를 드라이브에 저장할 수 있습니다.
# # If you're running Jupyter notebook in local, set your local caching directory in `cache_dir`.
# import locale
# def getpreferredencoding(do_setlocale = True):
#     return "UTF-8"
# locale.getpreferredencoding = getpreferredencoding

import os
cache_dir = r"G:\SSAFY\About_Code\AI_study\Assignment\Allganize_model"
os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists

output_dir = os.path.join(cache_dir, "results")
ML_dir = os.path.join(cache_dir, "logs")

In [None]:
model_id = "Qwen/Qwen2.5-0.5B"

local_path = model_id
local_save_path = os.path.join(cache_dir, local_path)

In [None]:
from huggingface_hub import snapshot_download
import os

def download_model_repo(repo_id, local_dir):
    # Download the whole repository to the specified local directory
    repo_path = snapshot_download(repo_id=repo_id,
                                  cache_dir=local_dir,
                                  local_dir=local_dir,
                                  local_dir_use_symlinks=False)

    print(f"Repository is saved to: {repo_path}")

def main():
    download_model_repo(model_id, local_save_path)
    print()

if __name__ == "__main__":
    main()

In [None]:
from datasets import load_dataset

ds = load_dataset("AI-MO/NuminaMath-CoT", split="train")
ds[0]

In [None]:
# randomly sample 20000 examples
# 로컬 환경에서 돌리기 위해 5000examples로 낮춤.
sampled_ds = ds.shuffle(seed=42).select(range(5000))

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    dtype=torch.float16,
    cache_dir=cache_dir)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_id,
                                          padding_side='left',
                                          truncation_side='left')

#### 수학적 추론에 있어서 step by step으로 성능 개선은 정석적인 방법이라 추가하였습니다. 또한 few-shot을 통해 단계적 추론으로 성능의 개선을 시도하였습니다.

In [None]:
chat_template = """
You are a math reasoning assistant.
You will try to solve GSM8K problems.

Solve problems by explaining step-by-step reasoning first.
Use the few-shot examples as a style guide, then solve the real problems.
Always end your solution with a single line:
Final Answer: <number>

Example 1:
Q: 2 + 3 = ?
A: 2 + 3 = 5.

Example 2:
Q: If a pen costs $2 and a notebook costs $4, total for 3 pens and 2 notebooks?
A: 3×2 + 2×4 = 6 + 8 = 14.

Now your turn.
{% for message in messages %}
{% if message['role'] == 'user' %}
Q: {{ message['content'] }}
Let's think step by step.
{% elif message['role'] == 'assistant' %}
A: {{ message['content'] }}
{% endif %}
{% endfor %}"""

tokenizer.chat_template = chat_template

tokenizer.eos_token = "<|im_end|>"
tokenizer.eos_token_id = 151645

if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.generation_config.eos_token_id = tokenizer.eos_token_id

In [None]:
def format_prompt_func(sample):
  sample['text'] = tokenizer.apply_chat_template(sample['messages'], tokenize = False, add_generation_prompt = False)
  return sample['text']

In [None]:
# 데이터 포맷팅 적용 (학습 전에 반드시 필요)
# num_proc=1로 설정하여 멀티프로세싱 문제 방지 (tokenizer 전역 변수 참조 문제)
sampled_ds = sampled_ds.map(format_prompt_func, num_proc=1)
sampled_ds = sampled_ds.train_test_split(test_size=0.1, seed=42)

In [None]:
from transformers import TrainingArguments, Trainer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
import mlflow
import os

# mlflow 초기 세팅 및 이전 기록이 돌아가고 있을 수 있으니 end run
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("Allganize_optimize_qwen2.5B_Home")
mlflow.end_run()

peft_config = LoraConfig(
  r = 32,
  lora_alpha=32,
  target_modules=["q_proj", "v_proj"],
  lora_dropout=0.05,
  bias='none',
  task_type="CAUSAL_LM"
)


training_arguments = SFTConfig(
    dataset_text_field='text',
    output_dir=os.path.join(cache_dir, "results"),
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    save_strategy='epoch',
    eval_steps=0.1,
    logging_steps=10,
    learning_rate=2e-5,
    weight_decay=0.01,
    # max_seq_len=2048,   # SFTConfig 0.24.0에선 사라짐.
    max_grad_norm=1,
    max_steps=-1,
    warmup_ratio=0.05,
    packing=True,
    lr_scheduler_type="cosine",
    #use_liger=True,    # SFTConfig 0.24.0에선 사라짐.
    report_to=["mlflow"],   # mlflow를 사용하여 추적
    bf16=False,
    fp16=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=sampled_ds['train'],
    eval_dataset=sampled_ds['test'],
    args=training_arguments,
    formatting_func=format_prompt_func,
    peft_config=peft_config,
)

with mlflow.start_run(run_name="Local_T4_Qwen_finetune"):
  mlflow.log_param("model_id", "Qwen/Qwen2.5-0.5B")
  mlflow.log_param("dataset", "AI-MO/NuminaMath-CoT")
  
  result = trainer.train()
  mlflow.log_metric("train_loss", result.training_loss)
  
  trainer.save_model(output_dir)
  tokenizer.save_pretrained(output_dir)

mlflow.end_run()

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel

new_model_name = "noridorimari/qwen0.5b-allganize-jooyoung"

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    dtype=torch.float16,
    cache_dir=cache_dir
)

model = PeftModel.from_pretrained(base_model, output_dir)
merged_model = model.merge_and_unload()  

merged_model.config.use_cache = True
merged_model.save_pretrained(output_dir, safe_serialization=True)
tokenizer.save_pretrained(output_dir)

In [None]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
!cd lm-evaluation-harness && pip install -e .

In [None]:
!pip install lm-eval[vllm]

## Ubuntu 환경에서 eval하였습니다

In [None]:
import os
eval_output_dir = os.path.join(output_dir, "submit", "gsm8k")
os.makedirs(eval_output_dir, exist_ok=True)

eval_output_path = os.path.join(eval_output_dir, "result-original.json")
tasks = "gsm8k"
local_model = output_dir
local_model_path = local_model.replace('\\', '/')
eval_output_path_normalized = eval_output_path.replace('\\', '/')

eval_cmd = f"lm_eval --model vllm --model_args pretrained={local_model_path},trust_remote_code=True,dtype=float16 --tasks {tasks} --device cuda:0 --batch_size auto:4 --output_path {eval_output_path_normalized}"

In [None]:
print(eval_cmd)

In [None]:
# # run an evaluation command
# import subprocess
# import shlex

# cmd_parts = shlex.split(eval_cmd, posix=False)
# print("실행 명령어:", " ".join(cmd_parts))
# result = subprocess.run(cmd_parts)

In [30]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
from pathlib import Path

# 폴더 경로 (현재 구조 기준)
base_model_name = "Qwen/Qwen2.5-0.5B-Instruct"
adapter_dir = Path("G:/SSAFY/About_Code/AI_study/Assignment/Allganize_model/results")
save_dir = Path("G:/SSAFY/About_Code/AI_study/Assignment/Allganize_model/results/merged")

# PEFT 설정 로드 (어댑터 config에서 base 모델 이름 가져옴)
config = PeftConfig.from_pretrained(adapter_dir)

# 베이스 모델 로드
base = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,  # 보통 Qwen/Qwen2.5-0.5B-Instruct
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

# 어댑터 로드
model = PeftModel.from_pretrained(base, adapter_dir)

# 병합
model = model.merge_and_unload()

# 병합 결과 저장
model.save_pretrained(save_dir)

# 토크나이저 복사
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.save_pretrained(save_dir)

print(f"최종 모델 저장 위치: {save_dir}")


최종 모델 저장 위치: G:\SSAFY\About_Code\AI_study\Assignment\Allganize_model\results\merged


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import os

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

peft_model = PeftModel.from_pretrained(
    base_model,
    r"G:\SSAFY\About_Code\AI_study\Assignment\Allganize_model\results", 
)

merged_model = peft_model.merge_and_unload()

save_dir = r"G:\SSAFY\About_Code\AI_study\Assignment\Allganize_model\results\merg"
os.makedirs(save_dir, exist_ok=True)
merged_model.save_pretrained(save_dir)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", trust_remote_code=True)
tokenizer.save_pretrained(save_dir)


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

('G:\\SSAFY\\About_Code\\AI_study\\Assignment\\Allganize_model\\results\\merg\\tokenizer_config.json',
 'G:\\SSAFY\\About_Code\\AI_study\\Assignment\\Allganize_model\\results\\merg\\special_tokens_map.json',
 'G:\\SSAFY\\About_Code\\AI_study\\Assignment\\Allganize_model\\results\\merg\\chat_template.jinja',
 'G:\\SSAFY\\About_Code\\AI_study\\Assignment\\Allganize_model\\results\\merg\\vocab.json',
 'G:\\SSAFY\\About_Code\\AI_study\\Assignment\\Allganize_model\\results\\merg\\merges.txt',
 'G:\\SSAFY\\About_Code\\AI_study\\Assignment\\Allganize_model\\results\\merg\\added_tokens.json',
 'G:\\SSAFY\\About_Code\\AI_study\\Assignment\\Allganize_model\\results\\merg\\tokenizer.json')

In [None]:
from peft import PeftModel
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    r"G:\SSAFY\About_Code\AI_study\Assignment\Allganize_model\results\merged",  # 병합 경로
    trust_remote_code=True
)

try:
    model.print_trainable_parameters()
except Exception as e:
    print("병합 완료")


✅ PEFT 어댑터가 없습니다. 완전 병합 모델입니다.


In [29]:
from huggingface_hub import HfApi
import os

api = HfApi(token=os.getenv("TOKEN"))
api.upload_folder(
    folder_path=r"G:\SSAFY\About_Code\AI_study\Assignment\Allganize_model\results\merged",
    repo_id="noridorimari/Allganize_last",
    repo_type="model",
)

- empty or missing yaml metadata in repo card


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

CommitInfo(commit_url='https://huggingface.co/noridorimari/Allganize_last/commit/dfca2be9ded500915fb9e8b241f4d7d836322140', commit_message='Upload folder using huggingface_hub', commit_description='', oid='dfca2be9ded500915fb9e8b241f4d7d836322140', pr_url=None, repo_url=RepoUrl('https://huggingface.co/noridorimari/Allganize_last', endpoint='https://huggingface.co', repo_type='model', repo_id='noridorimari/Allganize_last'), pr_revision=None, pr_num=None)

In [None]:
# Qwen2.5-0.5B-Insruct results for the reference
#vllm (pretrained=Qwen/Qwen2.5-0.5B-Instruct,trust_remote_code=True,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4
#|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
#|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
#|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3442|±  |0.0131|
#|     |       |strict-match    |     5|exact_match|↑  |0.3169|±  |0.0128|