# Prerequisites
본 `ipynb` 은 `Python=3.12` 에서 작성하였습니다. Package dependency 를 해결하기 위해 아래 cell 을 실행해주세요.

## Install Python packages

In [None]:
%pip -q install -U datasets azure-storage-blob tqdm Pillow

## Load environment variables from a .env file
secret 노출을 피하고 notebook 들간의 일관된 환경변수를 설정하기 위해 `dotenv` 을 이용한다.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv(override=True)

AZURE_STORAGE_ACCOUNT_BASE_URL = os.getenv("AZURE_STORAGE_ACCOUNT_BASE_URL")
AZURE_STORAGE_ACCOUNT_SAS_TOKEN = os.getenv("AZURE_STORAGE_ACCOUNT_SAS_TOKEN")

# Data preparation
fine-tuning 에 필요한 data 들을 준비한다. 데이터는 huggingface 의 dataset 을 이용하고, Azure Storage Account 의 blob 에 업로드하여 fine-tuning 에 사용한다.

In [None]:
from io import BytesIO
import json
from azure.core.credentials import AzureSasCredential
from azure.storage.blob import BlobServiceClient
from datasets import load_dataset
from tqdm import tqdm

AZURE_STORAGE_ACCOUNT_CONTAINER = "ai-foundry-finetuning"
AZURE_STORAGE_ACCOUNT_BLOB_PREFIX = "images/HuggingFaceM4-ChartQA"

container_client = (
    BlobServiceClient(account_url=AZURE_STORAGE_ACCOUNT_BASE_URL, credential=AzureSasCredential(AZURE_STORAGE_ACCOUNT_SAS_TOKEN))
    .get_container_client(AZURE_STORAGE_ACCOUNT_CONTAINER)
)

for split, n in [("train", 20), ("val", 5)]:
    dataset = load_dataset("HuggingFaceM4/ChartQA", split=split)
    n = min(1000, n)
    print(f"{split} split size: {len(dataset)}, using first {n} samples.")
    with open(f"sft-vlm-{split}.jsonl", "w", encoding="utf-8") as file:
        for idx in tqdm(range(n), desc=f"{split}"):
            blob_name = f"{AZURE_STORAGE_ACCOUNT_BLOB_PREFIX}/{split}/{idx}.png?{AZURE_STORAGE_ACCOUNT_SAS_TOKEN}"
            blob_url = f"{AZURE_STORAGE_ACCOUNT_BASE_URL}/{AZURE_STORAGE_ACCOUNT_CONTAINER}/{blob_name}"

            data = BytesIO()
            dataset[idx]["image"].save(data, format="PNG")
            data.seek(0)
            container_client.upload_blob(name=blob_name, data=data, overwrite=True)

            file.write(json.dumps({
                "messages": [
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": f"{dataset[idx]["query"].strip()}\nShort answer required.",
                            },
                            {
                                "type": "image_url",
                                "image_url": {"url": blob_url},
                            },
                        ],
                    },
                    {"role": "assistant", "content": [{"type": "text", "text": str(dataset[idx]["label"][0]).strip()}]}],
            }, ensure_ascii=False) + "\n")


생성된 train/validation jsonl 를 가지고, AI Foundry 에서 Fine-tuning 하면 된다.