#### Qwen3のファインチューニング
NVIDIA GeForce RTX 2070 SUPER(VRAM 8GB)にて、Qwen3-4Bに対してLoRAファインチューニングを実施。Qwen3-8Bは7時間弱かかるため1個下のモデルで実施。
Windows、GPUの古さが影響して、Tritonが利用できなかったため、これを利用しないように、インストールや環境変数は手を入れている。

[Unsloth公式ドキュメント](https://docs.unsloth.ai/)

※Colabローカルランタイムでの使用
1. ローカル環境で下記実行
   
   ```jupyter notebook --NotebookApp.allow_origin='https://colab.research.google.com' --port=8888 --NotebookApp.port_retries=0 --NotebookApp.allow_credentials=True```
2. 表示されたトークンをColabに入力

#### ライブラリインストールとインポート

In [None]:
import os
if "COLAB_" not in "".join(os.environ.keys()):
    print("ローカル環境")
    !pip install unsloth
    !pip uninstall -y cut_cross_entropy # Windows環境だとうまく動かないため。
    os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"        # TorchDynamo / Triton を使わない(Windows環境だとうまく動かないため。)
    os.environ["UNSLOTH_COMPILE_IGNORE_ERRORS"] = "1"  # エラーが出ても fallback
    DATASE_NUM_PROC = 1
else:
    print("Google Colab環境")
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth
    DATASE_NUM_PROC = None
    

Local environment


In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt
import pandas as pd
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import TextStreamer

#### Unsloth
モデルとトークナイザーの読み込み

In [None]:
# モデル一覧（公式ドキュメントとHuggingFace）
# https://docs.unsloth.ai/get-started/all-our-models
# https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: OpenAI failed to import - ignoring for now.
🦥 Unsloth Zoo will now patch everything to make training faster!


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.4.8: Fast Qwen3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA GeForce RTX 2070 SUPER. Num GPUs = 1. Max memory: 8.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.7.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

LoRA adaptersを追加

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.4.8 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


<a id="Data"></a>
### データ準備
Qwen3には推論モードと非推論モードがあり、非推論のデータセットだけだと推論能力に影響を与える可能性があるため、双方のデータセットを組み合わせる形でデータ構築:

1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.

2. We also leverage [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.
   
##### 金融データ
- https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train
- https://huggingface.co/datasets/FinGPT/fingpt-fiqa_qa
- https://github.com/czyssrs/FinQA

In [None]:
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

In [5]:
reasoning_dataset

Dataset({
    features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
    num_rows: 19252
})

In [6]:
non_reasoning_dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

推論データセットを会話形式に変換

In [7]:
def generate_conversation(examples):
    problems  = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role" : "user",      "content" : problem},
            {"role" : "assistant", "content" : solution},
        ])
    return { "conversations": conversations, }

In [8]:
reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
    tokenize = False,
)

In [26]:
print(reasoning_conversations[0])

<|im_start|>user
Given $\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>
<|im_start|>assistant
<think>
Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.

First, let me write down the equation again to make sure I have it right:

√(x² + 165) - √(x² - 52) = 7.

Okay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:

√(x² + 165) = 7 + √(x² - 52).

Now, if I square both sides, maybe I can get rid of the square roots. Let's do that:

(√(x² + 165))² = (7 + √(x² - 52))².

Simplifying the left side:

x² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².

The right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14

非推論のデータセットを会話形式に変換

Unslothの`standardize_sharegpt`関数を用いて、データセットのフォーマットを修正

In [None]:
dataset = standardize_sharegpt(non_reasoning_dataset)

non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize = False,
)

In [27]:
print(non_reasoning_conversations[0])

<|im_start|>user
Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. 

Furthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.

Finally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>
<|im_start|>assistant
<think>

</think>

Bool

In [12]:
print(len(reasoning_conversations))
print(len(non_reasoning_conversations))

19252
100000


今回はチャットモデル中心での学習と考えて、データセット割合は推論：非推論＝75%：25%にて混合

In [13]:
chat_percentage = 0.75

In [None]:

non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
    int(len(reasoning_conversations) * (1.0 - chat_percentage)),
    random_state = 2407,
)

In [None]:
data = pd.concat([
    pd.Series(reasoning_conversations),
    pd.Series(non_reasoning_subset)
])
data.name = "text"

combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)

<a name="Train"></a>
### 学習
Huggingface TRLの`SFTTrainer`を使用。[TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer)

60ステップを実行しているが、`num_train_epochs=1`を設定してフルに実行し、`max_steps=None`をオフにすることも可

In [None]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        dataset_num_proc            = DATASE_NUM_PROC,
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/24065 [00:00<?, ? examples/s]

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 2070 SUPER. Max memory = 8.0 GB.
3.668 GB of memory reserved.


※学習を再開するには、`trainer.train(resume_from_checkpoint = True)`を設定

In [18]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 24,065 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 66,060,288/4,000,000,000 (1.65% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.6151
2,0.6781
3,0.8538
4,0.6942
5,0.5814
6,0.5485
7,0.5496
8,0.5069
9,0.4685
10,0.5755


In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

564.9573 seconds used for training.
9.42 minutes used for training.
Peak reserved memory = 10.889 GB.
Peak reserved memory for training = 7.221 GB.
Peak reserved memory % of max memory = 136.112 %.
Peak reserved memory for training % of max memory = 90.263 %.


<a name="Inference"></a>
### 推論
Qwen-3チームによると、推論に推奨される設定は `temperature = 0.6, top_p = 0.95, top_k = 20` で、

通常のチャットベースの推論では、`temperature = 0.7, top_p = 0.8, top_k = 20`

In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

To solve the equation \((x + 2)^2 = 0\), we start by taking the square root of both sides. This gives us:

\[
x + 2 = \pm \sqrt{0}
\]

Since the square root of 0 is 0, we have:

\[
x + 2 = 0
\]

Subtracting 2 from both sides, we find:

\[
x = -2
\]

So the solution is \(x = -2\).<|im_end|>


In [None]:
messages = [
    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024, # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<think>
Okay, let's see. The equation is (x + 2) squared equals zero. Hmm, so (x + 2)^2 = 0. I remember that when you have something squared equal to zero, the only solution is when the inside part is zero because any real number squared is non-negative. So, if (x + 2)^2 is zero, then x + 2 must be zero. Let me check that. If x + 2 is zero, then x is -2. So, the solution is x equals -2. But wait, since it's squared, does that mean there's a multiplicity of two? But in terms of real solutions, it's just one solution, right? Because the equation is a quadratic, but it's a perfect square. So, the solution is x = -2 with multiplicity two. But the question just says to solve the equation, so maybe just stating x = -2 is sufficient. Let me make sure I didn't miss anything. The equation is a quadratic in standard form. If I expand it, it would be x^2 + 4x + 4 = 0. Then, using the quadratic formula: x = [-4 ± sqrt(16 - 16)] / 2 = [-4 ± 0]/2 = -2. So, only one solution, x = -2, repeated twice. 

<a name="Save"></a>
### モデルの保存と読み込み
最終モデルをLoRAアダプターとして保存するには、オンライン保存の場合はHuggingfaceの`push_to_hub`を、ローカル保存の場合は`save_pretrained`を使う

**[NOTE]** これはLoRAアダプターだけを保存し、フルモデルは保存しません。16bitまたはGGUFに保存するには、最下部参照

In [22]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model\\tokenizer_config.json',
 'lora_model\\special_tokens_map.json',
 'lora_model\\vocab.json',
 'lora_model\\merges.txt',
 'lora_model\\added_tokens.json',
 'lora_model\\tokenizer.json')

推論用に保存した LoRA アダプターをロードしたい場合は、`False` を `True` に設定する

In [23]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### VLLMのためのfloat16への保存
float16 の場合は `merged_16bit` を、int4 の場合は `merged_4bit` を選択

push_to_hub_merged`を使用してHugging Faceアカウントにアップロード可（要トークン）

In [24]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp 変換
`GGUF` / `llama.cpp` への保存は現在ネイティブでサポート。`llama.cpp`をクローンし、デフォルトで `q8_0` に保存。`q4_k_m` のようなすべてのメソッドを許可する。ローカルに保存する場合は `save_pretrained_gguf` を、HF にアップロードする場合は `push_to_hub_gguf` を使用する。

サポートされているクォンツメソッド (完全なリストは[Wikiページ](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options))：
* `q8_0` - 高速変換。リソースの使用量は多いが、一般的には許容範囲。
* `q4_k_m` - 推奨。attention.wvとfeed_forward.w2のテンソルの半分にQ6_Kを使い、それ以外はQ4_Kを使う。
* `q5_k_m` - 推奨。attention.wvとfeed_forward.w2のテンソルの半分にQ6_Kを使用、それ以外はQ5_K。

[Ollama ノートブック](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)をお試しください。

In [25]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

llama.cppの`model.gguf`ファイルか`model-Q4_K_M.gguf`ファイル、またはJanやOpen WebUIなどのUIベースのシステムを使用してください。Jan [こちら](https://github.com/janhq/jan) と Open WebUI [こちら](https://github.com/open-webui/open-webui) をインストールしてください。

これで完了です！Unslothについて質問があれば、[Discord](https://discord.gg/unsloth)チャンネルがあります！バグを見つけたり、LLMの最新情報を入手したり、ヘルプが必要な場合、プロジェクトに参加したい場合など、お気軽にDiscordに参加してください！

その他のリンク
1. 自分の推論モデルを訓練する - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Ollamaにfinetunesを保存する。[フリーノート](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [フリーColab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) 6.
6. DPO, ORPO, Continued pre-training, conversational finetuningなどのノートブックは[ドキュメント](https://docs.unsloth.ai/get-started/unsloth-notebooks)をご覧ください！