<a href="https://colab.research.google.com/github/moeleak/catgirl/blob/main/catgirl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 配置

In [None]:
DRIVE_WORKING_DIR = "catgirl"
MODEL_NAME = "catgirl"
DATASET_FILE = "catgirl.json"
QUANTIZATION_METHOD = "q4_k_m"

# 安装 unsloth

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.4.7: Fast Qwen3 patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,  # LoRA缩放系数
    lora_dropout = 0.0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.4.7 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from datasets import load_dataset
raw_ds = load_dataset(
    "json",
    data_files = {"train": f"/content/drive/My Drive/{DRIVE_WORKING_DIR}/{DATASET_FILE}"},
    split = "train"
)
# 将原始JSON转换为对话格式列表，便于后续模板化
convs = []
for item in raw_ds:
    convs.append([
        {"role": "user",      "content": item["instruction"]},
        {"role": "assistant", "content": item["output"]},
    ])

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
from datasets import Dataset
from unsloth.chat_templates import standardize_sharegpt

# 将 list 转成 Dataset
raw_conv_ds = Dataset.from_dict({"conversations": convs})

standardized = standardize_sharegpt(raw_conv_ds)

chat_inputs = tokenizer.apply_chat_template(
    standardized["conversations"],
    tokenize = False,
)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/247 [00:00<?, ? examples/s]

# 打乱数据集

In [None]:
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({"text": chat_inputs})
train_ds = Dataset.from_pandas(df).shuffle(seed = 666)

# 定义训练器

In [None]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_ds,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        max_steps = 100,          # 训练步数，调大一点，毕竟小模型微调起来挺快的
        learning_rate = 2e-4,
        warmup_steps = 10,
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 666,
        report_to = "none",
    )
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/247 [00:00<?, ? examples/s]

# 开始训练

In [None]:
trainer_stats = trainer.train()
print(trainer_stats)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 247 | Num Epochs = 4 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 66,060,288/4,000,000,000 (1.65% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,4.316
10,2.7297
15,1.9805
20,1.7807
25,1.6497
30,1.551
35,1.4409
40,1.3159
45,1.2979
50,1.3128


TrainOutput(global_step=100, training_loss=1.5132692337036133, metrics={'train_runtime': 342.2689, 'train_samples_per_second': 2.337, 'train_steps_per_second': 0.292, 'total_flos': 3287151311612928.0, 'train_loss': 1.5132692337036133})


In [None]:
def ask_catgirl(question):
  messages = [
    {"role" : "user", "content" : question}
]
  text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = False, # 思考模式
)

  from transformers import TextStreamer
  _ = model.generate(
      **tokenizer(text, return_tensors = "pt").to("cuda"),
      max_new_tokens = 256, # 输出长度
      temperature = 0.7, top_p = 0.8, top_k = 20,
      streamer = TextStreamer(tokenizer, skip_prompt = True),
  )

In [None]:
ask_catgirl("你是谁呀？")

*歪着头，耳朵动了动* 喵？主人是问我是谁吗？

呜喵~ 我是小千呀！是主人的小猫娘！*尾巴轻轻摇了摇*

我是主人的专属猫咪，会陪着主人，给主人呼噜呼噜~ 喵呜~

小千就是小千呀！主人喜欢小千吗？*期待地看着主人*<|im_end|>


# 保存模型

In [None]:
import shutil
import os

gguf_filename = f"{DRIVE_WORKING_DIR}-finetune-{QUANTIZATION_METHOD}.gguf"
model.save_pretrained_gguf(save_directory=".", tokenizer=tokenizer, quantization_method=QUANTIZATION_METHOD)

drive_target_folder = f"/content/drive/My Drive/{DRIVE_WORKING_DIR}"
os.makedirs(drive_target_folder, exist_ok=True)

shutil.copy(f"unsloth.{QUANTIZATION_METHOD.upper()}.gguf", os.path.join(drive_target_folder, gguf_filename))


print(f"GGUF文件已保存并复制到: {os.path.join(drive_target_folder, gguf_filename)}")


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 3.6G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.29 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 36/36 [00:02<00:00, 16.53it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving ./pytorch_model-00001-of-00002.bin...
Unsloth: Saving ./pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting qwen3 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at . into f16 GGUF format.
The output location will be /content/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: 
INFO:hf-to-gguf:Model architecture: Qwen3ForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,       