To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install transformers==4.51.3
    !pip install --no-deps unsloth

### Unsloth

In [None]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit", # Llama 3.2 vision support
    "unsloth/Llama-3.2-11B-Vision-bnb-4bit",
    "unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit", # Can fit in a 80GB card!
    "unsloth/Llama-3.2-90B-Vision-bnb-4bit",

    "unsloth/Pixtral-12B-2409-bnb-4bit",              # Pixtral fits in 16GB!
    "unsloth/Pixtral-12B-Base-2409-bnb-4bit",         # Pixtral base model

    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",          # Qwen2 VL support
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    "unsloth/Qwen2-VL-72B-Instruct-bnb-4bit",

    "unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit",      # Any Llava variant works!
    "unsloth/llava-1.5-7b-hf-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

**[NEW]** We also support finetuning ONLY the vision part of the model, or ONLY the language part. Or you can select both! You can also select to finetune the attention or the MLP layers!

In [None]:

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 128,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

<a name="Data"></a>
### Data Prep
We'll be using a sampled dataset of handwritten maths formulas. The goal is to convert these images into a computer readable form - ie in LaTeX form, so we can render it. This can be very useful for complex formulas.

You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).

Let's take an overview look at the dataset. We shall see what the 3rd image is, and what caption it had.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
/content/drive/MyDrive/eventa/query

In [None]:
from datasets import load_dataset, Features, Image as HFImage, Value
import json
import os
json_file_path = "/content/drive/MyDrive/eventa/dataforQwen - Copy.json"
custom_features = Features({
    'image': Value('string'),      # This will load the image from the path string
    'instruction': Value('string'),
    'output': Value('string'),
})
try:
    dataset = load_dataset("json", data_files=json_file_path, features=custom_features, split="train")
except Exception as e:
    print(f"Lỗi khi tải dataset: {e}")
    print("Vui lòng kiểm tra định dạng file JSON và đường dẫn hình ảnh.")
    exit()

In [None]:
subset = dataset.select(range(5000))


In [None]:
subset[0]

In [None]:
import torch
from torchvision.transforms import ToTensor
import torchvision.transforms as transforms
from PIL import Image
from tqdm import tqdm
import os
import gc
import time

# === Convert GPU Tensor -> CPU PIL Image ===
def tensor_to_pil(tensor_img):
    if tensor_img.is_cuda:
        tensor_img = tensor_img.cpu()
    tensor_img = tensor_img.mul(255).byte().clamp(0, 255)
    return transforms.ToPILImage()(tensor_img)

# === Load + resize ảnh trên CPU ===
def load_and_resize_image(path, size=(512, 512)):
    try:
        if not os.path.exists(path):
            return None
        image = Image.open(path).convert("RGB")
        image = image.resize(size, Image.BILINEAR)
        return ToTensor()(image)  # [3, H, W], float in [0,1]
    except Exception as e:
        print(f"Error reading {path}:", e)
        return None

# === Resize batch ảnh đã chuẩn hóa trên GPU ===
def process_images_batch_gpu(image_paths, device="cuda", target_size=(512, 512)):
    batch_tensors = []
    valid_paths = []

    for path in image_paths:
        tensor = load_and_resize_image(path, size=target_size)
        if tensor is not None:
            batch_tensors.append(tensor)
            valid_paths.append(path)

    if not batch_tensors:
        return []

    batch = torch.stack(batch_tensors).to(device)

    # (Optional) Resize lại trên GPU nếu resize trên CPU không đủ chính xác
    # batch = torch.nn.functional.interpolate(batch, size=target_size, mode='bilinear', align_corners=False)

    pil_images = [tensor_to_pil(img) for img in batch]

    # Cleanup (gọi ít để tránh chậm)
    torch.cuda.empty_cache()

    return list(zip(valid_paths, pil_images))

# === Tạo entry JSON từ batch ảnh + rows ===
def make_dataset_entries_from_batch(rows, image_dict):
    entries = []
    for row in rows:
        img_path = row.get("image", "")
        if img_path not in image_dict:
            continue
        entry = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": row.get("instruction", "")},
                    {"type": "image", "image": image_dict[img_path]}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": row.get("output", "")}
                ]
            }
        ]
        entries.append(entry)
    return entries

# === Xử lý toàn bộ dataset theo batch ===
def process_dataset_batch(subset, batch_size=64, device="cuda"):
    new_dataset = []

    for i in tqdm(range(0, len(subset), batch_size), desc="Processing in batches"):
        batch_rows_ds = subset.select(range(i, min(i + batch_size, len(subset))))
        batch_rows = batch_rows_ds.to_list()

        image_paths = [row.get("image", "") for row in batch_rows]
        image_pairs = process_images_batch_gpu(image_paths, device=device)

        image_dict = {path: img for path, img in image_pairs}
        batch_entries = make_dataset_entries_from_batch(batch_rows, image_dict)
        new_dataset.extend(batch_entries)

        # Gọi GC ít hơn, tránh làm chậm
        if i % (batch_size * 5) == 0:
            gc.collect()
            torch.cuda.empty_cache()

    return new_dataset

# === Gọi main ===
if __name__ == "__main__":
    from datasets import load_dataset
    device = "cuda" if torch.cuda.is_available() else "cpu"
    BATCH_SIZE = 128  # Thử 128 trên A100 nếu ảnh nhỏ

    # === Giả sử bạn đã load dataset từ HuggingFace hoặc file local ===
    # dataset = load_dataset("path/to/your_dataset")
    # subset = dataset["train"].select(range(1000))

    # Debug test
    # print(subset[0])  # kiểm tra một mẫu có image, instruction, output

    start = time.perf_counter()
    new_dataset = process_dataset_batch(subset, batch_size=BATCH_SIZE, device=device)
    end = time.perf_counter()
    print(f"✅ Hoàn tất! Số lượng mẫu: {len(new_dataset)} | Thời gian: {end - start:.2f} giây")


We can also render the LaTeX in the browser directly!

To format the dataset, all vision finetuning tasks should be formatted as follows:

```python
[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]
```

Let's convert the dataset into the "correct" format for finetuning:

In [None]:
len(new_dataset)

We look at how the conversations are structured for the first example:

Let's first see before we do any finetuning what the model outputs for the first example!

In [None]:
image = new_dataset[0][0]["content"][1]["image"]

In [None]:
image

In [None]:
FastVisionModel.for_inference(model) # Enable for inference!

instruction = "Based on the paragraph describing the image and the article below describing\nCNN s On the Road series brings you a greater insight into countries around the world. This time we travel to Azerbaijan in the lead up to the European Games to explore the culture of sports in the country sitting on the Caspian Sea. CNN Forget middle aged men in Lycra Azerbaijan s new breed of cyclist is young, fit and fast.Take Aqshin Ismailov and Tural Isgandarov.\n\nSporting the latest high performance bikes, they stand proudly in front of the futuristic Heydar Aliyev Cultural Centre in Baku, hoping to do their country proud in the European Games and other major competitions maybe even the Tour de France.Both are members of Azerbaijan s Baku Project, which has the express aim of developing cyclists capable of competing on the international stage.\n\nAs in other sports, the squad is a mix of local riders and imported talent with the common goal of winning top class bike races.Read MoreOpening honorThe 27 year old Ismailov will be under particular scrutiny on the opening weekend of the European Games, when he is the first Azeri athlete in action.Ismailov now specializes in mountain biking and the men s cross country event is one of the opening events on Saturday June 13, the first full day of competition.\n\nI will be the first to start and that s why it s a great feeling, he told CNN. My parents can be there to support me, and it s a big responsibility for me. Ismailov has set himself a modest goal against more experienced competitors from countries in which cycling is an established major sport.\n\nFrankly, I don t expect to get medal because I m a beginner in mountain biking, but I hope and believe that I will finish in the top 10, he predicted.Tour influenceIsgandarov, a road cyclist, believes the staging of a Tour of Azerbaijan cycle race for each of the last four years, attracting top class professionals, has helped massively raise the profile of the sport in his country. I come across many people who see the race on television and now come and watch our racing, he told CNN.\n\nFour years ago this sport wasn t popular, but interest in cycling is now growing day by day, he said.Big investmentNone of this has happened by accident, with Azerbaijan investing heavily in sports and hosting events like the European Games, and next year a Formula 1 race, in Baku. We are positioning ourselves as a sports country.\n\nNot to be focused on 10 sports or 15 sports but many more, Azerbaijan s Sport and Youth minister Azad Rahimov, told CNN.Local riders will face tough competition across all cycling disciplines at the European Games, with famous riders such as Tom Boonen of Belgium and Italy s Elia Viviani among the big names on the start sheets.But if the enthusiasm of Ismailov and Isgandarov can be translated into medals, they will also be challenging for places on the podium in the future.\n\nEvery small kid likes to ride a bike. For example, in my family we have three kids and they also want to be a cyclist. I say No problem. I ll help you. Ismailov, who used to compete in the national sport of wrestling before switching disciplines, agrees."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.

In [None]:
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = new_dataset,
    args = SFTConfig(
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        num_train_epochs = 3, # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

###

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
import pandas as pd
import json

file_path = '/content/drive/MyDrive/eventa/test1000tokenQwen.json'
df = pd.read_json(file_path, lines=True)
dff = df[1000:2000]
image_files = df['image'].tolist()
image_files = image_files[1000:2000]


In [None]:
import os
import torch
import json
import gc
import pandas as pd
from PIL import Image
from io import BytesIO
from tqdm import tqdm


# Hàm resize và chuyển ảnh sang PNG trong bộ nhớ
def pil_to_png_list(pil_img):
    pil_img = pil_img.resize((512, 512), resample=Image.Resampling.LANCZOS)
    buffer = BytesIO()
    pil_img.save(buffer, format="PNG")
    buffer.seek(0)
    img = Image.open(buffer).convert("RGB")
    return [img]

# Cấu hình thư mục và file output
image_dir = "/content/drive/MyDrive/eventa/query"
batch_size = 1


results = []

# Hàm sinh caption từ ảnh
def generate_caption(image_path, instruction):
    pil_image = Image.open(image_path).convert("RGB")
    image_list = pil_to_png_list(pil_image)

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_list[0]},
                {"type": "text", "text": instruction}
            ]
        }
    ]

    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

    inputs = tokenizer(
        image_list[0],
        input_text,
        add_special_tokens=False,
        return_tensors="pt"
    ).to("cuda")

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=512,
            use_cache=True,
            temperature=1.5,
            do_sample=False,
            top_p=0.9
        )

    caption = tokenizer.decode(output[0], skip_special_tokens=True)

    results.append({
        "filename": os.path.basename(image_path),
        "caption": caption.strip(),
    })

    del inputs, output
    torch.cuda.empty_cache()
    gc.collect()


for i in tqdm(range(0, len(image_files), batch_size), desc="Processing batches"):
    batch_files = image_files[i:i + batch_size]

    for j, filename in enumerate(batch_files):
        image_path = os.path.join(image_dir, filename)
        instruction = dff.iloc[i + j]['instruction']
        generate_caption(image_path, instruction)
output_file = "/content/drive/MyDrive/eventa/image_captions1.json"
# Lưu kết quả ra file JSON
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')

from huggingface_hub import login
login(HF_TOKEN)

model.push_to_hub("Nguyenhhh/Qwen-400M")
tokenizer.push_to_hub("Nguyenhhh/Qwen-400M")


In [None]:
from google.colab import userdata
userdata.get('secretName')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
import zipfile
import os

# Thư mục gốc cần nén
base_dir = '/kaggle/working'
zip_filename = '/kaggle/working/working_directory.zip'

# Tạo file zip mới
with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(base_dir):
        for file in files:
            abs_path = os.path.join(root, file)
            # Giữ nguyên cấu trúc từ gốc /
            arcname = abs_path.lstrip('/')  # hoặc dùng arcname = os.path.relpath(abs_path, '/') để chắc chắn
            zipf.write(abs_path, arcname=arcname)

print(f"Đã tạo file zip: {zip_filename}")
from IPython.display import FileLink
FileLink("/kaggle/working/working_directory.zip")