In [1]:
prompt = """
You are an expert image tagging and captioning AI specialized in creating training data for high-fidelity text-to-image models.

Your task is to write a comprehensive, dense, and visually descriptive caption for the provided image.

**Guidelines:**
1.  **Structure:**
* **Sentence 1:** Clearly identify the main subject, their action/pose, and the immediate context.
* **Sentence 2-3:** Describe the environment, background elements, and spatial relationships (e.g., "in the foreground," "to the left").
* **Sentence 4:** Describe the artistic medium (e.g., photograph, oil painting, 3D render), camera angle, lighting style (e.g., cinematic, volumetric), and color palette.
* **Sentence 5:** Mention specific textures, clothing details, or unique stylistic flourishes.

2.  **Tone & Style:**
* Be objective and direct.
* Use precise visual terminology (e.g., instead of "cool clothes," say "a distressed leather jacket with silver studs").
* Do NOT use filler phrases like "This image shows," "A picture of," or "In this scene." Start directly with the subject.

3.  **Detail Level:**
* Describe texts or signs visible in the image if legible.
* Describe the emotion or atmosphere if it is visually distinct (e.g., "melancholic atmosphere," "chaotic energy").

**Output Format:**
Provide a single, flowing paragraph of text between 150-250 words.
"""

In [None]:
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
import torch

model_name = "Qwen/Qwen3-VL-30B-A3B-Instruct"

processor = AutoProcessor.from_pretrained(model_name)
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    model_name, dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="cuda:0",
)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 13/13 [00:26<00:00,  2.03s/it]


In [3]:
from PIL import Image

image = Image.open('../data/tet.jpg')

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image", 
                "image": image
            },
            {
                "type": "text", 
                "text": prompt
            }
        ]
    }
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)
with torch.inference_mode():
    generated_ids = model.generate(
        **inputs, 
        do_sample=True,
        temperature=0.7,
        top_p=0.8,
        top_k=20,
        max_new_tokens=500,
    )

generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)


["A breathtaking aerial view captures a vast, snow-dusted mountain range from the perspective of an airplane window, with the aircraft's wingtip visible in the foreground. The scene unfolds in layers, with a dense, white cloud bank filling the lower portion of the frame, creating a sense of immense altitude and separation from the earth below. Beyond the clouds, a rugged alpine landscape stretches into the distance, its peaks and ridges blanketed in snow and ice, their sharp silhouettes contrasting against a pale blue sky. The lighting is bright and diffuse, characteristic of high-altitude daylight, which casts soft shadows and emphasizes the textures of the snow and rock. The image is a high-fidelity photograph, captured from a high-angle perspective, likely during a flight over a major mountain range. The color palette is dominated by cool tones of white, blue, and grey, with subtle hints of dark earthy tones on the lower slopes. The overall atmosphere is one of serene grandeur and v