# Image Text to Text

Simple image text to text generation using transformers

## Imports

In [1]:
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

## Load Vision Model

In [None]:
model_id = ""
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

## Image Text Generation

In [3]:
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(url)
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


user

What is in this image?
assistant
This image shows a street scene with a stop sign, a traditional Chinese archway, and several storefronts. There are also people walking and a black SUV driving on the street. The archway has Chinese characters and decorative elements, indicating a location that might be in a Chinese-speaking region.
