# Context-aware sensitive text extraction

### Mark Polokhov

In [1]:
import os
os.environ["TRANSFORMERS_CACHE"] = "X:/Programming/Models"

In [2]:
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image



In [3]:
model_id = "Qwen/Qwen2-VL-2B-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## 1

In [4]:
image = Image.open("image_1.png")

In [5]:
prompt = """
Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt}
        ]
    }
]

In [6]:
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = processor(
    text=[text],
    images=image,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
)

print(processor.decode(outputs[0], skip_special_tokens=True))

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


system
You are a helpful assistant.
user

Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations

assistant
Layer Norm
Causal Self-Attention
Fully Connected
Melody Conditioning
Positional Embedding
Convolutional Auto-encoder
EnCodec with RVQ
Residual Vector Quantization
Learned embedding table
Decoder
Text Conditioning


## 2

In [8]:
image = Image.open("image_2.jpg")

In [9]:
prompt = """
Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt}
        ]
    }
]

In [10]:
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = processor(
    text=[text],
    images=image,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
)

print(processor.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant.
user

Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations

assistant
Apache Kafka
Spark Streaming
Apache HBase
C
elasticsearch
MySQL


## Other 2

In [16]:
image = Image.open("image_other_2.png")

In [17]:
prompt = """
Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt}
        ]
    }
]

In [None]:
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = processor(
    text=[text],
    images=image,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=False,
)

print(processor.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant.
user

Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations

assistant
Шерлок Холмс


## Safe

In [19]:
image = Image.open("image_safe.png")

In [20]:
prompt = """
Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt}
        ]
    }
]

In [21]:
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = processor(
    text=[text],
    images=image,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=False,
)

print(processor.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant.
user

Extract ONLY sensitive text from the image.

Rules:
- Output ONLY text that appears verbatim in the image
- Sensitivity must depend on visual context
- One item per line
- No explanations

assistant
1. Centralized data collection and storage
2. Improve reporting and analytics
3. Reduce load on backend systems
4. Ensure data quality and compliance
