# NuExtract 2.0 Inference

In this notebook we will provide examples of how to use the NuExtract 2.0 models for inference.

First, let's load a model.

In [1]:
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "numind/NuExtract-2.0-2B"
# model_name = "numind/NuExtract-2.0-8B"

processor = AutoProcessor.from_pretrained(model_name, 
                                          trust_remote_code=True, 
                                          padding_side='left',
                                          use_fast=True)
model = AutoModelForVision2Seq.from_pretrained(model_name, 
                                               trust_remote_code=True, 
                                               torch_dtype=torch.bfloat16,
                                               attn_implementation="flash_attention_2",
                                               device_map="auto")

## Preparing Model Inputs

Before using the model, we also need to make sure our prompts are properly formatted to work with NuExtract. NuExtract expects all input information to come as a single user chat prompt, formatted as follows:

```python
f"""
# Template:
{template}
# Context:
{text}
"""
```
and if in-context examples are provided:
```python
f"""
# Template:
{template}
# Examples
## Input:
{input1}
## Output:
{output1}
## Input:
{input2}
## Output:
{output2}
# Context:
{text}
"""
```

If you are working with image inputs, you should use image placeholders for `text`, `input1`, etc. Later, we will inject tokens representing the actual image content in the location of these placeholders.

The following function will make this formatting more convenient for us.

In [None]:
def construct_messages(document, template, examples=None, image_placeholder="<|vision_start|><|image_pad|><|vision_end|>"):
    """
    Construct the individual NuExtract message texts, prior to chat template formatting.
    """
    images = []
    # add few-shot examples if needed
    if examples is not None and len(examples) > 0:
        icl = "# Examples:\n"
        for row in examples:
            example_input = row['input']
            
            if not isinstance(row['input'], str):
                example_input = image_placeholder
                images.append(row['input'])
                
            icl += f"## Input:\n{example_input}\n## Output:\n{row['output']}\n"
    else:
        icl = ""
        
    # if input document is an image, set text to an image placeholder
    text = document
    if not isinstance(document, str):
        text = image_placeholder
        images.append(document)
    text = f"""# Template:\n{template}\n{icl}# Context:\n{text}"""
    
    messages = [
        {
            "role": "system",
            "content": "You are NuExtract, an information extraction tool created by NuMind." 
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": text}] + images,
        }
    ]
    return messages

## Inference
### Basic Example

Now we are ready to run the model!

Let's start with a basic text-only example, where we want to extract peoples' names from a short text.

In [None]:
from qwen_vl_utils import process_vision_info

template = """{"names": ["verbatim-string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# prepare the user message content
messages = construct_messages(document, template)
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_vision_info(messages)[0]
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

Our NuExtract message is now formatted in standard chat template formatting; the tokenized version (`inputs`) will be given directly to the model.

In [4]:
print(text)

<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|>
<|im_start|>assistant



The other `image_inputs` are empty in this case because this is a text-only example.

In [5]:
print(image_inputs)

None


Now let's actually run the model.

In [6]:
# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)



['{"names": ["John", "Mary", "James"]}']


Alternatively, you can directly provide the template and in-context examples to `.apply_chat_template()`, rather then manually preparing the prompt via `construct_messages()`.

In [None]:
template = """{"names": ["verbatim-string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# prepare the user message content
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template, # template is specified here
    tokenize=False,
    add_generation_prompt=True,
)

print(text)

image_inputs = process_vision_info(messages)[0]
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|>
<|im_start|>assistant

['{"names": ["John", "Mary", "James"]}']


### In-Context Examples

Sometimes the model might not perform as well as we want because our task is challenging or involves some degree of ambiguity. Alternatively, we may want the model to follow some specific formatting, or just give it a bit more help. In cases like this it can be valuable to provide "in-context examples" to help NuExtract better understand the task.

To do so, we can provide a list `examples` to `apply_chat_template()` (or `construct_messages()`) which contains dictionaries of input/output pairs. In the example below, we show to the model that we want the extracted names to be in captial letters with `-` on either side (for the sake of illustration).

In [None]:
template = """{"names": ["verbatim-string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
    }
]

messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples, # examples provided here
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_vision_info(messages)[0]
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

We can see below that the in-context example has now been included in the model prompt, specifically between the template and context components.

In [9]:
print(text)

<|im_start|>user
# Template:
{"names": ["string"]}
# Examples:
## Input:
Stephen is the manager at Susan's store.
## Output:
{"names": ["-STEPHEN-", "-SUSAN-"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|>
<|im_start|>assistant



In [10]:
# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']


To get even better performance, add multiple in-context examples to your input.

### Image Inputs

If we want to give image inputs to NuExtract, instead of text, we simply provide a dictionary specifying the desired image file as the message content, instead of a string. E.g. `{"type": "image", "image": "file://image.jpg"}`.

You can also specify an image URL (e.g. `{"type": "image", "image": "http://path/to/your/image.jpg"}`) or base64 encoding (e.g. `{"type": "image", "image": "data:image;base64,/9j/..."}`).

First, we will need a modified version of `process_vision_info()` that handles image-based in-context examples as well as primary inputs.

In [11]:
def process_all_vision_info(messages, examples=None):
    """
    Process vision information from both messages and in-context examples, supporting batch processing.
    
    Args:
        messages: List of message dictionaries (single input) OR list of message lists (batch input)
        examples: Optional list of example dictionaries (single input) OR list of example lists (batch)
    
    Returns:
        A flat list of all images in the correct order:
        - For single input: example images followed by message images
        - For batch input: interleaved as (item1 examples, item1 input, item2 examples, item2 input, etc.)
        - Returns None if no images were found
    """
    from qwen_vl_utils import process_vision_info, fetch_image
    
    # Helper function to extract images from examples
    def extract_example_images(example_item):
        if not example_item:
            return []
            
        # Handle both list of examples and single example
        examples_to_process = example_item if isinstance(example_item, list) else [example_item]
        images = []
        
        for example in examples_to_process:
            if isinstance(example.get('input'), dict) and example['input'].get('type') == 'image':
                images.append(fetch_image(example['input']))
                
        return images
    
    # Normalize inputs to always be batched format
    is_batch = messages and isinstance(messages[0], list)
    messages_batch = messages if is_batch else [messages]
    is_batch_examples = examples and isinstance(examples, list) and (isinstance(examples[0], list) or examples[0] is None)
    examples_batch = examples if is_batch_examples else ([examples] if examples is not None else None)
    
    # Ensure examples batch matches messages batch if provided
    if examples and len(examples_batch) != len(messages_batch):
        if not is_batch and len(examples_batch) == 1:
            # Single example set for a single input is fine
            pass
        else:
            raise ValueError("Examples batch length must match messages batch length")
    
    # Process all inputs, maintaining correct order
    all_images = []
    for i, message_group in enumerate(messages_batch):
        # Get example images for this input
        if examples and i < len(examples_batch):
            input_example_images = extract_example_images(examples_batch[i])
            all_images.extend(input_example_images)
        
        # Get message images for this input
        input_message_images = process_vision_info(message_group)[0] or []
        all_images.extend(input_message_images)
    
    return all_images if all_images else None


In the example below, we give an image of a receipt (`data/1.jpg`) and ask the model to extract the name of the store. We also provide an ICL example of a receipt from Walmart (`data/0.jpg`).

In [12]:
template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://data/1.jpg"}
examples = [
    {
        "input": {"type": "image", "image": "file://data/0.jpg"},
        "output": """{"store": "WALMART"}"""
    }
]

messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

Just like in the text-only case above, our in-context example has been included in the prompt before the main context.

In [13]:
print(text)

<|im_start|>user
# Template:
{"store": "verbatim-string"}
# Examples:
## Input:
<|vision_start|><|image_pad|><|vision_end|>
## Output:
{"store": "WALMART"}
# Context:
<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant



Now if we look at `image_inputs` we will see that it contains actual images. When we pass this along with `text` to `processor()` it automatically encodes the images and injects a tokenized representation into the image placeholders within `text`.

In [14]:
print(image_inputs)

[<PIL.Image.Image image mode=RGB size=588x896 at 0x7F2E59587760>, <PIL.Image.Image image mode=RGB size=476x980 at 0x7F2E595867D0>]


In [15]:
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

['{"store": "TRADER JOE\'S"}']


### Batched Inference

Finally, we can run batched inference over a list of input examples, regardless of whether they contain text, images, and/or ICL examples.

In [None]:
inputs = [
    # image input with no ICL examples
    {
        "document": {"type": "image", "image": "file://data/0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
    },
    # image input with 1 ICL example
    {
        "document": {"type": "image", "image": "file://data/0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": {"type": "image", "image": "file://data/1.jpg"},
                "output": """{"store_name": "Trader Joe's"}""",
            }
        ],
    },
    # text input with no ICL examples
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["verbatim-string"]}""",
    },
    # text input with ICL example
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["verbatim-string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
            }
        ],
    },
]

# messages should be a list of lists for batch processing
messages = [[{"role": "user", "content": [x['document']]}] for x in inputs]

# apply chat template to each example individually
texts = [
    processor.tokenizer.apply_chat_template(
        messages[i],  # Now this is a list containing one message
        template=x['template'],
        examples=x.get('examples', None),
        tokenize=False, 
        add_generation_prompt=True)
    for i, x in enumerate(inputs)
]

image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Batch Inference
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
    print(y)

{"store_name": "WAL-MART"}
{"store_name": "Walmart"}
{"names": ["John", "Mary", "James"]}
{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}
