# Extract data from images

We want to extract data from some [voter flow charts](https://www.tagesschau.de/wahl/archiv/2025-02-23-BT-DE/analyse-wanderung.shtml). Each chart shows, for one party at the center, how many voters it gained from or lost to other parties compared to the previous election. Traditional OCR won’t get us very far (or would require a great deal of effort). LLMs, however, can recognize elements such as parties, arrows, directions, and labeled numbers as parts of a coherent diagram, especially if we describe how the chart is structured. This allows the LLM to translate the graphic into structured data.

In [None]:
# Colab stuff

import sys, pathlib

if ('google.colab' in sys.modules) and (not pathlib.Path('repo').exists()):
    !git clone https://github.com/marcelpauly/scicar-agents.git
    %cd scicar-agents
    %pip install -q -r requirements.txt

In [None]:
from datetime import datetime
import os

from pydantic import BaseModel, Field

from pydantic_ai import Agent, ImageUrl
from pydantic_ai.models.openai import OpenAIResponsesModel
from pydantic_ai.models.anthropic import AnthropicModel
from pydantic_ai.models.google import GoogleModel

# Allow async code to run inside Jupyter Notebook's existing event loop
import nest_asyncio
nest_asyncio.apply()

In [None]:
models = {}

# OpenAI
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY'
models['gpt'] = OpenAIResponsesModel('gpt-5-nano-2025-08-07')

# Anthropic
os.environ['ANTHROPIC_API_KEY'] = 'YOUR_ANTHROPIC_API_KEY'
models['claude'] = AnthropicModel('claude-3-5-haiku-20241022')

# Google
os.environ['GOOGLE_API_KEY'] = 'YOUR_GOOGLE_API_KEY'
# We use `gemini-2.5-flash` because `gemini-2.5-flash-mini` couldn't handle plus/minus in our example
models['gemini'] = GoogleModel('gemini-2.5-flash')

In [None]:
# Choose a model
model = models['gpt']

In [None]:
# Prompts

system_prompt = ('You are a specialist for structured information extraction from German election “voter flow” graphics.')

prompt = '''Analyze the provided image and extract data from it as described below. The graphic shows the voter flow between a party that is the focus and other parties.

The graphic has different sections. At the top, there is a headline, at the bottom, there are the source and the timestamp of the data. In between is the actual drawing area of the graphic, where the voter flow is depicted. This area is structured as follows:

1. Horizontally centered you find the focus party.
2. If there are source parties from which voters have moved to the focus party, these source parties are listed on the left side one below the other. To the right of them, slightly offset downward, is the number of voters who have moved from the source party to the focus party. However, it may also be the case that there are no source parties, in which case the left part of the drawing area is empty.
3. If there are recipient parties to which voters have moved from the focus party, these recipient parties are listed on the right side one below the other. To the left of them, slightly offset downward and left-aligned with the focus party, is the number of voters who have moved from the focus party to the recipient party. However, it may also be the case that there are no recipient parties, in which case the right part of the drawing area is empty.

The number of voters who have moved between two parties is always given as integers. They are displayed as strings in the graphic and use dots as thousands separators.

If it is an inflow to the focus party, the numerical value is to be considered positive. If it is an outflow from the focus party, the numerical value is to be considered negative, so you must prepend a minus (`-`) to the numerical value.

Convert the information in the graphic into the specified data structure.
'''

In [None]:
# Pydantic models to set the data structure
# (VoterFlows is going to be our requested output type. Its variable 'relations' is a list of Relation objects described in its own class.)

class Relation(BaseModel):
    party: str = Field(description = 'The name of the other party whose relationship with the focus party is being described.')
    value: int = Field(
        description=(
            'Number of voters who switched between the focus party and the other party. '
            'Inflows (on the left side of the graphic) from the other party to the focus party must be positive. '
            'Outflows (on the right side of the graphic) from the focus party to the other party must be negative, so you must prepend a minus (`-`) to the numerical value. '
            'In the graphic, dots within numbers are thousands separators. Remove them, since we need integers. There are no decimal separators.'
        )
    )

class VoterFlows(BaseModel):
    party: str = Field(description = 'The name of the party that is the focus of the analysis.')
    timestamp: datetime = Field(description = 'The timestamp of the data, as indicated at the bottom of the graphic. (In the graphic it has the German date format with day first: `Stand: %d.%m.%Y, %H:%M Uhr`. You convert it into a regular datetime object.)')
    relations: list[Relation] = Field(description = 'List of voter flows to and from other parties.')

In [None]:
agent = Agent(
    model = model,
    output_type = VoterFlows,
    system_prompt = system_prompt,
)

def extract_voter_flows_from_image(image_url: str) -> dict:
    result = agent.run_sync([prompt, ImageUrl(url=image_url)])
    return result.output.model_dump()

In [None]:
url = 'https://www.tagesschau.de/wahl/archiv/2025-02-23-BT-DE/charts/analyse-wanderung/chart_1873611.jpg'
data = extract_voter_flows_from_image(url)
data

**Possible next steps:**

- Validation
- Providing a tool for cleaning numbers (removing thousands seperators, setting sign +/-)
- Normalize party names
- Loop over all partys
- Crosschecks (Do the numbers from several charts match up?)