# How well can Multimodal LLMs interpret complex financial data? Let’s find out!

▶ I recently conducted an experiment with 5 different Multimodal LLMs, and sizes, to see how they handle interpreting a complex financial chart. The models tested were: **Llama-3.2-11B-Vision, Pixtral-12B, Qwen 2 VL 2B**, and the heavyweights: **Claude 3.5 Sonnet and GPT-4o**.

▶ The chart I used was taken from JP Morgan's 2022 report, featuring multiple data types and visual elements like bar and line graphs—a real test of the models' ability to process intricate financial information.

▶ Why does this matter? In finance, being able to accurately interpret visuals and numerical data is critical. I wanted to assess:

*   How well these models handle mixed data formats.
*   Whether they can interpret complex financial values.
*   How feasible it is to use them in real-world financial analysis, despite varying model sizes and architectures.

▶ Even though these models differ in size and complexity, the ultimate goal was to determine their potential when working with numbers and visual data. Some fascinating insights emerged!

▶ How I run them?

*   I have run a quantized version of Llama-3.2-11B-Vision, on Google colab, using GPU (Runtime ==> Change Runtime Type ==> GPU)

*   I used MistralAI for Pixtral-12B (Mistral AI Key + subscribe to free usage)

*   I used HuggingFace for Qwen2VL

*   I used OpenAI and Anthropic API for GPT-4o and Claude 3.5 Sonnet  


▶ Key Takeaways:

*   Well, the largest models excel in extracting the correct numbers from this complex chart. However, they sometimes do not extract the whole expected values (For example, they ommit to extract the first table, or they don't capture the whole years from the chart...)

*   The smallest models, while they can capture all the metrics included in the chart, they don't extract yet the accurate numbers...I believe they are good at describing images, but not yet for exact numbers.

Here is the image from which we want multimodal LLMs to extract numbers:

In [None]:
# path_img = local_path
from IPython.display import Image
Image(path_img)

# Install Libs

In [None]:
# !pip install anthropic -q
# !pip install openai -q

# Specify Keys

In [None]:
from google.colab import userdata
CLAUDE_API_KEY = userdata.get('CLAUDE_API_KEY')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
MISTRAL_API_KEY = userdata.get('MISTRAL_API_KEY')

import openai
openai.api_key = OPENAI_API_KEY

import anthropic

# Load Image

In [None]:
import io
import base64
from PIL import Image

# Convert the PNG images to base64 encoded strings: One example images
images = [Image.open(f"{path_img}")]

base64_encoded_pngs = []
quality=75
max_size=(1024, 1024)
for image in images:
        # Resize the image if it exceeds the maximum size
        if image.size[0] > max_size[0] or image.size[1] > max_size[1]:
            image.thumbnail(max_size, Image.Resampling.LANCZOS)
        image_data = io.BytesIO()
        image.save(image_data, format='PNG', optimize=True, quality=quality)
        image_data.seek(0)
        base64_encoded = base64.b64encode(image_data.getvalue()).decode('utf-8')
        base64_encoded_pngs.append(base64_encoded)

# GPT-4o

In [None]:
from openai import OpenAI

client_openai = OpenAI(api_key=OPENAI_API_KEY)
MODEL_NAME_GPT = "gpt-4o-mini"

def get_completion_gpt4o(messages, model_name):
    response = client_openai.chat.completions.create(
        model=model_name,
        # max_tokens=2048,
        temperature=0,
        messages=messages
    )
    print(response.model)
    return response.choices[0].message.content

def append_message(content, question):
    content.append({"type": "text", "text": question})
    messages = [
      {
          "role": 'user',
          "content": content
      }
    ]
    return messages

## Prompt 1: raw data without explicit format

🔽 ⬇ : It didn't to extract the first table: 🔽 ⬇

The extracted data is good

In [None]:
content = [{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_png}"}} for encoded_png in base64_encoded_pngs]
question = "Extract raw data from the image."
messages_gpt = append_message(content, question)

In [None]:
%%time
MODEL_NAME_GPT = "gpt-4o"
print(get_completion_gpt4o(messages_gpt, MODEL_NAME_GPT))

## Prompt: Json format

🔽 ⬇ All data are extracted :🔽 ⬇

✅ Numbers coming from the chart are **GOOD** ✅

In [None]:
content = [{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encoded_png}"}} for encoded_png in base64_encoded_pngs]
question = "Extract ALL raw data from the image in a json format."
messages_gpt = append_message(content, question)
# MODEL_NAME_GPT = "gpt-4o"
print(get_completion_gpt4o(messages_gpt, MODEL_NAME_GPT))

One values that are not correct from the first table:

"2021": ==> "net_income": 33.1,

## Prompt: Markdown format

In [None]:
content = [{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encoded_png}"}} for encoded_png in base64_encoded_pngs]
question = "Extract raw data from the image in a markdown format when it's possible."
messages_gpt = append_message(content, question)

In [None]:
%%time
MODEL_NAME_GPT = "gpt-4o"
print(get_completion_gpt4o(messages_gpt, MODEL_NAME_GPT))

# GPT-4o-mini

Not good: It extracted some of the data (from the chart) but the values are not good

In [None]:
%%time
MODEL_NAME_GPT = "gpt-4o-mini"
print(get_completion_gpt4o(messages_gpt, MODEL_NAME_GPT))

In [None]:
%%time
#I asked for markdown format
MODEL_NAME_GPT = "gpt-4o-mini"
print(get_completion_gpt4o(messages_gpt, MODEL_NAME_GPT))

# Claude Sonnet 3.5

**Very Good results!**

In [None]:
client_claude = anthropic.Anthropic(
    api_key=CLAUDE_API_KEY,
)

MODEL_NAME = "claude-3-5-sonnet-20240620"
def get_completion_claude(messages):
    response = client_claude.messages.create(
        model=MODEL_NAME,
        max_tokens=2048,
        temperature=0,
        messages=messages
    )
    print(response.model)
    return response.content[0].text

## Prompt 1

In [None]:
content = [{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": encoded_png}} for encoded_png in base64_encoded_pngs]
question = "Extract raw data information from the images."
messages_claude = append_message(content, question)

In [None]:
%%time
MODEL_NAME = "claude-3-5-sonnet-20240620"
chart_analysis = get_completion_claude(messages_claude)
print(chart_analysis)

In [None]:
content = [{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": encoded_png}} for encoded_png in base64_encoded_pngs]
question = "Extract raw data information from the images."
messages_claude = append_message(content, question)

MODEL_NAME = "claude-3-5-sonnet-20240620"
chart_analysis = get_completion_claude(messages_claude)
print(chart_analysis)

## Prompt 2 and 3

🔽 ⬇ : In the following experiments, I asked Claude 3.5 Sonnet for markdwon format :

1. It forgets about the first table ==> I only get the chart data
==> It also stated that some data were in puropose not included for "simplicity"

2. Then I modified the pormpt to explicitly ask it to gather **ALL** raw data from the image ==> It succedd then to gather all the numbers


In [None]:
content = [{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": encoded_png}} for encoded_png in base64_encoded_pngs]
question = "Extract raw data information from the images in markdown format."
messages_claude = append_message(content, question)

In [None]:
%%time
chart_analysis = get_completion_claude(messages_claude)
print(chart_analysis)

🔽 ⬇ "Some data are not included in this table for simplicity."

In [None]:
content = [{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": encoded_png}} for encoded_png in base64_encoded_pngs]
question = "Extract raw data from the images in a markdown format when it's possible."
messages_claude = append_message(content, question)
chart_analysis = get_completion_claude(messages_claude)
print(chart_analysis)

## Prompt 4

prompt with "ALL" explicitly set :

In [None]:
content = [{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": encoded_png}} for encoded_png in base64_encoded_pngs]
question = "Extract ALL raw data from the images in a markdown format when it's possible."
messages_claude = append_message(content, question)
chart_analysis = get_completion_claude(messages_claude)
print(chart_analysis)

## Prompt 5

In [None]:
%%time
#"Describe the image" prompt
chart_analysis = get_completion_claude(messages_claude)
print(chart_analysis)

# Llama 3.2 11B - Vision

https://huggingface.co/meta-llama/Llama-3.2-11B-Vision



## Without Quantization

In [None]:
# Install this to be able to use : MllamaForConditionalGeneration ==> A simple pip install does not work
!pip install git+https://github.com/huggingface/transformers

In [None]:
!pip install --upgrade transformers -q

In [None]:
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision"

In [None]:
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, #auto
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
#processor.apply_chat_template  ==> does not work ==> it doesn't have chat_template

In [None]:
from PIL import Image
image = Image.open(path_img)

# prompt = "<|image|><|begin_of_text|>Extract raw data information from the image."

# <|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
# <|image|>Extract raw data information from the image:<|eot_id|><|start_header_id|>assistant<|end_header_id|>

prompt="""
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
<|image|>Extract raw data information from the image:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
inputs = processor(image, prompt, return_tensors="pt").to(model.device)

In [None]:
%%time
output = model.generate(**inputs)
print(processor.decode(output[0]))

Results: **I'm not able to provide that information.**

In [None]:
%%time
##Inference time is so looonnng ==> stop it because it's not possible
output = model.generate(**inputs, max_new_tokens = 2048)
print(processor.decode(output[0]))

## With Quantization

Not good results

In [None]:
!pip install git+https://github.com/huggingface/transformers
# accelerate bitsandbytes huggingface_hub

In [None]:
from transformers import BitsAndBytesConfig
from transformers import MllamaForConditionalGeneration, AutoProcessor
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "meta-llama/Llama-3.2-11B-Vision"

model_qtz = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained(model_id)

In [None]:
from PIL import Image
image = Image.open(path_img)

prompt="""
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
<|image|>Extract raw data information from the image:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
inputs = processor(image, prompt, return_tensors="pt").to(model_qtz.device)

In [None]:
%%time
output = model_qtz.generate(**inputs, max_new_tokens = 2048)
print(processor.decode(output[0]))

# Llama 3.2 11B Vision -Instruct

## With Quantization to 4-bit

It extracted redundant information and numbers are inacurrate

In [None]:
!pip install git+https://github.com/huggingface/transformers bitsandbytes

In [None]:
%%time
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained(model_id)

BitsAndBytesConfig is part of the effort to make transformer models more efficient by using quantization techniques, particularly when loading models with reduced precision, like 8-bit or 4-bit integer types, instead of the standard 32-bit floating-point numbers.

 The BitsAndBytesConfig allows you to configure how a model is loaded and run in lower precision.


NF4 stands for "Normalized Float 4-bit"

In [None]:
%%time
from PIL import Image
image = Image.open(path_img)

# prompt = "<|image|><|begin_of_text|>Extract raw data information from the image."
# inputs = processor(image, prompt, return_tensors="pt").to(model.device)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Extract raw data information from the image:"}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=2048)
print(processor.decode(output[0]))

# Qwen2-VL-2B-Instruct

https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct



It successfully extracted the numbers from the chart and highlighted the various metrics. However, the numbers provided are inaccurate. There is also a discrepancy between the bar values (Net Income), which are extracted fairly accurately, and the line values (EPS and ROTCE), which are incorrect for nearly all years.
However it didn't extract the first table.


In [None]:
!pip install qwen_vl_utils -q

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

In [None]:
%%time
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": path_img,
            },
            {"type": "text", "text": "Extract raw data information from the image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

# Pixtral


https://docs.mistral.ai/capabilities/vision/

https://huggingface.co/mistralai/Pixtral-12B-2409

## MistralAI package

The model extraced information coming from the table and charts. It provides the different metrics included in the image.

The table values are mostly correct. However, while some of the values in the charts are accurate, the majority are incorrect.

I believe we can use it for image description, but when it comes to extracting numbers, we should rely on larger models.



In [None]:
%%time
import base64
import requests
import os
from mistralai import Mistral


base64_image = base64_encoded_pngs[0]

# Retrieve the API key from environment variables
api_key = MISTRAL_API_KEY
model = "pixtral-12b-2409"
client = Mistral(api_key=api_key)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Extract raw data information from the charts."
            },
            {
                "type": "image_url",
                "image_url": f"data:image/png;base64,{base64_image}"
            }
        ]
    }
]

# Get the chat response
chat_response = client.chat.complete(
    model=model,
    messages=messages
)

print(chat_response.choices[0].message.content)