NLP-LLMs tend to have problems with analysing and understanding complex structures such as tables, plots and images included in scientific articles. Since especially in chemistry and material science information about chemical components is included in these, one should think about different approaches for these structures. Therefore, vision language models (VLMs) since they can analyse images alongside text. There are several open and closed-source VLMs available e.g. [Vision models from OpenAI](https://platform.openai.com/docs/guides/vision), [Claude models](https://docs.anthropic.com/en/docs/vision) and [DeepSeek-VL](https://github.com/deepseek-ai/DeepSeek-VL). As an example the extraction of images with [GPT4-o](https://platform.openai.com/docs/models/gpt-4o) is shown:

First one has to convert the file into images. As an example a PDF file obtained in the [Section 1](../obtaining_data/data_mining.ipynb) is converted into images.

In [1]:
from pdf2image import convert_from_path

file_path = '../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf'

# converting the PDF files to images 
pdf_images = convert_from_path(file_path)

After that one should process the obtained images fo instance rotate pages with vertical text since many models have problems with this. Next one has to convert the pictures into machine-readable Base-64 format. 

In [2]:
from pytesseract import Output
import pytesseract
import imutils
import cv2
import os
import base64
from PIL import Image
import numpy as np

# process the images to a unified and for an VLM better suiting format
def process_image(image, max_size, output_folder, file_path, i):
    width, height = image.size
    resized_image = resize_image(image, max_size)
    rotate_image = correct_text_orientation(resized_image, output_folder, file_path, i)
    jpeg_image = convert_to_jpeg(rotate_image)
    base64_encoded_image = base64.b64encode(jpeg_image).decode("utf-8")
    return (
        base64_encoded_image,
        max(width, height), 
    )

# most VLM models struggle with rotated text therefore, rotated text gets detect and the pages flipped
def correct_text_orientation(image, save_directory, file_path, i):
    if isinstance(image, Image.Image):
        image = pil_to_cv2(image)

    rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = pytesseract.image_to_osd(rgb, output_type=Output.DICT)

    rotated = imutils.rotate_bound(image, angle=results["rotate"])

    base_filename = os.path.basename(file_path)
    name_without_ext, _ = os.path.splitext(base_filename)
    new_filename = os.path.join(
        save_directory, f"corrected_{name_without_ext}_page{i+1}.png"
    )

    cv2.imwrite(new_filename, rotated)
    print(f"[INFO] {file_path} - corrected image saved as {new_filename}")
    return rotated

# the images get converted into jpeg format
def convert_to_jpeg(cv2_image):
    retval, buffer = cv2.imencode(".jpg", cv2_image)
    if retval:
        return buffer

# conversion of the images from a python-image-library object to an OpenCV object
def pil_to_cv2(image):
    np_image = np.array(image)
    cv2_image = cv2.cvtColor(np_image, cv2.COLOR_RGB2BGR)
    return cv2_image

# the images get resized to a unified size with a maximum dimensions
def resize_image(image, max_dimension):
    width, height = image.size

    # Check if the image has a palette and convert it to true color mode
    if image.mode == "P":
        if "transparency" in image.info:
            image = image.convert("RGBA")
        else:
            image = image.convert("RGB")
    # convert to black and white
    image = image.convert("L")

    if width > max_dimension or height > max_dimension:
        if width > height:
            new_width = max_dimension
            new_height = int(height * (max_dimension / width))
        else:
            new_height = max_dimension
            new_width = int(width * (max_dimension / height))
        image = image.resize((new_width, new_height), Image.LANCZOS)

    return image


output_folder_images = './images'

# all images get preprocessed
images_base64 = [process_image(image, 2048, output_folder_images, file_path, j)[0] for j, image in enumerate(pdf_images)]

  from pandas.core import (


[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page1.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page2.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page3.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page4.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page5.png
[INFO] ../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf - corrected image saved as ./images/corrected_10.26434_chemrxiv-2024-1l0sn_page6.png


As a next step one could call the OpenAI API. Therefore, one needs an API-key to pay for the calls. Moreover, one needs to create the prompt including the images and the text prompt. 

In [3]:
# the text prompt text for the model call gets defined
prompt_text = 'Extract all the relevant information about Buchwald-Hartwig reactions included in these images.'

# the composite prompt is put together 
def get_prompt_vision_model(images_base64, prompt_text):
    content = []
    # the images get added in base64 format and in the end the text prompt will be added
    for data in images_base64:
        content.append(create_image_content(data))

    content.append({"type": "text", "text": prompt_text})
    return content

# the images get converted into base64 format
def create_image_content(image, detail="high"):
    return {
        "type": "image_url",
        # the level of detail is set to 'high' since mostly text on the images is small
        "image_url": {"url": f"data:image/jpeg;base64,{image}", "detail": detail},
    }

# the composite prompt for the model call gets defined 
prompt = get_prompt_vision_model(images_base64, prompt_text)

In [5]:
from openai import OpenAI
from dotenv import load_dotenv

# the openai api gets called; the temperature is set to 0 since the output should have a high accuracy
# the gpt4-o model is used since this is the cheapest and fastest openai vision model
def call_openai(
    prompt, model="gpt-4o", temperature: float = 0.0, **kwargs
):
    """Call chat openai model

    Args:
        prompt (str): Prompt to send to model
        model (str, optional): Name of the API. Defaults to ""gpt-4-vision-preview".
        temperature (float, optional): inference temperature. Defaults to 0.

    Returns:
        dict: new data
    """
    client = OpenAI()
    completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a scientific assistant, extracting important information about polymerization conditions"
                "out of pdfs in valid json format. Extract just data which you are 100% confident about the "
                "accuracy. Keep the entries short without details. Be careful with numbers.",
            },
            {"role": "user", "content": prompt},
        ],
        temperature=temperature,
        response_format={"type": "json_object"},
        **kwargs,
    )
    # the input and output token are reported in order to track costs of the api calls
    input_tokens = completion.usage.prompt_tokens
    output_token = completion.usage.completion_tokens
    # the output of the model call is saved
    message_content = completion.choices[0].message.content
    return message_content, input_tokens, output_token

# the openai api key is loading
dotenv_path = '../OPENAI_KEY.env'
load_dotenv(dotenv_path)
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

# the openai api is called and the used token and the output printed
output, input_token, output_token = call_openai(prompt=prompt)
print('Output: ', output)
print('Input-token used:', input_token, ' Output_token used: ', output_token)

Output:  {
  "Buchwald-Hartwig Reaction Conditions": [
    {
      "Reaction Type": "Cross-coupling",
      "Key Step": "Reaction of an α-amino-BODIPY and the respective halide",
      "Catalyst": "Pd(OAc)2",
      "Ligand": "(±)-BINAP",
      "Base": "Cs2CO3",
      "Solvent": "PhMe",
      "Temperature": "80 °C",
      "Time": "1.5 h",
      "Yield": "Up to 68% for unsymmetric BODIPYs"
    },
    {
      "Monomer": "α-chloro- and α-amino-BODIPYs",
      "Catalyst": "Pd(OAc)2",
      "Ligand": "(±)-BINAP",
      "Base": "Cs2CO3",
      "Solvent": "PhMe",
      "Temperature": "80 °C",
      "Yield": "Up to 68% for unsymmetric BODIPYs"
    },
    {
      "Monomer": "Br-Ar-mono-NH2",
      "Catalyst": "Pd(OAc)2",
      "Ligand": "(±)-BINAP",
      "Base": "Cs2CO3",
      "Solvent": "PhMe",
      "Temperature": "80 °C",
      "Time": "1.5 h",
      "Yield": "Up to 68% for unsymmetric BODIPYs"
    }
  ],
  "Additional Notes": [
    {
      "Note": "The reaction showed a trend of improvemen

Now one could use this structured output to build up a database of Buchwald-Hartwig-Coupling reactions. 