# Multi-modal eval: Image resolution

`Multi-modal slide decks` is a public dataset that contains a dataset of question-answer pairs from slide decks with visual content.

The question-answer pairs are derived from the visual content in the decks, testing the ability of RAG to perform visual reasoning.

GPT-4 can be used to answer questions based upon visual slide content, but [image resolution](https://community.openai.com/t/400-errors-on-gpt-vision-api-since-today/534538/16) is a question:

* Higher resolution costs more tokens, but also leads to flakiness from the API (BadRequestErrors)
* Lower resolution reduces costs and errors, but sacrifices performance

## Pre-requisites

In [None]:
# %pip install -U langchain langsmith langchain_benchmarks
# %pip install -U openai chromadb pypdfium2 open-clip-torch pillow

In [None]:
import getpass
import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
env_vars = ["LANGCHAIN_API_KEY", "OPENAI_API_KEY"]
for var in env_vars:
    if var not in os.environ:
        os.environ[var] = getpass.getpass(prompt=f"Enter your {var}: ")

## Dataset

We can browse the available LangChain benchmark datasets for retrieval.

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

registry = registry.filter(Type="RetrievalTask")
registry

Name,Type,Dataset ID,Description
LangChain Docs Q&A,RetrievalTask,452ccafc-18e1-4314-885b-edd735f17b9d,Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Reports,RetrievalTask,c47d9617-ab99-4d6e-a6e6-92b8daf85a7d,Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Multi-modal slide decks,RetrievalTask,40afc8e7-9d7e-44ed-8971-2cae1eb59731,This public dataset is a work-in-progress and will be extended over time.  Questions and answers based on slide decks containing visual tables and charts. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer.


In [2]:
task = registry["Multi-modal slide decks"]
task

0,1
Name,Multi-modal slide decks
Type,RetrievalTask
Dataset ID,40afc8e7-9d7e-44ed-8971-2cae1eb59731
Description,This public dataset is a work-in-progress and will be extended over time.  Questions and answers based on slide decks containing visual tables and charts. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer.
Retriever Factories,
Architecture Factories,
get_docs,{}


In [3]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Multi-modal slide decks already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74.


In [4]:
from langchain_benchmarks.rag.tasks.multi_modal_slide_decks import get_file_names

file_names = list(get_file_names())  # PosixPath

## Load

For each presentation, extract an image for each slide.

In [8]:
import os
from pathlib import Path
import base64
import io
from io import BytesIO

from PIL import Image
import pypdfium2 as pdfium

from langchain.chat_models import ChatOpenAI
from langchain.schema.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
import uuid

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.storage import InMemoryStore

from langchain.vectorstores import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings

from langchain.schema.runnable import RunnableLambda, RunnablePassthrough


def get_images(file):
    """
    Get PIL images from PDF pages and save them to a specified directory
    :param file: Path to file
    :return: A list of PIL images
    """

    # Get presentation
    pdf = pdfium.PdfDocument(file)
    n_pages = len(pdf)

    # Get images
    pil_images = []
    print(f"Extracting {n_pages} images for {file.name}")
    for page_number in range(n_pages):
        page = pdf.get_page(page_number)
        bitmap = page.render(scale=1, rotation=0, crop=(0, 0, 0, 0))
        pil_image = bitmap.to_pil()
        pil_images.append(pil_image)
    return pil_images


def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string

    :param base64_string: Base64 string
    :param size: Image size
    :return: Re-sized Base64 string
    """
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))

    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")


def convert_to_base64(pil_image, size):
    """
    Convert PIL images to Base64 encoded strings

    :param pil_image: PIL image
    :param size: Tuple w/ img size
    :return: Re-sized Base64 string
    """

    buffered = BytesIO()
    pil_image.save(buffered, format="JPEG")  # You can change the format if needed
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    img_str = resize_base64_image(img_str, size=size)
    return img_str


def image_summarize(img_base64, prompt, llm):
    """
    Make image summary

    :param img_base64: Base64 encoded string for image
    :param prompt: Text prompt for summarizatiomn
    :return: Image summarization prompt

    """
    if llm == "gpt4v":
        chat = ChatOpenAI(model="gpt-4-vision-preview", max_tokens=1024)
    elif llm == "gemini":
        chat = ChatGoogleGenerativeAI(model="gemini-pro-vision")

    msg = chat.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content


def generate_img_summaries(img_base64_list, llm):
    """
    Generate summaries for images

    :param img_base64_list: Base64 encoded images
    :param llm: LLM
    :return: List of image summaries and processed images
    """

    # Store image summaries
    image_summaries = []
    processed_images = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval."""

    # Apply summarization to images
    for i, base64_image in enumerate(img_base64_list):
        try:
            image_summaries.append(image_summarize(base64_image, prompt, llm))
            processed_images.append(base64_image)
        except Exception as e:
            print(f"Error with image {i+1}: {e}")

    return image_summaries, processed_images


def create_multi_vector_retriever(vectorstore, image_summaries, images):
    """
    Create retriever that indexes summaries, but returns raw images or texts

    :param vectorstore: Vectorstore to store embedded image sumamries
    :param image_summaries: Image summaries
    :param images: Base64 encoded images
    :return: Retriever
    """

    # Initialize the storage layer
    store = InMemoryStore()
    id_key = "doc_id"

    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for i, s in enumerate(doc_summaries)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))

    add_documents(retriever, image_summaries, images)

    return retriever


def prepare_images(docs):
    """
    Prepare iamges for prompt

    :param docs: A list of base64-encoded images from retriever.
    :return: Dict containing a list of base64-encoded strings.
    """
    b64_images = []
    for doc in docs:
        if isinstance(doc, Document):
            doc = doc.page_content
        b64_images.append(doc)
    return {"images": b64_images}


def img_prompt_func(data_dict, num_images=2):
    """
    GPT-4V prompt for image analysis.

    :param data_dict: A dict with images and a user-provided question.
    :param num_images: Number of images to include in the prompt.
    :return: A list containing message objects for each image and the text prompt.
    """
    messages = []
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"][:num_images]:
            image_message = {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image}"},
            }
            messages.append(image_message)
    text_message = {
        "type": "text",
        "text": (
            "You are an analyst tasked with answering questions about visual content.\n"
            "You will be give a set of image(s) from a slide deck / presentation.\n"
            "Use this information to answer the user question. \n"
            f"User-provided question: {data_dict['question']}\n\n"
        ),
    }
    messages.append(text_message)
    return [HumanMessage(content=messages)]


def multi_modal_rag_chain(retriever, llm):
    """
    Multi-modal RAG chain
    :param retriever: Retriever
    :param llm: LLM

    """

    # Multi-modal LLM
    if llm == "gpt4v":
        model = ChatOpenAI(model="gpt-4-vision-preview", max_tokens=1024)
    elif llm == "gemini":
        model = ChatGoogleGenerativeAI(model="gemini-pro-vision")

    # RAG pipeline
    chain = (
        {
            "context": retriever | RunnableLambda(prepare_images),
            "question": RunnablePassthrough(),
        }
        | RunnableLambda(img_prompt_func)
        | model
        | StrOutputParser()
    )

    return chain


# Images
images = []
for fi in file_names:
    images.extend(get_images(fi))

# Experiment configurations
experiments = [
    ((480, 270), "gpt4v-480-270", "gpt4v"),
    ((720, 405), "gpt4v-720-405", "gpt4v"),
    ((960, 540), "gpt4v-960-540", "gpt4v"),
    ((480, 270), "gemini-480-270", "gemini"),
    ((720, 405), "gemini-720-405", "gemini"),
    ((960, 540), "gemini-960-540", "gemini"),
]

stor_chain = {}
for img_resolution, expt, llm in experiments:
    # Base64 strings
    images_base_64 = [convert_to_base64(i, img_resolution) for i in images]

    # Image summaries
    image_summaries, images_base_64_processed = generate_img_summaries(
        images_base_64, llm
    )

    # Vectorstore o index the summaries
    vectorstore = Chroma(collection_name=expt, embedding_function=OpenAIEmbeddings())

    # Create retriever
    retriever = create_multi_vector_retriever(
        vectorstore,
        image_summaries,
        images_base_64_processed,
    )

    stor_chain[expt] = multi_modal_rag_chain(retriever, llm)

Extracting 30 images for DDOG_Q3_earnings_deck.pdf
Error with image 11: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}
Error with image 5: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}
Error with image 25: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}
Error with image 5: Error code: 400 - {'error': {'message':

Gemini produced an empty response.
Unrecognized role: . Treating as a ChatMessage.


## Eval

Run evaluation on our dataset:

* `task.name` is the dataset of QA pairs that we cloned
* `eval_config` specifies the [LangSmith evaluator](https://docs.smith.langchain.com/evaluation/evaluator-implementations#correctness-qa-evaluation) for our dataset, which will use GPT-4 as a grader
* The grader will evaluate the chain-generated answer to each question relative to ground truth

In [9]:
import uuid

from langchain.smith import RunEvalConfig
from langsmith.client import Client

# Evaluator configuration
client = Client()
eval_config = RunEvalConfig(
    evaluators=["cot_qa"],
)

# Experiments
chain_map = {
    "gpt4v-480-270": stor_chain["gpt4v-480-270"],
    "gpt4v-720-405": stor_chain["gpt4v-720-405"],
    "gpt4v-960-540": stor_chain["gpt4v-960-540"],
    "gemini-480-270": stor_chain["gemini-480-270"],
    "gemini-720-405": stor_chain["gemini-720-405"],
    "gemini-960-540": stor_chain["gemini-960-540"],
}

# Run evaluation
run_id = uuid.uuid4().hex[:4]
test_runs = {}
for project_name, chain in chain_map.items():
    test_runs[project_name] = client.run_on_dataset(
        dataset_name=task.name,
        llm_or_chain_factory=lambda: (lambda x: x["Question"]) | chain,
        evaluation=eval_config,
        verbose=True,
        project_name=f"{project_name}-{run_id}",
        project_metadata={"chain": project_name},
    )

View the evaluation results for project 'gpt4v-480-270-ceff' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74/compare?selectedSessions=0e53beb9-68da-4cf7-86b0-c9277f53246e

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74
[------------------------------------------------->] 10/10

Unnamed: 0,output,feedback.COT Contextual Accuracy,error,execution_time,run_id
count,10,10.0,0.0,10.0,10
unique,10,,0.0,,10
top,As of the latest information provided in the i...,,,,972df170-09dc-4900-9d76-b197d0a4e3b5
freq,1,,,,1
mean,,0.5,,10.055602,
std,,0.527046,,3.083101,
min,,0.0,,6.434943,
25%,,0.0,,7.127925,
50%,,0.5,,10.379362,
75%,,1.0,,12.739249,


View the evaluation results for project 'gpt4v-720-405-ceff' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74/compare?selectedSessions=5b810cb3-25f9-45c1-91f6-8807e44e1e42

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74
[>                                                 ] 0/10

Chain failed for example 4529633d-b9a8-4565-83d6-f65b1ff9dd80 with inputs {'Question': 'What is the projected TAM for observability expected for each year through 2026?'}
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}


[---->                                             ] 1/10

Chain failed for example 16d929e5-f2ef-43ea-a408-9192b00bd8e2 with inputs {'Question': "What is the projected cloud spend in $B's in 2026E?"}
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}


[--------------------------------------->          ] 8/10

Chain failed for example b682bce8-bc6d-4d87-95aa-e56546a57a33 with inputs {'Question': 'What was the % Y/Y growth in FY20, FY21, and FY22?'}
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}


[------------------------------------------------->] 10/10

Unnamed: 0,output,feedback.COT Contextual Accuracy,error,execution_time,run_id
count,7,7.0,3,10.0,10
unique,7,,3,,10
top,"As of the data provided in the slide, Datadog ...",,"Error code: 400 - {'error': {'message': ""You u...",,e9ea6a99-77fd-4a53-9c11-a46c5463daf4
freq,1,,1,,1
mean,,1.0,,10.191968,
std,,0.0,,2.482067,
min,,1.0,,7.256001,
25%,,1.0,,8.690568,
50%,,1.0,,9.492313,
75%,,1.0,,11.067811,


View the evaluation results for project 'gpt4v-960-540-ceff' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74/compare?selectedSessions=65ede05e-4460-43cd-b9ba-01ad52228a35

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74
[>                                                 ] 0/10

Chain failed for example 16d929e5-f2ef-43ea-a408-9192b00bd8e2 with inputs {'Question': "What is the projected cloud spend in $B's in 2026E?"}
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}


[---->                                             ] 1/10

Chain failed for example 4529633d-b9a8-4565-83d6-f65b1ff9dd80 with inputs {'Question': 'What is the projected TAM for observability expected for each year through 2026?'}
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': "You uploaded an unsupported image. Please make sure your image is below 20 MB in size and is of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'image_parse_error'}}


[------------------------------------------------->] 10/10

Unnamed: 0,output,feedback.COT Contextual Accuracy,error,execution_time,run_id
count,8,8.0,2,10.0,10
unique,8,,2,,10
top,"Based on the second image provided, Datadog ha...",,"Error code: 400 - {'error': {'message': ""You u...",,2f83f649-41c4-4f59-99d5-2b3a63fcbff0
freq,1,,1,,1
mean,,1.0,,12.496728,
std,,0.0,,1.784272,
min,,1.0,,8.60851,
25%,,1.0,,11.929827,
50%,,1.0,,12.544359,
75%,,1.0,,13.262577,


View the evaluation results for project 'gemini-480-270-ceff' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74/compare?selectedSessions=6e116420-5e6b-4a1a-9e2d-11b198556899

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74
[------------------------------------------------->] 10/10

Unnamed: 0,output,feedback.COT Contextual Accuracy,error,execution_time,run_id
count,10,10.0,0.0,10.0,10
unique,10,,0.0,,10
top,"As of September 30, 2022, Datadog had 26,800 ...",,,,989b94d0-bddd-439a-ae93-594baab68263
freq,1,,,,1
mean,,0.6,,10.221055,
std,,0.516398,,0.643574,
min,,0.0,,9.430064,
25%,,0.0,,9.744385,
50%,,1.0,,10.118048,
75%,,1.0,,10.740429,


View the evaluation results for project 'gemini-720-405-ceff' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74/compare?selectedSessions=8fef994a-fe86-4252-b834-6032b0222a6f

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74
[------------------------------------------------->] 10/10

Unnamed: 0,output,feedback.COT Contextual Accuracy,error,execution_time,run_id
count,10,10.0,0.0,10.0,10
unique,10,,0.0,,10
top,"As of September 30, 2022, Datadog had 26,800 ...",,,,1e3158d0-d8cc-4859-86f9-4765fd4d23ab
freq,1,,,,1
mean,,0.9,,12.710439,
std,,0.316228,,0.927829,
min,,0.0,,11.49637,
25%,,1.0,,12.256355,
50%,,1.0,,12.361182,
75%,,1.0,,12.90268,


View the evaluation results for project 'gemini-960-540-ceff' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74/compare?selectedSessions=aabcb791-febd-4a8e-9f80-9db2ab5a0960

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/cd8e425b-5769-4b5e-a784-cbf8e2d72c74
[------------------------------------------------->] 10/10

Unnamed: 0,output,feedback.COT Contextual Accuracy,error,execution_time,run_id
count,10,10.0,0.0,10.0,10
unique,10,,0.0,,10
top,"As of September 30, 2022, Datadog had approxi...",,,,eebfb034-ca79-4128-90a4-ec8e93d50521
freq,1,,,,1
mean,,0.5,,15.657744,
std,,0.527046,,1.688655,
min,,0.0,,13.549547,
25%,,0.0,,14.640496,
50%,,0.5,,15.683292,
75%,,1.0,,16.265661,
