## Semi-structured and Multi-modal RAG

Many documents contain a mixture of content types, including text, tables, and images.

Semi-structured data can be challenging for conventional RAG for at least two reasons:

* Text splitting may break up tables, corrupting the data in retrieval
* Embedding tables may pose challenges for semantic similarity search

And the information captured in images is typically lost.

In this notebook we will try option 2.

`Option 2:`

* Use a multimodal LLM (such as [GPT4-V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images
* Embed and retrieve text
* Pass text chunks to an LLM for answer synthesis

This notebook shows how we might tackle this :

- We will use Unstructured to parse images, text, and tables from documents (PDFs).
- We will use the multi-vector retriever to store raw tables, text, (optionally) images along with their summaries for retrieval.$

![ss_mm_rag.png](../../diagrams/ss_mm_rag.png)

In [1]:
import base64
import io
import os

from dotenv import load_dotenv
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from PIL import Image
from PIL.Image import Resampling
from unstructured.partition.pdf import partition_pdf

load_dotenv("../../.env.research")

  from .autonotebook import tqdm as notebook_tqdm


True

In [2]:
raw_pdf_elements = partition_pdf(
    filename=os.getenv("DATA_DIR") + "/rummikub_rules_with_images.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=2500,
    new_after_n_chars=2200,
    combine_text_under_n_chars=1000,
    image_output_dir_path=os.getenv("DATA_DIR"),
)

In [3]:
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

In [4]:
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

model = ChatOllama(model="llama3.2")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [6]:
tables = [i for i in tables]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [8]:
def resize_and_encode_image(image_path, size=(128, 128)):
    """
    Resize an image and encode it as Base64.

    Args:
    image_path (str): Path to the image.
    size (tuple): Desired size of the image (width, height).

    Returns:
    str: Base64 string of the resized image.
    """
    try:
        # Open and resize the image
        with Image.open(image_path) as img:
            img = img.convert("RGB")  # Ensure consistent format
            img.thumbnail(size, Resampling.LANCZOS)

            # Save resized image to a bytes buffer
            buffered = io.BytesIO()
            img.save(buffered, format="PNG")
            return base64.b64encode(buffered.getvalue()).decode("utf-8")
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return None

In [10]:
prompt_image = """You are an assistant that can interpret images and text. The images are all from rules about the game rummikub.
Here is an image in Base64 format:
{element}

Please describe this image in detail."""
prompt = ChatPromptTemplate.from_template(prompt_image)

multi_model = ChatOllama(model="llava-llama3")

image_summarize_chain = (
    {"element": lambda x: resize_and_encode_image(x)}
    | prompt
    | multi_model
    | StrOutputParser()
)

In [None]:
image_dir = os.getenv("IMG_DIR")
image_paths = [
    os.path.join(image_dir, img)
    for img in os.listdir(image_dir)
    if img.endswith((".jpg", ".png"))
]

image_summaries = image_summarize_chain.batch(image_paths, {"max_concurrency": 5})

In [None]:
for img_path, summary in zip(image_paths, image_summaries):
    print(f"Afbeelding: {img_path}\nBeschrijving: {summary}\n")

## Conclusion
Option 2 is not possible and option 3 will neither be possible because even after resizing to try and reduce the size of the images, it still takes too long to make a description of the images. I think it is because the local model is not powerful enough to handle the images or the images are too complex to describe.