## Semi-structured and Multi-modal RAG

Many documents contain a mixture of content types, including text, tables, and images.

Semi-structured data can be challenging for conventional RAG for at least two reasons:

* Text splitting may break up tables, corrupting the data in retrieval
* Embedding tables may pose challenges for semantic similarity search

And the information captured in images is typically lost.

In this notebook we will try option 1.

`Option 1:`

* Use multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text
* Retrieve both using similarity search
* Pass raw images and text chunks to a multimodal LLM for answer synthesis

In this notebook we will:

* We will use [Unstructured](https://unstructured.io/) to parse images, text, and tables from documents (PDFs).
* We will use Open Clip multi-modal embeddings.
* We will use [Chroma](https://www.trychroma.com/) with support for multi-modal.

![chroma_multimodal](../../diagrams/chroma_multimodal.png)

In [1]:
import os

from langchain_chroma import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings
from unstructured.partition.pdf import partition_pdf

In [2]:
raw_pdf_elements = partition_pdf(
    filename="../../data/rummikub_rules_with_images.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=2500,
    new_after_n_chars=2200,
    combine_text_under_n_chars=1000,
    image_output_dir_path="../../data",
)

ModuleNotFoundError: No module named 'torch'

In [None]:
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

In [None]:
vectorstore = Chroma(
    collection_name="mm_rag_clip_photos",
    embedding_function=OpenCLIPEmbeddings(
        model_name="ViT-g-14", checkpoint="laion2b_s34b_b88k"
    ),
)

image_uris = sorted(
    [
        os.path.join("../../figures", image_name)
        for image_name in os.listdir("../../figures")
        if image_name.endswith(".jpg")
    ]
)

# Add images
vectorstore.add_images(uris=image_uris)

# Add documents
vectorstore.add_texts(texts=texts)

# Add tables
vectorstore.add_texts(texts=tables)

# Make retriever
retriever = vectorstore.as_retriever()

open_clip_pytorch_model.bin:   0%|          | 0.00/5.47G [00:00<?, ?B/s]

In [None]:
retriever.invoke("Explain manipulation")

[Document(metadata={}, page_content='Another variation concerns the point at which manipulation can begin. Most sets of rules agree that as soon as you have laid down your initial sets and runs to a value of 30 or more points, you can in the same turn start manipulating the sets and runs on the table and adding further tiles to them. According to the Dutch Spelregelboek, however, manipulation can only begin on your next turn after the turn in which you laid down your initial meld. Manipulation on the same turn that you lay down your initial meld is, however, allowed in the "Dutch Sabra" variation in that book. The Lemada (1999) rules also appear not to allow manipulation on the turn in which you make your initial meld. The Spears (1988) rules explicitly do allow it ("once players have entered the game they can on the same turn \'play the table\'..."). The Pressman (1987 and 1998) and Goliath (1994) rules are somewhat ambiguous, but seem to allow manipulation to begin on the same turn a

## Conclusion
Option 1 does process but in google colab because my pc is not powerful enough to run this code. I'm not sure if the images are processed correctly. But I think the embeddings are a bit better than the nomic embeddings but are computationally more expensive. Because we don't have a lot of computational power, we can't use this option. This notebook is downloaded from Google collab.