In [1]:
! pip install langchain unstructured[all-docs] pydantic lxml




In [3]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Path to save images
path = "/home/vqa/masterthesis/ourproject/"

# Get elements
raw_pdf_elements = partition_pdf(
    filename=path + "output.pdf",
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
# TableChunk if Table > max chars set above
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 5}

In [4]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

0
5


Text and Table summaries

In [5]:
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

In [6]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOllama(model="llama2:7b-chat")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [7]:
# Apply to text
texts = [i.text for i in text_elements if i.text != ""]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

In [8]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

Images summarization

In [9]:
%%bash

# Define the directory containing the images
IMG_DIR=/home/vqa/masterthesis/ourproject/figures-test

echo "Contents of the directory ${IMG_DIR}:"
ls -l "${IMG_DIR}"

# Loop through each image in the directory
for img in "${IMG_DIR}"/*.jpg; do
    # Extract the base name of the image without extension
    base_name=$(basename "$img" .jpg)

    # Define the output file name based on the image name
    output_file="${IMG_DIR}/${base_name}.txt"

    # Execute the command and save the output to the defined output file
    python3 /home/vqa/masterthesis/ourproject/runllava.py --i "$img" --p "Describe the image in detail. Be specific about graphs, such as bar plots." --o "$output_file"
done

Contents of the directory /home/vqa/masterthesis/ourproject/figures-test:
total 284
-rw-rw-r-- 1 vqa vqa  5558 feb 20 13:32 figure-10-8.jpg
-rw-rw-r-- 1 vqa vqa   481 feb 22 10:28 figure-10-8.txt
-rw-rw-r-- 1 vqa vqa 17142 feb 20 13:32 figure-11-10.jpg
-rw-rw-r-- 1 vqa vqa   433 feb 22 10:29 figure-11-10.txt
-rw-rw-r-- 1 vqa vqa  6872 feb 20 13:32 figure-11-9.jpg
-rw-rw-r-- 1 vqa vqa   742 feb 22 10:27 figure-11-9.txt
-rw-rw-r-- 1 vqa vqa  3541 feb 20 13:32 figure-13-11.jpg
-rw-rw-r-- 1 vqa vqa   272 feb 22 10:28 figure-13-11.txt
-rw-rw-r-- 1 vqa vqa 58014 feb 20 13:32 figure-2-1.jpg
-rw-rw-r-- 1 vqa vqa   427 feb 22 10:28 figure-2-1.txt
-rw-rw-r-- 1 vqa vqa  2298 feb 20 13:32 figure-2-2.jpg
-rw-rw-r-- 1 vqa vqa   247 feb 22 10:29 figure-2-2.txt
-rw-rw-r-- 1 vqa vqa 16408 feb 20 13:32 figure-4-3.jpg
-rw-rw-r-- 1 vqa vqa   450 feb 22 10:27 figure-4-3.txt
-rw-rw-r-- 1 vqa vqa 16593 feb 20 13:32 figure-4-4.jpg
-rw-rw-r-- 1 vqa vqa   413 feb 22 10:29 figure-4-4.txt
-rw-rw-r-- 1 vqa vqa 575

 Response: 1. The image features a map of the United States, specifically focusing on Alaska.
2. A small square or rectangle is placed in the top left corner of the map. This could be a representation of Play Modes.
Response saved to: /home/vqa/masterthesis/ourproject/figures-test/figure-10-8.txt
Starting Ollama server with LLaVa...
 Response:  The image features an electric device with a couple of batteries attached to it. One of the batteries is located on the left side, and the other one is on the right side of the object. These batteries are connected via wires, with one wire running horizontally across the middle of the scene. The arrangement of these components suggests that they form part of a power supply or an energy storage system.
Response saved to: /home/vqa/masterthesis/ourproject/figures-test/figure-11-10.txt
Starting Ollama server with LLaVa...
 Response:  The image features a brown background with white letters on it. There are two words written in large, capitalized te

In [10]:
import glob
import os

path = '/home/vqa/masterthesis/ourproject/figures-test' 
# Get all .txt files in the directory
file_paths = glob.glob(os.path.expanduser(os.path.join(path, "*.txt")))

# Read each file and store its content in a list
img_summaries = []
for file_path in file_paths:
    with open(file_path, "r") as file:
        img_summaries.append(file.read())

print(img_summaries)

# Clean up residual logging
cleaned_img_summary = []
for s in img_summaries:
    split_result = s.split("clip_model_load: total allocated memory: 201.27 MB\n\n", 1)
    if len(split_result) > 1:
        cleaned_img_summary.append(split_result[1].strip())
    else:
        # Handle cases where split string is not found
        cleaned_img_summary.append(s.strip())



Vectorstore

In [11]:
print(texts)



In [25]:
print("Length of image_text_summaries:", len(text_summaries))

Length of image_text_summaries: 5


In [24]:
print("Length of image_text_elements:", len(texts))

Length of image_text_elements: 5


In [12]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
        collection_name="summaries", embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    )

# The storage layer for the parent documents
store = InMemoryStore()  # <- Can we extend this to images
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

In [13]:
# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# # Add tables
# table_ids = [str(uuid.uuid4()) for _ in tables]
# summary_tables = [
#     Document(page_content=s, metadata={id_key: table_ids[i]})
#     for i, s in enumerate(table_summaries)
# ]
# retriever.vectorstore.add_documents(summary_tables)
# retriever.docstore.mset(list(zip(table_ids, tables)))

# Add images
img_ids = [str(uuid.uuid4()) for _ in cleaned_img_summary]
summary_img = [
    Document(page_content=s, metadata={id_key: img_ids[i]})
    for i, s in enumerate(cleaned_img_summary)
]
retriever.vectorstore.add_documents(summary_img)
retriever.docstore.mset(
    list(zip(img_ids, cleaned_img_summary))
)  # Store the image summary as the raw document

In [14]:
retriever.get_relevant_documents("Images / mona lisa")[
    0
]

'The image features a brown background with white letters on it. There are two words written in large, capitalized text: "CARE" and "COMMUNITY." These words appear to be the central focus of the image, standing out against the monochrome backdrop. The boldness of the lettering draws attention to the message being conveyed.'

RAG

In [15]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Option 1: LLM
model = ChatOllama(model="llama2:7b-chat")
# Option 2: Multi-modal LLM
# model = LLaVA

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [16]:
chain.invoke(
    "Which battery does the CD player use?"
)

'The CD player uses two AA (LRO6 alkaline) batteries.'

In [17]:
chain.invoke(
    "What can the CD player play?"
)

'The CD player can play compact discs.'

In [18]:
chain.invoke("Should I clean the CD player?")

'The manual does not recommend cleaning the CD player, stating that it is not necessary and that simply wiping the outside surfaces with a soft cloth will be sufficient as needed. Therefore, it is not recommended to clean the CD player.'

In [19]:
chain.invoke(
    "What is the performance of LLaVa across across multiple image domains / subjects?"
)

"Based on the provided context, there is no direct answer to the question regarding the performance of LLaVa across multiple image domains/subjects. The provided context only refers to images featuring a map of the United States with a small square or rectangle placed in the top left corner, representing Play Modes. There are also images related to an anti-skip feature and a declaration of conformity for a Bose PM-1 Portable Compact Disc Player.\n\nHowever, it is possible to infer some information about LLaVa's performance across multiple image domains/subjects based on the context provided:\n\n1. The presence of images related to different subjects (e.g., maps, anti-skip features, declaration of conformity) suggests that LLaVa may be capable of processing and analyzing various types of visual content.\n2. The fact that the images are labeled with specific technical regulations (e.g., EN 55013, EN 55020, etc.) and accreditation information implies that LLaVa may be designed to operate 

In [20]:
chain.invoke(
    "Explain the chicken nugget picture."
)

'Ah, I see! Based on the context you provided, the "chicken nugget" picture is likely a humorous representation of a complex electronic device. The image features multiple buttons and cords, which are often associated with electronic devices. However, the addition of a chicken nugget in the center of the image creates a playful and absurd contrast, implying that the device is not just functional but also tasty!\n\nPerhaps the image is meant to represent a fictional gadget that combines the functions of a stereo, portable music player, and a snack dispenser. The buttons on top could be for playing music, pausing, or adjusting the volume, while the cords suggest that it needs to be plugged into an outlet to function. The chicken nugget, however, adds a whimsical touch, suggesting that this device is not just practical but also indulgent.\n\nOverall, the "chicken nugget" picture appears to be a humorous take on the typical electronic device diagram, using absurdity and exaggeration to cre

In [21]:
chain.invoke(
    "List the foods in the picture of the firdge."
)

"I apologize, but based on the context provided, there are no foods visible in the image. The scene depicts a group of batteries placed next to each other, along with two scissors. There is also a map of the United States and a detailed diagram of an electronic device's inner workings. However, there are no foods or any related elements visible in the images. Therefore, I cannot list any foods in the picture."

In [22]:
chain.invoke(
    "Explain the LLaVA architecture based on the picture."
)

' Based on the provided pictures, I can infer that the LLaVA architecture is likely a modular and hierarchical design, with multiple components working together to form a cohesive system. Here are some key observations and insights:\n\n1. Modular design: The image features multiple components, each with its own unique function or role. These components include the speaker, buttons, clock face, batteries, and wires. This suggests that LLaVA architecture is designed to be modular, with each component playing a specific role in the overall system.\n2. Hierarchical structure: The image shows multiple levels of organization within the device. For example, the battery compartment appears to be located inside the main body of the device, which suggests a hierarchical structure with the batteries serving as a subcomponent of the device. Similarly, the buttons and clock face are also organized in a hierarchical manner, with the buttons being part of the overall user interface and the clock face