In [1]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
path = "/mnt/c/Users/vishal.tyagi/GenAi/multi_rag/"
# Get elements
raw_pdf_elements = partition_pdf(
    filename=path + "LLaVA.pdf",
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 33,
 "<class 'unstructured.documents.elements.Table'>": 4}

In [4]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

4
33


In [5]:
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

In [6]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
# model = ChatOllama(model="llama2:13b-chat")
model = ChatOllama(model="tinyllama:1.1b-chat")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [7]:
# Apply to text
texts = [i.text for i in text_elements if i.text != ""]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [9]:
table_summaries

['A summary of the table or text includes:\n\n- Conversation Detail description Complex reasoning: The conversation details describe complex and abstract ideas, with a focus on discussing relevant topics related to a particular topic. Conv + 5% Detail + 10% Complex: These chunks showcase the importance of specific details and complex reasoning in the conversation. - Conversation: This chunk includes an overview of the discussion, along with the topics discussed during the conversation. No Instruction Tuning: The chunk is included to indicate that there may be a lack of instruction or guidance throughout the conversation. It helps to summarize the main points and highlight areas where clarification would be helpful. - Instruction Tuning: This chunk shows any areas where further clarification or instruction might be necessary, but it does not necessarily imply a lack of focus on the topics discussed during the conversation.',
 'The table or text chunk given earlier is:\n\nTable/Text Chun

In [10]:
text_summaries

['In conclusion, our study introduces the first attempt to use machine-generated instruction-following data for language-only GPT-4 and LLaVA models that can generate multimodal languaure-image instruction-following data. Our experiments show that LLaVA outperforms GPT-4 on challenging application-oriented tasks by exhibiting the behavior of multimodal GP-4 on unseen images/instructions. The generated visual instruction tuning data, our model, and code are publicly available for future research on visual instruction following.',
 "The table or text is:\n\nHuman interaction with the world through vision and language has emerged as an active area of research in artificial intelligence, with various tasks and applications being solved by large vision models (LVMs) that can handle multi-modal signals. However, humans still interact with the world through visual and languaage instructions, which require a versatile general-purpose assistant capable of following them and completing various r

In [2]:
%%bash

# Define the directory containing the images
IMG_DIR=/mnt/c/Users/vishal.tyagi/GenAi/multi_rag/figures/

# Loop through each image in the directory
for img in "${IMG_DIR}"*.jpg; do
    # Extract the base name of the image without extension
    base_name=$(basename "$img" .jpg)

    # Define the output file name based on the image name
    output_file="${IMG_DIR}${base_name}.txt"

    # Execute the command and save the output to the defined output file
    ~/llama.cpp/llama-llava-cli -m ./models/ggml-model-q4_k.gguf --mmproj ./models/mmproj-model-f16.gguf --temp 0.1 -p "Describe the image in detail. Be specific about graphs, such as bar plots." --image "$img" > "$output_file"

done

Log start


llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/ggml-model-q4_k.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              

In [11]:
import glob
import os

# Get all .txt files in the directory
file_paths = glob.glob(os.path.expanduser(os.path.join(path,"figures", "*.txt")))

In [12]:
# Read each file and store its content in a list
cleaned_img_summary = []
for file_path in file_paths:
    with open(file_path, "r") as file:
        cleaned_img_summary.append(file.read())

# # Clean up residual logging
# cleaned_img_summary = [
#     s.split("clip_model_load: total allocated memory: 201.27 MB\n\n", 1)[1].strip()
#     for s in img_summaries
# ]

In [13]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_core.documents import Document




In [14]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="summaries", embedding_function=GPT4AllEmbeddings()
)

# The storage layer for the parent documents
store = InMemoryStore()  # <- Can we extend this to images
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

Failed to load libllamamodel-mainline-cuda.so: dlopen: libcudart.so.11.0: cannot open shared object file: No such file or directory
Failed to load libllamamodel-mainline-cuda-avxonly.so: dlopen: libcudart.so.11.0: cannot open shared object file: No such file or directory


In [15]:
# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

# Add images
img_ids = [str(uuid.uuid4()) for _ in cleaned_img_summary]
summary_img = [
    Document(page_content=s, metadata={id_key: img_ids[i]})
    for i, s in enumerate(cleaned_img_summary)
]
retriever.vectorstore.add_documents(summary_img)
retriever.docstore.mset(
    list(zip(img_ids, cleaned_img_summary))
)  # Store the image summary as the raw document

In [17]:
texts

['2023\n\n3\n\n2\n\n0\n\n2 c e D 1 1 ] V C . s c [ 2 v 5 8 4 8 0 . 4 0 3 2 : v\n\narXiv\n\ni\n\nX\n\nr\n\na\n\nVisual Instruction Tuning\n\nHaotian Liu1∗, Chunyuan Li2∗, Qingyang Wu3, Yong Jae Lee1\n\n1University of Wisconsin–Madison 2Microsoft Research 3Columbia University https://llava-vl.github.io\n\nAbstract\n\nInstruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we in- troduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general- purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation bench

In [20]:
retriever.invoke("Images / figures with playful and creative examples")

'B More Results\n\nWe present more qualitative results of LLaVA to analyze its emergent behaviors and observed weaknesses. For more quantitative results of LLaVA on academic benchmarks, please refer to the improved baselines with visual instruction tuning [32]. In Table 9, LLaVA demonstrates a similar behavior as GPT-4 in another example from its paper. Similar to the GPT-4 live demo by OpenAI, LLaVA is capable of generating the HTML/JS/CSS code for an interactive joke website based on a simplified user input sketch in Fig. 2, despite a minor error. As shown in Fig. 3, LLaVA can follow user’s instructions in a conversational style and provide detailed responses or creative writings. Furthermore, LLaVA is able to relate the visual content to the textual knowledge from the pretrained LLM, as demonstrated in Fig. 4 and Fig. 5.\n\nOne interesting emergent behavior of LLaVA is that it is able to understand visual contents that are not covered in the training. For example, in Fig. 6, it is a

In [21]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Option 1: LLM
model = ChatOllama(model="tinyllama:1.1b-chat")
# Option 2: Multi-modal LLM
# model = LLaVA

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [22]:
chain.invoke(
    "What is the performance of LLaVa across across multiple image domains / subjects?"
)

"The study conducted by Gao et al. Compares and evaluates how well different language models (including text-only GPTAI, multi-modal chain-of-thoughtt (MM-CoT) and deep learning models such as LLaVA with a visual feature before the last layer) respond to multiple image domains, including scienceQA and BLIP-2.\n\nThe study finds that all three models can follow a user instruction to provide an appropriate answer for multimodal data based on their respective image domain knowledge. The GLVA model also performs better than GPTAI in terms of accuracy, especially in questions related to science subjects. LLaVa with an added multi-modal chain-of-thoughtt (MM-CoT) outperforms both GPTAI and GLVA models on this dataset by 7.52%, achieving a new SoTA score of 90.96% compared to 90.14%. The study also finds that combining GPTAI with the visual feature from an external model, such as LLaVA with MM-CoT, can improve the accuracy by about 7%, indicating the value of using multiple models and knowled

In [23]:
chain.invoke(
    "Explain any images / figures in the paper with playful and creative examples."
)

"The image features a blue circle with a white outline, which represents a person's head. Inside the circle, there is a speech bubble with a message inside. The overall design of the image is visually appealingly and easy to understand. The message is written in a playful and creative manner, possibly relating to a joke or a conversation between the person and someone else. The blue circle with white outline appears on a white background creating a clean and visually appealing design."

In [24]:
chain.invoke(
    "Explain any images / figures in the paper with playful and creative examples."
)

"Sure! Here's an example of a handwritten note on a piece of paper with playful and creative examples:\n\n[Handwritten note on a piece of paper]\n\nMy Joke Website\n\nFunny Joke\n\nPush to Revial Puccline?"