### Semi-structured RAG (Tables and Text)

Text and tables are can be found in many documents.Conventional RAG may have difficulties when dealing with semi-structured data for two main reasons which are text splitting could fracture tables and contaminate the data during retrieval and using embedded tables in semantic similarity searches could provide difficulties.

In [2]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Get elements
raw_pdf_elements = partition_pdf(
    filename="/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/llama.pdf",
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path="/home/heliya/Desktop/rag_approaches/src/rag_approaches/images",
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 39,
 "<class 'unstructured.documents.elements.Table'>": 16,
 "<class 'unstructured.documents.elements.TableChunk'>": 3}

In [6]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))


# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

19
39


In [7]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [13]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [9]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [14]:
# Apply to texts
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

In [16]:
table_summaries

['The table provides information about different datasets and their properties. The CommonCrawl dataset has the highest sampling proportion at 67.0% and the largest disk size at 3.3 TB. The StackExchange dataset has the lowest sampling proportion at 2.0% and a disk size of 78 GB. The other datasets, C4, Github, Wikipedia, Books, and ArXiv, have sampling proportions ranging from 15.0% to 2.5% and disk sizes ranging from 783 GB to 83 GB.',
 'The table presents data on four different models with varying dimensions, heads, layers, and learning rates. The first model has a dimension of 4096, 32 heads, 32 layers, and a learning rate of 3.0e−4. The second model has a dimension of 5120, 40 heads, 40 layers, and the same learning rate as the first. The third model has a dimension of 6656, 52 heads, 60 layers, and a learning rate of 1.5e−4. The fourth model has a dimension of 8192, 64 heads, 80 layers, and the same learning rate as the third. All models have 4M parameters and a model size rangin

In [17]:
import uuid
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

# Initialize Chroma vector store with a single collection
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# Add summarized texts to Chroma
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={"doc_id": doc_ids[i], "type": "text"})
    for i, s in enumerate(text_summaries)
]
vectorstore.add_documents(summary_texts)

# Add summarized tables to Chroma
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={"doc_id": table_ids[i], "type": "table"})
    for i, s in enumerate(table_summaries)
]
vectorstore.add_documents(summary_tables)

# Function to retrieve and display the original content based on the doc_id
def retrieve_original_content(query, vectorstore):
    # Perform the similarity search
    results = vectorstore.similarity_search(query)

    # Display the results
    for result in results:
        doc_type = result.metadata.get("type")
        doc_id = result.metadata.get("doc_id")
        if doc_type == "text":
            original_content = texts[doc_ids.index(doc_id)]
        elif doc_type == "table":
            original_content = tables[table_ids.index(doc_id)]
        
        print(f"Original {doc_type.capitalize()} Content:")
        print(original_content)
        print("-" * 50)


In [18]:
# Example query to retrieve documents
query = "What is the number of training tokens for LLaMA2?"
retrieve_original_content(query, vectorstore)

Original Text Content:
Table 3: Zero-shot performance on Common Sense Reasoning tasks.

reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022). Moreover, we also over- lap the computation of activations and the commu- nication between GPUs over the network (due to all_reduce operations) as much as possible.

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

3 Main results

We evaluate LLaMA on free-form generation tasks and multiple choice tasks. In the multiple choice tasks, the objective is to select the most appropriate completion among a set of given op- tions, based on a provided context. We select the completion with the highest likelihood given the provided context. We follow Gao et al. (2021) and use the likelihood normalized by the number of characters i