In [1]:
import pandas as pd
from lxml import html
from pydantic import BaseModel
from typing import Any, Optional
from unstructured.partition.pdf import partition_pdf

## Data Loading

### Partition PDF tables, text, and images w/ Unstructured
  
* Test on `LLaMA2` Paper: https://arxiv.org/pdf/2307.09288.pdf
* Use `chunking_strategy="by_title"`, which rolls up subsequent non-Table elements under a Title into a CompositeElement
* Also supports [image extraction](https://github.com/Unstructured-IO/unstructured/pull/1371)

In [3]:
img_path = "/Users/rlm/Desktop/Papers/"

In [4]:
# Get elements
raw_pdf_elements = partition_pdf(filename="/Users/rlm/Desktop/Papers/2307.09288.pdf",
                                 chunking_strategy="by_title",
                                 extract_images_in_pdf=True,
                                 infer_table_structure=True,
                                 image_output_dir_path=img_path)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# Elements
unique_categories = {str(type(element)) for element in raw_pdf_elements}
unique_categories

{"<class 'unstructured.documents.elements.CompositeElement'>",
 "<class 'unstructured.documents.elements.Table'>",
 "<class 'unstructured.documents.elements.TableChunk'>"}

In [7]:
class Element(BaseModel):
    type: str
    text: Any

# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))
    else:
        categorized_elements.append(Element(type="table-chunk", text=str(element)))
        

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

# Table Chunk
table_chunk_elements = [e for e in categorized_elements if e.type == "table-chunk"]
print(len(table_chunk_elements))

100
414
0


In [8]:
# *** Placeholder *** 
# Ideally we get this from Unstructured parsing 
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(        
    chunk_size = 4000,
    chunk_overlap  = 200,
)
texts = [i.text for i in text_elements]
all_text_concat = "".join(texts)
docs = text_splitter.split_text(all_text_concat)
len(docs)

20

## Multi-vector retriever

Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).

### Text and Table summaries

In [9]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

In [10]:
# Prompt 
prompt_text="""You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text) 

# Summary chain 
model = ChatOpenAI(temperature=0,model="gpt-4")
summarize_chain = {"element": lambda x:x} | prompt | model | StrOutputParser()

In [None]:
# Apply to text
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})



In [21]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})



### Image summaries 

* API: Use `GPT4-v`
* OSS: Use [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) with [LLaVA or similar](https://huggingface.co/mys/ggml_llava-v1.5-7b/tree/main).

In [None]:
! pip install opencv-python

import os
import cv2
directory_path = "/Users/rlm/Desktop/Papers/"
images = []
for filename in os.listdir(img_path):
    if filename.endswith(".jpg") or filename.endswith(".png"):
        image_path = os.path.join(directory_path, filename)
        image = cv2.imread(image_path)
        images.append(image)

In [None]:
### Image -> Text summaries 
### Store Text summaries as shown below 

### Add to vectorstore

Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries.

In [16]:
import uuid
from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain.schema.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings()
)

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summaries)]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [Document(page_content=s,metadata={id_key: table_ids[i]}) for i, s in enumerate(table_summaries)]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

### Sanity Check

The first table of the present `Llama 2 family of models` with Params and Context Length, etc.

We correctly extract this:

In [28]:
tables[0]

'Training Data Params Context GQA Tokens LR Length 7B 2k 1.0T 3.0x 10-4 See Touvron et al. 13B 2k 1.0T 3.0 x 10-4 LiaMa 1 (2023) 33B 2k 14T 1.5 x 10-4 65B 2k 1.4T 1.5 x 10-4 7B 4k 2.0T 3.0x 10-4 Liama 2 A new mix of publicly 13B 4k 2.0T 3.0 x 10-4 available online data 34B 4k v 2.0T 1.5 x 10-4 70B 4k v 2.0T 1.5 x 10-4'

In [29]:
table_summaries[0]

'The table presents different training data parameters for various models. The parameters include the number of tokens (ranging from 7B to 70B), context (2k or 4k), GQA (from 1.0T to 14T), and learning rate (LR) (from 1.5 x 10-4 to 3.0 x 10-4). Some models are referenced, such as Touvron et al., LiaMa 1, LiaMa 2, and a new mix of publicly available online data.'

In [31]:
# We can retrive this table
retriever.get_relevant_documents("What is the number of training tokens for LLaMA2?")

['Train PPL\n\n2.2 2.1 2.0 19 18 17 1.6   15 14 0 250 500 750 1000 1250 1500 1750 2000 Processed Tokens\n\n(Billions)\n\nFigure 5: Training Loss for Llama 2 models. We compare the training loss of the Llama 2 family of models. We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation.\n\nTokenizer. We use the same tokenizer as Llama 1; it employs a bytepair encoding (BPE) algorithm (Sennrich et al., 2016) using the implementation from SentencePiece (Kudo and Richardson, 2018). As with Llama 1, we split all numbers into individual digits and use bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32k tokens.',
 'Training Data Params Context GQA Tokens LR Length 7B 2k 1.0T 3.0x 10-4 See Touvron et al. 13B 2k 1.0T 3.0 x 10-4 LiaMa 1 (2023) 33B 2k 14T 1.5 x 10-4 65B 2k 1.4T 1.5 x 10-4 7B 4k 2.0T 3.0x 10-4 Liama 2 A new mix of publicly 13B 4k 2.0T 3.0 x 10-4 available online data 34B 4k v 2.0T 1.5 x 10-4 70B 4k v 2.0T 1.5 x 10

## RAG

Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval).

In [33]:
from operator import itemgetter
from langchain.schema.runnable import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0,model="gpt-4")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()} 
    | prompt 
    | model 
    | StrOutputParser()
)

In [34]:
chain.invoke("What is the number of training tokens for LLaMA2?")

'The number of training tokens for LLaMA2 is 2 trillion.'

We can check the [trace](https://smith.langchain.com/public/322fd162-845b-4f82-a15c-898e94551967/r) to see what chunks were retrieved:

This includes our table:

```
Training Data Params Context GQA Tokens LR Length 7B 2k 1.0T 3.0x 10-4 See Touvron et al. 13B 2k 1.0T 3.0 x 10-4 LiaMa 1 (2023) 33B 2k 14T 1.5 x 10-4 65B 2k 1.4T 1.5 x 10-4 7B 4k 2.0T 3.0x 10-4 Liama 2 A new mix of publicly 13B 4k 2.0T 3.0 x 10-4 available online data 34B 4k v 2.0T 1.5 x 10-4 70B 4k v 2.0T 1.5 x 10-4
```