# Using Unstructured with LangChain & AstraDB

In this notebook, we show a basic RAG-style example that uses the Unstructured API to parse a PDF document, store the corresponding document into a vector store (`AstraDB`) and finally, perform some basic queries against that store. The notebook is modeled after the quick start notebooks and hence is meant as a way of getting started with Unstructured, backed by a vector database.

To use Unstructured, you need an API key. Sign-up for one here: https://unstructured.io/api-key-hosted. A key will be emailed to you.

### Requirements

In [1]:
# First, install the required dependencies
! pip3 install --quiet ragstack-ai

### Configuration

In [2]:
import os
from getpass import getpass

os.environ["UNSTRUCTURED_API_KEY"] = getpass("Enter your Unstructured API Key:")
os.environ["ASTRA_DB_ENDPOINT"] = input("Enter you Astra DB API Endpoint: ")
os.environ["ASTRA_DB_TOKEN"] = getpass("Enter you Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

### Using the Unstructured API to parse a PDF

In this example notebook, we'll focus our analysis on pages 9 and 10 of the referenced paper, available at https://arxiv.org/pdf/1706.03762.pdf, to limit API usage.

#### Simple Parsing

First we will start with the most basic parsing mode. This works well if your document doesn't contain any complex formatting or tables.

In [3]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader
import os

loader = UnstructuredAPIFileLoader(
    file_path="./data/1706.03762.pdf",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
)
simple_docs = loader.load()
len(simple_docs)

1

By default, the parser returns 1 document per pdf file.  Lets examine some the contents of the document:

In [4]:
print(simple_docs[0].page_content[0:400])

3 2 0 2

g u A 2

] L C . s c [

7 v 2 6 7 3 0 . 6 0 7 1 : v i X r a

Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

Attention Is All You Need

Ashish Vaswani∗ Google Brain avaswani@google.com

Noam Shazeer∗ Google Brain noam@google.com

Niki Parmar∗ Google Research nikip


This sample of the document contents shows the first table's description, and the start of a very poorly formatted table.

#### Advanced Parsing

By changing the processing strategy and response mode, we can get more detailed document structure. Unstructured can break the document into elements of different types, which can be helpful for improving your RAG system.

For example, the `Table` element type includes the table formatted as simple html, which can help the LLM answer questions from the table data, and we could exclude elements of type `Footer` from our vector store.

A list of all the different element types can be found here: https://unstructured-io.github.io/unstructured/introduction/overview.html#id1

Returned metadata can also be helpful. For example, the `page_number` of the pdf input, and a `parent_id` property which helps define nesting of text sections.

In [5]:
from langchain_community.document_loaders import unstructured

elements = unstructured.get_elements_from_api(
    file_path="./data/1706.03762.pdf",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
    strategy="hi_res", # default "auto"
    pdf_infer_table_structure=True,
)

len(elements)

194

Instead of a single document returned from the pdf, we now have 27 elements. Below, we use element type and `parent_id` to show a clearer representation of the document structure.

In [7]:
from IPython.display import display, HTML

parents = {}

for el in elements:
    parents[el.id] = el.text

for el in elements:
    if el.category == "Table":
        display(HTML(el.metadata.text_as_html))
    elif el.metadata.parent_id:
        print(f"parent: '{parents[el.metadata.parent_id]}' content: {el.text}")
    else:
        print(el)

3 2 0 2 g u A 2 ] L C . s c [ 7 v 2 6 7 3 0 . 6 0 7 1 : v i X r a
parent: '3 2 0 2 g u A 2 ] L C . s c [ 7 v 2 6 7 3 0 . 6 0 7 1 : v i X r a' content: Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.
parent: '3 2 0 2 g u A 2 ] L C . s c [ 7 v 2 6 7 3 0 . 6 0 7 1 : v i X r a' content: Attention Is All You Need
parent: 'Attention Is All You Need' content: Ashish Vaswani∗ Google Brain avaswani@google.com
parent: 'Attention Is All You Need' content: Noam Shazeer∗ Google Brain noam@google.com
parent: 'Attention Is All You Need' content: Niki Parmar∗ Google Research nikip@google.com
parent: 'Attention Is All You Need' content: Jakob Uszkoreit∗ Google Research usz@google.com
parent: 'Attention Is All You Need' content: Llion Jones∗ Google Research llion@google.com
parent: 'Attention Is All You Need' content: Aidan N. Gomez∗ † University of Toronto aidan@cs.toronto.edu
p

Layer Type,Complexity per Layer,Sequential Operations,Maximum Path Length
Self-Attention,O(n?-d),0o(1),0o(1)
Recurrent,O(n-d?),O(n),O(n)
Convolutional,O(k-n-d?),()(1),O(logk(n))
Self-Attention (restricted),O(r-n-d),0(1),O(n/r)


3.5 Positional Encoding
parent: '3.5 Positional Encoding' content: Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
parent: '3.5 Positional Encoding' content: In this work, we use sine and cosine functions of different frequencies:
P E(pos,2i) = sin(pos/100002i/dmodel) P E(pos,2i+1) = cos(pos/100002i/dmodel)
where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function

Model,BLEU,BLEU,Training Cost (FLOPs),Training Cost (FLOPs),Unnamed: 5_level_0
Model,Unnamed: 1_level_1,EN-DE,EN-FR,EN-DE,EN-FR
ByteNet [18] 2375,ByteNet [18] 2375,ByteNet [18] 2375,ByteNet [18] 2375,ByteNet [18] 2375,
Deep-Att + PosUnk,,39.2,,1.0-10%,
GNMT + RL [38],24.6,39.92,2.3-10°,1.4-10%,
ConvS2S [9],25.16,40.46,9.6-10®,1.5-10%,
MoE,26.03,40.56,2.0-10°,1.2.10%,
Deep-Att + PosUnk Ensemble,,40.4,,8.0-10%°,
GNMT + RL Ensemble [38],2630,41.16,1.8-10%,1.1-10*,
ConvS2S Ensemble [9],26.36,41.29,7.7-101,1.2.10%,
Transformer (base model),273,38.1,,3.3.10'%,
Transformer (big),28.4,41.8,,2.3-101,


Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0.1.
Label Smoothing During training, we employed label smoothing of value ϵls = 0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
6 Results
6.1 Machine Translation
parent: '6.1 Machine Translation' content: On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published mod

Unnamed: 0,N,dwss,dn,b,di,Pug,as,gon,| 08,BORT,PR,Unnamed: 12
base,| 6,512.0,2048,8,64,0.1,1.0,100K,| 492,258.0,65.0,
,,,,1,512,,,,529,249.0,,
,,,,4,128,,,,500,255.0,,
(A),,,,16,32,,,,491,258.0,,
,,,,32,16,,,,501,254.0,,
,,,,,16,,,,516,251.0,58.0,
®),,,,,32,,,,501,254.0,60.0,
©),2,,,,,,,,6.11,237.0,36.0,
©),,4.0,,,,,,,,519.0,253.0,50.0
©),,8.0,,,,,,,,488.0,255.0,80.0


development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
6.3 English Constituency

Parser,Training,WSJ 23 F1
Vinyals & Kaiser el al. (2014),"WSIJ only, discriminative",88.3
Petrov et al. (2006),"WSIJ only, discriminative",90.4
Zhu et al. (2013) [40],"WSIJ only, discriminative",90.4
Dyer et al. (2016),"WSJ only, discriminative",91.7
Transformer (4 layers),"WSIJ only, discriminative",91.3
Zhu et al. (2013) [40],semi-supervised,91.3
Vinyals Transformer (4 layers),semi-supervised semi-supervised,92.7
Luong et al. (2015) \\,multi-task,93.0
Dyer et al. (2016),generative,93.3


increased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting.
Our results in Table 4 show that despite the lack of task-specific tuning our model performs sur- prisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley- Parser [29] even when training only on the WSJ training set of 40K sentences.
7 Conclusion
parent: '7 Conclusion' content: In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
parent: '7 Conclusion' content: For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers

Here we clearly see that Unstructured is parsing both table and document structure.

### Storing into Astra DB

Now we will continue with the RAG process, by creating embeddings for the pdf, and storing them in Astra.

In [None]:
from langchain_community.vectorstores import AstraDB
from langchain_openai import OpenAIEmbeddings

astra_db_store = AstraDB(
    collection_name="langchain_unstructured",
    embedding=OpenAIEmbeddings(),
    token=os.getenv("ASTRA_DB_TOKEN"),
    api_endpoint=os.getenv("ASTRA_DB_ENDPOINT")
)

We will create LangChain Documents by splitting the text after `Table` elements and before `Title` elements. Additionally, we use the html output format for table data.

In [None]:
from langchain_core.documents import Document

documents = []
current_doc = None

for el in elements:
    if el.category in ["Header", "Footer"]:
        continue # skip these
    if el.category == "Title":
        documents.append(current_doc)
        current_doc = None
    if not current_doc:
        current_doc = Document(page_content="", metadata=el.metadata.to_dict())
    current_doc.page_content += el.metadata.text_as_html if el.category == "Table" else el.text
    if el.category == "Table":
        documents.append(current_doc)
        current_doc = None

astra_db_store.add_documents(documents)

['cc7802558e4b431db3e683a9b6d0b892',
 '5778068445504853b9eba0c7c8b623eb',
 '4bf0439b4e484cc0851473f2838a5a9f',
 'fc479ac7fada4edeb672f309f451df47']

### Querying

Now that we have populated our vector store, we will build a RAG pipeline and execute some queries.

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = """
Answer the question based only on the supplied context. If you don't know the answer, say "I don't know".
Context: {context}
Question: {question}
Your answer:
"""

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", streaming=False, temperature=0)

chain = (
    {"context": astra_db_store.as_retriever(), "question": RunnablePassthrough()}
    | PromptTemplate.from_template(prompt)
    | llm
    | StrOutputParser()
)

First we can ask a question about some text in the document:

In [None]:
chain.invoke("What does reducing the attention key size do?")

'Reducing the attention key size hurts model quality.'

Next we can try to get a value from the 2nd table:

In [None]:
chain.invoke("For the transformer to English constituency results, what was the 'WSJ 23 F1' value for 'Dyer et al. (2016) (5]'?")

"The 'WSJ 23 F1' value for 'Dyer et al. (2016) (5]' was 91.7."

And finally we can ask a question that doesn't exist in our content to confirm that the LLM rejection is working correctly.

In [None]:
# Query fails to be answered due to lack of context in Astra DB
chain.invoke("When was George Washington born?")

"I don't know. The context does not provide any information about George Washington's birthdate."