# Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search
### Install the required dependencies:

In [3]:
#use %pip on Jupyter
%pip install -q cassio datasets langchain openai tiktoken

Note: you may need to restart the kernel to use updated packages.


#### Pre-requisites:

Set up a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) as vector storage database. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID. These are the connection parameters that will be needed later.

Request an OpenAI key if haven't already at: [OpenAI API Key](https://cassio.org/start_here/#llm-access) 

#### Next:
- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Import the packages:

In [1]:
# LangChain components to use
%pip install -U langchain-community

from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset


Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# With CassIO, the engine powering the Astra DB integration in LangChain,
# initialize connection with cassio: 
%pip install cassio
%pip install PyPDF2
%pip install pdfreader
from PyPDF2 import PdfReader

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install mercury 
%pip install reportlab
import mercury as mr
from reportlab.pdfgen import canvas 
pdf_file = canvas.Canvas(r'C:\Users\wangs\OneDrive\Desktop\TRUST_DOC.pdf')
mr.PDF(r"C:\Users\wangs\OneDrive\Desktop\TRUST_DOC.pdf")




Note: you may need to restart the kernel to use updated packages.


### Setup

#### Provide key/secrets:

Replace the following with your Astra DB connection details and your OpenAI API key. Everyone's secrets are different, and please be mindful to remove these after usage 

In [4]:
# Please do not publicize these secrets, as others may bill usage to your account
# AstraDB (DataStax) and Langchain (OpenAI)

ASTRA_DB_APPLICATION_TOKEN = "AstraCS:ciRoEvRkqeMuZjnoStsRJGGW:462debc861f7714e23636b91d7cfe1721148a04e1d442c29d817b141fd8d6a3d" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "f4b2fda7-a18b-453b-87e9-f691a9229b5b"
OPENAI_API_KEY = "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"

In [5]:
# provide the path of the Trust Document PDF file
pdfreader = PdfReader(r'C:\Users\wangs\OneDrive\Desktop\TRUST_DOC.pdf')


In [6]:
from typing_extensions import Concatenate
# read text from pdf, extract text 
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

### Display the raw text below:

In [7]:
raw_text

'SAMPLE OF A REVOCABLE TRUST \nby \nKarin Sloan DeLaney,\n Esq.  \nBaldwinsville  NY\n113114DECLARATION OF TRUST  1\nJOHN CLIENT TRUST 2 \nTHIS DECLARATION , made the _______ day of November, 2015 by JOHN H. \nCLIENT , of 123 Main St., Syracuse, NY 13202 (hereinafte r referred to as "Grantor" and "Trustee"); \nW I T N E S S E T H :  \n1. TRUST PROPERTY.   The Grantor has this day delivered the property described in\nSchedule "A", attached hereto, to the Trustee and does hereby transfer ownership of such property .3\nThe Trustee agrees to act as Trustee of such as sets and to hold, administer and distribute the \nproperty, together with all additions thereto and all reinvestments thereof, as the principal of a trust \nestate for the benefit of Grantor in accordance wi th the terms and provision\ns hereinafter set out. \n1 Since the abolishment of the merger doctrine, an individual may create a trust with his \nown assets and act as sole Trustee.  If the document establishing such an ent

In [8]:
# Metadata Extraction 
num_of_words = len(raw_text)
num_new_lines = raw_text.count("\n")
period_counts = raw_text.count(".")
print("Number of words:", num_of_words, "\n"
      "Number of new lines:", num_new_lines, "\n"
      "Number of periods:" ,period_counts)

Number of words: 31751 
Number of new lines: 444 
Number of periods: 194


Initialize the connection to your database:

In [None]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)


Create the LangChain embedding and LLM objects for later usage:

In [None]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

Create your LangChain vector store ... backed by Astra DB!

In [None]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [None]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it should not increase token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800, #1000
    chunk_overlap  = 200, #need to encompass words so that it doesn't lose sentence completeness/contextualization
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
texts[:50]

['SAMPLE OF A REVOCABLE TRUST \nby \nKarin Sloan DeLaney,\n Esq.  \nBaldwinsville  NY\n113114DECLARATION OF TRUST  1\nJOHN CLIENT TRUST 2 \nTHIS DECLARATION , made the _______ day of November, 2015 by JOHN H. \nCLIENT , of 123 Main St., Syracuse, NY 13202 (hereinafte r referred to as "Grantor" and "Trustee"); \nW I T N E S S E T H :  \n1. TRUST PROPERTY.   The Grantor has this day delivered the property described in\nSchedule "A", attached hereto, to the Trustee and does hereby transfer ownership of such property .3\nThe Trustee agrees to act as Trustee of such as sets and to hold, administer and distribute the \nproperty, together with all additions thereto and all reinvestments thereof, as the principal of a trust \nestate for the benefit of Grantor in accordance wi th the terms and provision',
 'property, together with all additions thereto and all reinvestments thereof, as the principal of a trust \nestate for the benefit of Grantor in accordance wi th the terms and provision\ns he

### Load the dataset into the vector store



In [None]:
astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=10):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


QUESTION: "I am a Trust Officer. John Client is no more. Donna Client has request $800,000 in order to purchase a new home. Should I approve this distribution? Can you read the document and help me make the decision according to the clauses of the Trust?"




ANSWER: "Based on the language in the document, it appears that the Trustee has the discretion to pay from the trust principal for the support, maintenance, education, and comfort of the Grantor. Additionally, the Grantor has the right to withdraw all or any part of the principal upon written notice to the Trustee. However, this right of withdrawal is personal to the Grantor and cannot be exercised by any other party, including an attorney-in-fact, guardian, conservator, or committee. 

Therefore, it is ultimately up to the Trustee to decide whether to approve the distribution for the purchase of a new home for Donna Client. The Trustee should consider whether this distribution aligns with the purposes outlined in the document, and may also want to review the financial status of the trust to ensure that it can support the distribution without negatively impacting the other beneficiaries or the overall purpose of the trust."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9077] "3. DISPOSITION.  (A)  The Trustee may accumulate, or pay or apply the income of the  ..."
    [0.9077] "3. DISPOSITION.  (A)  The Trustee may accumulate, or pay or apply the income of the  ..."
    [0.9058] "115  
Grantor or his attorney-in-fact may add proper ty to the principal of this Tru ..."
    [0.9058] "115  
Grantor or his attorney-in-fact may add proper ty to the principal of this Tru ..."
    [0.9048] "SAMPLE OF A REVOCABLE TRUST 
by 
Karin Sloan DeLaney,
 Esq.  
Baldwinsville  NY
1131 ..."
    [0.9047] "SAMPLE OF A REVOCABLE TRUST 
by 
Karin Sloan DeLaney,
 Esq.  
Baldwinsville  NY
1131 ..."
    [0.9028] "to prevent abuse and an unwa nted battle of fiduciaries.   
 
10  See:  In the Matte ..."
    [0.9027] "to prevent abuse and an unwa nted battle of fiduciaries.   
 
10  See:  In the Matte ..."
    [0.9024] "furtherance of the interests of the beneficiaries hereunder; and to receive and reta ..."
    [0.9024] "furtherance of the interests of the beneficiaries 