# Querying PDF With Astra and LangChain
### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

## Import Dependencies

In [19]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
import yaml
from langchain.text_splitter import CharacterTextSplitter
# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio
from PyPDF2 import PdfReader
from typing_extensions import Concatenate

### ASREA cridentials:
You need to provide `ASTRA_DB_APPLICATION_TOKEN`, `ASTRA_DB_ID`, `OPENAI_API_KEY` in a yaml file under `/tmp/langchain-conf.yml` and populate the information you need as below:
  ```python
  OpenAI:
    ASTRA_DB_APPLICATION_TOKEN: <ASTRA_DB_APPLICATION_TOKEN>
    ASTRA_DB_ID: <ASTRA_DB_ID>
    OPENAI_API_KEY: <OPENAI_API_KEY>
    API_Endpoint: <API_Endpoint>
  ```

In [25]:
# Specify the path to your YAML file
OpenAI_conf = '/tmp/langchain-conf.yml'

# Open the YAML file and load its contents
with open(OpenAI_conf, 'r') as file:
    config = yaml.safe_load(file)


In [30]:
pdf_reader = PdfReader('./Attention Is All You Need.pdf')

In [14]:
pdf_reader.pages

<PyPDF2._page._VirtualList at 0x7de7412d4a60>

In [35]:
raw_text = ''
for i, page in enumerate(pdf_reader.pages):
  if i<11: # analize the pdf until page 10
    content = page.extract_text()
    if content:
      raw_text+= content
len(raw_text)

33800

### Initialize the connection to your database:

In [36]:
cassio.init(token=config['OpenAI']['ASTRA_DB_APPLICATION_TOKEN'], database_id=config['OpenAI']['ASTRA_DB_ID'])


Create your LangChain vector store ... backed by Astra DB!

In [37]:
llm = OpenAI(openai_api_key=config['OpenAI']['OPENAI_API_KEY'])
embedding = OpenAIEmbeddings(openai_api_key=config['OpenAI']['OPENAI_API_KEY'])

In [None]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="attention_all_you_need_demo",
    session=None,
    keyspace=None,
)

In [None]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

### Load the dataset into the vector store


In [None]:

astra_vector_store.add_texts(texts)

print("Inserted %i headlines." % len(texts))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- Please explain attention for a child?_
- How many encoders and decoders used in the article?

In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))