# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

In [4]:
!pip install -q cassio datasets langchain openai tiktoken langchain-community

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0m

## Import the packages you'll need:

In [5]:
# LangChain components to use
from langchain_community.vectorstores import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [6]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/232.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [7]:
from PyPDF2 import PdfReader

### Setup
### Database setup and key (astra Db token and astra db Id)

In [8]:
ASTRA_DB_APPLICATION_TOKEN = " take from website from db" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = " create your id from website " # enter your Database ID

OPENAI_API_KEY = " insert your own" # enter your OpenAI key

In [9]:

# provide the path of  pdf file/files.
pdfreader = PdfReader('starai.pdf')

In [10]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [11]:
raw_text

'From Statistical Relational to Neural-Symbolic Artiﬁcial Intelligence\nLuc De Raedt1;2,Sebastijan Duman ˇci´c1,Robin Manhaeve1and Giuseppe Marra1\n1KU Leuven, Department of Computer Science and Leuven.AI\n2¨Orebro University, Center for Applied Autonomous Sensor Systems\nfluc.deraedt, sebastian.dumancic, robin.manhaeve, giuseppe.marrag@kuleuven.be\nAbstract\nNeural-symbolic and statistical relational artiﬁcial\nintelligence both integrate frameworks for learning\nwith logical reasoning. This survey identiﬁes sev-\neral parallels across seven different dimensions be-\ntween these two ﬁelds. These cannot only be used\nto characterize and position neural-symbolic artiﬁ-\ncial intelligence approaches but also to identify a\nnumber of directions for further research.\n1 Introduction\nThe integration of learning and reasoning is one of the key\nchallenges in artiﬁcial intelligence and machine learning today,\nand various communities have been addressing it. That is\nespecially true for the 

# Initialize the connection to your database:


In [12]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

 ## Create the LangChain embedding and LLM objects for later usage:

In [13]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  llm = OpenAI(openai_api_key=OPENAI_API_KEY)
  embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


## Create your LangChain vector store ... backed by Astra DB!

In [14]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [15]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [16]:
texts[:50]

['From Statistical Relational to Neural-Symbolic Artiﬁcial Intelligence\nLuc De Raedt1;2,Sebastijan Duman ˇci´c1,Robin Manhaeve1and Giuseppe Marra1\n1KU Leuven, Department of Computer Science and Leuven.AI\n2¨Orebro University, Center for Applied Autonomous Sensor Systems\nfluc.deraedt, sebastian.dumancic, robin.manhaeve, giuseppe.marrag@kuleuven.be\nAbstract\nNeural-symbolic and statistical relational artiﬁcial\nintelligence both integrate frameworks for learning\nwith logical reasoning. This survey identiﬁes sev-\neral parallels across seven different dimensions be-\ntween these two ﬁelds. These cannot only be used\nto characterize and position neural-symbolic artiﬁ-\ncial intelligence approaches but also to identify a\nnumber of directions for further research.\n1 Introduction',
 'to characterize and position neural-symbolic artiﬁ-\ncial intelligence approaches but also to identify a\nnumber of directions for further research.\n1 Introduction\nThe integration of learning and reasoni

### Load the dataset into the vector store

In [17]:
astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

In [19]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break
    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:150]))


Enter your question (or type 'quit' to exit): How do Statistical Relational AI and Neuro-Symbolic AI differ in their fundamental goals?

QUESTION: "How do Statistical Relational AI and Neuro-Symbolic AI differ in their fundamental goals?"




ANSWER: "Statistical Relational AI and Neuro-Symbolic AI have different fundamental goals. The former focuses more on logical reasoning and is better suited for explainable AI, while the latter focuses more on sub-symbolic processing and is better suited for tasks like computer vision and natural language processing."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9440] "From Statistical Relational to Neural-Symbolic Artiﬁcial Intelligence
Luc De Raedt1;2,Sebastijan Duman ˇci´c1,Robin Manhaeve1and Giuseppe Marra1
1KU L ..."
    [0.9301] "the former operates more at the symbolic level, lending itself
naturally to explainable AI, while the latter operates more
at the sub-symbolic level,  ..."
    [0.9251] "by positioning a wide variety of StarAI and NeSy systems
along these dimensions and pointing out analogies between
them. This provides not only new in ..."
    [0.9208] "best of both worlds. These ideas include using neural models
to guide the symbolic search [Kalyan et al., 2018; Ellis et
al., 2018a; Valkov et al., 20 ..."

What's your next question (or type 'quit' to exit): quit
