# Quickstart: QA with Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Load data: you will populate the vector store with a number of "The Onion" headlines from a HuggingFace dataset;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

In [1]:
# LangChain components to use
from langchain_community.vectorstores.cassandra import Cassandra
from langchain_community.llms import OpenAI
from langchain_community.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

from PyPDF2 import PdfReader
from dotenv import load_dotenv
import os

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Setup

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [2]:
ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")  
ASTRA_DB_ID = os.getenv("ASTRA_DB_ID") 

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # enter your OpenAI key

In [3]:
reader = PdfReader('budget_speech.pdf')

In [4]:
from typing_extensions import Concatenate

raw_text = ''
for i, page in enumerate(reader.pages):
  content = page.extract_text()
  if content:
    raw_text += content

In [6]:
print(raw_text[:500])

GOVERNMENT OF INDIA
BUDGET 2025-2026
SPEECH
OF
NIRMALA SITHARAMAN
MINISTER OF FINANCE
February 1,  2025 
CONTENTS  
 
PART – A 
 Page No.  
Introduction  1 
Budget Theme  1 
Agriculture as the 1st engine  3 
MSMEs as the 2nd engine  6 
Investment as the 3rd engine  8 
A. Investing in People  8 
B. Investing in  the Economy  10 
C. Investing in Innovation  14 
Exports as the 4th engine  15 
Reforms as the Fuel  16 
Fiscal Policy  18 
 
 
PART – B 
Indirect taxes  20 
Direct Taxes   23 
 
Annexure


Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [7]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage:

In [8]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  llm = OpenAI(openai_api_key=OPENAI_API_KEY)
  embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


Create your LangChain vector store ... backed by Astra DB!

In [9]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [10]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=800,
    chunk_overlap=200,
    length_function=len
)

texts = text_splitter.split_text(raw_text)

In [11]:
print(texts[:10])

['GOVERNMENT OF INDIA\nBUDGET 2025-2026\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2025 \nCONTENTS  \n \nPART – A \n Page No.  \nIntroduction  1 \nBudget Theme  1 \nAgriculture as the 1st engine  3 \nMSMEs as the 2nd engine  6 \nInvestment as the 3rd engine  8 \nA. Investing in People  8 \nB. Investing in  the Economy  10 \nC. Investing in Innovation  14 \nExports as the 4th engine  15 \nReforms as the Fuel  16 \nFiscal Policy  18 \n \n \nPART – B \nIndirect taxes  20 \nDirect Taxes   23 \n \nAnnexure to Part -A 29 \nAnnexure to Part -B 31 \n \n   \n \nBudget 202 5-2026 \n \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1 , 202 5 \nHon’ble Speaker,  \n I present the Budget for 2025 -26. \nIntroduction  \n1. This Budget continues our Government ’s efforts to:  \na) accelerate growth,', 'Minister of Finance  \nFebruary 1 , 202 5 \nHon’ble Speaker,  \n I present the Budget for 2025 -26. \nIntroduction  \n1. This Budget continues our Government 

### Load the dataset into the vector store

> Feel free to tweak the `NUM_HEADLINES` to a higher value for added fun (at a moderate additional expense in OpenAI usage and with a little more time to load)

In [12]:
astra_vector_store.add_texts(texts[:50])
print(f"Inserted {len(texts[:50])} headlines")

astra_retriever = astra_vector_store.as_retriever()

Inserted 50 headlines


In [14]:
from langchain_core.prompts import ChatPromptTemplate
try:
    from langchain.chains.combine_documents import create_stuff_documents_chain
    from langchain.chains.retrieval import create_retrieval_chain
except ImportError:
    from langchain_classic.chains.combine_documents import create_stuff_documents_chain
    from langchain_classic.chains.retrieval import create_retrieval_chain

In [16]:
# Importe a classe necessária
try:
    from langchain.chains import RetrievalQA
except ImportError:
    from langchain_classic.chains import RetrievalQA

prompt = ChatPromptTemplate.from_template("""
    Responda a pergunta a seguir com base apenas no contexto fornecido:
    
    <context>
    {context}
    </context>
    
    Pergunta: {input}
""")

# 3. Crie a Chain de Combinação de Documentos (O que faz o trabalho do 'stuff')
document_chain = create_stuff_documents_chain(llm, prompt)

# 4. Crie a Chain de Recuperação Final (O que conecta o retriever ao LLM)
# astra_retriever é o seu astra_vector_store.as_retriever()
qa_chain_lcel = create_retrieval_chain(astra_retriever, document_chain)

In [17]:
first_question = True

while True:
  if first_question:
    query_text = input("\nEnter your question (or type 'q' to exit): ").strip()
  else:
    query_text = input("\nWhat's your next question (or type 'q' to exit): ").strip()

  if query_text.lower() == 'q':
    break
  if query_text == '':
    continue

  first_question = False

  print(f"Question: {query_text}")
  result = qa_chain_lcel.invoke({"input": query_text})
  answer = result['answer'].strip()
    
  print(f"Answer: {answer}")

  print("FIRST DOCUMENTS BY RELEVANCE:")
  for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
    print(f"{score:.4f}: \"{doc.page_content[:84]}\"")

Question: How much the agriculture target will be increased to and what the focus will be?
Answer: Resposta: The agriculture target will be increased to 100 developing agri-districts and the focus will be on rural women, young farmers, rural youth, marginal and small farmers, and landless families.
FIRST DOCUMENTS BY RELEVANCE:
0.9137: "rural areas so that migration is an option, but not a necessity.  
12. The programme"
0.9137: "rural areas so that migration is an option, but not a necessity.  
12. The programme"
0.9137: "rural areas so that migration is an option, but not a necessity.  
12. The programme"
0.9137: "rural areas so that migration is an option, but not a necessity.  
12. The programme"
Question: What is the current GPD?
Answer: Resposta: Não é possível determinar o PIB atual com base no contexto fornecido.
FIRST DOCUMENTS BY RELEVANCE:
0.8745: "blended finance facility with contribution from the Government , banks and 
private "
0.8745: "blended finance facility with con

### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What are scientists doing with amoebas?_
- _Did ChatGPT take the bar exam?_
- _Are gas stoves a controversial item in a household?_

In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit):  What are scientists doing with amoebas?



QUESTION: "What are scientists doing with amoebas?"
ANSWER: "They are torturing them in an attempt to extract information on where life came from."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9397] "Biologists Torture Amoeba For Information On Where Life Came From #~# CAMBRIDGE, MA— ..."
    [0.8773] "Expectant Couple Hoping For Human Baby #~# CONWAY, AR—Praying to be blessed with a c ..."
    [0.8772] "Dolphin Trained To Kill By U.S. Military In ’60s Now Lying Destitute In Street #~# S ..."
    [0.8740] "USDA Approves First Vaccine For Honeybees #~# The United States Department of Agricu ..."



What's your next question (or type 'quit' to exit):  Did ChatGPT take the bar exam?



QUESTION: "Did ChatGPT take the bar exam?"
ANSWER: "Yes, ChatGPT was reportedly forced to take the bar exam."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9327] "ChatGPT Forced To Take Bar Exam Even Though Dream Was To Be AI Art Bot #~# MINNEAPOL ..."
    [0.9033] "What To Know About ChatGPT #~# The artificially intelligent chatbot ChatGPT has rece ..."
    [0.8993] "CEOs Explain How They Will Use ChatGPT #~# ChatGPT, an AI-based program that creates ..."
    [0.8708] "Alito, Thomas Share Laugh After Discovering They Both Leaked Dobbs Decision #~# WASH ..."



What's your next question (or type 'quit' to exit):  Are gas stoves a controversial item in a household?



QUESTION: "Are gas stoves a controversial item in a household?"
ANSWER: "Yes, gas stoves have become a controversial item in households due to recent suggestions by the Consumer Product Safety Commission that they could be banned."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9431] "Conservatives Defend Their Right To Have Gas Stoves #~# Recently, a member of the Co ..."
    [0.9314] "Experts Warn Gas Stoves May Slowly Ingratiate Selves In Family To Kill And Take Plac ..."
    [0.8843] "Concerning Study Finds 1 In 10 Americans Lack Access To Adequate Food Eating Challen ..."
    [0.8722] "Parents Feel Safer Letting Kids Drink And Drive Under Their Roof #~# ASTORIA, OREGON ..."



What's your next question (or type 'quit' to exit):  quit
