# astra-vsearch-QA-for-documents
This demo guides you through setting up Astra DB with Vector Search, Cassio and Open AI to implement an generative Q&A for your own Documentation

Jupyter notebook for generative Q&A for douments is powered by [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html) and [OpenAI](https://github.com/openai/) and Casssio [Opensource LLM integration with Cassandra and Astra DB](https://cassio.org/).

## Astra Vector Search
Astra vector search enables developers to search a database by context or meaning rather than keywords or literal values. This is done by using “embeddings”. Embeddings are a type of representation used in machine learning where high-dimensional or complex data is mapped onto vectors in a lower-dimensional space. These vectors capture the semantic properties of the input data, meaning that similar data points have similar embeddings.
Reference: [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html)

## CassIO
CassIO is the ultimate solution for seamlessly integrating Apache Cassandra® with generative artificial intelligence and other machine learning workloads. This powerful Python library simplifies the complicated process of accessing the advanced features of the Cassandra database, including vector search capabilities. With CassIO, developers can fully concentrate on designing and perfecting their AI systems without any concerns regarding the complexities of integration with Cassandra.
Reference [Cassio](https://cassio.org/)

## OpenAI
OpenAI provides various tools and resources to implement your own Document QA Search system. This includes pre-trained language models like GPT-3.5, which can understand and generate human-like text. Additionally, OpenAI offers guidelines and APIs to leverage their models for document search and question-answering tasks, enabling developers to build powerful and intelligent Document QA Search applications.
Reference: [OpenAI](https://github.com/openai/)

## Demo Summary
ChatGPT excels at answering questions, but only on topics it remembers from its training data. It offers you a nice dialog interface to ask questions and get answers.

But what do you do when you have your onw documents? How can you leverage the GenAI and LLM models to get insights in those?

Think of an Q/A Bot that you want to provide to your customers for asking questions against the documentation of your products.

For beeing able to do so, you have to implement your own ChatGPT-like solution.
The implementation requires
1. Analysing your existing documents and store the information
2. Providing search capabilities for your questions to get answers

This is solve by using a LLM models. Ideally you embedd the data as vectors and store them in a vector database and then use the LLM models on top of that database.

This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference on your onw documentations based on Astra DB vector search.




# Getting Started with this notebook

These are prerequisites you need to to before running this notebook
- Create a new vector search enabled database in Astra.
- Create a keyspace
- Create a token with permissions to create tables
- Download your secure-connect-bundle.zip file
- Create an OpenAI account and download an API Key

- When you run this notebook, it will ask you for providing the secure-connect-bundle.zip, some text file and client ids, passwords as well as API Key

# Setup

This jupyter notebook was build on Colab. You need to install the following libraries.

In [None]:
# install required dependencies
! pip install \
    "cassandra-driver>=3.28.0" \
    "openai==0.27.7" \
    "tiktoken==0.4.0" \
    "langchain>=0.0.218" \
    "cassio==0.0.7"

Collecting cassandra-driver>=3.28.0
  Downloading cassandra_driver-3.28.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai==0.27.7
  Downloading openai-0.27.7-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken==0.4.0
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain>=0.0.218
  Downloading langchain-0.0.239-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cassio==0.0.7
  Downloading cassio-0.0.7-py3-none-an

# Imports

In [None]:
# Imports for our environment and accessing Astra DB
import os

import getpass
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from google.colab import files

# Astra DB configuration, connection bundle and token secrets

You will need a secure connect bundle and a user with access permission. For demo purposes the "administrator" role will work fine.
More information about how to get the bundle can be found [here](https://docs.datastax.com/en/astra-serverless/docs/connect/secure-connect-bundle.html).

In [None]:
#upload secure connect bundle
print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    SECURE_CONNECT_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )
#Alternatively upload to the environment and reference it here
#SECURE_CONNECT_BUNDLE_PATH = '/content/secure-connect-documentation.zip'


Please upload your Secure Connect Bundle


Saving secure-connect-vectordemo.zip to secure-connect-vectordemo.zip


In [None]:
ASTRA_DB_TOKEN_BASED_USERNAME = getpass.getpass('What Astra DB token username do you want to use? ')
#ASTRA_DB_TOKEN_BASED_USERNAME = '<<ENTER>>'

What Astra DB token username do you want to use? ··········


In [None]:
ASTRA_DB_TOKEN_BASED_PASSWORD = getpass.getpass('What Astra DB token password do you want to use? ')
#ASTRA_DB_TOKEN_BASED_PASSWORD = '<<ENTER>>'

What Astra DB token password do you want to use? ··········


In [None]:
ASTRA_DB_KEYSPACE = input(f'Which Astra DB keypsace do you want to use? ')
#ASTRA_DB_KEYSPACE = 'mykeyspace'

Which Astra DB keypsace do you want to use? vector_preview


# Provide Sample Data
If you want to provide some docoments, you can upload them here.
As a sample document you can also download some text here:

In [None]:
# retrieve the text of a short story that will be indexed in the vector store
! curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
SAMPLEDATA_PATH="amontillado.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 65 13022   65  8523    0     0  44289      0 --:--:-- --:--:-- --:--:-- 44160100 13022  100 13022    0     0  66524      0 --:--:-- --:--:-- --:--:-- 66438


In [None]:
# Alternatively you can provide your own file - please consider to customize the queries at the end of the notebook to match your content.
#provide some sample files
print('Please upload your own sample file:')
uploaded = files.upload()
if uploaded:
    sampleDataFileTitle = list(uploaded.keys())[0]
    SAMPLEDATA_PATH = os.path.join(os.getcwd(), sampleDataFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Sample Data. Please re-run the cell.'
    )

Please upload your own sample file:


Saving Vector Search for Generative AI Apps.pdf to Vector Search for Generative AI Apps.pdf


# Connect to Astra DB

In [None]:
# make sure that you can connect to Astra DB - if you see errors, then have a look at the environment you configured earlier

cloud_config = {
   'secure_connect_bundle': SECURE_CONNECT_BUNDLE_PATH
}
auth_provider = PlainTextAuthProvider(ASTRA_DB_TOKEN_BASED_USERNAME, ASTRA_DB_TOKEN_BASED_PASSWORD)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
#cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider, allow_beta_protocol_version=True)
#allow_beta_protocol_version()#'5/v5-beta';
session = cluster.connect()

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(132541626461840) a4c29bc6-b900-44c5-b980-584a0c4542b5-us-east1.db.astra.datastax.com:29042:c46c5c60-6472-473b-9160-a9c7d7812e51> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


# LLM Provider Setup
CassIO seamlessly integrates with LangChain, offering Cassandra-specific tools for many tasks. In our example we will use vector stores, indexers, embeddings and queries.

And we will use OpenAI for our LLM services. (See Pre-requisites on [cassio.org](https://cassio.org/start_here/#llm-access) for more details).

In [None]:
# Set your secret(s) for LLM access:
# we will use GPT embeddings, so please provide your OpenAI AKP Key
apiSecret = getpass.getpass('Your secret for LLM provider OpenAI: ')
#apiSecret = "<<ENTER>>"
os.environ['OPENAI_API_KEY'] = apiSecret

Your secret for LLM provider OpenAI: ··········


In [None]:
#Import the needed libraries and declare the LLM model
import langchain
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Cassandra
from langchain.document_loaders import TextLoader

# creation of the LLM resources
embedding_function = OpenAIEmbeddings()

# read the documents into a list called docs
from langchain.document_loaders import TextLoader
loader = TextLoader(SAMPLEDATA_PATH)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

print(f'Docs created, it has {len(docs)} elements')

Docs created, it has 16 elements


#Define the Vector Store on Astra DB
Langchain and Cassio will automatically create the needed tables and SAI in Astra DB for you. No worries about that configuration.

In [None]:
#define the table name to be used to store our embeddings, Cassio will create the objects in Astra DB for you.
ASTRA_DB_TABLE_NAME = 'vdocuments'

cassVStore = Cassandra.from_documents(
    documents=docs,
    embedding=embedding_function,
    session=session,
    keyspace=ASTRA_DB_KEYSPACE,
    table_name=ASTRA_DB_TABLE_NAME,
)

In [None]:
# just in case this demo runs multiple times and you want to clean up, run this:
cassVStore.clear()

# Now Query the Data and execute some "searches" against it
First we will start with a similarity search using the Vectorstore's implementation

In [None]:
# similarity search:
prompt = "What did Luchesi say about Nitro?"

# matched_docs is a list with the found documents from the similarity search
matched_docs = cassVStore.similarity_search(prompt)
# for each of the found documents, print the content
for i, d in enumerate(matched_docs):
    print(f"\n## Document {i}\n")
    print(d.page_content)


## Document 0

"As you are engaged, I am on my way to Luchesi.  If any one has a
critical turn, it is he.  He will tell me--"

"Luchesi cannot tell Amontillado from Sherry."

"And yet some fools will have it that his taste is a match for your
own."

"Come, let us go."

"Whither?"

"To your vaults."

"My friend, no; I will not impose upon your good nature.  I perceive
you have an engagement.  Luchesi--"

"I have no engagement;--come."

"My friend, no.  It is not the engagement, but the severe cold with
which I perceive you are afflicted.  The vaults are insufferably damp.
They are encrusted with nitre."

"Let us go, nevertheless.  The cold is merely nothing. Amontillado!
You have been imposed upon.  And as for Luchesi, he cannot distinguish
Sherry from Amontillado."

Thus speaking, Fortunato possessed himself of my arm. Putting on a mask
of black silk, and drawing a _roquelaire_ closely about my person, I
suffered him to hurry me to my palazzo.

## Document 1

"Proceed," I said; "herei

# Finally do a Q/A Search
To be able implement question answering over documents we need to do four steps:

1. Create an index on top of our Vector store
2. Create a Retriever from that index
3. Ask questions (promts)!


A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) it. Vector stores can be used as the backbone of a retriever.

Hint: The query method is creating a chain using OpenAI document Search:

        llm = llm or OpenAI(temperature=0)
        chain = RetrievalQA.from_chain_type(
            llm, retriever=self.vectorstore.as_retriever(), **kwargs
        )

In [None]:
# Q/A LLM Search
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
index = VectorStoreIndexWrapper(vectorstore=cassVStore)

# Search within the document context for some text related information.
prompt = "Who is Luchesi?"
index.query(prompt)


' Luchesi is a connoisseur of wine who cannot tell Amontillado from Sherry.'

In [None]:
#Alternatively you can use a retrieval chain and some conversation history
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
#vectordb.persist()
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
pdf_qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.8) , cassVStore.as_retriever(), memory=memory)


query = "Why is Montresor upset with Fortunato?"
result = pdf_qa({"question": query})
print("Answer:")
result["answer"]

Answer:


' Montresor is upset with Fortunato because Fortunato insulted him and he vowed revenge.'

# Other Usefull stuff:  
E.g. query the Vector Store to see what has been added to it and what happended with our documentation

In [None]:
cqlSelect = f'SELECT * FROM {ASTRA_DB_KEYSPACE}.{ASTRA_DB_TABLE_NAME} LIMIT 3;'  # (Not a production-optimized query ...)
rows = session.execute(cqlSelect)
for row_i, row in enumerate(rows):
    print(f'\nRow {row_i}:')
    print(f'    document_id:      {row.document_id}')
    print(f'    embedding_vector: {str(row.embedding_vector)[:64]} ...')
    print(f'    document:         {row.document[:64]} ...')
    print(f'    metadata_blob:    {row.metadata_blob}')

print('\n...')


Row 0:
    document_id:      f89128c3dd05495983d864d3f0b1657e
    embedding_vector: [-0.0009743598056957126, -0.010959136299788952, -0.0147337932139 ...
    document:         I had scarcely laid the first tier of the masonry when I discove ...
    metadata_blob:    {"source": "amontillado.txt"}

Row 1:
    document_id:      37213d160dc34728a459ca37ce3054d0
    embedding_vector: [-0.003442918648943305, -0.014413947239518166, 0.027327004820108 ...
    document:         He had a weak point--this Fortunato--although in other regards h ...
    metadata_blob:    {"source": "amontillado.txt"}

Row 2:
    document_id:      b7d0cf95728042bbba8a3dd7000f207f
    embedding_vector: [-0.00042468414176255465, -0.015043167397379875, 0.0231247935444 ...
    document:         The wine sparkled in his eyes and the bells jingled.  My own fan ...
    metadata_blob:    {"source": "amontillado.txt"}

...
