<a href="https://colab.research.google.com/github/likhia/astra-vsearch-QA-for-documents/blob/main/%5BShared%5D_astra_vsearch_QA_for_documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# astra-vsearch-QA-for-documents
This demo guides you through setting up Astra DB with Vector Search, Cassio and Open AI to implement an generative Q&A for your own Documentation

Jupyter notebook for generative Q&A for douments is powered by [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html) and [OpenAI](https://github.com/openai/) and Casssio [Opensource LLM integration with Cassandra and Astra DB](https://cassio.org/).

## Astra Vector Search
Astra vector search enables developers to search a database by context or meaning rather than keywords or literal values. This is done by using “embeddings”. Embeddings are a type of representation used in machine learning where high-dimensional or complex data is mapped onto vectors in a lower-dimensional space. These vectors capture the semantic properties of the input data, meaning that similar data points have similar embeddings.
Reference: [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html)

## CassIO
CassIO is the ultimate solution for seamlessly integrating Apache Cassandra® with generative artificial intelligence and other machine learning workloads. This powerful Python library simplifies the complicated process of accessing the advanced features of the Cassandra database, including vector search capabilities. With CassIO, developers can fully concentrate on designing and perfecting their AI systems without any concerns regarding the complexities of integration with Cassandra.
Reference [Cassio](https://cassio.org/)

## OpenAI
OpenAI provides various tools and resources to implement your own Document QA Search system. This includes pre-trained language models like GPT-3.5, which can understand and generate human-like text. Additionally, OpenAI offers guidelines and APIs to leverage their models for document search and question-answering tasks, enabling developers to build powerful and intelligent Document QA Search applications.
Reference: [OpenAI](https://github.com/openai/)

## Demo Summary
ChatGPT excels at answering questions, but only on topics it remembers from its training data. It offers you a nice dialog interface to ask questions and get answers.

But what do you do when you have your onw documents? How can you leverage the GenAI and LLM models to get insights in those?

Think of an Q/A Bot that you want to provide to your customers for asking questions against the documentation of your products.

For beeing able to do so, you have to implement your own ChatGPT-like solution.
The implementation requires
1. Analysing your existing documents and store the information
2. Providing search capabilities for your questions to get answers

This is solve by using a LLM models. Ideally you embedd the data as vectors and store them in a vector database and then use the LLM models on top of that database.

This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference on your onw documentations based on Astra DB vector search.




# Getting Started with this notebook

These are prerequisites you need to to before running this notebook
- Create a new vector search enabled database in Astra.
- Create a keyspace
- Create a token with permissions to create tables
- Download your secure-connect-bundle.zip file
- Create an OpenAI account and download an API Key

- When you run this notebook, it will ask you for providing the secure-connect-bundle.zip, some text file and client ids, passwords as well as API Key


# Setup

This jupyter notebook was build on Colab. You need to install the following libraries.

In [None]:
# install required dependencies
! pip install \
    "cassandra-driver>=3.28.0" \
    "openai==0.27.7" \
    "tiktoken==0.4.0" \
    "langchain>=0.0.218" \
    "cassio==0.0.7" \
    "pypdf"

Collecting cassandra-driver>=3.28.0
  Downloading cassandra_driver-3.28.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai==0.27.7
  Downloading openai-0.27.7-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken==0.4.0
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain>=0.0.218
  Downloading langchain-0.0.237-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cassio==0.0.7
  Downloading cassio-0.0.7-py3-none-an

# Imports

In [None]:
# Imports for our environment and accessing Astra DB
import os

import getpass
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from google.colab import files
import openai

# Astra DB configuration, connection bundle and token secrets

You will need a secure connect bundle and a user with access permission. For demo purposes the "administrator" role will work fine.
More information about how to get the bundle can be found [here](https://docs.datastax.com/en/astra-serverless/docs/connect/secure-connect-bundle.html).

In [None]:
#upload secure connect bundle
print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    SECURE_CONNECT_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )
#Alternatively upload to the environment and reference it here
#SECURE_CONNECT_BUNDLE_PATH = '/content/secure-connect-documentation.zip'


Please upload your Secure Connect Bundle


Saving secure-connect-vector2.zip to secure-connect-vector2.zip


In [None]:
ASTRA_DB_TOKEN_BASED_USERNAME = getpass.getpass('What Astra DB token username do you want to use? ')
#ASTRA_DB_TOKEN_BASED_USERNAME = '<<ENTER>>'



In [None]:
ASTRA_DB_TOKEN_BASED_PASSWORD = getpass.getpass('What Astra DB token password do you want to use? ')
#ASTRA_DB_TOKEN_BASED_PASSWORD = '<<ENTER>>'


In [None]:
ASTRA_DB_KEYSPACE = input(f'Which Astra DB keypsace do you want to use? ')
#ASTRA_DB_KEYSPACE = 'mykeyspace'


# Provide Sample Data
If you want to provide some docoments, you can upload them here.
As a sample document you can also download some text here:

In [None]:
# Please skip this if you are not testing text file.
# retrieve the text of a short story that will be indexed in the vector store
! curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
SAMPLEDATA = ["amontillado.txt"]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 13022  100 13022    0     0   129k      0 --:--:-- --:--:-- --:--:--  129k


In [None]:
# Alternatively you can provide your own file - please consider to customize the queries at the end of the notebook to match your content.

#only use if you want to test your onw files

#provide some sample files
print('Please upload your own sample file:')
uploaded = files.upload()
if uploaded:
    #sampleDataFileTitle = list(uploaded.keys()[0])
    #SAMPLEDATA_PATH = os.path.join(os.getcwd(), sampleDataFileTitle)
    SAMPLEDATA = uploaded
else:
    raise ValueError(
        'Cannot proceed without Sample Data. Please re-run the cell.'
    )

print(f'File Uploaded')

Please upload your own sample file:


Saving Whitepaper_AstraDB-Designing-Serverless-Cloud-Native-DBaaS_6141_07.22.21.pdf to Whitepaper_AstraDB-Designing-Serverless-Cloud-Native-DBaaS_6141_07.22.21.pdf
File Uploaded


# Connect to Astra DB

In [None]:
# make sure that you can connect to Astra DB - if you see errors, then have a look at the environment you configured earlier

cloud_config = {
   'secure_connect_bundle': SECURE_CONNECT_BUNDLE_PATH
}
auth_provider = PlainTextAuthProvider(ASTRA_DB_TOKEN_BASED_USERNAME, ASTRA_DB_TOKEN_BASED_PASSWORD)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(132580212726736) 204bdb00-f5c8-420a-a68d-9c26142b0d38-us-east1.db.astra.datastax.com:29042:1a3e9df8-9ad9-46b2-9c4d-2df267a906a9> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


# Read files, Create Embeddings, Store in Vector DB

CassIO seamlessly integrates with LangChain, offering Cassandra-specific tools for many tasks. In our example we will use vector stores, indexers, embeddings and queries.

And we will use OpenAI for our LLM services. (See Pre-requisites on [cassio.org](https://cassio.org/start_here/#llm-access) for more details).

In [None]:
# Set your secret(s) for LLM access:
# we will use OpenAI embeddings, so please provide your OpenAI AKP Key
apiSecret = getpass.getpass('Your secret for LLM provider OpenAI: ')

openai.api_key = apiSecret
os.environ['OPENAI_API_KEY'] = apiSecret

In [None]:
#Import the needed libraries and declare the LLM model
import langchain
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Cassandra
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
import os

# define the embedding function to use
embedding_function = OpenAIEmbeddings()

#define the table name to be used to store our embeddings, Cassio will create the objects in Astra DB for you.
ASTRA_DB_TABLE_NAME = 'vdocuments'


# now loop through all files uploaded and process them
for elem in SAMPLEDATA:
  filename = elem
  doc_path = os.path.join(os.getcwd(), filename)

  # check which filetype they are and parse them (loader)
  # check if the file is a PDF
  if filename.endswith(".pdf"):
		# load the PDF file
    loader = PyPDFLoader(doc_path)
    pages = loader.load_and_split()
    print(f"Processed PDF file: {filename}")

  # check if the file is a TXT
  elif filename.endswith(".txt"):
    loader = TextLoader(doc_path)
    pages = loader.load_and_split()
    print(f"Processed TXT file: {filename}")

  # other files will not be processed
  else:
    # handle the case where the file has an unsupported extension
    print(f"Unsupported file type: {filename}")

  noOfPages = 0

  # if any file was loaded, proceed
  if len(pages) >0:

    #create a vector store object for the document, that will automatically embedd it
    cassVStore = Cassandra.from_documents(
      documents=pages,
      embedding=embedding_function,
      session=session,
      keyspace=ASTRA_DB_KEYSPACE,
      table_name=ASTRA_DB_TABLE_NAME,
    )

    noOfPages = len(pages)

    # now clean up
    # delete the file
    os.remove(doc_path)
    # empty pages
    pages = ""

# empty the list of file names, just in case this block is run twice.
SAMPLEDATA = []

print(f"\nProcessing done.")

Processed PDF file: Whitepaper_AstraDB-Designing-Serverless-Cloud-Native-DBaaS_6141_07.22.21.pdf

Processing done.


In [1]:
# just in case this demo runs multiple times and you want to clean up, run this:
# cassVStore.clear()

NameError: ignored

# Now Query the Data and execute some "searches" against it
First we will start with a similarity search using the Vectorstore's implementation

In [None]:
# Please change the prompt based on the PDF that is uploaded.
# similarity search:
# prompt = "What is vector search?"
# prompt = "what is embedding?"
prompt = "What is Astra DB?"

# matched_docs is a list with the found documents from the similarity search
matched_docs = cassVStore.similarity_search(prompt)
# for each of the found documents, print the content
for i, d in enumerate(matched_docs):
    print(f"\n## Document {i}\n")
    print(d.page_content)


## Document 0

WHITEPAPER
D a t a S t a x  A s t r a  D B
Designing a Serverless Cloud-Native
Database-as-a-Service Based on Apache Cassandra™
Astra DB is a globally distributed, serverless, multi-model
database service built
by DataStax to satisfy the needs of users on their
cloud provider of choice. It is
the ﬁrst and only serverless and multi-region database
service that is based on an
open-source NoSQL database, namely Apache Cassandra.
In this paper, we share our experience, rationale
and lessons learned of adapting
Cassandra into a multi-tenant database to serve the
serverless needs of Astra users.
We present a novel microservices-based, cloud-native
architecture that integrates
natively with Kubernetes to bring true, safe, stateful
workloads to the cloud-native age.
This design enables ﬁne-grained, elastic scalability
of individual components to meet
the capacity demands of modern application workloads.
In this work, our main
contributions are: (i) novel microservices-based,
cl

# Perform Q/A Search using different ways


In [None]:
# VectorStoreIndexWrapper allows for easy querying of existing data in a vector store
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
index = VectorStoreIndexWrapper(vectorstore=cassVStore)

# Search within the document context for some text related information.
index.query(prompt)


' Astra DB is a globally distributed, serverless, multi-model database service built by DataStax to satisfy the needs of users on their cloud provider of choice. It is the first and only serverless and multi-region database service that is based on an open-source NoSQL database, namely Apache Cassandra.'

In [None]:
#https://betterprogramming.pub/build-a-chatbot-on-your-csv-data-with-langchain-and-openai-ed121f85f0cd
#Alternatively you can use a retrieval chain and some conversation history
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
#vectordb.persist()
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

pdf_qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.8) , cassVStore.as_retriever(), memory=memory)

result = pdf_qa({"question": prompt})
print("Answer :", result["answer"])

Answer :  Astra DB is a globally distributed, serverless, multi-model database service built by DataStax to satisfy the needs of users on their cloud provider of choice. It is the first and only serverless and multi-region database service that is based on an open-source NoSQL database, namely Apache Cassandra.


In [None]:

message_objects = []

# With the role as 'system',  we tell the model how we want it to behave and tell it how its personality and type of response should be.
message_objects.append({"role":"system",
                        "content":"You're a chatbot to answer questions using the data provided."})

# With the role as 'user',  pass the question from user.
message_objects.append({"role":"user",
                        "content": prompt})

answers_list = []

# With the role as 'assistant',  load the results from Astra with Vector Search.  That helps the model to provide answer to the question asked by user.
# embedding for prompt
print("Prompt : " , prompt)

embedding_query = embedding_function.embed_query(prompt)

cqlSelect = f'SELECT document FROM {ASTRA_DB_KEYSPACE}.{ASTRA_DB_TABLE_NAME} ORDER BY embedding_vector ANN OF {embedding_query} LIMIT 5;'
rows = session.execute(cqlSelect)
for row_i, row in enumerate(rows):
    brand_dict = {'role': "assistant", "content": f"{row.document}"}
    answers_list.append(brand_dict)

message_objects.extend(answers_list)
message_objects.append({"role": "assistant", "content": "Here's my answer to your question."})

completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo-16k",
  messages=message_objects
)

print(completion.choices[0].message['content'])




Prompt :  What is vector search?
Vector search is a method to find related objects that have similar attributes or characteristics. It uses embeddings, which are mathematical representations of data that capture the meaning of the objects. By converting objects into vectors and using algorithms like Approximate Nearest Neighbor (ANN) search, vector search can efficiently and quickly find similar data without needing exact keywords or descriptions. Vector search is commonly used in applications like semantic search, recommendation systems, and generative AI.


# Content in Vector Search:  
E.g. query the Vector Store to see what has been added to it and what happended with our documentation

In [None]:
cqlSelect = f'SELECT * FROM {ASTRA_DB_KEYSPACE}.{ASTRA_DB_TABLE_NAME} LIMIT 3;'  # (Not a production-optimized query ...)
rows = session.execute(cqlSelect)
for row_i, row in enumerate(rows):
    print(f'\nRow {row_i}:')
    print(f'    document_id:      {row.document_id}')
    print(f'    embedding_vector: {str(row.embedding_vector)[:64]} ...')
    print(f'    document:         {row.document[:64]} ...')
    print(f'    metadata_blob:    {row.metadata_blob}')

print('\n...')


Row 0:
    document_id:      896b223d5e784110854890aeae293cf8
    embedding_vector: [-0.009008970111608505, -0.02013927698135376, 0.0010774513939395 ...
    document:         In Astra DB, there can be multiple Kubernetes clusters,
each ser ...
    metadata_blob:    {"source": "/content/Whitepaper_AstraDB-Designing-Serverless-Cloud-Native-DBaaS_6141_07.22.21.pdf", "page": 10}

Row 1:
    document_id:      5df6a10dac6342c7b0c344868d84761f
    embedding_vector: [-0.005672066938132048, 0.007558133453130722, 0.0276391934603452 ...
    document:         In
this
example,
the
vectors
have
two
dimensions,
and
the
entrie ...
    metadata_blob:    {"source": "/content/Vector Search for Generative AI Apps.pdf", "page": 5}

Row 2:
    document_id:      0a7237da5e394061b724d61bf229d644
    embedding_vector: [0.012285375036299229, 0.000801105925347656, -0.0165159143507480 ...
    document:         The high-level architecture of Astra DB is shown in
Figure 3.1.  ...
    metadata_blob:    {"source": "