# Vector Search with DataStax Enterprise 7 & RAGStack

This page provides a quick start for using [DataStax Enterprise 7](https://www.datastax.com/blog/introducing-vector-search-for-self-managed-datastax-enterprise) as a Vector Store.

Additionally, we're introducing [RAGStack](https://www.datastax.com/products/ragstack), an out of the box solution simplifying Retrieval Augmented Generation (RAG) in AI apps. RAGStack includes the best open-source libraries for implementing RAG, giving developers a comprehensive Gen AI Stack leveraging LangChain, CassIO, and more.

***In addition to access to the database, an OpenAI API Key is required to run the full example.***

In [None]:
import os
from getpass import getpass

cluster_external_ip = 'my-ip'
cass_user = 'my-user'

In [None]:
print(cluster_external_ip)

In [None]:
#Dependency Install
%pip install datasets pypdf ragstack-ai ipywidgets

In [None]:
from datasets import load_dataset

import langchain
from langchain_openai import OpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter

##### Paste your OpenAI API key into the prompt

In [None]:
os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")

In [None]:
embe = OpenAIEmbeddings(model="text-embedding-3-large")

In [None]:
cass_pass = getpass("DSE password = ")

In [None]:
print(cass_pass)

## DataStax Enterprise 7 Session
`RAGStack` includes LangChain modules for both vector similarity search and vector database operations. Here we will `IMPORT` the `Cassandra` vector store library, which includes DataStax Enterprise functionality.

In [None]:
from langchain.vectorstores import Cassandra

In [None]:
from cassandra.cluster import Cluster, PlainTextAuthProvider

# User name & password
auth_provider = PlainTextAuthProvider(
        username=cass_user, password=cass_pass)

cluster = Cluster([cluster_external_ip],connect_timeout=30,auth_provider=auth_provider)
session = cluster.connect()

In [None]:
# Create a keyspace in the DSE 7 cluster
session.execute("CREATE KEYSPACE IF NOT EXISTS vector_keyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };")

In [None]:
import cassio

cassio.init(session=session, keyspace="vector_keyspace")

In [None]:
#Create the LangChain vector store object
vstore = Cassandra(
    embedding=embe, table_name="cassandra_vector_demo", session=None, keyspace=None
)

### Load A Dataset
Convert each entry in the source dataset into a `Document`, then write them into the vector store:

In [None]:
philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]

docs = []
for entry in philo_dataset:
    metadata = {"author": entry["author"]}
    doc = Document(page_content=entry["quote"], metadata=metadata)
    docs.append(doc)

inserted_ids = vstore.add_documents(docs)
print(f"\nInserted {len(inserted_ids)} documents.")

In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.

_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._

In [None]:
texts = ["I think, therefore I am.", "To the things themselves!"]
metadatas = [{"author": "descartes"}, {"author": "husserl"}]
ids = ["desc_01", "huss_xy"]

inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)
print(f"\nInserted {len(inserted_ids_2)} documents.")

_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_
_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_
_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._
### Run simple searches
This section demonstrates metadata filtering and getting the similarity scores back:

In [None]:
results = vstore.similarity_search("Our life is what we make of it", k=3)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

In [None]:
results_filtered = vstore.similarity_search(
    "Our life is what we make of it",
    k=3,
    filter={"author": "plato"},
)
for res in results_filtered:
    print(f"* {res.page_content} [{res.metadata}]")

In [None]:
results = vstore.similarity_search_with_score("Our life is what we make of it", k=3)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

### MMR (Maximal-Marginal-Relevance) search

In [None]:
results = vstore.max_marginal_relevance_search(
    "Our life is what we make of it",
    k=3,
    filter={"author": "aristotle"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

### Deleting stored documents

In [None]:
delete_1 = vstore.delete(inserted_ids[:3])
print(f"all_succeed={delete_1}")  # True, all documents deleted

In [None]:
delete_2 = vstore.delete(inserted_ids[2:5])
print(f"some_succeeds={delete_2}")  # True, though some IDs were gone already

### Running A Minimal RAG Chain
The next cells will implement a simple RAG pipeline:
- download a sample PDF file and load it onto the store;
- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;
- run the question-answering chain.

In [None]:
!curl -L \
"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true" -o "what-is-philosophy.pdf"

In [None]:
#Load the PDF file

pdf_loader = PyPDFLoader("what-is-philosophy.pdf")
#Create document chunks & embeddings
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)

print(f"Documents from PDF: {len(docs_from_pdf)}.")
inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)
print(f"Inserted {len(inserted_ids_from_pdf)} documents.")

In [None]:
#Create the prompt and chain
retriever = vstore.as_retriever(search_kwargs={"k": 3})

philo_template = """
You are a philosopher that draws inspiration from great thinkers of the past
to craft well-thought answers to user questions. Use the provided context as the basis
for your answers and do not make up new reasoning paths - just mix-and-match what you are given.
Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.

CONTEXT:
{context}

QUESTION: {question}

YOUR ANSWER:"""

philo_prompt = ChatPromptTemplate.from_template(philo_template)

llm = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | philo_prompt
    | llm
    | StrOutputParser()
)

In [None]:
#Run the whole chain and answer question
chain.invoke("How does Russel elaborate on Peirce's idea of the security blanket?")

### Cleanup
If you want to completely delete the collection from your DSE7 instance, run this.

_(You will lose the data you stored in it.)_

In [None]:
vstore.delete_collection()

### Learn more

For more information, extended quickstarts and additional usage examples, please visit the [CassIO documentation](https://cassio.org/frameworks/langchain/about/) for more on using the LangChain `Cassandra` vector store.