# **Setup**

In [None]:
!pip install "cassio>=0.1.0" llama-index

In [11]:
import os

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Document,
    StorageContext,
)
from llama_index.vector_stores import CassandraVectorStore

database connection parameters and secrets

In [14]:
import os
from getpass import getpass

try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

# Your database's Secure Connect Bundle zip file is needed:
if IS_COLAB:
    print('Please upload your Secure Connect Bundle zipfile: ')
    uploaded = files.upload()
    if uploaded:
        astraBundleFileTitle = list(uploaded.keys())[0]
        ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
    else:
        raise ValueError(
            'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
        )
else:
    # you are running a local-jupyter notebook:
    ASTRA_DB_SECURE_BUNDLE_PATH = input("Please provide the full path to your Secure Connect Bundle zipfile: ")

Please upload your Secure Connect Bundle zipfile: 


Saving secure-connect-voldemort-vector.zip to secure-connect-voldemort-vector.zip


In [15]:
ASTRA_DB_APPLICATION_TOKEN = getpass("Please provide your Database Token ('AstraCS:...' string): ")

Please provide your Database Token ('AstraCS:...' string): ··········


In [16]:
ASTRA_DB_KEYSPACE = input("Please provide the Keyspace name for your Database: ")

Please provide the Keyspace name for your Database: LlamaIndex


In [17]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

#Establish Connectivity
cluster = Cluster(
    cloud={
        "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
    },
    auth_provider=PlainTextAuthProvider(
        "token",
        ASTRA_DB_APPLICATION_TOKEN,
    ),
)

session = cluster.connect()
keyspace = ASTRA_DB_KEYSPACE

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(137842539953120) b2d68f45-7d3e-4936-a50c-b0742b34b3f5-us-east-2.db.astra.datastax.com:29042:432a9432-577d-4222-b785-4faf40272fd6> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


In [18]:
OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key: ··········


In [19]:
import openai

openai.api_key = OPENAI_API_KEY

Creating and populating the Vector Store

In [22]:
# load documents
documents = SimpleDirectoryReader("/content/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    f"First document, text ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)

Total documents: 1
First document, id: 5364256d-d5cc-4a62-acd2-73fe7aa52784
First document, hash: 2e2d9629223c077019a6dde689049344ff2293d6c52372871420119ec049f25c
First document, text (75014 characters):


What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...


Initialize the Cassandra Vector Store

Creation of the vector store entails creation of the underlying database table if it does not exist yet:

In [24]:
cassandra_store = CassandraVectorStore(
    session=session,
    keyspace=keyspace,
    table="cassandra_vector_table_1",
    embedding_dimension=1536,
)

Now wrap this store into an index LlamaIndex abstraction for later querying:


In [25]:
storage_context = StorageContext.from_defaults(vector_store=cassandra_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Note that the above from_documents call does several things at once: it splits the input documents into chunks of manageable size (“nodes”), computes embedding vectors for each node, and stores them all in the Cassandra Vector Store.

# **Querying the store**

Basic querying

In [26]:
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)

The author chose to work on AI because they were inspired by a novel called "The Moon is a Harsh Mistress" by Heinlein, which featured an intelligent computer called Mike. Additionally, they were influenced by a PBS documentary that showed Terry Winograd using SHRDLU, a program that the author believed could be improved by teaching it more words.


### **MMR-based queries**

The MMR (maximal marginal relevance) method is designed to fetch text chunks from the store that are at the same time relevant to the query but as different as possible from each other, with the goal of providing a broader context to the building of the final answer:

In [27]:
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)

The author chose to work on AI because they believed that it was a field that held promise and potential for advancing the understanding of natural language and intelligence. They were initially drawn to AI because of their fascination with the program SHRDLU, which they considered to be a step towards achieving intelligence. However, as they delved deeper into the field, they realized that the existing approaches to AI, which involved explicit data structures and formal representations, were not effective in truly understanding natural language. Despite this realization, the author still found value in working on AI and decided to focus on Lisp, a programming language associated with AI, as they believed it was interesting in its own right.


# **Connecting to an existing store**

Since this store is backed by Cassandra, it is persistent by definition. So, if you want to connect to a store that was created and populated previously, here is how:

In [33]:
new_store_instance = CassandraVectorStore(
    session=session,
    keyspace=keyspace,
    table="cassandra_vector_table_1",
    embedding_dimension=1536,
)

# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(vector_store=new_store_instance)

# now you can do querying, etc:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What did the author study prior to working on AI?")

print(response.response)


The author studied art prior to working on AI.


# **Removing documents from the index**

First get an explicit list of pieces of a document, or “nodes”, from a Retriever spawned from the index:

In [34]:
retriever = new_index_instance.as_retriever(
    vector_store_query_mode="mmr",
    similarity_top_k=3,
    vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
    print(f"    [{idx}] score = {node_with_score.score}")
    print(f"        id    = {node_with_score.node.node_id}")
    print(f"        text  = {node_with_score.node.text[:90]} ...")

Found 3 nodes.
    [0] score = 0.4293121408693589
        id    = 57c1b2d3-0ef0-4c93-8707-24be53f3045a
        text  = What I Worked On

February 2021

Before college the two main things I worked on, outside o ...
    [1] score = 0.002232291606439618
        id    = d0509622-bf5d-482b-a70c-f357effc31a7
        text  = Now all I had to do was learn Italian.

Only stranieri (foreigners) had to take this entra ...
    [2] score = 0.02296870053065997
        id    = f6abf4c1-a1f2-428e-862a-407409edd3d8
        text  = All you had to do was teach SHRDLU more words.

There weren't any classes in AI at Cornell ...


Print nodes ref_doc_id they all should have same as we inserted only one record.

In [35]:
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))

Nodes' ref_doc_id:
5364256d-d5cc-4a62-acd2-73fe7aa52784
5364256d-d5cc-4a62-acd2-73fe7aa52784
5364256d-d5cc-4a62-acd2-73fe7aa52784


If you need to remove the text file you uploaded:

In [36]:
new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)

Repeat the very same query and check the results now. You should see no results being found:

In [37]:
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

print(f"Found {len(nodes_with_scores)} nodes.")

Found 0 nodes.


# **Metadata filtering**

The Cassandra vector store support metadata filtering in the form of exact-match key=value pairs at query time.

In this demo, a single source document is loaded (the paul_graham_essay.txt text file). we will attach some custom metadata to the document to illustrate how we can can restrict queries with conditions on the metadata attached to the documents.

In [38]:
md_storage_context = StorageContext.from_defaults(
    vector_store=CassandraVectorStore(
        session=session,
        keyspace=keyspace,
        table="cassandra_vector_table_2_md",
        embedding_dimension=1536,
    )
)


def my_file_metadata(file_name: str):
    """Depending on the input file name, associate a different metadata."""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # this (unfortunately) will not happen in this demo
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# Load documents and build index
md_documents = SimpleDirectoryReader(
    "/content/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)

you can now add filtering to your query engine:

In [40]:
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[ExactMatchFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query("How long it took the author to write his thesis?")
print(md_response.response)


Empty Response


To test that the filtering is at play, try to change it to use only "dinos" documents… there will be no answer