# **Setup**

In [None]:
!pip install "cassio>=0.1.0" llama-index

In [18]:
import os

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Document,
    StorageContext,
)
from llama_index.indices.query.query_transform import HyDEQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine
from llama_index.vector_stores import CassandraVectorStore
from IPython.display import Markdown, display

database connection parameters and secrets

In [4]:
import os
from getpass import getpass

try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

# Your database's Secure Connect Bundle zip file is needed:
if IS_COLAB:
    print('Please upload your Secure Connect Bundle zipfile: ')
    uploaded = files.upload()
    if uploaded:
        astraBundleFileTitle = list(uploaded.keys())[0]
        ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
    else:
        raise ValueError(
            'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
        )
else:
    # you are running a local-jupyter notebook:
    ASTRA_DB_SECURE_BUNDLE_PATH = input("Please provide the full path to your Secure Connect Bundle zipfile: ")

Please upload your Secure Connect Bundle zipfile: 


Saving secure-connect-voldemort-vector.zip to secure-connect-voldemort-vector.zip


In [19]:
ASTRA_DB_APPLICATION_TOKEN = getpass("Please provide your Database Token ('AstraCS:...' string): ")

Please provide your Database Token ('AstraCS:...' string): ··········


In [20]:
ASTRA_DB_KEYSPACE = input("Please provide the Keyspace name for your Database: ")

Please provide the Keyspace name for your Database: llamaindex


In [21]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

#Establish Connectivity
cluster = Cluster(
    cloud={
        "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
    },
    auth_provider=PlainTextAuthProvider(
        "token",
        ASTRA_DB_APPLICATION_TOKEN,
    ),
)

session = cluster.connect()
keyspace = ASTRA_DB_KEYSPACE

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(135010588228096) b2d68f45-7d3e-4936-a50c-b0742b34b3f5-us-east-2.db.astra.datastax.com:29042:2c823767-4e6e-4c32-ba6f-fed79fd419f1> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


In [22]:
OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key: ··········


In [23]:
import openai

openai.api_key = OPENAI_API_KEY

Creating and populating the Vector Store

In [24]:
# load documents
documents = SimpleDirectoryReader("/content/paul_graham").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    f"First document, text ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)

Total documents: 1
First document, id: cc1dab80-e247-4407-8e15-7cf878f592c1
First document, hash: 2e2d9629223c077019a6dde689049344ff2293d6c52372871420119ec049f25c
First document, text (75014 characters):


What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...


Initialize the Cassandra Vector Store

Creation of the vector store entails creation of the underlying database table if it does not exist yet:

In [25]:
cassandra_store = CassandraVectorStore(
    session=session,
    keyspace=keyspace,
    table="cassandra_vector_table_1",
    embedding_dimension=1536,
)

Now wrap this store into an index LlamaIndex abstraction for later querying:


In [26]:
storage_context = StorageContext.from_defaults(vector_store=cassandra_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Note that the above from_documents call does several things at once: it splits the input documents into chunks of manageable size (“nodes”), computes embedding vectors for each node, and stores them all in the Cassandra Vector Store.

# **Querying the store**

Basic querying

In [29]:
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)

The author chose to work on AI because they were inspired by a novel called "The Moon is a Harsh Mistress" by Heinlein, which featured an intelligent computer called Mike. Additionally, they were influenced by a PBS documentary that showed Terry Winograd using SHRDLU, a program that they believed could be improved by teaching it more words.


### **MMR-based queries**

The MMR (maximal marginal relevance) method is designed to fetch text chunks from the store that are at the same time relevant to the query but as different as possible from each other, with the goal of providing a broader context to the building of the final answer:

In [28]:
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)

The author chose to work on AI because they believed that it was a field that had the potential to climb the lower slopes of intelligence. They were excited about the possibilities of teaching programs like SHRDLU to understand natural language and expand their concept of a program.


# **HyDE Query Transform**

HyDE stands for Hypothetical Document Embeddings, is a technique used in semantic search to find documents based on similarities in semantic embedding. It’s a zero-shot learning technique, meaning it can make predictions about data it has not been trained on.

In the context of search, HyDE works by generating a hypothetical answer to a query using a language model. This hypothetical answer is then embedded into a vector space, similar to how real documents are embedded. When a search query comes in, similar real documents are retrieved based on vector similarity to the hypothetical document. This allows for a more precise and relevant retrieval of documents, even when the exact terms used in the search query may not be present in the documents.

The aim of HyDE is to improve the quality of search results by focusing on the underlying intent of the search query, rather than just the exact words used. This makes it particularly useful for tasks like question-answering, where the goal is to find the most relevant information to answer a user’s question, rather than just finding documents that contain the exact words used in the question

### First, we query without transformation: The same query string is used for embedding lookup and also summarization.

In [31]:
query_str = "what did paul graham do after going to RISD"

In [32]:
query_engine = index.as_query_engine()
response = query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

<b>After going to RISD, Paul Graham did freelance work for a group that did projects for customers.</b>

### Now, we use HyDEQueryTransform to generate a hypothetical document and use it for embedding lookup.

In [33]:
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

<b>After going to RISD, Paul Graham dropped out and moved to New York. He then decided to write another book on Lisp and became a studio assistant for Idelle Weber, a painter. Additionally, he started a company with Robert Morris to put art galleries online, but the idea did not succeed.</b>

In this example, HyDE improves output quality significantly, by hallucinating accurately what Paul Graham did after RISD (see below), and thus improving the embedding quality, and final output.

In [34]:
query_bundle = hyde(query_str)
hyde_doc = query_bundle.embedding_strs[0]
display(Markdown(f"<b>{hyde_doc}</b>"))

<b>After attending the Rhode Island School of Design (RISD), Paul Graham embarked on a remarkable journey that would shape his future as a successful entrepreneur and influential figure in the tech industry. Armed with a degree in painting, Graham initially pursued his passion for art, immersing himself in the vibrant art scene and exhibiting his work in various galleries.

However, Graham's insatiable curiosity and innate problem-solving abilities led him to explore the world of computer programming. Recognizing the immense potential of technology, he delved into coding and quickly became enamored with its limitless possibilities. This newfound passion prompted Graham to shift his focus and embark on a new path.

In 1995, Graham co-founded Viaweb, an early e-commerce platform that allowed users to build their online stores. This groundbreaking venture not only showcased Graham's entrepreneurial spirit but also demonstrated his ability to identify emerging trends and capitalize on them. Viaweb's success caught the attention of Yahoo, which acquired the company in 1998, solidifying Graham's reputation as a tech visionary.

Following the acquisition, Graham continued to make significant contributions to the tech industry. In 2001, he co-founded Y Combinator, a startup accelerator that provides funding and mentorship to early-stage companies. Y Combinator quickly gained recognition as one of the most prestigious and influential startup incubators, nurturing the growth of countless successful companies, including Dropbox, Airbnb, and Reddit.

Graham's impact extended beyond his role at Y Combinator. He became a prolific writer, sharing his insights and experiences through thought-provoking essays. His writings, which covered a wide range of topics including entrepreneurship, technology, and philosophy, garnered a large following and solidified his status as a respected intellectual.

In addition to his entrepreneurial endeavors and writing, Graham also dedicated himself to philanthropy. He and his wife, Jessica Livingston, established the Paul Graham Foundation, which supports various charitable causes, including education and poverty alleviation.

In summary, after attending RISD, Paul Graham transitioned from the world of art to the realm of technology. Through his ventures, such as Viaweb and Y Combinator, he demonstrated his entrepreneurial prowess and ability to identify and nurture promising startups. Graham's writings and philanthropic efforts further solidified his influence and impact, making him a prominent figure in both the tech industry and the intellectual community.</b>

# **Connecting to an existing store**

Since this store is backed by Cassandra, it is persistent by definition. So, if you want to connect to a store that was created and populated previously, here is how:

In [30]:
new_store_instance = CassandraVectorStore(
    session=session,
    keyspace=keyspace,
    table="cassandra_vector_table_1",
    embedding_dimension=1536,
)

# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(vector_store=new_store_instance)

# now you can do querying, etc:
query_engine = new_index_instance.as_query_engine(similarity_top_k=5)
response = query_engine.query("What did the author study prior to working on AI?")

print(response.response)


Prior to working on AI, the author studied painting and drawing at the Accademia.


# **Removing documents from the index**

First get an explicit list of pieces of a document, or “nodes”, from a Retriever spawned from the index:

In [None]:
retriever = new_index_instance.as_retriever(
    vector_store_query_mode="mmr",
    similarity_top_k=3,
    vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
    print(f"    [{idx}] score = {node_with_score.score}")
    print(f"        id    = {node_with_score.node.node_id}")
    print(f"        text  = {node_with_score.node.text[:90]} ...")

Found 3 nodes.
    [0] score = 0.4293121408693589
        id    = 57c1b2d3-0ef0-4c93-8707-24be53f3045a
        text  = What I Worked On

February 2021

Before college the two main things I worked on, outside o ...
    [1] score = 0.002232291606439618
        id    = d0509622-bf5d-482b-a70c-f357effc31a7
        text  = Now all I had to do was learn Italian.

Only stranieri (foreigners) had to take this entra ...
    [2] score = 0.02296870053065997
        id    = f6abf4c1-a1f2-428e-862a-407409edd3d8
        text  = All you had to do was teach SHRDLU more words.

There weren't any classes in AI at Cornell ...


Print nodes ref_doc_id they all should have same as we inserted only one record.

In [None]:
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))

Nodes' ref_doc_id:
5364256d-d5cc-4a62-acd2-73fe7aa52784
5364256d-d5cc-4a62-acd2-73fe7aa52784
5364256d-d5cc-4a62-acd2-73fe7aa52784


If you need to remove the text file you uploaded:

In [None]:
new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)

Repeat the very same query and check the results now. You should see no results being found:

In [None]:
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

print(f"Found {len(nodes_with_scores)} nodes.")

Found 0 nodes.


# **Metadata filtering**

The Cassandra vector store support metadata filtering in the form of exact-match key=value pairs at query time.

In this demo, a single source document is loaded (the paul_graham_essay.txt text file). we will attach some custom metadata to the document to illustrate how we can can restrict queries with conditions on the metadata attached to the documents.

In [None]:
md_storage_context = StorageContext.from_defaults(
    vector_store=CassandraVectorStore(
        session=session,
        keyspace=keyspace,
        table="cassandra_vector_table_2_md",
        embedding_dimension=1536,
    )
)


def my_file_metadata(file_name: str):
    """Depending on the input file name, associate a different metadata."""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # this (unfortunately) will not happen in this demo
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# Load documents and build index
md_documents = SimpleDirectoryReader(
    "/content/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)

you can now add filtering to your query engine:

In [None]:
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[ExactMatchFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query("How long it took the author to write his thesis?")
print(md_response.response)


Empty Response


To test that the filtering is at play, try to change it to use only "dinos" documents… there will be no answer