# Text and ANN Search on Astra DB (powered by cassandra)

This notebook demonstrates how Astra DB can combine vector similarity search with term search to improve performance, and increase relevance on Generative AI use cases.

## Dependencies

In [1]:
pip install datasets cassandra-driver

Note: you may need to restart the kernel to use updated packages.


## Astra DB Setup

In [2]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

In [3]:
import os
from getpass import getpass

try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

In [4]:
# Your database's Secure Connect Bundle zip file is needed:
if IS_COLAB:
    print('Please upload your Secure Connect Bundle zipfile: ')
    uploaded = files.upload()
    if uploaded:
        astraBundleFileTitle = list(uploaded.keys())[0]
        ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
    else:
        raise ValueError(
            'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
        )
else:
    # you are running a local-jupyter notebook:
    ASTRA_DB_SECURE_BUNDLE_PATH = input("Please provide the full path to your Secure Connect Bundle zipfile: ")

ASTRA_DB_APPLICATION_TOKEN = getpass("Please provide your Database Token ('AstraCS:...' string): ")
ASTRA_DB_KEYSPACE = input("Please provide the Keyspace name for your Database: ")

In [5]:
# Don't mind the "Closing connection" error after "downgrading protocol..." messages,
# it is really just a warning: the connection will work smoothly.
cluster = Cluster(
    cloud={
        "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
    },
    auth_provider=PlainTextAuthProvider(
        "token",
        ASTRA_DB_APPLICATION_TOKEN,
    ),
)

session = cluster.connect()
keyspace = ASTRA_DB_KEYSPACE

## Load Dataset

For this example we are going to be using a variation of the [yahoo_answers dataset](https://huggingface.co/datasets/yahoo_answers_topics/viewer/yahoo_answers_topics/train) from huggingface. For demonstration purposes, we included the label for topic (instead of its id), so that we can use term based search to filter our queries. This dataset has over 1,000,000 answers so it's great for demonstrating Astra DB's performance, scalability, and to leverage filtering capabilities with hybrid search.

In [6]:
from datasets import load_dataset
dataset = load_dataset("jdabello/yahoo_answers_topics", split="train")
dataset

Dataset({
    features: ['id', 'topic', 'question_title', 'question_content', 'best_answer'],
    num_rows: 1400000
})

Lets inspect one record.

In [7]:
dataset[0]

{'id': 0,
 'topic': 'Computers & Internet',
 'question_title': "why doesn't an optical mouse work on a glass table?",
 'question_content': 'or even on some surfaces?',
 'best_answer': 'Optical mice use an LED and a camera to rapidly capture images of the surface beneath the mouse.  The infomation from the camera is analyzed by a DSP (Digital Signal Processor) and used to detect imperfections in the underlying surface and determine motion. Some materials, such as glass, mirrors or other very shiny, uniform surfaces interfere with the ability of the DSP to accurately analyze the surface beneath the mouse.  \\nSince glass is transparent and very uniform, the mouse is unable to pick up enough imperfections in the underlying surface to determine motion.  Mirrored surfaces are also a problem, since they constantly reflect back the same image, causing the DSP not to recognize motion properly. When the system is unable to see surface changes associated with movement, the mouse will not work pr

## LLM Setup

In [8]:
OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

In [9]:
import openai

openai.api_key = OPENAI_API_KEY

The yahoo answers dataset, has 2 separate columns for the Question Title, and the Question Content. For this excercise we are concatenating those into a single column that we will call Question. For demonstration purposes, we are going to get the embedding of the first question, and get the number of dimensions for that vector. This is needed to create the table.

In [10]:
embedding_model_name = "text-embedding-ada-002"

result = openai.Embedding.create(
    input=f"{dataset[0]['question_title']}. {'question_content'}",
    engine=embedding_model_name,
)
result

<OpenAIObject list at 0x28d9502f0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        0.008803000673651695,
        -0.01650131121277809,
        0.0021489679347723722,
        -0.02678874135017395,
        -0.0017424763645976782,
        0.01860022358596325,
        -0.023888928815722466,
        -0.02329515665769577,
        -0.03457680717110634,
        -0.02220427617430687,
        0.015451855957508087,
        0.004460187163203955,
        -0.013877672143280506,
        -0.01717793568968773,
        -0.017868366092443466,
        0.02045057900249958,
        0.010149342939257622,
        0.02054724097251892,
        0.013856959529221058,
        -0.015879923477768898,
        -0.03532247245311737,
        0.009548666886985302,
        0.008913470432162285,
        -0.00653838599100709,
        -0.015258535742759705,
        -0.005633920896798372,
        0.004498160909861326,
        -0.03093132935464382,
     

In [11]:
#Let's get the number of dimensions for our embedding
len(result.data[0].embedding)

1536

In [13]:
# Run this to drop the table and indexes before starting over
session.execute(f"DROP TABLE IF EXISTS {ASTRA_DB_KEYSPACE}.yahoo_answers")

<cassandra.cluster.ResultSet at 0x1740efc10>

Now we are going to create our table. We will include the id of the answer, the topic, the Question (concatenated title, and content), the best answer, and the embedding of the question. That's what we will use for the ANN search. We now know that the embedding model we are using generates an embedding with 1536 dimensions.

In [14]:
mktable_cql = f"""CREATE TABLE {ASTRA_DB_KEYSPACE}.yahoo_answers (
answer_id int PRIMARY KEY,
topic text,
question text,
best_answer text,
question_embedding vector<float, 1536>
);
"""
session.execute(mktable_cql)

<cassandra.cluster.ResultSet at 0x28d440310>

In [15]:
session.execute(f"""CREATE CUSTOM INDEX ON {ASTRA_DB_KEYSPACE}.yahoo_answers(question_embedding)
                USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
                WITH OPTIONS = {{ 'similarity_function': 'dot_product' }}""")

<cassandra.cluster.ResultSet at 0x28d95c9d0>

In [16]:
session.execute(f"""
    CREATE CUSTOM INDEX ON {ASTRA_DB_KEYSPACE}.yahoo_answers(topic)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
    WITH OPTIONS = {{
    'index_analyzer': '{{
    "tokenizer" : {{"name" : "standard"}},
    "filters" : [{{"name" : "porterstem"}},{{"name" : "lowercase",	"args": {{}}}}]
    }}'}};""")

<cassandra.cluster.ResultSet at 0x132475410>

We created the embedding for a single record. Now let's do it for 50,000 questions. Note that you can slice the dataset to whatever size you want to test out. This process takes around 8 minutes. There are rate limits in place.

In [17]:
subset_train_data = dataset.select(range(50000))

In [18]:
embedding_model_name = "text-embedding-ada-002"

def get_embedding(input):
    embeddings=[]
    if(len(input)>=0):
        result = openai.Embedding.create(
            input=input,
            engine=embedding_model_name,
        )
        for result_data in result.data:
            embeddings.append(result_data.embedding)
    return embeddings

In [19]:
embedding_params = []


input=[]
for row in subset_train_data:
    input.append(f"{row['question_title']}. {row['question_content']}")
    if len(input) % 2000 == 0:  #sending batches of 2000 questions
        embedding_params=embedding_params+get_embedding(input)
        print(f"{len(embedding_params)} rows processed.")
        input=[]
embedding_params=embedding_params+get_embedding(input)
print(f"{len(embedding_params)} rows processed.")


2000 rows processed.
4000 rows processed.
6000 rows processed.
8000 rows processed.
10000 rows processed.
12000 rows processed.
14000 rows processed.
16000 rows processed.
18000 rows processed.
20000 rows processed.
22000 rows processed.
24000 rows processed.
26000 rows processed.
28000 rows processed.
30000 rows processed.
32000 rows processed.
34000 rows processed.
36000 rows processed.
38000 rows processed.
40000 rows processed.
42000 rows processed.
44000 rows processed.
46000 rows processed.
48000 rows processed.
50000 rows processed.


InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.

In [20]:
params_list = []
i=0
for row in subset_train_data:
    params_list.append((row['id'],row['best_answer'],f"{row['question_title']}. {row['question_content']}",embedding_params[i], row['topic']))
    i=i+1

In [21]:
from cassandra.concurrent import execute_concurrent_with_args
request = session.prepare(
                    f"""
                INSERT INTO {ASTRA_DB_KEYSPACE}.yahoo_answers
                (answer_id, best_answer, question, question_embedding, topic)
                VALUES (?, ?, ?, ?, ?)
                """
)
execute_concurrent_with_args(session, request, params_list)

[ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2b2b10>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2b3790>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2d5ed0>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2d6c90>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2b1b90>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c30f9d0>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c30fb90>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2b2550>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2b1bd0>),
 ExecutionResult(success=True, result_or_exc=<cassandra.cluster.ResultSet object at 0x38c2b2250>),
 Execution

## ANN Search

We are running an ambiguous query on purpose to show how the relevance would be improved by adding a filter on the topic as well. We'll start with just ANN search.

In [22]:
from cassandra.query import SimpleStatement
question = 'Is apple good?'

result = openai.Embedding.create(
    input=question,
    engine=embedding_model_name,
)
embedding=result.data[0].embedding

query = SimpleStatement(
    f"""
    SELECT question, best_answer
    FROM {ASTRA_DB_KEYSPACE}.yahoo_answers
    ORDER BY question_embedding ANN OF {embedding} LIMIT 10;
    """
    )

results = session.execute(query)
top_answers = results._current_rows

for row in top_answers:
  print(f"""{row.question}, {row.best_answer}\n""")

Which is better,a apple or a peach?. , PEACHES!

which is better - apple or orange ?. , I think that apples have many uses.  I've used them in a chicken stir-fry, pie, with tuna salad or chicken salad, on a spinach salad, baked, fried, peeled, sliced, as a garnish.\n\nI've also consumed them in juice, cider, hard cider, applesauce, and many types of deserts.\n\nApples are also more exciting because there are so many distinct varieties.  I like Gala apples the best.\n\nOranges are great, but other than for eating them plain, drinking orange juice, or using a little orange zest in a recipe, I don't have as many uses for them.  Plus, the peel makes them more time-consuming to eat.

Are iPods good or bad?. , Good. All your music and videos in your palm.

Why is the apple on Apple's (the IT co.) bitten??. , It's a forbiden fruit- tempting users to try it (Heaven's own ;). Inspiring your imagination (Jobs was/is one big marketing guy)\n+ One of the logos have had the "a" from apple fitting r

There are some answers that reference Apple's (the technology company headquartered in California), but the top results are actually referencing the fruit. Relevance can be improved by leveraging hybrid search and filtering by topic.

In [24]:
from cassandra.query import SimpleStatement
question = 'Is apple good?'

result = openai.Embedding.create(
    input=question,
    engine=embedding_model_name,
)
embedding=result.data[0].embedding

query = SimpleStatement(
    f"""
    SELECT question, best_answer
    FROM {ASTRA_DB_KEYSPACE}.yahoo_answers
    WHERE topic : 'computers'
    ORDER BY question_embedding ANN OF {embedding} LIMIT 10;
    """
    )

results = session.execute(query)
top_answers = results._current_rows

for row in top_answers:
  print(f"""{row.question}, {row.best_answer}\n""")

Why is the apple on Apple's (the IT co.) bitten??. , It's a forbiden fruit- tempting users to try it (Heaven's own ;). Inspiring your imagination (Jobs was/is one big marketing guy)\n+ One of the logos have had the "a" from apple fitting right over the apple.\n\n\nYou can find more about Apple's logo & history here:\nhttp://en.wikipedia.org/wiki/Apple_computer#Logo

dell or apple?. , I used to work for Dell, and in my honest opinion, Dell would be the better way to go.  When you buy a Dell, you get a lower price because it's direct from the manufactuer.  Plus, there customer service is a customer-comes-first focus, so you get top rated care, but remember it's only for hardware issues, not how-to-use-my-computer because-I'm-new issues.\n\nI've never tried apple, but I want to.  But I own a Dell, I have a dell axim and dell dj, and I Always get good service and a good price.

Is Apple Computers conceited?. They boast about themselves alot. Do they're products deserve the respect., Well, 