# **Getting Started with this notebook**

* Create a new vector search enabled database in Astra.
* For the easy path, name the keyspace in that database "**vector_preview**"(otherwise be prepared to modify the CQL in this notebook)
* Create a token with permissions to create tables
* Download your **secure-connect-bundle** zip file.
* When you open this notebook in Google Colab or your own notebook server, drag-and-drop the secure connect bundle into the File Browser (on the left panel) of the notebook
* Update the Keys & Environment Variables cell in the notebook with information from the token you generated and the name of your secure connect bundle file.


# **Install libraries**

In [58]:
!pip install openai pandas cassandra-driver



# **Import Modules**

In [59]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
from cassandra.query import SimpleStatement
import openai
import pandas as pd
import time
import requests

In [60]:
import os
from getpass import getpass

try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

# **Keys & Environment Variables**

In [61]:
# Your database's Secure Connect Bundle zip file is needed:
if IS_COLAB:
    print('Please upload your Secure Connect Bundle zipfile: ')
    uploaded = files.upload()
    if uploaded:
        astraBundleFileTitle = list(uploaded.keys())[0]
        ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
    else:
        raise ValueError(
            'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
        )
else:
    # you are running a local-jupyter notebook:
    ASTRA_DB_SECURE_BUNDLE_PATH = input("Please provide the full path to your Secure Connect Bundle zipfile: ")

Please upload your Secure Connect Bundle zipfile: 


Saving secure-connect-voldemort-vector.zip to secure-connect-voldemort-vector (1).zip


In [62]:
ASTRA_DB_APPLICATION_TOKEN = getpass("Please provide your Database Token ('AstraCS:...' string): ")

Please provide your Database Token ('AstraCS:...' string): ··········


In [69]:
ASTRA_DB_KEYSPACE = input("Please provide the Keyspace name for your Database: ")

Please provide the Keyspace name for your Database: bike_rec


In [70]:
#Establish Connectivity
cluster = Cluster(
    cloud={
        "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
    },
    auth_provider=PlainTextAuthProvider(
        "token",
        ASTRA_DB_APPLICATION_TOKEN,
    ),
)

session = cluster.connect()
keyspace = ASTRA_DB_KEYSPACE

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(138802813272464) b2d68f45-7d3e-4936-a50c-b0742b34b3f5-us-east-2.db.astra.datastax.com:29042:f03b8bff-1861-4185-8d91-4e5dc70ee886> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


**LLM Setup**

In [65]:
OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key: ··········


In [66]:
import openai

openai.api_key = OPENAI_API_KEY

# **Select a model to compute embeddings**

In [67]:
model_id = "text-embedding-ada-002"

# **Connect to Astra with Vector Search**

# **Drop / Create Schema**

In [71]:
# only use this to reset the schema
session.execute(f"""DROP INDEX IF EXISTS {ASTRA_DB_KEYSPACE}.descriptions_vector_idx""")
session.execute(f"""DROP INDEX IF EXISTS {ASTRA_DB_KEYSPACE}.type_idx_analyzer""")
session.execute(f"""DROP TABLE IF EXISTS {ASTRA_DB_KEYSPACE}.bikes""")

<cassandra.cluster.ResultSet at 0x7e3d9f1ed2d0>

In [72]:
# # Create Table
session.execute(f"""CREATE TABLE IF NOT EXISTS {ASTRA_DB_KEYSPACE}.bikes
(model text,
  brand text,
  type text,
  price decimal,
  description text,
  description_embedding vector<float, 1536>,
  PRIMARY KEY (brand,model))""")

<cassandra.cluster.ResultSet at 0x7e3d8c69c490>

In [73]:
# # Create Index
session.execute(f"""CREATE CUSTOM INDEX IF NOT EXISTS descriptions_vector_idx ON {ASTRA_DB_KEYSPACE}.bikes (description_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = {{'similarity_function':'dot_product'}}""")
session.execute(f"""CREATE CUSTOM INDEX IF NOT EXISTS type_idx_analyzer ON {ASTRA_DB_KEYSPACE}.bikes (type) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = {{'index_analyzer': '{{"tokenizer" : {{"name" : "standard"}},"filters" : [{{"name" : "porterstem"}},{{"name" : "lowercase",	"args": {{}}}}]}}'}};""")

<cassandra.cluster.ResultSet at 0x7e3d8fa80220>

# **Building the Knowledge Base**

We start by constructing our knowledge base. We'll use a mostly prepared dataset called Stanford Question-Answering Dataset (SQuAD) hosted on Hugging Face Datasets. We download it like so:

In [74]:
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
#print(bikes)
bikes = pd.DataFrame(bikes)
bikes

Unnamed: 0,model,brand,price,type,specs,description
0,Jigger,Velorim,270,Kids bikes,"{'material': 'aluminium', 'weight': '10'}","Small and powerful, the Jigger is the best rid..."
1,Hillcraft,Bicyk,1200,Kids Mountain Bikes,"{'material': 'carbon', 'weight': '11'}",Kids want to ride with as little weight as pos...
2,Chook air 5,Nord,815,Kids Mountain Bikes,"{'material': 'alloy', 'weight': '9.1'}",The Chook Air 5 gives kids aged six years and...
3,Eva 291,Eva,3400,Mountain Bikes,"{'material': 'carbon', 'weight': '9.1'}","The sister company to Nord, Eva launched in 20..."
4,Kahuna,Noka Bikes,3200,Mountain Bikes,"{'material': 'alloy', 'weight': '9.8'}",Whether you want to try your hand at XC racing...
5,XBN 2.1 Alloy,Breakout,810,Road Bikes,"{'material': 'alloy', 'weight': '7.2'}",The XBN 2.1 Alloy is our entry-level road bike...
6,WattBike,ScramBikes,2300,eBikes,"{'material': 'alloy', 'weight': '15'}",The WattBike is the best e-bike for people who...
7,Soothe Electric bike,Peaknetic,1950,eBikes,"{'material': 'alloy', 'weight': '14.7'}","The Soothe is an everyday electric bike, from ..."
8,Secto,Peaknetic,430,Commuter bikes,"{'material': 'aluminium', 'weight': '10.0'}",If you struggle with stiff fingers or a kinked...
9,Summit,nHill,1200,Mountain Bike,"{'material': 'alloy', 'weight': '11.3'}",This budget mountain bike from nHill performs ...


# **Load the table with data and create text embeddings**


In [75]:
import traceback

counter = 0;
total = 0
for id in bikes.index:

  description = bikes['description'][id].replace(',', '\,')
  description = description.replace('"', '\"')

  # Create Embedding for each bike row, save them to the database
  full_chunk = bikes['description'][id]

  if (counter > 0):
    embedding = openai.Embedding.create(input=full_chunk, model=model_id)['data'][0]['embedding']

    query = SimpleStatement(f"""INSERT INTO bike_rec.bikes(model, brand, price, type, description, description_embedding) VALUES (%s, %s, %s, %s, %s, %s)""")

    # Create a try-catch block
    try:
      print(bikes['model'][id], bikes['brand'][id], bikes['price'][id], bikes['type'][id], description, embedding)
      session.execute(query, (bikes['model'][id], bikes['brand'][id], bikes['price'][id], bikes['type'][id], description, embedding), trace=True)
    except Exception as e:
      # Log the exception
      traceback.print_exc()
      print(e)
      break
    else:
      # The CQL statement executed successfully
      print("Embeddings were inserted successfully.")

  # With free trial of openAI, the rate limit is set as 60/per min.  Please set this counter depends on your own rate limit.
  counter += 1

  # It takes a long time to load all data.  It is set as 300 so it takes a few mins to load.
  total += 1

  if(total >= 300):
    display('total records inserted ')
    display(counter)
    total = 0
  #  break

Hillcraft Bicyk 1200 Kids Mountain Bikes Kids want to ride with as little weight as possible. Especially on an incline! They may be at the age when a 27.5" wheel bike is just too clumsy coming off a 24" bike. The Hillcraft 26 is just the solution they need! Imagine 120mm travel. Boost front/rear.  You have NOTHING to tweak because it is easy to assemble right out of the box. The Hillcraft 26 is an efficient trail trekking machine. Up or down does not matter - dominate the trails going both down and up with this amazing bike. The name Monarch comes from Monarch trail in Colorado where we love to ride.  It’s a highly technical\, steep and rocky trail but the rip on the waydown is so fulfilling.  Don’t ride the trail on a hardtail! It is so much more fun on the full suspension Hillcraft!  Hit your local trail with the Hillcraft Monarch 26 to get to where you want to go.   [0.027309365570545197, 0.03310307115316391, 0.003910417668521404, -0.014049404300749302, 0.006506212055683136, 0.02435

# **Convert a query string into a text embedding to use as part of the query**

Provide a question to find out the information of the university that you are interested and see how it works with Vector Search and ChatGPT.

Here we use the same API that we used to calculate embeddings for each row in the database, but this time we are using your input question to calculate a vector to use in a query.

In [76]:
# Question to find out the information that you need.
customer_input = "Bike for small kids"

# Create embedding based on same model
embedding = openai.Embedding.create(input=customer_input, model=model_id)['data'][0]['embedding']
#display(embedding)

Let's take a look at what a query against a vector index could look like. The query vector has the same dimensions (number of entries in the list) as the embeddings we generated a few steps ago for each row in the database.

# **Find the top 3 results using ANN Similarity**

In [77]:
# Use the embedding to find the information nearest to the question asked.
query = SimpleStatement(
    f"""
    SELECT *
    FROM bike_rec.bikes
    ORDER BY description_embedding ANN OF {embedding} LIMIT 5;
    """
    )
#display(query)

In [79]:
results = session.execute(query)
top_3_results = results._current_rows

bikes_results = pd.DataFrame(top_3_results)
display(bikes_results)

Unnamed: 0,brand,model,description,description_embedding,price,type
0,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...,"[0.01237909309566021, -0.0003592295979615301, ...",815,Kids Mountain Bikes
1,ScramBikes,WattBike,The WattBike is the best e-bike for people who...,"[0.014994332566857338, -0.010562343522906303, ...",2300,eBikes
2,BikeShind,ThrillCycle,"An artsy\, retro-inspired bicycle that’s as f...","[0.004851819947361946, -0.012223890982568264, ...",815,Commuter Bikes
3,Bicyk,Hillcraft,Kids want to ride with as little weight as pos...,"[0.027309365570545197, 0.03310307115316391, 0....",1200,Kids Mountain Bikes
4,Noka Bikes,Kahuna,Whether you want to try your hand at XC racing...,"[-0.006983320694416761, -0.010508619248867035,...",3200,Mountain Bikes


## **Let's Change to use Text Analyzer Search to find top 3 Results**

In [80]:
# Use the embedding to find the information nearest to the question asked.
query = SimpleStatement(
    f"""
    SELECT *
    FROM bike_rec.bikes
    WHERE type : 'Kids bikes'
    ORDER BY description_embedding ANN OF {embedding} LIMIT 5;
    """
    )
#display(query)

In [81]:
results = session.execute(query)
top_3_results = results._current_rows

bikes_results = pd.DataFrame(top_3_results)
display(bikes_results)

Unnamed: 0,brand,model,description,description_embedding,price,type
0,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...,"[0.01237909309566021, -0.0003592295979615301, ...",815,Kids Mountain Bikes
1,Bicyk,Hillcraft,Kids want to ride with as little weight as pos...,"[0.027309365570545197, 0.03310307115316391, 0....",1200,Kids Mountain Bikes
