# Astra DB with AstraPy

Learn how to use your Astra DB database with AstraPy.

In this quickstart, you'll create a vector collection, store a few documents on it, and run **vector searches** on it.

_Prerequisites:_ Make sure you have an Astra DB instance and get ready to supply the corresponding *Token* and the *API Endpoint*
(read more [here](https://docs.datastax.com/en/astra/home/astra.html)).

## Setup

In [None]:
!pip install --quiet --upgrade astrapy openai langchain cassio tiktoken gradio

### Import needed libraries

In [None]:
import os, json
from getpass import getpass

from astrapy.db import AstraDB

### Provide database credentials

These are the connection parameters on your Astra dashboard. Example values:

- API Endpoint: `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`
- Token: `AstraCS:6gBhNmsk135...`


In [None]:
ASTRA_DB_API_ENDPOINT = ""
ASTRA_DB_APPLICATION_TOKEN = ""

In [None]:
os.environ['OPENAI_API_KEY'] = ""

## Create a collection

### Create the client

In [None]:
astra_db = AstraDB(
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
    namespace='workspan'
)

### Create the collection

The `create_collection` method results in a new collection on your database.

In [None]:
collection = astra_db.create_collection("workspan_collection", dimension=1536)

Here, `dimension` is the vector dimension (or "size", i.e. how many numeric components your vector will have).

We choose a very low number in this example for demonstration purposes, but actual embedding vectors usually are much longer.

_Note:_ In case it exists already and the parameters match, this method does just return the collection -- you will get an error, instead, if you try to create a collection with the same name but a different configuration (such as a mismatching dimension).

## Insert documents

When working with vector stores, your documents can have arbitrary fields, as long as you use only letters, digits and the `_` (underscore) character, preferrably sticking to `snake_case`, in their name.

In particular, note the reserved dollar sign in the field names `$vector` and `$similarity`.

In [None]:
from openai import OpenAI

client = OpenAI(
  api_key=os.environ['OPENAI_API_KEY'],  # this is also the default, it can be omitted
)

embedding_model_name = "text-embedding-ada-002"

In [None]:
import uuid

# record #1

uuid_rec1 = str(uuid.uuid1())

next_step = f"""
Action Items:
From Michael, confirmed deprioritize. From Anjaney, account executive interest to schedule meeting - Anjaney to schedule call with Nirav/Amy on R&D.
"""

cadence = f"""
Next Step:
08/16/2023 : Review partner information updates and update opportunity details. 8/17(LR) - connecting with Partner to offer co-sell support

Next Step History:
null;08/16/2023 : Review partner information updates and update opportunity details.;08/16/2023 : Review partner information updates and update opportunity details. 8/17(LR) - connecting with Partner to offer co-sell support
"""

metadata = {"customer_id": 'CUS100', "partner_id": 'AWS', "opportunity_id": 'WS-7202838a', "customer_name": 'Teradyne, Inc.' }
next_step_and_cadence_rec1 = "{} : {} : {}".format(metadata, next_step, cadence)

vector_embedding_rec1 = client.embeddings.create(
        input=next_step_and_cadence_rec1,
        model=embedding_model_name,
    ).data[0].embedding


# record #2

uuid_rec2 = str(uuid.uuid1())

next_step = f"""
Action Items:
From Autumn, send recording of last call and our discussed inputs from demo 8/28. Ramesh will provide to Caroline by early next week (of 9/11).
"""

cadence = f"""
REVIEW TECH & Economic Proposal
"""

metadata = {"customer_id": 'CUS100', "partner_id": 'AWS', "opportunity_id": 'WS-8a038b8a', "customer_name": 'Teradyne, Inc.' }
next_step_and_cadence_rec2 = "{} : {} : {}".format(metadata, next_step, cadence)

vector_embedding_rec2 = client.embeddings.create(
        input=next_step_and_cadence_rec2,
        model=embedding_model_name,
    ).data[0].embedding


# record #3

uuid_rec3 = str(uuid.uuid1())

next_step = f"""
Action Items:
Joint sync set for 9/7. Enablement session to follow + in person account mapping. Caroline / Michael to begin coordinating. EAI presence
"""

cadence = f"""
07/05/2023: Contact Federico Gandolfo,federico.hernan.gandolfo@abc.com,+54.911.3204.4871 to discuss Deal support
"""

metadata = {"customer_id": 'CUS100', "partner_id": 'AWS', "opportunity_id": 'WS-8a3b0348', "customer_name": 'Teradyne, Inc.' }
next_step_and_cadence_rec3 = "{} : {} : {}".format(metadata, next_step, cadence)

vector_embedding_rec3 = client.embeddings.create(
        input=next_step_and_cadence_rec3,
        model=embedding_model_name,
    ).data[0].embedding


# record #4

uuid_rec4 = str(uuid.uuid1())

next_step = f"""
Action Items:
From Caroline, user community engaged to respond to questions. @Dataiku - How can we get initial data from user community/pull together PoV for client? Action (Asan/Ken (sp?)): In-person outreach to Deloitte users and follow-up to 5 responses received.
"""

cadence = f"""
null;06/20/2023: Contact Federico Gandolfo,federico.hernan.gandolfo@abc.com,+54.911.3204.4871 to discuss Deal support;07/05/2023: Contact Federico Gandolfo,federico.hernan.gandolfo@abc.com,+54.911.3204.4871 to discuss Deal support
"""

metadata = {"customer_id": 'CUS100', "partner_id": 'AWS', "opportunity_id": 'WS-8a7128a3', "customer_name": 'Teradyne, Inc.' }
next_step_and_cadence_rec4 = "{} : {} : {}".format(metadata, next_step, cadence)

vector_embedding_rec4 = client.embeddings.create(
        input=next_step_and_cadence_rec4,
        model=embedding_model_name,
    ).data[0].embedding


# record #5

uuid_rec5 = str(uuid.uuid1())

next_step = f"""
Propsal did not go thru. No budget Left. Negative.
"""

cadence = f"""
No further follow up required.
"""

metadata = {"customer_id": 'CUS101', "partner_id": 'AWS', "opportunity_id": 'WS-8a7128a4', "customer_name": 'Teradyne, Inc.' }
next_step_and_cadence_rec5 = "{} : {} : {}".format(metadata, next_step, cadence)

vector_embedding_rec5 = client.embeddings.create(
        input=next_step_and_cadence_rec5,
        model=embedding_model_name,
    ).data[0].embedding



### Insert multiple documents

In [None]:
v_doc_list = [
    {
        "_id": uuid_rec1,
        'customer_id': 'CUS100',
        'partner_id': 'AWS',
        'opportunity_id': 'WS-7202838a',
        'customer_name': 'Teradyne, Inc.',
        "description": next_step_and_cadence_rec1,
        "$vector": vector_embedding_rec1,
    },
    {
        "_id": uuid_rec2,
        'customer_id': 'CUS100',
        'partner_id': 'AWS',
        'opportunity_id': 'WS-8a038b8a',
        'customer_name': 'Teradyne, Inc.',
        "description": next_step_and_cadence_rec2,
        "$vector": vector_embedding_rec2,
    },
    {
        "_id": uuid_rec3,
        'customer_id': 'CUS100',
        'partner_id': 'AWS',
        'opportunity_id': 'WS-8a3b0348',
        'customer_name': 'Teradyne, Inc.',
        "description": next_step_and_cadence_rec3,
        "$vector": vector_embedding_rec3,
    },
    {
        "_id": uuid_rec4,
        'customer_id': 'CUS100',
        'partner_id': 'AWS',
        'opportunity_id': 'WS-8a7128a3',
        'customer_name': 'Teradyne, Inc.',
        "description": next_step_and_cadence_rec4,
        "$vector": vector_embedding_rec4,
    },
    {
        "_id": uuid_rec5,
        'customer_id': 'CUS101',
        'partner_id': 'AWS',
        'opportunity_id': 'WS-8a7128a4',
        'customer_name': 'Cisco, Inc.',
        'AWS Partner Sales Stage' : 'Technical Validation',
        "description": next_step_and_cadence_rec5,
        "$vector": vector_embedding_rec5,
    },
]

response = collection.insert_many(v_doc_list)
print(response)

{'status': {'insertedIds': ['f554f99a-8be6-11ee-ad16-0242ac1c000c', 'f578fa70-8be6-11ee-ad16-0242ac1c000c', 'f593e664-8be6-11ee-ad16-0242ac1c000c', 'f5ab0830-8be6-11ee-ad16-0242ac1c000c', 'f5c12480-8be6-11ee-ad16-0242ac1c000c']}}


## Find documents

Find by `opportunity_id`:

In [None]:
document = collection.find_one(filter={"opportunity_id":"WS-8a7128a4"})
print(document)

Find by any (non-vector) filter clause:

### Find by vector similarity

By default, the `$similarity` field is returned with each document (note the decreasing order):

In [None]:
query_vector = client.embeddings.create(
                    input="What are the next steps?",
                    model=embedding_model_name,
                ).data[0].embedding

documents = collection.vector_find(query_vector, limit=5)
for document in documents:
    print(f"\n{document}")

You can specify which **fields** you'll get back and/or whether you need the **similarity** as well:

In [None]:
documents = collection.vector_find(
    query_vector,
    limit=5,
    fields=["customer_id","partner_id","opportunity_id"],  # remember the dollar sign (reserved name)
    include_similarity=False,
)
for document in documents:
    print(f"\n{document}")

You can compound with other `filter` clauses, effectively implementing **metadata filtering** on your vector searches:

In [None]:
documents = collection.vector_find(
    query_vector,
    limit=5,
    filter={"customer_id": "CUS100"},
)
for document in documents:
    print(f"\n{document}")

These options are supported for the `vector_find_one` method as well:

In [None]:
fields = ["description"]

document = collection.vector_find_one(
    query_vector,
    fields=["opportunity_id"],
    include_similarity=True,  # not really necessary since True is the default
)
print(document)

#UI Demo

The llm_response_openai method implements the following steps,
1.   Generate an OpenAI embedding corresponding to the user's question.
2.   Retrieve records from Astra Vector DB based on vector similarity search for the generated embedding.
3.   Build a prompt for the Large Language Model (LLM).
4.   Generate LLM results based on the prompt.

In [None]:
prompt_template_str = """Human: Use the following pieces of context to provide a concise answer to the question at the end.
                      If you don't know the answer, just say that you don't know, don't try to make up an answer.

                      <context>
                      {context}
                      </context>

                      Question: {question}

                      Assistant:"""

def llm_response_openai(question: str, verbose: bool = False) -> str:
    if verbose:
        print(f"\n[answer_question] Question: {question}")
    # Retrieval of the most relevant stored documents from the vector store:

    query_vector = client.embeddings.create(
                    input=question,
                    model=embedding_model_name,
                ).data[0].embedding

    context_docs = collection.vector_find(
        query_vector,
        limit=5,
        fields=["description"],  # remember the dollar sign (reserved name)
        include_similarity=False,
    )
    context = "\n".join(doc['description'] for doc in context_docs)
    if verbose:
        print("\n[answer_question] Context:")
        print(context)
    # Filling the prompt template with the current values
    llm_prompt_str = prompt_template_str.format(
        context=context,
        question=question
    )

    response = client.chat.completions.create(
        model="gpt-4",
        temperature=0,
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Use the instructions provided to process the information."},
            {"role": "user", "content": llm_prompt_str}
        ]
    )
    print(response)
    return response.choices[0].message.content

# UI - for testing sample queries

Sample queries,


1.   Can you list the opportunities for 'customer_id': 'CUS100' ?
2.   Can you list the opportunities for 'customer_id': 'CUS101' ?
3.   What are the next steps for each opportunity for the customer 'customer_id': 'CUS100' ?
4.   What are the identified challenges for the customer 'customer_id': 'CUS100' ?





In [None]:
import gradio as gr

def predict(message, history):
    response = llm_response_openai(message)
    return response

gr.ChatInterface(predict).launch()