# Philosophy with Vector Embeddings, OpenAI and Cassandra / Astra DB

In this quickstart you will learn how to build a "philosophy quote finder & generator" using OpenAI's vector embeddings and DataStax Astra DB (_or a vector-capable Apache Cassandra® cluster, if you prefer_) as the vector store for data persistence.

The basic workflow of this notebook is outlined below. You will evaluate and store the vector embedding for a number of quotes by famous philosopher and later use them to build a powerful search engine and, after that, a generator of new quotes!

The notebook exemplifies some of the standard usage patterns of vector search -- and also shows how easy is it to get started with the [Vector capabilities of Astra DB](https://astra.datastax.com/).

_Choose-your-framework_

Please note that this notebook uses the [CassIO library](https://cassio.org), but we cover other choices of technology to accomplish the same task. Check out this folder's [README](https://github.com/openai/openai-cookbook/tree/main/examples/vector_databases/cassandra_astradb) for other options. This notebook can run either as a Colab notebook or with as a regular Jupyter notebook.

Table of contents:
- Setup
- Get DB connection
- Connect to OpenAI
- Load quotes into the Vector Store
- Use case 1: **quote search engine**
- Use case 2: **quote generator**
- (Optional) exploit partitioning in the Vector Store

### How it works

**Indexing**

Each quote is made into an embedding vector with OpenAI's `Embedding`. These are saved in the Vector Store along with some metadata, including the author's name and a few other pre-computed tags, for later customization of the search.

![1_vector_indexing](https://user-images.githubusercontent.com/14221764/262085997-215c3854-a004-45f0-8afc-51b924b059a0.png)

**Search**

To find a quote similar to the provided search quote, the latter is made into an embedding vector on the fly, and this vector is used to query the store for similar vectors ... i.e. similar quotes that were previously indexed. The search can optionally be constrained by additional metadata ("find me quotes by Spinoza similar to this one ...")

![2_vector_search](https://user-images.githubusercontent.com/14221764/262086005-5824d690-b8a4-4cbe-a6fd-a43fa785f8dc.png)

The key point here is that "similar quotes" translates, in vector space, to vectors that are metrically close to each other: thus, vector similarity search effectively implements semantic similarity. _This is the key reason vector embeddings are so powerful._

The sketch below tries to convey this idea. Each quote, once it's made into a vector, is a point in space. Well, in this case it's on a sphere, since OpenAI's embedding vectors, as most others, are normalized to _unit length_. Oh, and the sphere is actually not three-dimensional, rather 1536-dimensional!

![3_vector_space](https://user-images.githubusercontent.com/14221764/262086007-f417c44b-c048-47f8-9dbd-140472798b6d.png)

**Generation**

Given a suggestion (a topic or a tentative quote), the search step is performed, and the first returned results (quotes) are fed into an LLM prompt which asks the generative model to invent a new text along the lines of the passed examples _and_ the initial suggestion.

![4_quote_generation](https://user-images.githubusercontent.com/14221764/262087157-784117ff-7c56-45bc-9c76-577d09aea19a.png)

## Setup

First install some required packages:

In [1]:
!pip install cassio openai

## Get DB connection

A couple of secrets are required to create a `Session` object (a connection to your Astra DB instance).

_(Note: some steps will be slightly different on Google Colab and on local Jupyter, that's why we detect the runtime type.)_

In [2]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

In [3]:
try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

In [4]:
# We need your database's Secure Connect Bundle zip file:
from getpass import getpass
if IS_COLAB:
    # Upload your Secure Connect Bundle zipfile:
    import os
    from google.colab import files


    print('Please upload your Secure Connect Bundle zipfile: ')
    uploaded = files.upload()
    if uploaded:
        astraBundleFileTitle = list(uploaded.keys())[0]
        ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
    else:
        raise ValueError(
            'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
        )
else:
    # you are running a local-jupyter notebook:
    ASTRA_DB_SECURE_BUNDLE_PATH = input("Please provide the full path to your Secure Connect Bundle zipfile: ")

ASTRA_DB_APPLICATION_TOKEN = getpass("Please provide your Database Token ('AstraCS:...' string): ")
ASTRA_DB_KEYSPACE = input("Please provide the Keyspace name for your Database: ")

Please provide the full path to your Secure Connect Bundle zipfile:  /path/to/secure-connect-DATABASE.zip
Please provide your Database Token ('AstraCS:...' string):  ········
Please provide the Keyspace name for your Database:  my_keyspace


#### Creation of the DB connection

This is how you create a connection to Astra DB:

_(Incidentally, you could also use any Cassandra cluster (as long as it provides Vector capabilities), just by changing the parameters to the following `Cluster` instantiation.)_

In [5]:
cluster = Cluster(
    cloud={
        "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
    },
    auth_provider=PlainTextAuthProvider(
        "token",
        ASTRA_DB_APPLICATION_TOKEN,
    ),
)

session = cluster.connect()
keyspace = ASTRA_DB_KEYSPACE

#### Creation of the Vector Store through CassIO

We need a table which support vectors and is equipped with metadata. Let's call it "philosophers":

In [6]:
# create a vector store with cassIO
from cassio.table import MetadataVectorCassandraTable

In [7]:
v_table = MetadataVectorCassandraTable(
    session,
    keyspace,
    "philosophers",
    vector_dimension=1536,
)

## Connect to OpenAI

#### Setup your secret key

In [8]:
OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key:  ········


In [9]:
import openai

openai.api_key = OPENAI_API_KEY

### A test call for embeddings

Let us quickly check how one can get the embedding vectors for a list of input texts:

In [10]:
embedding_model_name = "text-embedding-ada-002"

result = openai.Embedding.create(
    input=[
        "This is a sentence",
        "A second sentence"
    ],
    engine=embedding_model_name,
)

In [11]:
print(f"len(result.data)              = {len(result.data)}")
print(f"result.data[1].embedding      = {str(result.data[1].embedding)[:55]}...")
print(f"len(result.data[1].embedding) = {len(result.data[1].embedding)}")

len(result.data)              = 2
result.data[1].embedding      = [-0.010772699490189552, 0.0013737495755776763, 0.003638...
len(result.data[1].embedding) = 1536


## Load quotes into the Vector Store

Let's get a JSON file containing our quotes. We already prepared this collection and put it into this repo for quick loading.

_(Note: we adapted the following from a Kaggle dataset -- which we acknowledge -- and also added a few tags to each quote.)_

In [12]:
import json
import requests

if IS_COLAB:
    # load from Web request to (github) repo
    json_url = "https://raw.githubusercontent.com/hemidactylus/openai-cookbook/SL-cassandra_astra_vector/examples/vector_databases/cassandra_astradb/sources/philo_quotes.json"
    quote_dict = json.loads(requests.get(json_url).text)    
else:
    # load from local repo
    quote_dict = json.load(open("./sources/philo_quotes.json"))

A quick inspection of the input data structure:

In [13]:
print(quote_dict["source"])

total_quotes = sum(len(quotes) for quotes in quote_dict["quotes"].values())
print(f"\nQuotes loaded: {total_quotes}.\nBy author:")
print("\n".join(f"  {author} ({len(quotes)})" for author, quotes in quote_dict["quotes"].items()))

print("\nSome examples:")
for author, quotes in list(quote_dict["quotes"].items())[:2]:
    print(f"  {author}:")
    for quote in quotes[:2]:
        print(f"    {quote['body'][:50]} ... (tags: {', '.join(quote['tags'])})")

Adapted from this Kaggle dataset: https://www.kaggle.com/datasets/mertbozkurt5/quotes-by-philosophers (License: CC BY-NC-SA 4.0)

Quotes loaded: 450.
By author:
  aristotle (50)
  freud (50)
  hegel (50)
  kant (50)
  nietzsche (50)
  plato (50)
  sartre (50)
  schopenhauer (50)
  spinoza (50)

Some examples:
  aristotle:
    True happiness comes from gaining insight and grow ... (tags: knowledge)
    The roots of education are bitter, but the fruit i ... (tags: education, knowledge)
  freud:
    We are what we are because we have been what we ha ... (tags: history)
    From error to error one discovers the entire truth ... (tags: )


#### Reduce dataset if you wish

If you want to run a smaller-scale demo, feel free to adjust the numbers below and run the cell -- it will simply reduce the amount of quotes to insert.

In [14]:
# Set parameters and run this cell to run on a shortened data set
AUTHORS_TO_USE = 9  # all: 9
QUOTES_PER_AUTHOR = 6  # all: 50

quote_dict["quotes"] = {
    author: quotes[:QUOTES_PER_AUTHOR]
    for author, quotes in list(quote_dict["quotes"].items())[:AUTHORS_TO_USE]
}

### Insert quotes into vector store

We will compute the embeddings for the quotes and save them into the Vector Store, along with the text itself and the metadata we plan to use later. Note that we add the author as a metadata field in addition to the "tags" already found with the quote itself.

To optimize speed and reduce the calls, we choose to perform batched calls to the embedding OpenAI service, with one batch per author.

_(Note: for faster execution, Cassandra and CassIO would let you do concurrent inserts, which we don't do here for a more straightforward demo code.)_

In [15]:
for philosopher, quotes in quote_dict["quotes"].items():
    print(f"{philosopher}: ", end="")
    result = openai.Embedding.create(
        input=[quote["body"] for quote in quotes],
        engine=embedding_model_name,
    )
    for quote_idx, (quote, q_data) in enumerate(zip(quotes, result.data)):
        v_table.put(
            row_id=f"q_{philosopher}_{quote_idx}",
            body_blob=quote["body"],
            vector=q_data.embedding,
            metadata={**{tag: True for tag in quote["tags"]}, **{"author": philosopher}},
        )
        print("*", end='')
    print(" Done.")
print("Finished inserting.")

aristotle: ****** Done.
freud: ****** Done.
hegel: ****** Done.
kant: ****** Done.
nietzsche: ****** Done.
plato: ****** Done.
sartre: ****** Done.
schopenhauer: ****** Done.
spinoza: ****** Done.
Finished inserting.


## Use case 1: **quote search engine**

For the quote-search functionality, we need first to make the input quote into a vector, and then use it to query the store (besides handling the optional metadata into the search call, that is).

Let's encapsulate the search-engine functionality into a function for ease of re-use:

In [16]:
def find_quote_and_author(query_quote, n, author=None, tags=None):
    query_vector = openai.Embedding.create(
        input=[query_quote],
        engine=embedding_model_name,
    ).data[0].embedding
    metadata = {}
    if author:
        metadata["author"] = author
    if tags:
        for tag in tags:
            metadata[tag] = True
    #
    results = v_table.ann_search(
        query_vector,
        n=n,
        metadata=metadata,
    )
    return [
        (result["body_blob"], result["metadata"]["author"])
        for result in results
    ]

#### Putting search to test

Passing just a quote:

In [17]:
find_quote_and_author("We struggle all our life for nothing", 3)

[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('The meager satisfaction that man can extract from reality leaves him starving.',
  'freud'),
 ('The valor that struggles is better than the weakness that endures.',
  'hegel')]

Search restricted to an author:

In [18]:
find_quote_and_author("We struggle all our life for nothing", 2, author="nietzsche")

[('To live is to suffer, to survive is to find some meaning in the suffering.',
  'nietzsche'),
 ('Everything the State says is a lie, and everything it has it has stolen.',
  'nietzsche')]

Search constrained to a tag (out of those we saved earlier with the quotes):

In [19]:
find_quote_and_author("We struggle all our life for nothing", 2, tags=["politics"])

[('Everything the State says is a lie, and everything it has it has stolen.',
  'nietzsche'),
 ('He who seeks equality between unequals seeks an absurdity.', 'spinoza')]

## Use case 2: **quote generator**

For this task we need another component from OpenAI, namely an LLM to generate the quote for us (based on input we obtain by querying the Vector Store).

We also need a working template for the prompt that will be filled for the generate-quote LLM completion task.

In [20]:
completion_model_name = "gpt-3.5-turbo"

generation_prompt_template = """"Generate a single short philosophical on the given topic,
similar in spirit and form to the provided actual example quotes.
Do not exceed 20-30 words in your quote.

REFERENCE TOPIC: "{topic}"

ACTUAL EXAMPLES:
{examples}
"""

Like we did for search, this functionality is best wrapped into a handy function (which internally uses search):

In [21]:
def generate_quote(topic, n=2, author=None, tags=None):
    quotes = find_quote_and_author(query_quote=topic, n=n, author=author, tags=tags)
    if quotes:
        prompt = generation_prompt_template.format(
            topic=topic,
            examples="\n".join(f"  - {quote[0]}" for quote in quotes),
        )
        # a little logging:
        print("** quotes found:")
        for q, a in quotes:
            print(f"**    - {q} ({a})")
        print("** end of logging")
        #
        response = openai.ChatCompletion.create(
            model=completion_model_name,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=320,
        )
        return response.choices[0].message.content.replace('"', '').strip()
    else:
        print("** no quotes found.")
        return None

#### Putting quote generation to test

Just passing a text (a "quote", but we can actually just suggest a topic since its vector embedding will still end up at the right place in the vector space):

In [22]:
q_topic = generate_quote("politics and virtue")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - The roots of education are bitter, but the fruit is sweet. (aristotle)
**    - The valor that struggles is better than the weakness that endures. (hegel)
** end of logging

A new generated quote:
Power corrupts, but virtue is the compass that guides politics toward justice and progress.


Use inspiration from just a single philosopher:

In [23]:
q_topic = generate_quote("animals", author="schopenhauer")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
**    - The assumption that animals are without rights, and the illusion that our treatment of them has no moral significance, is a positively outrageous example of Western crudity and barbarity. Universal compassion is the only guarantee of morality. (schopenhauer)
**    - It is difficult to keep quiet if you have nothing to do (schopenhauer)
** end of logging

A new generated quote:
The greatness of a society lies in its treatment of animals; our compassion towards them reflects our true morality.


## (Optional) Partitioning

TODO

## Conclusion

Congratulations! You have seen how to use OpenAI for the vector embeddings and Astra DB / Cassandra for storage and build a sophisticated philosophical search engine and quote generator.

This example used [CassIO](https://cassio.org) to interface with the Vector Store - but this is not the only choice. Check the [README](https://github.com/openai/openai-cookbook/tree/main/examples/vector_databases/cassandra_astradb) for other options and integration with popular frameworks.

To find out more on how Astra DB can be a key ingredient in your ML/GenAI applications, visit [Astra DB]()'s web page.