# Using OCEANBASE V4.3.3 as a vector database for OpenAI embeddings

This notebook guides you step by step on using OCEANBASE as a vector database for OpenAI embeddings.

This notebook presents an end-to-end process of:
1. Using precomputed embeddings created by OpenAI API.
2. Storing the embeddings in a cloud instance of OCEANBASE.
3. Converting raw text query to an embedding with OpenAI API.
4. Using OCEANBASE to perform the nearest neighbour search in the created collection.

### What is OCEANBASE V4.3.3

[OCEANBASE](https://www.oceanbase.com/) OceanBase V4.3.3 is the first GA release of the 4.3 series, bringing several key breakthroughs. First, on the basis of relational databases, vector types and indexes suitable for AI analysis and processing are supported. Secondly, in order to better meet the requirements of strong isolation of TP and AP resources in HTAP mixed load scenarios, a new form of column-stored replica is introduced. In addition, this release also achieves significant performance improvements for AP query tasks. Other new features include support for complex data types in Array, optimized the computing performance of Roaringbitmap, enhanced the refresh capability of materialized views, expanded the function of foreign tables, improved the performance of foreign table import, and optimized the plan generation and execution policies of AP class SQL to accelerate the improvement of capabilities for OLAP workloads. At the same time, most of the features of V4.2.4 and earlier versions have been supported in V4.3.3, and a new converged version will be released in the future, which is also applicable to OLTP services.


### Deployment options

- Using [OCEANBASE V4.3.3 Vector Database](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000001428560) to fast deploy it.

## Prerequisites

For the purposes of this exercise we need to prepare a couple of things:

1. OCEANBASE cloud server instance.
2. The 'mysql' library to interact with the vector database. Any other oceanbase client library is ok.
3. An [OpenAI API key](https://beta.openai.com/account/api-keys).

We might validate if the server was launched successfully by running a simple curl command:

### Install requirements

This notebook obviously requires the `openai` and `mysql` packages, but there are also some other additional libraries we will use. The following command installs them all:


In [None]:
! pip install openai mysql pandas wget

Prepare your OpenAI API key
The OpenAI API key is used for vectorization of the documents and queries.

If you don't have an OpenAI API key, you can get one from https://beta.openai.com/account/api-keys.

Once you get your key, please add it to your environment variables as OPENAI_API_KEY.

If you have any doubts about setting the API key through environment variables, please refer to [Best Practices for API Key Safety](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).

In [3]:
# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
else:
    print("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready


## Connect to OCEANBASE
First add it to your environment variables. or you can just change the "mysql.connect" parameters below

Connecting to a running instance of OCEANBASE server is easy with the official Python library:

In [4]:
import os
import mysql.connector

# Note. alternatively you can set a temporary env variable like this:
# os.environ["OCEANBASE_HOST"] = "your_host"
# os.environ["OCEANBASE_PORT"] = "3306"
# os.environ["OCEANBASE_DATABASE"] = "your_database"
# os.environ["OCEANBASE_USER"] = "your_user"
# os.environ["OCEANBASE_PASSWORD"] = "your_password"

connection = mysql.connector.connect(
    host=os.environ.get("OCEANBASE_HOST", "localhost"),
    port=os.environ.get("OCEANBASE_PORT", "2881"),
    database=os.environ.get("OCEANBASE_DATABASE", "your_database"),
    user=os.environ.get("OCEANBASE_USER", "your_user"),
    password=os.environ.get("OCEANBASE_PASSWORD", "your_password")
)

# Create a new cursor object
cursor = connection.cursor()

# Make sure to close the connection and cursor after use
# cursor.close()
# connection.close()

We can test the connection by running any available method:

In [5]:
# Execute a simple query to test the connection
cursor.execute("SELECT 1;")
result = cursor.fetchone()

# Check the query result
if result == (1,):
    print("Connection successful!")
else:
    print("Connection failed.")

Connection successful!


In [7]:
import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'

The downloaded file has to be then extracted:

In [8]:
import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(output_directory)


# check the csv file exist
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
    print(f"The file {file_name} exists in the data directory.")
else:
    print(f"The file {file_name} does not exist in the data directory.")

The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.


## Index data

OCEANBASE stores data in __relation__ where each object is described by at least one vector. Our relation will be called **articles** and each object will be described by both **title** and **content** vectors. 

We will start with creating a relation and create a vector index on both **title** and **content**, and then we will fill it with our precomputed embeddings.

In [6]:
create_table_sql = '''
CREATE TABLE IF NOT EXISTS articles (
    id INTEGER NOT NULL,
    url TEXT,
    title TEXT,
    content TEXT,
    title_vector vector(1536),
    content_vector vector(1536),
    vector_id INTEGER,
    PRIMARY KEY (`id`),
    VECTOR INDEX `content_vector` (`content_vector`) WITH (distance=L2, type=HNSW),
    VECTOR INDEX `title_vector` (`title_vector`) WITH (distance=L2, type=HNSW)
);

'''

# Execute the SQL statements
cursor.execute(create_table_sql)

# Commit the changes
connection.commit()

## Load data

In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits.

In [None]:
import io

# Path to your local CSV file
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'

# Define a generator function to process the file line by line
def process_file(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            yield line

# Create a StringIO object to store the modified lines
modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))

# Create the COPY command for the copy_expert method
copy_command = '''
COPY articles (id, url, title, content, title_vector, content_vector, vector_id)
FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');
'''

# Execute the COPY command using the copy_expert method
cursor.copy_expert(copy_command, modified_lines)

# Commit the changes
connection.commit()

In [9]:
# Check the collection size to make sure all the points have been stored
count_sql = """select count(*) from articles;"""
cursor.execute(count_sql)
result = cursor.fetchone()
print(f"Count:{result[0]}")

Count:25000


## Search data

Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-3-small` OpenAI model we also have to use it during search.

In [10]:
def query_oceanbase(query, collection_name, vector_name="title_vector", top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-3-small",
    )["data"][0]["embedding"]

    # Convert the embedded_query to PostgreSQL compatible format
    embedded_query_pg = "[" + ",".join(map(str, embedded_query)) + "]"

    # Create SQL query
    query_sql = f"""
    SELECT id, url, title, l2_distance({vector_name},'{embedded_query_pg}'::VECTOR(1536)) AS similarity
    FROM {collection_name}
    ORDER BY {vector_name} <-> '{embedded_query_pg}'::VECTOR(1536)
    LIMIT {top_k};
    """
    # Execute the query
    cursor.execute(query_sql)
    results = cursor.fetchall()

    return results

In [11]:
import openai

query_results = query_OCEANBASE("modern art in Europe", "Articles")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

1. Museum of Modern Art (Score: 0.5)
2. Western Europe (Score: 0.485)
3. Renaissance art (Score: 0.479)
4. Pop art (Score: 0.472)
5. Northern Europe (Score: 0.461)
6. Hellenistic art (Score: 0.457)
7. Modernist literature (Score: 0.447)
8. Art film (Score: 0.44)
9. Central Europe (Score: 0.439)
10. European (Score: 0.437)
11. Art (Score: 0.437)
12. Byzantine art (Score: 0.436)
13. Postmodernism (Score: 0.434)
14. Eastern Europe (Score: 0.433)
15. Europe (Score: 0.433)
16. Cubism (Score: 0.432)
17. Impressionism (Score: 0.432)
18. Bauhaus (Score: 0.431)
19. Surrealism (Score: 0.429)
20. Expressionism (Score: 0.429)


In [12]:
# This time we'll query using content vector
query_results = query_OCEANBASE("Famous battles in Scottish history", "Articles", "content_vector")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

1. Battle of Bannockburn (Score: 0.489)
2. Wars of Scottish Independence (Score: 0.474)
3. 1651 (Score: 0.457)
4. First War of Scottish Independence (Score: 0.452)
5. Robert I of Scotland (Score: 0.445)
6. 841 (Score: 0.441)
7. 1716 (Score: 0.441)
8. 1314 (Score: 0.429)
9. 1263 (Score: 0.428)
10. William Wallace (Score: 0.426)
11. Stirling (Score: 0.419)
12. 1306 (Score: 0.419)
13. 1746 (Score: 0.418)
14. 1040s (Score: 0.414)
15. 1106 (Score: 0.412)
16. 1304 (Score: 0.411)
17. David II of Scotland (Score: 0.408)
18. Braveheart (Score: 0.407)
19. 1124 (Score: 0.406)
20. July 27 (Score: 0.405)
