![Banner](banner.png)

# Exercise 1: Introduction into vector search and embeddings

This notebook deals with the implementation and management of vector searches in a database. Vector search is a powerful tool to find semantic similarities in large data sets. In contrast to classical search methods, which are based on exact matches, vector search makes it possible to find similar objects even if they are not identical. This method is used in applications such as image, text or product recommendation systems.

### Aim of this notebook
This exercise demonstrates how vector data can be stored in a database and made accessible through similarity searches. The main steps include:

- Setting up a dedicated tablespace and user for the vector database.
- Loading and managing ONNX models to generate embeddings.
- Implementing triggers to automatically generate embeddings when new data is inserted into the database. 
- Performing similarity searches to identify data sets that are similar to a given input value.
- Optimizing the vector index and analyzing performance.

### Data
The synthetic data used in this exercise represents real estate information. Each row in the table represents a property and contains the following attributes:

1. **PID:** A unique identifier (ID) for each property.
2. **TYP:** The type of property, e.g. "apartment building", "detached house", "flat", "bungalow", "semi-detached house" etc.
3. **PREIS:** The selling price of the property in euros.
4. **ZIMMER:** The number of bedrooms in the property.
5. **STADT:** The name of the city in which the property is located.
6. **LAND:** The federal state in which the property is located, e.g. "Schleswig-Holstein", "Baden-Württemberg", "Thuringia" etc.
7. **BESCHREIBUNG:** A textual description of the property and its features.

These attributes provide a variety of features that can be used for vector search. The attribute ‘description’ is converted into an embedding in this exercise to enable a semantic search based on similarities between properties.
### Learning targets
After completing this exercise, you should:

1. understand how embeddings are generated and managed in a database.
2. be able to perform vector searches with high accuracy and efficiency.
3. know how to optimize storage resources and evaluate the performance of vector indices.

Have fun & success!

## Preparation: Setting up the environment and connecting to the Oracle database

To begin with, the working environment is set up and a connection to an Oracle database is established. This includes the installation of the necessary Python packages and the initialization of the Oracle client.

#### Installed packages:
- `oracledb`: Required to connect to Oracle databases and execute SQL queries.
- `ipython-sql`: Enables the use of SQL in Jupyter notebooks.
- `pandas`: Used for data processing and analysis. Here it is used to load and manage the results of SQL queries in DataFrames.

#### Code-Details:
- The Oracle client is initialized with `oracledb.init_oracle_client`, where the directory for the Oracle libraries is specified.
- The environment variables `HOST_NAME` and `PDB_NAME` are used to create the connection string (`dsn`) for the Oracle database. This contains the host name and the Pluggable Database Name (PDB), which are required to address the database.
- A connection to the Oracle database is established with the user `vector` and the password `vector`. The connection string combines host and PDB, which are stored in the environment variables.


In [2]:
!pip install oracledb 
!pip install ipython-sql
!pip install pandas

Collecting oracledb
  Downloading oracledb-2.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (5.5 kB)
Downloading oracledb-2.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m110.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: oracledb
Successfully installed oracledb-2.5.1
Collecting ipython-sql
  Downloading ipython_sql-0.5.0-py3-none-any.whl.metadata (17 kB)
Collecting prettytable (from ipython-sql)
  Downloading prettytable-3.12.0-py3-none-any.whl.metadata (30 kB)
Collecting sqlparse (from ipython-sql)
  Downloading sqlparse-0.5.3-py3-none-any.whl.metadata (3.9 kB)
Downloading ipython_sql-0.5.0-py3-none-any.whl (20 kB)
Downloading prettytable-3.12.0-py3-none-any.whl (31 kB)
Downloading sqlparse-0.5.3-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44

In [3]:
import oracledb
import pandas as pd
import os
import warnings
import time

warnings.filterwarnings('ignore')
pd.set_option('expand_frame_repr', False)
pd.options.display.max_colwidth = 800

d = '/home/jovyan/.jupyter/instantclient_23_5'
oracledb.init_oracle_client(lib_dir=d)
host = os.environ.get('HOST_NAME')
pdb = os.environ.get('PDB_NAME')
cs = host + '/' + pdb
print(cs)
# should be something like 'db23ai.subbb3fff175.quickcluster.oraclevcn.com/michael.subbb3fff175.quickcluster.oraclevcn.com'

db23ai.subbb3fff175.quickcluster.oraclevcn.com/marcel.subbb3fff175.quickcluster.oraclevcn.com


In [4]:
connection = oracledb.connect(user='vector', password='vector', dsn=cs)
print(connection)

<oracledb.Connection to vector@db23ai.subbb3fff175.quickcluster.oraclevcn.com/marcel.subbb3fff175.quickcluster.oraclevcn.com>


These commands optimize the memory for the in-memory column store (16 GB) and the vector store (12 GB) to enable fast queries and efficient vector data searches.

## Script 01: Creating the tablespace and user - ⚠️Skript does NOT need to be executed⚠️

In this section, the database environment is prepared by creating a new tablespace and a user.

- **Tablespace:** The tablespace `tbsvec` is created with an initial size of 40 GB and can expand automatically. 
- **User "vector":** The user `vector` is created and given full access to the tablespace `tbsvec`.
- **Roles and permissions:** The user `vector` is assigned specific roles, including permission to create mining models and access to the directory `vec_dump`.

<a id='modellverwaltung'></a>
## Skript 02: loading an ONNX-Model

In this section, the ONNX model `distil_v2` is loaded into the database. The model is used to create embeddings and works with cosine similarity as a distance function.

### What is ONNX?

**ONNX** (Open Neural Network Exchange) is an open format developed to enable interoperability between different deep learning frameworks. With ONNX, models can be seamlessly transferred between different tools and platforms. It supports a variety of operations and allows trained models to be used efficiently in production environments or optimized on other hardware platforms such as GPUs.

### choice of models: distiluse-base-multilingual-cased-v2

In exercise 1, we use the **`distiluse-base-multilingual-cased-v2`** model from the `sentence-transformers` library. This model was chosen because it is optimized for multilingual use cases and is particularly well suited for semantic text similarity tasks. It supports multiple languages and offers high performance in the calculation of vector embeddings. 

### Note: Executing SQL code in a Jupyter notebook with Python

1. **Define SQL code**: SQL commands are saved as strings in Python variables.
2. **Execute SQL**: The SQL code is executed with `cursor.execute(sql)`.

In [None]:
sql = """
BEGIN
 DBMS_VECTOR.DROP_ONNX_MODEL(
  'distil_model',
  force => true 
)
;

 DBMS_VECTOR.LOAD_ONNX_MODEL( 
  'VEC_DUMP', 
  'distil_v2.onnx', 
  'distil_model', 
   JSON('{"function":"embedding","embeddingOutput":"embedding","input":{"input": ["DATA"]}}') 
)
;
END;
"""


tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Loading model took {toc - tic:0.4f} seconds")



## Script 02: Load another ONNX model "minilm"

In this step, the ONNX model `minilml6_v2` is loaded into the database. This model also generates embeddings and uses the same input configuration as the previous model.

> **Note:** The model `minilml6_model` is only loaded in this notebook, but is only actually used in **Exercise 2**. In this exercise, only the model `distiluse-base-multilingual-cased-v2` is used to calculate the embeddings.

In [None]:
sql= """
BEGIN
DBMS_VECTOR.DROP_ONNX_MODEL( 
  'minilml6_model', 
  force => true 
)
;
 DBMS_VECTOR.LOAD_ONNX_MODEL( 
  'VEC_DUMP', 
  'minilml6_v2.onnx', 
  'minilml6_model', 
   JSON('{"function":"embedding","embeddingOutput":"embedding","input":{"input": ["DATA"]}}') 
)
;
END;
"""

tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Loading model took {toc - tic:0.4f} seconds")

## Script 02a: Query model information

In this section, SQL queries are executed to retrieve information about the loaded models. Details such as the model name, the mining function, the algorithm, the model size and specific attributes of the models are displayed.

In [None]:
sql= """
SELECT MODEL_NAME, MINING_FUNCTION, ALGORITHM,
ALGORITHM_TYPE, MODEL_SIZE/1024/1024 "SIZE [MB]"
FROM user_mining_models
ORDER BY MODEL_NAME
"""

df = pd.read_sql(sql=sql, con=connection)
display(df)

In [None]:
sql="""
SELECT model_name, attribute_name, attribute_type, data_type, vector_info
FROM user_mining_model_attributes
ORDER BY MODEL_NAME
"""

df = pd.read_sql(sql=sql, con=connection)
display(df)

## Script 03: Trigger for automatic embeddings

**Functionality:**
- Each time data is inserted or updated in the `real estate` table, the `description` field is used to generate an embedding.
- The generated embedding is saved in the corresponding `EMBED` column of the table.

In [None]:
## Hint: Trigger should be deactivated before bigger INSERT operations, or else there would be no parallelisation.
sql="""
CREATE OR REPLACE TRIGGER immobilien_embed
BEFORE INSERT OR UPDATE ON immobilien
FOR EACH ROW
DECLARE
params clob;
BEGIN
  params := '{"provider":"database","model":"distil_model"}';
  :new.embed := DBMS_VECTOR.UTL_TO_EMBEDDING(:new.beschreibung,json(params));
END;
"""

with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")

## Script 04: Parallel calculation and storage of embeddings

This section executes an SQL procedure that temporarily deactivates the `IMMOBILIEN_EMBED` trigger to enable parallel calculation and updating of embeddings in the `immobilien` table. For entries where the `EMBED` column is still empty, an embedding is generated and saved based on the `distil_model` and the `description` text field.

**Execution:**
1. the `IMMOBILIEN_EMBED` trigger is deactivated.
2. the `EMBED` column is filled with newly calculated embeddings. 
3. In addition, a parallel mode for DDL operations is enforced to improve performance.
4. The trigger `IMMOBILIEN_EMBED` is reactivated.


**The execution of this step takes about 6 minutes at parallelization level 6.**

In [None]:
sql="""
BEGIN
EXECUTE IMMEDIATE ('ALTER TRIGGER IMMOBILIEN_EMBED DISABLE');
UPDATE /*+ enable_parallel_dml parallel(tim,9) */ immobilien tim 
SET EMBED=TO_VECTOR(VECTOR_EMBEDDING(distil_model USING beschreibung AS DATA))
WHERE EMBED IS NULL;
COMMIT;
EXECUTE IMMEDIATE ('ALTER TRIGGER IMMOBILIEN_EMBED ENABLE');
END; """

tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Vectorizing data took {toc - tic:0.4f} seconds")

## Experiments:
* How long does it take to create embeddings with a different LLM, like minilml6 ?
* How long does it take to create embeddings with a different degree of parallelism, like 12 ?
* How long does it take to create embeddings through an external GPU based provider ?

For time reasons, we will reduce the processed amount of data to 200 rows.
Data will NOT be committed, e.g. written to the database table.

Please take into account that You are not alone on the system. You are sharing CPU cores and one GPU with others.


In [50]:
sql="""
BEGIN
EXECUTE IMMEDIATE ('ALTER TRIGGER IMMOBILIEN_EMBED DISABLE');
UPDATE /*+ enable_parallel_dml parallel(tim,12) */ immobilien tim 
SET EMBED=TO_VECTOR(VECTOR_EMBEDDING(minilml6_model USING beschreibung AS DATA))
WHERE ROWNUM < 200;
ROLLBACK;
EXECUTE IMMEDIATE ('ALTER TRIGGER IMMOBILIEN_EMBED ENABLE');
END; """

tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Vectorizing data took {toc - tic:0.4f} seconds")

SQL execution successful
Vectorizing data took 12.1250 seconds


In [51]:
sql="""
DECLARE
  params clob;
BEGIN
params := '{"provider":"ollama",'||
  '"host"    :"local", '||
  '"url"     : "http://ollama.meinnetzwerk.com/api/embeddings", '||
  '"transfer_timeout": 120, '|| -- cannot do longer than the operating system
  '"model"   : "all-minilm:latest" '||
  '}';

EXECUTE IMMEDIATE ('ALTER TRIGGER IMMOBILIEN_EMBED DISABLE');
UPDATE IMMOBILIEN TIM
SET EMBED=DBMS_VECTOR.UTL_TO_EMBEDDING(beschreibung,json(params))
WHERE ROWNUM < 200;
ROLLBACK;
EXECUTE IMMEDIATE ('ALTER TRIGGER IMMOBILIEN_EMBED ENABLE');
END; """

tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Vectorizing data took {toc - tic:0.4f} seconds")

SQL execution successful
Vectorizing data took 31.7187 seconds


In [None]:
sql="""
BEGIN
EXECUTE IMMEDIATE ('drop index if exists IDX_IMMOBILIEN_EMBED');
EXECUTE IMMEDIATE ('ALTER SESSION FORCE PARALLEL DDL PARALLEL 4');
END;"""

with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")

## Script 04a: Creation and verification of the vector index
In this section, a vector index is created on the `EMBED` column of the `real estate` table to perform efficient similarity searches based on vector data.

**Procedure:**
1. **Creation of the vector index:** The index `IDX_IMMOBILIEN_EMBED` is created to perform fast and precise searches.
2. **Check the index:** An SQL query is executed to check the newly created index and confirm its type and subtype. 
3. **Retrieving the index parameters:** The parameters of the vector index are retrieved as JSON data and displayed in a clear format.

The index creation and subsequent analysis ensure that the database is optimally prepared for the vector search.

### Theory: Cosine similarity and cosine distance

 From mathematics: scalar product of two vectors
 
 $$ \vec a \cdot \vec b = |\vec a| \cdot |\vec b| \cdot cos\theta$$
 
 $ cos\theta = \frac{\vec a \cdot \vec b}{|\vec a| \cdot |\vec b|}$ => cosine similarity
 
 $ 1 - cos\theta $ => cosine distance (The "COSINE" in the VECTOR_DISTANCE function above)
 
 So the closer to 0 the cosine distance, the more similar the vectors.

In [None]:
sql = """
create vector index IDX_IMMOBILIEN_EMBED on IMMOBILIEN(embed)
organization INMEMORY NEIGHBOR GRAPH
distance COSINE 
with target accuracy 95
"""

tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Vector index creation took {toc - tic:0.4f} seconds")

In [None]:
sql= """
SELECT INDEX_NAME, INDEX_TYPE, INDEX_SUBTYPE
FROM USER_INDEXES
WHERE INDEX_NAME='IDX_IMMOBILIEN_EMBED'
"""

df = pd.read_sql(sql=sql, con=connection)
display(df)

In [None]:
sql= """
SELECT JSON_SERIALIZE(IDX_PARAMS returning varchar2) "IDX_Params"
FROM VECSYS.VECTOR$INDEX
where IDX_NAME = 'IDX_IMMOBILIEN_EMBED'
"""

import json

with connection.cursor() as cursor:     
    cursor.execute(sql)
    res, =cursor.fetchone()
    json_obj = json.loads(res)
    pretty= json.dumps(json_obj, indent=4)
    print(pretty)

## Script 04b Create vector index, partitioned by vector distance
This section describes the steps for creating a new vector index and then analyzing its parameters. The existing index is first removed to ensure that the new index can be created without conflicts.
These steps ensure that an optimized vector index is created that enables high performance in vector searches.

In [None]:
sql="""
BEGIN
EXECUTE IMMEDIATE ('drop index if exists IDX_IMMOBILIEN_EMBED');
EXECUTE IMMEDIATE ('ALTER SESSION FORCE PARALLEL DDL PARALLEL 4');
END;"""

with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")

In [None]:
sql = """
create vector index IDX_IMMOBILIEN_EMBED on IMMOBILIEN(embed)
organization NEIGHBOR PARTITIONS 
distance COSINE 
with target accuracy 95
"""

tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Vector index creation took {toc - tic:0.4f} seconds")

In [None]:
sql= """
SELECT INDEX_NAME, INDEX_TYPE, INDEX_SUBTYPE
FROM USER_INDEXES
WHERE INDEX_NAME='IDX_IMMOBILIEN_EMBED'
"""

df = pd.read_sql(sql=sql, con=connection)
display(df)

In [None]:
sql= """
SELECT JSON_SERIALIZE(IDX_PARAMS returning varchar2) "IDX_Params"
FROM VECSYS.VECTOR$INDEX
where IDX_NAME = 'IDX_IMMOBILIEN_EMBED'
"""

with connection.cursor() as cursor:     
    cursor.execute(sql)
    res, =cursor.fetchone()
    json_obj = json.loads(res)
    pretty= json.dumps(json_obj, indent=4)
    print(pretty)

## Script 05: Similarity search based on embeddings

In this section, we perform a similarity search to find the five most similar single-family homes in the `real estate` table based on an input text.

**Steps:**
1. a text input is stored in a variable `searchtext` and converted into an embedding. 
2. the `real estate` table is searched for entries of the type `single-family house`.
3. the entries are sorted according to their similarity to the entered text, based on the cosine similarity of the embeddings.
4. the five most similar entries are displayed with a target accuracy of 80%.

In [35]:
suchtext = input("Suchtext hier einfügen")

Insert search text here with medical care and sauna


In [None]:
sql1="SELECT vector_embedding(distil_model using :suchtext as data) from dual"

sql2="""
select pid, typ, beschreibung, embed <=> :query_vector as cosine_dist
from immobilien
where typ='Einfamilienhaus'
order by cosine_dist
fetch approx first 5 rows only with target accuracy 80
"""

with connection.cursor() as cursor:     
    query_vector = cursor.var(oracledb.DB_TYPE_VECTOR)    
    cursor.execute(sql1, suchtext=suchtext)
    query_vector, =cursor.fetchone()
    
    df = pd.read_sql(sql2, params={'query_vector': query_vector}, con=connection)
    display(df)

## Excursus: Creating a full-text index and comparative query

The example is intended to show that a full-text search operates and is structured differently to a vector search. The full-text search finds individual words in texts, but also word stems and similar words spelled differently using a fuzzy search. Synonyms or entire phrases such as "medical care" instead of "medical practices" must be trained manually, e.g. by creating and using a thesaurus for each specialist area.

In [40]:
sql = '''
CREATE INDEX idx_text on immobilien(beschreibung) indextype is ctxsys.context
'''
tic = time.perf_counter()
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")
toc = time.perf_counter()
print(f"Text index creation took {toc - tic:0.4f} seconds")

SQL execution successful
Text index creation took 2.1615 seconds


The index has been created, now we create a simple thesaurus with one synonym and two nestings

In [None]:
sql = '''
begin
  ctx_thes.create_thesaurus(
    name     => 'MEIN_THESAURUS',
    casesens => false
  );
  ctx_thes.create_relation('MEIN_THESAURUS', 'Whirlpool', 'SYN', 'Jacuzzi');
  ctx_thes.create_relation('MEIN_THESAURUS', 'Wellness', 'NT', 'Sauna');
  ctx_thes.create_relation('MEIN_THESAURUS', 'Wellness', 'NT', 'Jacuzzi');
end;
'''
with connection.cursor() as cursor:     
    cursor.execute(sql)
    if cursor.warning :
        print(cursor.warning)
    else :
        print("SQL execution successful")

And now we use a classic full-text search with thesaurus nesting. Individual words can be linked with AND / OR syntax or separated by commas. Free text works less well unless this has also been provided for.

In [None]:
suchtext = input("enter search text here")
suchtext = "NT("+suchtext+", 10, MEIN_THESAURUS)"
sql = '''
SELECT pid, typ, beschreibung, score(1) FROM immobilien 
              WHERE CONTAINS(beschreibung, :suchtext, 1) > 0
              order by score(1)
'''
with connection.cursor() as cursor:     
    
    df = pd.read_sql(sql, params={'suchtext': suchtext}, con=connection)
    display(df)

## Script 05: Custom similarity search

In this section, the vector-based similarity search is further refined by only returning entries whose **Cosine distance** to the search vector is less than or equal to 0.5. This ensures that only properties that are particularly similar to the search vector are displayed.

### Steps:
1. **Calculating the query vector:** The search text is used again to generate an embedding vector that serves as a reference for the search.
2. **Filtering by cosine distance:** In contrast to the previous search, only entries whose cosine distance to the search vector is 0.5 or less are returned here. This restricts the results to highly similar single-family homes.
3. **Top 5 results:** The five most similar entries based on this filter are returned, with a target accuracy of 80%.

This query makes it possible to find only those properties that are particularly close to the search vector in terms of the content of the description.

In [None]:
sql2="""
SELECT * FROM
(
SELECT pid, typ, beschreibung, VECTOR_DISTANCE(embed, :query_vector, COSINE) AS COSINE_DIST
FROM immobilien
)
WHERE typ='Einfamilienhaus' and COSINE_DIST <= 0.5
ORDER BY COSINE_DIST
FETCH APPROX FIRST 5 ROWS ONLY WITH TARGET ACCURACY 80
"""

df = pd.read_sql(sql2, params={'query_vector': query_vector}, con=connection)
display(df)

## Script 05: Advanced similarity search

In this third section, the similarity search is performed again, but this time with a relaxed condition for the **cosine distance**. While the previous search only considered entries with a maximum cosine distance of 0.5, this search expands the filter to include entries with a cosine distance of up to 0.9. This makes it possible to capture a larger number of potentially relevant entries.

In [None]:
sql2="""
SELECT * FROM
(
SELECT pid, typ, beschreibung, VECTOR_DISTANCE(embed, :query_vector, COSINE) AS COSINE_DIST
FROM immobilien
)
WHERE COSINE_DIST <= 0.9
order by COSINE_DIST 
FETCH FIRST 5 ROWS ONLY
"""

df = pd.read_sql(sql2, params={'query_vector': query_vector}, con=connection)
display(df)


## Index accuracy check

In this section, the accuracy of the created vector index is checked. The Oracle function `dbms_vector.index_accuracy_query` is used to measure the **target accuracy** of the index in a similarity search.

### Steps:
1. **query of the index:** A query is executed to check the accuracy of the vector index `IDX_IMMOBILIEN_EMBED` based on the specified search vector.
2. **Parameter:** 
 - The search vector (`query_vector`) serves as the basis for the accuracy measurement. 
 - The query returns the accuracy for the top 10 results, where a target accuracy of 80% is set.
3. **Result:** The actual accuracy of the index is returned to assess whether the index meets the requirements.

This check is to ensure that the vector index provides sufficient accuracy when searching for similar properties.

In [None]:
sql="""
select dbms_vector.index_accuracy_query( 
  OWNER_NAME => 'VECTOR',
  INDEX_NAME => 'IDX_IMMOBILIEN_EMBED',
  qv => :query_vector,
  top_K => 10,
  target_accuracy => 80 ) as accuracy from dual
"""
with connection.cursor() as cursor:  
    cursor.execute(sql, query_vector=query_vector)
    res, = cursor.fetchone()
    print (res)

## Checking the vector memory pool

This section checks the memory status of the **vector memory pool** in the database. This query provides information about how much memory is allocated for the vector data and how much of it is actually used.

### Steps:
1. **Memory allocation and usage:** The query returns the **allocated** (ALLOC_BYTES_MB) and **used** (USED_BYTES_MB) amount of memory in megabytes for each memory pool. 
2. **Populate-Status:** The status of the memory initialization is displayed by the field `populate_status`, which indicates the progress of the memory initialization. 
3. **Columns:**
 - `CON_ID`: Container ID of the database instance.
 - `POOL`: The name of the storage pool.
 - `ALLOC_BYTES_MB` and `USED_BYTES_MB`: Allocated and used memory in megabytes. 
 - `populate_status`: Status of the memory pool.

This query helps to monitor the memory resources for vector processing in the database and to identify bottlenecks.

In [None]:
sql="""
select CON_ID, POOL, 
round(ALLOC_BYTES/1024/1024,1) as ALLOC_BYTES_MB, 
round(USED_BYTES/1024/1024,1) as USED_BYTES_MB,
populate_status
from V$VECTOR_MEMORY_POOL 
order by 1,2
"""
df = pd.read_sql(sql, con=connection)
display(df)

## Search without vectors: result/difference

Traditional search methods, such as the keyword-based search, only compare texts on the basis of exact word matches. This means that only documents or data records containing the exact words entered are found. However, this method neglects the semantic relationship between terms. For example, a simple keyword search for "house" would possibly overlook "building" or "apartment" as these terms are not identical even though they have similar meanings.

In contrast, the **vector search** uses embeddings that capture the semantic context of words. This makes it possible to find data records that are similar in content, even if the exact terms do not match. This makes searches much more flexible and intelligent, as semantic relationships are recognized and used.

The difference in the results between a keyword-based search and a vector search is clear: While the conventional search only returns exact matches, the vector search can return content-relevant and similar entries that would remain undetected in a keyword-based search.

## Summary

This notebook provides a detailed introduction to the implementation and management of vector databases, from setting up the database environment to performing efficient similarity searches. Steps were taken to create a special tablespace and user, set up automatic generation of embeddings using triggers and enable parallel processing to optimize performance.

A central element was the creation and management of a vector index that supports the search for semantic similarities based on embeddings. Various queries showed how top results can be filtered and sorted based on cosine distance. In addition, the accuracy of the vector index was verified and memory resources were monitored to ensure efficient processing of large datasets.

The techniques demonstrated in this notebook illustrate the power of modern vector searches and show how the right indexing and memory configuration can make complex searches accurate and fast. These approaches are applicable in many areas, such as product recommendation, semantic text search or similarity search in image and real estate data.