# (Experimental) Generating, Indexing and Searching Embeddings

**WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.**

## Installing the Package

For testing purpose, let's install the latest development version:

In [1]:
%cd ../../../
!python3 -m pip install --upgrade .

/home/gpadmin/GreenplumPython
Defaulting to user installation because normal site-packages is not writeable
Processing /home/gpadmin/GreenplumPython
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Building wheels for collected packages: greenplum-python
  Building wheel for greenplum-python (PEP 517) ... [?25ldone
[?25h  Created wheel for greenplum-python: filename=greenplum_python-1.0.1-py3-none-any.whl size=71903 sha256=305b83c461fb90310fafe09821f5778ef5235a439aee59ee3c1f304e349188d6
  Stored in directory: /tmp/pip-ephem-wheel-cache-w_h4u4oe/wheels/bb/1f/99/ff8594e48ec11df99af6e0ee8611a5e560e9f44d1a3fefb351
Successfully built greenplum-python
Installing collected packages: greenplum-python
Successfully installed greenplum-python-1.0.1


## Preparing Data

With GreenplumPython install, let's create a table with some sample text data:

In [2]:
content = ["I have a dog.", "I like eating apples."]

import greenplumpython as gp

db = gp.database("postgresql://localhost:7000")
t = (
    db.create_dataframe(columns={"id": range(len(content)), "content": content})
    .save_as(
        table_name="text_sample",
        column_names=["id", "content"],
        distribution_key={"id"},
        distribution_type="hash",
        drop_if_exists=True,
    )
    .check_unique(columns={"id"})
)

## Generating and Indexing Embeddings

On the text sample table, we can now create an embedding index with the new `embedding` module:

In [3]:
import greenplumpython.experimental.embedding

t = t.embedding().create_index(column="content", model="all-MiniLM-L6-v2")
t

id,content
0,I have a dog.
1,I like eating apples.


This will generate embeddings for the text data using the specified model and create vector index on the embeddings for fast k-NN search.

## Semantic Search by Embeddings

With the embedding index, we can search for contents based on the semantic similairy:

In [4]:
t.embedding().search(column="content", query="apple", top_k=1)

id,content
1,I like eating apples.


This is going to be very efficient since we don't need to scan all the data.

## Cleaning All at Once

To ease management, the dependencies of the embedding index and the base table will be recorded in database.

As a result, trying to droping the base table alone will fail:

In [6]:
%reload_ext sql
%sql postgresql://localhost:7000
%sql DROP TABLE text_sample

 * postgresql://localhost:7000
(psycopg2.errors.DependentObjectsStillExist) cannot drop table text_sample because other objects depend on it
DETAIL:  table cte_32a769763ae94cd9b4036ceb590c4f0d depends on table text_sample
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

[SQL: DROP TABLE text_sample]
(Background on this error at: https://sqlalche.me/e/20/2j85)


To drop the base table, we need to also drop the embedding index. This can be achieved with `CASCADE`:

In [7]:
%%sql
DROP TABLE text_sample CASCADE;

SELECT oid, relname
FROM gp_dist_random('pg_class')
WHERE relname = 'cte_32a769763ae94cd9b4036ceb590c4f0d';

 * postgresql://localhost:7000
Done.
0 rows affected.


oid,relname


As we can see, after `DROP CASCADE`, the embedding index also gets dropped on all segments.