# Vector search

:::note
Since vector-search is all-the-rage right now, 
here is the simplest possible iteration of semantic 
text-search with a `sentence_transformers` model, 
as an entrypoint to `pinnacledb`.

Note that `pinnacledb` is much-much more than vector-search
on text. Explore the docs to read about classical machine learning, 
computer vision, LLMs, fine-tuning and much much more!
:::


First let's get some data. These data are the markdown files 
of the very same documentation you are reading!
You can download other sample data-sets for testing `pinnacledb`
by following [these lines of code](../reusable_snippets/get_useful_sample_data).

In [1]:
import json
import requests 
r = requests.get('https://pinnacledb-public-demo.s3.amazonaws.com/text.json')

with open('text.json', 'wb') as f:
    f.write(r.content)

with open('text.json', 'r') as f:
    data = json.load(f)        

print(data[0])

---
sidebar_position: 5
---

# Encoding data

In AI, typical types of data are:

- **Numbers** (integers, floats, etc.)
- **Text**
- **Images**
- **Audio**
- **Videos**
- **...bespoke in house data**

Most databases don't support any data other than numbers and text.
SuperDuperDB enables the use of these more interesting data-types using the `Document` wrapper.

### `Document`

The `Document` wrapper, wraps dictionaries, and is the container which is used whenever 
data is exchanged with your database. That means inputs, and queries, wrap dictionaries 
used with `Document` and also results are returned wrapped with `Document`.

Whenever the `Document` contains data which is in need of specialized serialization,
then the `Document` instance contains calls to `DataType` instances.

### `DataType`

The [`DataType` class](../apply_api/datatype), allows users to create and encoder custom datatypes, by providing 
their own encoder/decoder pairs.

Here is an example of applying an `DataType` 

Now we connect to SuperDuperDB, using MongoMock as a databackend.
Read more about connecting to SuperDuperDB [here](../core_api/connect) and
a semi-exhaustive list of supported data-backends for connecting [here](../reusable_snippets/connect_to_pinnacledb).

In [2]:
from pinnacledb import pinnacle, Document

db = pinnacle('mongomock://test')

_ = db['documents'].insert_many([Document({'txt': txt}) for txt in data]).execute()

2024-May-23 22:32:53.64| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:69   | Data Client is ready. mongomock.MongoClient('localhost', 27017)
2024-May-23 22:32:53.66| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:42   | Connecting to Metadata Client with engine:  mongomock.MongoClient('localhost', 27017)
2024-May-23 22:32:53.66| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:155  | Connecting to compute client: None
2024-May-23 22:32:53.66| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:85   | Building Data Layer
2024-May-23 22:32:53.66| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:220  | Configuration: 
 +---------------+------------------+
| Configuration |      Value       |
+---------------+------------------+
|  Data Backend | mongomock://test |
+---------------+------------------+
2024-May-23 22:32:53.67| INFO     | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:37   | Submitting job. function:<function callable_job a

In [3]:
db.show()

[]

We are going to make these data searchable by activating a [`Model`](../apply_api/model) instance 
to compute vectors for each item inserted to the `"documents"` collection.
For that we'll use the [sentence-transformers](https://sbert.net/) integration to `pinnacledb`.
Read more about the `sentence_transformers` integration [here](../ai_integrations/sentence_transformers)
and [here](../../api/ext/sentence_transformers/).

In [4]:
from pinnacledb.ext.sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    identifier="test",
    predict_kwargs={"show_progress_bar": True},
    model="all-MiniLM-L6-v2",
    device="cpu",
    postprocess=lambda x: x.tolist(),
)



2024-May-23 22:33:00.27| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:386  | Initializing SentenceTransformer : test
2024-May-23 22:33:00.27| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:389  | Initialized  SentenceTransformer : test successfully


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

We can check that this model gives us what we want by evaluating an output 
on a single data-point. (Learn more about the various aspects of `Model` [here](../models/).)

In [5]:
model.predict_one(data[0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[-0.0728381797671318,
 -0.04369897395372391,
 -0.053990256041288376,
 0.05244452506303787,
 -0.023977573961019516,
 0.01649916172027588,
 -0.011447322554886341,
 0.061035461723804474,
 -0.07156683504581451,
 -0.021972885355353355,
 0.01267794519662857,
 0.018208766356110573,
 0.05270218849182129,
 -0.020327100530266762,
 -0.019956670701503754,
 0.027658769860863686,
 0.05226463824510574,
 -0.09045840799808502,
 -0.05595366284251213,
 -0.015193621627986431,
 0.11809872835874557,
 0.006927163805812597,
 -0.042815908789634705,
 0.020163120701909065,
 -0.007551214192062616,
 0.05370991304516792,
 -0.06269364058971405,
 -0.015371082350611687,
 0.07905995100736618,
 0.01635877788066864,
 0.013246661052107811,
 0.05565343424677849,
 0.01678791269659996,
 0.08823869377374649,
 -0.06329561769962311,
 0.018252376466989517,
 0.01689964346587658,
 -0.09000741690397263,
 -0.013926311396062374,
 -0.054565709084272385,
 0.09763795882463455,
 -0.045446526259183884,
 -0.11169185489416122,
 -0.016722979

Now that we've verified that this model works, we can "activate" it for 
vector-search by creating a [`VectorIndex`](../apply_api/vector_index).

In [6]:
import pprint

vector_index = model.to_vector_index(select=db['documents'].find(), key='txt')

pprint.pprint(vector_index)

VectorIndex(identifier='test:vector_index',
            uuid='acd20227-14e2-4cee-9507-f738315f5d42',
            indexing_listener=Listener(identifier='component/listener/test/b335fc9c-ad9e-4495-8c39-6894c5b4f842',
                                       uuid='b335fc9c-ad9e-4495-8c39-6894c5b4f842',
                                       key='txt',
                                       model=SentenceTransformer(preferred_devices=('cuda',
                                                                                    'mps',
                                                                                    'cpu'),
                                                                 device='cpu',
                                                                 identifier='test',
                                                                 uuid='11063ea2-4afa-4cab-8a55-21d0c7ad2900',
                                                                 signature='singleton',
               

You will see that the `VectorIndex` contains a [`Listener`](../apply_api/listener) instance.
This instance wraps the model, and configures it to compute outputs 
on data inserted to the `"documents"` collection with the key `"txt"`.

To activate this index, we now do:

In [7]:
db.apply(vector_index)

2024-May-23 22:33:06.79| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:386  | Initializing DataType : dill_lazy
2024-May-23 22:33:06.79| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:389  | Initialized  DataType : dill_lazy successfully
2024-May-23 22:33:08.38| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:386  | Initializing DataType : dill
2024-May-23 22:33:08.38| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:389  | Initialized  DataType : dill successfully
2024-May-23 22:33:08.42| INFO     | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:37   | Submitting job. function:<function method_job at 0x1107caac0>


204it [00:00, 142844.41it/s]

2024-May-23 22:33:08.55| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:386  | Initializing SentenceTransformer : test
2024-May-23 22:33:08.55| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.component:389  | Initialized  SentenceTransformer : test successfully





Batches:   0%|          | 0/7 [00:00<?, ?it/s]

2024-May-23 22:33:12.78| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:783  | Adding 204 model outputs to `db`
2024-May-23 22:33:12.89| SUCCESS  | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:43   | Job submitted on <pinnacledb.backends.local.compute.LocalComputeBackend object at 0x15267d010>.  function:<function method_job at 0x1107caac0> future:3598065c-0bfb-4d94-9b25-6e7e82c09bd0
2024-May-23 22:33:12.90| INFO     | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:37   | Submitting job. function:<function callable_job at 0x1107caa20>
2024-May-23 22:33:12.98| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:170  | Loading vectors of vector-index: 'test:vector_index'
2024-May-23 22:33:12.98| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:180  | documents.find(documents[0], documents[1])


Loading vectors into vector-table...: 204it [00:00, 3148.10it/s]

2024-May-23 22:33:13.05| SUCCESS  | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:43   | Job submitted on <pinnacledb.backends.local.compute.LocalComputeBackend object at 0x15267d010>.  function:<function callable_job at 0x1107caa20> future:c355caeb-daab-4712-a269-6bfca8da2c09





([<pinnacledb.jobs.job.ComponentJob at 0x28d6f95d0>,
  <pinnacledb.jobs.job.FunctionJob at 0x28d757850>],
 VectorIndex(identifier='test:vector_index', uuid='acd20227-14e2-4cee-9507-f738315f5d42', indexing_listener=Listener(identifier='component/listener/test/b335fc9c-ad9e-4495-8c39-6894c5b4f842', uuid='b335fc9c-ad9e-4495-8c39-6894c5b4f842', key='txt', model=SentenceTransformer(preferred_devices=('cuda', 'mps', 'cpu'), device='cpu', identifier='test', uuid='11063ea2-4afa-4cab-8a55-21d0c7ad2900', signature='singleton', datatype=DataType(identifier='test/datatype', uuid='e46268dc-5c88-48dd-8595-f774c35a8f09', encoder=None, decoder=None, info=None, shape=(384,), directory=None, encodable='native', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermediate_type='bytes', media_type=None), output_schema=None, flatten=False, model_update_kwargs={}, predict_kwargs={'show_progress_bar': True}, compute_kwargs={}, validation=None, metric_values={}, object=SentenceTransformer(
   (0): Transformer(

The `db.apply` command is a universal command for activating AI components in SuperDuperDB.

You will now see lots of output - the model-outputs/ vectors are computed 
and the various parts of the `VectorIndex` are registered in the system.

You can verify this with:

In [8]:
db.show()

[{'identifier': 'test', 'type_id': 'model'},
 {'identifier': 'component/listener/test/b335fc9c-ad9e-4495-8c39-6894c5b4f842',
  'type_id': 'listener'},
 {'identifier': 'test:vector_index', 'type_id': 'vector_index'}]

In [9]:
db['documents'].find_one().execute().unpack()

{'txt': "---\nsidebar_position: 5\n---\n\n# Encoding data\n\nIn AI, typical types of data are:\n\n- **Numbers** (integers, floats, etc.)\n- **Text**\n- **Images**\n- **Audio**\n- **Videos**\n- **...bespoke in house data**\n\nMost databases don't support any data other than numbers and text.\nSuperDuperDB enables the use of these more interesting data-types using the `Document` wrapper.\n\n### `Document`\n\nThe `Document` wrapper, wraps dictionaries, and is the container which is used whenever \ndata is exchanged with your database. That means inputs, and queries, wrap dictionaries \nused with `Document` and also results are returned wrapped with `Document`.\n\nWhenever the `Document` contains data which is in need of specialized serialization,\nthen the `Document` instance contains calls to `DataType` instances.\n\n### `DataType`\n\nThe [`DataType` class](../apply_api/datatype), allows users to create and encoder custom datatypes, by providing \ntheir own encoder/decoder pairs.\n\nHere

To "use" the `VectorIndex` we can execute a vector-search query:

In [11]:
query = db['documents'].like({'txt': 'Tell me about vector-search'}, vector_index=vector_index.identifier, n=3).find()
cursor = query.execute()

2024-May-23 22:33:16.62| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:1095 | {}


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

This query will return a cursor of [`Document`](../fundamentals/document) instances.
To obtain the raw dictionaries, call the `.unpack()` command:

In [12]:
for r in cursor:
    print('=' * 100)
    print(r.unpack()['txt'])
    print('=' * 100)

---
sidebar_position: 7
---

# Vector-search

SuperDuperDB allows users to implement vector-search in their database by either 
using in-database functionality, or via a sidecar implementation with `lance` and `FastAPI`.

## Philosophy

In `pinnacledb`, from a user point-of-view vector-search isn't a completely different beast than other ways of 
using the system:

- The vector-preparation is exactly the same as preparing outputs with any model, 
  with the special difference that the outputs are vectors, arrays or tensors.
- Vector-searches are just another type of database query which happen to use 
  the stored vectors.

## Algorithm

Here is a schematic of how vector-search works:

![](/img/vector-search.png)

## Explanation

A vector-search query has the schematic form:

```python
table_or_collection
    .like(Document(<dict-to-search-with>))      # the operand is vectorized using registered models
    .filter_results(*args, **kwargs)            # the results of vector-search are 

You should see that the documents returned are relevant to the `like` part of the 
query.

Learn more about building queries with `pinnacledb` [here](../execute_api/overview.md).