# Vector-search with SuperDuperDB

In [None]:
!pip install superduperdb
!pip install sentence_transformers

Set your `openai` key if it's not already in your `.env` variables

In [None]:
import os

if 'OPENAI_API_KEY' not in os.environ:
    raise Exception('You need to set an OpenAI key as environment variable: "export OPEN_API_KEY=sk-..."')

This line allows `superduperdb` to connect to MongoDB. Under the hood, `superduperdb` sets up configurations
for where to store:
- models
- outputs
- metadata
In addition `superduperdb` configures how vector-search is to be performed.

In [1]:
import os

# Uncomment one of the following lines to use a bespoke MongoDB deployment
# For testing the default connection is to mongomock

mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")
# mongodb_uri = "mongodb://localhost:27017/documents"
# mongodb_uri = "mongodb://superduper:superduper@mongodb:27017/documents"
# mongodb_uri = "mongodb://<user>:<pass>@<mongo_cluster>/<database>"
# mongodb_uri = "mongodb+srv://<username>:<password>@<atlas_cluster>/<database>"

# Super-Duper your Database!
from superduperdb import superduper
db = superduper(mongodb_uri, artifact_store='filesystem://./models')

INFO:root:Creating artifact store directory


In [2]:
db

<superduperdb.db.base.db.DB at 0x156cbe410>

We've prepared some data - it's the inline documentation of the `pymongo` API!

In [3]:
!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  120k  100  120k    0     0   268k      0 --:--:-- --:--:-- --:--:--  269k5      0  0:00:02 --:--:--  0:00:02 53433


We can insert this data to MongoDB using the `superduperdb` API, which supports `pymongo` commands.

In [4]:
import json
from superduperdb.db.mongodb.query import Collection
from superduperdb.container.document import Document as D

with open('pymongo.json') as f:
    data = json.load(f)

In [5]:
data[0]

{'key': 'pymongo.mongo_client.MongoClient',
 'parent': None,
 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
 'document': 'mongo_client.md',
 'res': 'pymongo.mongo_client.MongoClient'}

In [6]:
db.execute(
    Collection('documents').insert_many([D(r) for r in data])
)

([ObjectId('65316786ebc8b1a0bd775b8e'),
  ObjectId('65316786ebc8b1a0bd775b8f'),
  ObjectId('65316786ebc8b1a0bd775b90'),
  ObjectId('65316786ebc8b1a0bd775b91'),
  ObjectId('65316786ebc8b1a0bd775b92'),
  ObjectId('65316786ebc8b1a0bd775b93'),
  ObjectId('65316786ebc8b1a0bd775b94'),
  ObjectId('65316786ebc8b1a0bd775b95'),
  ObjectId('65316786ebc8b1a0bd775b96'),
  ObjectId('65316786ebc8b1a0bd775b97'),
  ObjectId('65316786ebc8b1a0bd775b98'),
  ObjectId('65316786ebc8b1a0bd775b99'),
  ObjectId('65316786ebc8b1a0bd775b9a'),
  ObjectId('65316786ebc8b1a0bd775b9b'),
  ObjectId('65316786ebc8b1a0bd775b9c'),
  ObjectId('65316786ebc8b1a0bd775b9d'),
  ObjectId('65316786ebc8b1a0bd775b9e'),
  ObjectId('65316786ebc8b1a0bd775b9f'),
  ObjectId('65316786ebc8b1a0bd775ba0'),
  ObjectId('65316786ebc8b1a0bd775ba1'),
  ObjectId('65316786ebc8b1a0bd775ba2'),
  ObjectId('65316786ebc8b1a0bd775ba3'),
  ObjectId('65316786ebc8b1a0bd775ba4'),
  ObjectId('65316786ebc8b1a0bd775ba5'),
  ObjectId('65316786ebc8b1a0bd775ba6'),


In the remainder of the notebook you can choose between using `openai` or `sentence_transformers` to 
perform vector-search. After instantiating the model wrappers, the rest of the notebook is identical.

In [7]:
from superduperdb.ext.openai.model import OpenAIEmbedding

model = OpenAIEmbedding(model='text-embedding-ada-002')

In [None]:
import sentence_transformers
from superduperdb.container.model import Model
from superduperdb.ext.vector.encoder import vector

model = Model(
    identifier='all-MiniLM-L6-v2',
    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
    encoder=vector(shape=(384,)),
    predict_method='encode',
    postprocess=lambda x: x.tolist(),
    batch_predict=True,
)

In [8]:
model.predict('This is a test', one=True)

[-0.008059182204306126,
 -0.003603511257097125,
 -0.000528058095369488,
 -0.005753727629780769,
 -0.024468205869197845,
 0.016131576150655746,
 -0.014929304830729961,
 -0.004634029697626829,
 -0.0009636337636038661,
 -0.03445630520582199,
 0.015920188277959824,
 0.01726778782904148,
 -0.008997217752039433,
 0.0022311382927000523,
 0.008713165298104286,
 1.3005340406380128e-05,
 0.02448141761124134,
 0.0005771893775090575,
 0.008336629718542099,
 -0.007444834802299738,
 0.005446553695946932,
 0.0075637404806911945,
 -0.011547090485692024,
 0.02483813464641571,
 -0.028352467343211174,
 -0.02319987490773201,
 0.0035044229589402676,
 -0.03522258996963501,
 0.019421307370066643,
 -0.009941860102117062,
 0.021878696978092194,
 -0.0173470601439476,
 0.001747257076203823,
 -0.0363323800265789,
 0.0007807332440279424,
 -0.012676697224378586,
 -0.010609054937958717,
 -0.01729421131312847,
 0.00801954697817564,
 -0.010886501520872116,
 0.009162365458905697,
 0.016686471179127693,
 0.0071475696749

Now we can configure the Atlas vector-search index. 
This command saves and sets up a model to "listen" to a particular subfield (or whole document) for
new text, and convert this on the fly to vectors which are then indexed by Atlas vector-search.

In [9]:
from superduperdb.container.vector_index import VectorIndex
from superduperdb.container.listener import Listener

db.add(
    VectorIndex(
        identifier=f'pymongo-docs-{model.identifier}',
        indexing_listener=Listener(
            model=model,
            key='value',
            select=Collection('documents').find(),
            predict_kwargs={'max_chunk_size': 1000},
        ),
    )
)

{
  "createSearchIndexes": "documents",
  "indexes": [
    {
      "name": "pymongo-docs-text-embedding-ada-002",
      "definition": {
        "mappings": {
          "dynamic": true,
          "fields": {
            "_outputs": {
              "fields": {
                "value": {
                  "fields": {
                    "text-embedding-ada-002": [
                      {
                        "dimensions": 1536,
                        "similarity": "cosine",
                        "type": "knnVector"
                      }
                    ]
                  },
                  "type": "document"
                }
              },
              "type": "document"
            }
          }
        }
      }
    }
  ]
}


INFO:root:Adding model text-embedding-ada-002 to db
INFO:root:Done.
527it [00:02, 225.58it/s]
INFO:root:Computing chunk 0/0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.01it/s]


[]

In [16]:
db.show('vector_index')

['my-index', 'pymongo-docs-text-embedding-ada-002']

In [17]:
db.show('listener')

['text-embedding-ada-002/txt', 'text-embedding-ada-002/value']

Now the index is set up we can use it in a query. `superduperdb` provides some syntactic sugar for 
the `aggregate` search pipelines, which can trip developers up. It also handles 
all conversion of inputs to vectors under the hood

In [15]:
from superduperdb.db.mongodb.query import Collection
from superduperdb.container.document import Document as D
from IPython.display import *

query = 'Perform analytics'

result = db.execute(
    Collection('documents')
        .like(D({'value': query}), vector_index=f'pymongo-docs-{model.identifier}', n=5)
        .find()
)

display(Markdown('---'))

for r in result:
    display(Markdown(f'### `{r["parent"] + "." if r["parent"] else ""}{r["res"]}`'))
    display(Markdown(r['value']))
    display(Markdown('---'))

---

### `db[collection_name] || db.collection_name.aggregate`


Perform a database-level aggregation.

See the [aggregation pipeline](https://mongodb.com/docs/manual/reference/operato

---

### `pymongo.client_options.ClientOptions.heartbeat_frequency`


The monitoring frequency in seconds.



---

### `c[name] || c.name.aggregate`


Perform an aggregation using the aggregation framework on this
collection.

The [`aggregate()`](#pymongo.collection.Col

---

### `pymongo.client_options.ClientOptions.event_listeners`


The event listeners registered for this client.

See [`monitoring`](monitoring.md#module-pymongo.monitoring) for detail

---

### `pymongo.results.BulkWriteResult.inserted_count`


The number of documents inserted.



---