# SuperDuperDB: cluster usage

SuperDuperDB allows developers, on the one hand to experiment and setup models quickly in scripts and notebooks, and on the other hand deploy persistent services, which are intended to "always" be on. These persistent services are:

- Dask scheduler
- Dask workers
- Vector-searcher service
- Change-data-capture (CDC) service

![](../docs/hr/static/img/light.png)

To set up `pinnacledb` to use this cluster mode, it's necessary to add explicit configurations 
for each of these components. The following configuration does that, as well as enabling a pre-configured 
community edition MongoDB database:

```yaml
data_backend: mongodb://pinnacle:pinnacle@mongodb:27017/test_db
artifact_store: filesystem://./data
cluster:
  cdc: http://cdc:8001
  compute: dask+tcp://scheduler:8786
  vector_search: http://vector-search:8000
```

Add this configuration in `/.pinnacledb/config.yaml`, where `/` is the root of your project.

Once this configuration has been added, you're ready to use the `pinnacledb` sandbox environment, which uses 
`docker-compose` to deploy:

- Standalone replica-set of MongoDB community edition
- Dask scheduler
- Dask workers
- Vector-searcher service
- Change-data-capture (CDC) service
- Jupyter notebook service

To set up this environment, navigate to your local copy of the `pinnacledb` repository, and build the image with:

```bash
make testenv_image pinnacleDB_EXTRAS=sandbox
```

Then start the environment with:

```bash
make testenv_init
```

This last command starts containers for each of the above services with `docker-compose`. You should see a bunch of logs for each service (mainly MongoDB).

Once you have carried out these steps, you are ready to complete the rest of this notebook.

In [1]:
import os

# move to the root of the project (assumes starts in `/examples`)
os.chdir('../')

from pinnacledb import CFG

# check that config has been properly set-up
assert CFG.data_backend == 'mongodb://pinnacle:pinnacle@mongodb:27017/test_db'

In [2]:
from pinnacledb.backends.mongodb import Collection
from pinnacledb import pinnacle

db = pinnacle()
doc_collection = Collection('documents')

[32m 2023-Nov-30 09:08:53.41[0m| [34m[1mDEBUG   [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.base.build[0m:[36m36  [0m | [34m[1mParsing data connection URI:mongodb://pinnacle:pinnacle@mongodb:27017/test_db[0m
[32m 2023-Nov-30 09:08:53.41[0m| [1mINFO    [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.base.build[0m:[36m101 [0m | [1mData Client is ready. MongoClient(host=['mongodb:27017'], document_class=dict, tz_aware=False, connect=True, serverselectiontimeoutms=5000)[0m
[32m 2023-Nov-30 09:08:53.43[0m| [1mINFO    [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.base.datalayer[0m:[36m79  [0m | [1mBuilding Data Layer[0m


In [3]:
!curl -O https://pinnacledb-public.s3.eu-west-1.amazonaws.com/pymongo.json

import json

with open('pymongo.json') as f:
    data = json.load(f)

data[0]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  120k  100  120k    0     0   316k      0 --:--:-- --:--:-- --:--:--  315k


{'key': 'pymongo.mongo_client.MongoClient',
 'parent': None,
 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
 'document': 'mongo_client.md',
 'res': 'pymongo.mongo_client.MongoClient'}

In [4]:
from pinnacledb import Document

out, G = db.execute(
    doc_collection.insert_many([Document(r) for r in data[:-100]])
)

[32m 2023-Nov-30 09:10:13.38[0m| [1mINFO    [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.base.datalayer[0m:[36m448 [0m | [1mCDC active, skipping refresh[0m


In [5]:
db.metadata.show_jobs()

[]

In [6]:
import sentence_transformers
from pinnacledb import Model, vector

model = Model(
   identifier='all-MiniLM-L6-v2',
   object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
   encoder=vector(shape=(384,)),
   predict_method='encode',           # Specify the prediction method
   postprocess=lambda x: x.tolist(),  # Define postprocessing function
   batch_predict=True,                # Generate predictions for a set of observations all at once 
)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [7]:
from pinnacledb import Listener, VectorIndex

jobs, vi = db.add(
    VectorIndex(
        identifier=f'pymongo-docs-{model.identifier}',
        indexing_listener=Listener(
            select=doc_collection.find(),
            key='value',
            model=model,
            predict_kwargs={'max_chunk_size': 1000},
        ),
    )
)

[32m 2023-Nov-30 09:14:39.80[0m| [34m[1mDEBUG   [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.misc.server[0m:[36m26  [0m | [34m[1mTrying to connect cdc at http://cdc:8001/handshake/config method: post[0m
[32m 2023-Nov-30 09:14:39.81[0m| [34m[1mDEBUG   [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.misc.server[0m:[36m26  [0m | [34m[1mTrying to connect cdc at http://cdc:8001/listener/add method: get[0m
[32m 2023-Nov-30 09:14:44.25[0m| [1mINFO    [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.components.model[0m:[36m224 [0m | [1mAdding model all-MiniLM-L6-v2 to db[0m
[32m 2023-Nov-30 09:14:44.25[0m| [34m[1mDEBUG   [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.base.datalayer[0m:[36m901 [0m | [34m[1mmodel/all-MiniLM-L6-v2/0 already exists - doing nothing[0m
[32m 2023-Nov-30 09:14:44.67

In [8]:
jobs[0].watch()

[32m 2023-Nov-30 09:14:49.94[0m| [1mINFO    [0m | [36m46bda0c5357e[0m| [36m05d945dd-6fdc-49f3-a6b3-d07c4fe0ad3c[0m| [36mpinnacledb.components.model[0m:[36m224 [0m | [1mAdding model all-MiniLM-L6-v2 to db[0m
[32m 2023-Nov-30 09:14:49.94[0m| [34m[1mDEBUG   [0m | [36m46bda0c5357e[0m| [36m05d945dd-6fdc-49f3-a6b3-d07c4fe0ad3c[0m| [36mpinnacledb.base.datalayer[0m:[36m901 [0m | [34m[1mmodel/all-MiniLM-L6-v2/0 already exists - doing nothing[0m
[32m 2023-Nov-30 09:14:49.95[0m| [1mINFO    [0m | [36m46bda0c5357e[0m| [36m05d945dd-6fdc-49f3-a6b3-d07c4fe0ad3c[0m| [36mpinnacledb.components.model[0m:[36m376 [0m | [1mComputing chunk 0/0[0m
427it [00:00, 117156.26it/s]


In [9]:
db.execute(doc_collection.find_one())

Document({'_id': ObjectId('65685175cd6015a505aa832b'), 'key': 'pymongo.mongo_client.MongoClient', 'parent': None, 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n', 'document': 'mongo_client.md', 'res': 'pymongo.mongo_client.MongoClient', '_fold': 'train', '_outputs': {'value': {'all-MiniLM-L6-v2': {'0': [-0.06886107474565506, 0.019031450152397156, -0.07293467968702316, 0.022899897769093513, -0.03625085949897766, -0.05900949984788895, -0.009488087147474289, 0.016871213912963867, 0.08023775368928909, 0.015818415209650993, -0.03495324030518532, 0.011837320402264595, 0.027356967329978943, 0.0005958160036243498, 0.020689714699983597, -0.02070797234773636, 0.09423939138650894, -0.03771770000457764, 0.08832316845655441, 0.03095955401659012, -0.06022973731160164, -0.0893627405166626, -0.013766525313258171, 0.061106398701667786, -0.04972735792398453, -0.06167677044868469, 0.01156203169375658, -0.014520982280373573, -0.031374141573905945, 0.023699967190623283,

In [10]:
from IPython.display import Markdown

result = sorted(db.execute(
    doc_collection
        .like(Document({'value': 'Aggregate'}), n=10, vector_index=f'pymongo-docs-{model.identifier}')
        .find({}, {'_outputs': 0})
), key=lambda r: -r['score'])

# Display a horizontal line
display(Markdown('---'))

# Iterate through the query results and display them
for r in result:
    # Display the document's parent and res values in a formatted way
    display(Markdown(f'### `{r["parent"] + "." if r["parent"] else ""}{r["res"]}`'))
    
    # Display the value of the document
    display(Markdown(r['value']))
    
    # Display a horizontal line
    display(Markdown('---'))

[32m 2023-Nov-30 09:15:08.60[0m| [1mINFO    [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.base.datalayer[0m:[36m112 [0m | [1mloading of vectors of vector-index: 'pymongo-docs-all-MiniLM-L6-v2'[0m
[32m 2023-Nov-30 09:15:08.61[0m| [34m[1mDEBUG   [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.misc.server[0m:[36m26  [0m | [34m[1mTrying to connect vector_search at http://vector-search:8000/handshake/config method: post[0m
[32m 2023-Nov-30 09:15:08.63[0m| [34m[1mDEBUG   [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.misc.server[0m:[36m26  [0m | [34m[1mTrying to connect vector_search at http://vector-search:8000/create/search method: get[0m
[32m 2023-Nov-30 09:15:11.58[0m| [34m[1mDEBUG   [0m | [36mdemo    [0m| [36m953707b3-5fc7-4576-bf83-2f4e0c34638f[0m| [36mpinnacledb.misc.server[0m:[36m26  [0m | [34m[1mTrying to connect vec

---

### `c[name] || c.name.aggregate`


Perform an aggregation using the aggregation framework on this
collection.

The [`aggregate()`](#pymongo.collection.Col

---

### `c[name] || c.name.watch`


Watch changes on this collection.

Performs an aggregation with an implicit initial `$changeStream`
stage and returns a

---

### `c[db_name] || c.db_name.watch`


Watch changes on this cluster.

Performs an aggregation with an implicit initial `$changeStream`
stage and returns a
[`

---

### `c[name] || c.name.count_documents`


Count the number of documents in this collection.



---

### `pymongo.results.BulkWriteResult.upserted_count`


The number of documents upserted.



---

### `pymongo.results.BulkWriteResult.inserted_count`


The number of documents inserted.



---

### `pymongo.results.BulkWriteResult.modified_count`


The number of documents modified.



---

### `pymongo.results.UpdateResult.modified_count`


The number of documents modified.



---

### `pymongo.client_session.ClientSession.commit_transaction`


Commit a multi-statement transaction.



---

### `pymongo.operations.SearchIndexModel.document`


The document for this index.



---

In [None]:
db.drop(force=True)

The great thing about this production mode, is that now allows data to be inserted into the service via other 
MongoDB clients, even from other programming languages and applications.

We show-case this here, by inserting the rest of the data using the official Python MongoDB driver `pymongo`.

This cell will update the models, even if you restart the program:

In [14]:
import pymongo

coll = pymongo.MongoClient('mongodb://pinnacle:pinnacle@mongodb:27017/test_db').test_db.documents

coll.insert_many(data[-100:])

To get an idea what is happening, you can view the logs of the CDC container on 
your host by executing this command in a terminal:

```bash
docker logs -n 20 testenv_cdc_1
```

Note this won't work inside this notebook since it's running in its own container.

The CDC service should have captured the changes created with the `pymongo` insert, and has submitted a new job(s)
to the `dask` cluster.

You can confirm that another job has been created and executed:

In [18]:
db.metadata.show_jobs()

[{'identifier': '4a5a6ed9-7e97-4687-aaee-ebc612fb9d41',
  'time': datetime.datetime(2023, 11, 30, 9, 14, 44, 259000),
  'status': 'success'},
 {'identifier': '5ef246a0-2784-47f9-9603-5e988a0ebba1',
  'time': datetime.datetime(2023, 11, 30, 9, 17, 24, 104000),
  'status': 'success'},
 {'identifier': 'a5077d81-0e00-4004-b501-23af356e0234',
  'time': datetime.datetime(2023, 11, 30, 9, 17, 24, 111000),
  'status': 'success'}]

You can view the `stdout` of the most recent job with this command:

In [26]:
db.metadata.watch_job('a5077d81-0e00-4004-b501-23af356e0234')

[32m 2023-Nov-30 09:17:24.26[0m| [1mINFO    [0m | [36m46bda0c5357e[0m| [36m05d945dd-6fdc-49f3-a6b3-d07c4fe0ad3c[0m| [36mpinnacledb.components.model[0m:[36m224 [0m | [1mAdding model all-MiniLM-L6-v2 to db[0m
[32m 2023-Nov-30 09:17:24.26[0m| [34m[1mDEBUG   [0m | [36m46bda0c5357e[0m| [36m05d945dd-6fdc-49f3-a6b3-d07c4fe0ad3c[0m| [36mpinnacledb.base.datalayer[0m:[36m901 [0m | [34m[1mmodel/all-MiniLM-L6-v2/0 already exists - doing nothing[0m
[32m 2023-Nov-30 09:17:24.27[0m| [1mINFO    [0m | [36m46bda0c5357e[0m| [36m05d945dd-6fdc-49f3-a6b3-d07c4fe0ad3c[0m| [36mpinnacledb.components.model[0m:[36m376 [0m | [1mComputing chunk 0/0[0m


In [23]:
db.execute(doc_collection.count_documents({'_outputs': {'$exists': 1}}))

527