# Run computations on Dask

In this example, we show how to run computations on a Dask cluster, rather than in the same process as 
data is submitted from. This allows compute to be scaled horizontally, and also submitted to 
workers, which may utilize specialized hardware, including GPUs.

To do this, we need to override the default configuration. To do this, we only need specify the 
configurations which diverge from the defaults. In particular, to use a Dask cluster, we specify 
`CFG.distributed = True`

In [None]:
!echo '{"distributed": true}' > configs.json
!cat configs.json

We can now confirm, by importing the loaded configuration `CFG`, that `CFG.distribute == True`:

In [2]:
from pinnacledb import CFG

import pprint
pprint.pprint(CFG.dict())

{'apis': {'providers': {},
          'retry': {'stop_after_attempt': 2,
                    'wait_max': 10.0,
                    'wait_min': 4.0,
                    'wait_multiplier': 1.0}},
 'cdc': False,
 'dask': {'deserializers': [],
          'ip': 'localhost',
          'local': True,
          'password': '',
          'port': 8786,
          'serializers': [],
          'username': ''},
 'data_layers': {'artifact': {'cls': 'mongodb',
                              'connection': 'pymongo',
                              'kwargs': {'host': 'localhost',
                                         'password': 'testmongodbpassword',
                                         'port': 27018,
                                         'username': 'testmongodbuser'},
                              'name': '_filesystem:test_db'},
                 'data_backend': {'cls': 'mongodb',
                                  'connection': 'pymongo',
                                  'kwargs': {'host': 'loca

Now that we've set up the environment to use a Dask cluster, we can add some data to the `Datalayer`.

In [3]:
from pinnacledb.db.base.build import build_datalayer

db = build_datalayer()

INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.scheduler:State start
INFO:distributed.diskutils:Found stale lock file and directory '/var/folders/y9/b74b9yj906s_wtj0rrh2lf7c0000gn/T/dask-scratch-space/scheduler-zr4pij_d', purging
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:50788
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:8787/status
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:50791'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:50792'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:50793'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:50794'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:50799', name: 0, sta

In [4]:
db.db.client.drop_database('test_db')
db.db.client.drop_database('_filesystem:test_db')

As in the previous tutorials, we can wrap models from a range of AI frameworks to interoperate with the data set, 
as well as inserting data with, for instances, tensors of a specific data type:

In [5]:
import pymongo
import torch

from pinnacledb import pinnacle
from pinnacledb.container.document import Document as D
from pinnacledb.ext.torch.tensor import tensor
from pinnacledb.db.mongodb.query import Collection

m = pinnacle(
    torch.nn.Linear(128, 7),
    encoder=tensor(torch.float, shape=(7,))
)

t32 = tensor(torch.float, shape=(128,))

output = db.execute(
    Collection('localcluster').insert_many(
        [D({'x': t32(torch.randn(128))}) for _ in range(1000)], 
        encoders=(t32,)
    )
)

INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.
INFO:root:found 0 uris


Now when we instruct the model to make predictions based on the `Datalayer`, the computations run on the Dask cluster. The `.predict` method returns a `Job` instance, which can be used to monitor the progress of the computation:

In [6]:
job = m.predict(
    X='x',
    db=db,
    select=Collection('localcluster').find(),
)

job.watch()

INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.


  return torch.from_numpy(array)

  0%|          | 0/1000 [00:00<?, ?it/s]
100%|##########| 1000/1000 [00:00<00:00, 19485.55it/s]


To check that the `Datalayer` has been populated with outputs, we can check the `"_outputs"` field of a record:

In [8]:
db.execute(Collection('localcluster').find_one()).unpack()

{'_id': ObjectId('64c9016978525ecbb9f364ee'),
 'x': tensor([-0.4790,  1.6279,  0.6850, -1.5799, -0.4191,  1.2377, -1.3383,  1.1423,
          0.8592,  0.3720,  0.6008,  0.5610,  0.5444,  0.5510, -1.6476,  0.5723,
         -1.7961,  1.3233, -1.8662,  1.5770,  0.6966, -0.6335, -0.0390, -0.6632,
         -1.1577, -0.3995,  0.1329,  0.7714,  1.7876,  0.0969, -0.6961,  0.7991,
         -0.5688,  0.2482,  0.9115,  1.1281, -0.6572, -0.4827, -0.1045,  1.2745,
          1.3540, -0.5822,  1.1297,  1.2904,  0.6449, -0.2306,  1.7910, -0.0457,
         -0.2778, -0.0755, -1.3594,  1.0392, -0.3065, -0.4266,  1.0180,  0.4784,
         -2.0998,  0.5551,  0.0905,  0.8975, -0.8218, -0.5843,  0.6310, -0.3477,
         -0.9706,  2.0773, -0.2294,  0.5185,  1.0339, -0.5785,  0.1760,  0.6875,
         -0.6930,  0.8977, -0.2261,  0.9070, -0.8581,  1.0713, -0.8523,  2.1395,
         -1.5788,  1.8719, -0.2903, -0.0486, -1.5856, -1.5521,  1.1228,  0.5800,
         -0.0910, -0.3383, -0.6915, -0.1575,  0.9333,  0.8