# Running computations on Dask

In this example, we show how to run computations on a Dask cluster, rather than in the same process as 
data is submitted from. This allows compute to be scaled horizontally, and also submitted to 
workers, which may utilize specialized hardware, including GPUs.

To do this, we need to override the default configuration. To do this, we only need specify the 
configurations which diverge from the defaults. In particular, to use a Dask cluster, we specify 
`CFG.distributed = True`

In [10]:
!echo '{"distributed": true}' > configs.json
!cat configs.json

{"distributed": true}


We can now confirm, by importing the loaded configuration `CFG`, that `CFG.distribute == True`:

In [11]:
from pinnacledb import CFG

import pprint
pprint.pprint(CFG.dict())

{'apis': {'providers': {},
          'retry': {'stop_after_attempt': 2,
                    'wait_max': 10.0,
                    'wait_min': 4.0,
                    'wait_multiplier': 1.0}},
 'cdc': False,
 'dask': {'deserializers': [],
          'ip': 'localhost',
          'local': True,
          'password': '',
          'port': 8786,
          'serializers': [],
          'username': ''},
 'data_layers': {'artifact': {'cls': 'mongodb',
                              'connection': 'pymongo',
                              'kwargs': {'host': 'localhost',
                                         'password': 'testmongodbpassword',
                                         'port': 27018,
                                         'username': 'testmongodbuser'},
                              'name': '_filesystem:test_db'},
                 'data_backend': {'cls': 'mongodb',
                                  'connection': 'pymongo',
                                  'kwargs': {'host': 'loca

Now that we've set up the environment to use a Dask cluster, we can add some data to the `Datalayer`.

In [12]:
from pinnacledb.datalayer.base.build import build_datalayer

db = build_datalayer()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 49921 instead
INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:49922
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:49921/status
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:49925'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:49926'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:49927'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:49928'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:49935', name: 2, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:49935
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:49939
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:49933', name: 1, status: init, memory: 0, processing: 0>
INFO:distributed.schedul

In [15]:
db.db.client.drop_database('testdb')
db.db.client.drop_database('_filesystem:testdb')

As in the previous tutorials, we can wrap models from a range of AI frameworks to interoperate with the data set, 
as well as inserting data with, for instances, tensors of a specific data type:

In [16]:
import pymongo
import torch

from pinnacledb import pinnacle
from pinnacledb.core.document import Document as D
from pinnacledb.encoders.torch.tensor import tensor
from pinnacledb.datalayer.mongodb.query import Collection

m = pinnacle(
    torch.nn.Linear(128, 7),
    encoder=tensor(torch.float, shape=(7,))
)

t32 = tensor(torch.float, shape=(128,))

output = db.execute(
    Collection('localcluster').insert_many(
        [D({'x': t32(torch.randn(128))}) for _ in range(1000)], 
        encoders=(t32,)
    )
)

INFO:root:found 0 uris


Now when we instruct the model to make predictions based on the `Datalayer`, the computations run on the Dask cluster. The `.predict` method returns a `Job` instance, which can be used to monitor the progress of the computation:

In [17]:
job = m.predict(
    X='x',
    db=db,
    select=Collection('localcluster').find(),
)

job.watch()

To check that the `Datalayer` has been populated with outputs, we can check the `"_outputs"` field of a record:

In [19]:
db.execute(Collection('localcluster').find_one())

Document({'_id': ObjectId('64c1d71719f67e1b48cdf3dd'), 'x': Encodable(x=tensor([-0.1185, -0.0404, -0.1605, -0.1824, -0.3947,  1.8372, -0.0539, -0.4879,
         0.8195,  1.5238, -1.4167, -2.0685, -0.9496, -0.1749,  1.2830,  0.8140,
         0.9339, -0.4342, -1.1220, -0.3847, -1.9544,  0.1534, -1.2628, -0.3808,
        -0.8816, -1.6340, -0.2665,  0.1103, -0.5372, -2.4823, -0.7087,  1.3672,
         1.2557, -0.5210,  0.9967, -1.0727,  1.2504,  0.5167,  2.3840, -1.0954,
         1.4559, -0.3013,  0.6051,  0.6426,  0.0433, -1.4484,  0.4398, -0.9044,
         1.2979, -1.0627,  0.0968, -0.7796,  1.4209, -0.1462, -1.5939, -1.6432,
        -0.7866,  1.5465,  0.1046,  1.0507, -0.3342,  0.7803, -0.2175, -1.3123,
        -0.7993, -0.3464, -1.0265, -2.1460,  0.1070, -0.9112,  0.3214, -0.0166,
         0.9422, -0.6901,  0.5440,  0.0243,  0.8935, -1.1099, -0.9086, -0.1164,
        -0.8712, -0.3519,  0.2826, -1.7652,  0.6794, -0.1562, -0.1362,  0.9918,
         0.0920,  0.9796,  0.0780, -0.2145, -1.3