# Distributing labelling across multiple people

One of the main challenges about labelling data is that it can take a lot of time.

To get around this, many people want to distribute the task across multiple people - potentially even outsourcing it to a crowd platform - and this is challenging using a standard in-memory python object.

In `superintendent`, you can get around this by using the `superintendent.distributed` submodule. The labelling widgets effectively replicate the widgets in the main `superintendet` module, but do so using a database to store the "queue" of objects, as well as the results of the labelling.


The distributed submodule stores and retrieves data from a SQL database, serialising / deserialising it along the way. You simply pass your data in the same way you do with `superintendent` widgets, and can retrieve the labels in the same way. In theory, other than having to set up the database, everything else should be the same.

The use case ultimately looks a bit like this:

![distributed diagram](superintendent-distributed-diagram.png)

This allows you to ask your colleagues to label data for you. By removing the labelling process from the active learning process, it also allows you to scale the compute that does the active learning, e.g. use a server with GPUs to train complex models, while the labelling user can just use a laptop.

Ultimately, the database architecture also means that you have more persistent storage, and are more robust to crashes.

## Distributing the labelling of images across people

`superintendent` uses [SQLAlchemy](https://www.sqlalchemy.org/) to communicate with the database, and all you need to provide is a ["connection url"](https://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls).

First, we make sure that we are using a completely fresh database:


In [1]:
import os
if os.path.isfile("demo.db"):
    os.remove("demo.db")

In [2]:
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits().data


In [3]:
from superintendent.distributed import SemiSupervisor

widget = SemiSupervisor.from_images(
    connection_string="sqlite:///demo.db",
    options=range(10)
)


We can then add data to the database. Because every widget that connects to the DB, we should only run this code once:

In [4]:
widget.add_features(digits[:500, :])

We can then start labelling data:

In [5]:
widget

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

You can inspect by using the `widget.queue` attribute, which encapsulates the database connection and the methods for retrieving and submitting data.

In [6]:
with widget.queue.session() as session:
    print(session.query(widget.queue.data).count())

500


In [7]:
from pprint import pprint

with widget.queue.session() as session:
    pprint(session.query(widget.queue.data).first().__dict__)

{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1a18b6f0b8>,
 'completed_at': datetime.datetime(2018, 11, 12, 10, 32, 32, 150793),
 'id': 1,
 'input': '{"__type__": "__np.ndarray__", "__content__": [0.0, 0.0, 5.0, 13.0, '
          '9.0, 1.0, 0.0, 0.0, 0.0, 0.0, 13.0, 15.0, 10.0, 15.0, 5.0, 0.0, '
          '0.0, 3.0, 15.0, 2.0, 0.0, 11.0, 8.0, 0.0, 0.0, 4.0, 12.0, 0.0, 0.0, '
          '8.0, 8.0, 0.0, 0.0, 5.0, 8.0, 0.0, 0.0, 9.0, 8.0, 0.0, 0.0, 4.0, '
          '11.0, 0.0, 1.0, 12.0, 7.0, 0.0, 0.0, 2.0, 14.0, 5.0, 10.0, 12.0, '
          '0.0, 0.0, 0.0, 0.0, 6.0, 13.0, 10.0, 0.0, 0.0, 0.0]}',
 'inserted_at': datetime.datetime(2018, 11, 12, 10, 32, 15, 560751),
 'output': '"0"',
 'popped_at': datetime.datetime(2018, 11, 12, 10, 32, 15, 672503),
 'priority': None,
 'worker_id': None}


As you can see, `superintendent` added 500 entries into the database. The format of this row is not necessarily important, as you can retrieve the data needed using `superintendent` itself.

## Retrieving data from the distributed widget

Any `superintendent` connected to the database can retrieve the labels using `widget.new_labels`:

In [8]:
pprint(widget.new_labels[:30])

['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9']


## Doing active learning during distributed labelling

One of the great benefits of using the distributed submodule is that you can perform active learning, where the labelling of data and the training of the active learning model are split across different machines. You can achieve this by creating a widget object that you don't intend to use for labelling - only for orchestration of labelling by others:

In [9]:
from sklearn.linear_model import LogisticRegression

widget = SemiSupervisor(
    connection_string="sqlite:///demo.db",
    classifier=LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=5000),
    reorder='margin'
)

In [None]:
widget.orchestrate()

Score: 0.96
