# Labelling data with `superintendent`

One of the most important activities in a machine learning problem can be the data labelling stage. If you start out with completely unlabelled data, you can either use unsupervised learning techniques, or you can label data, and either use semi-supervised or supervised machine learning techniques.

`superintendent` was built with this in mind: labelling data as part of a data science / machine learning project. Because the aim of this package is to integrate with existing workflows as much as possible, `superintendent` is designed to work in the interactive python ecosystem: in jupyter notebooks and jupyter lab.

The idea behind this approach - putting data labelling right in the notebook - is that it allows users to very easily and quickly prototype their machine learning models, with self-created labels, without leaving their working environment.

Since jupyter notebooks run in a web browser, all widgets inside superintendent can also be deployed as web applications, using [the notebook-to-website tool `voila`](https://github.com/voila-dashboards/voila).

Superintendent provides the following widgets for general-purpose class labelling:

1. `superintendent.ClassLabeller`, for assigning one label per data point
2. `superintendent.MultiClassLabeller`, for assigning multiple labels per data point
<!-- 3. `superintendent.ClusterSupervisor`, for assigning one label to many points (i.e. clusters) -->

Both of these classes allow users to do active learning, meaning you can leverage a machine learning model to reduce the number of labels you have to create before you get to good levels of accuracy.

For detailed examples, take a look at the [examples gallery](examples/index.md).


## Getting started: labelling data points

A common use-case is this: you want to assign a label for every data point. You can use the `SemiSupervisor` widget for this:

In [1]:
from superintendent import ClassLabeller

For all superintendent widgets, you provide data using the `features` argument:

In [2]:
widget = ClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "First option",
        "Second option",
    ]
)

To display a superintendent widget, you simply have to put it as the last statement
in a jupyter notebook cell.

In [3]:
widget


VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

Once you have labelled points, you can get your new labels with the `new_labels` attribute:

In [4]:
widget.new_labels

['First option', 'Second option', None]

### Customising your input options

What options you provide to the person doing the labelling can have big effects on how your data will be labelled. `superintendent` therefore allows you to modify a lot of aspects of the "Input widget".

#### `Other` text field input

By default, a text field is available for users to submit additional labels. Once submitted, these labels are presented to the user as buttons as well.

If you want to prevent people from being able to submit a "Other" option, you can do so with the `allow_freetext` keyword argument:

In [5]:
ClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "First option",
        "Second option",
    ],
    allow_freetext=False,
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

#### Displaying options as buttons or dropdowns

You can provide as many options as you want for your labelling task. However, if there are too many, displaying them as buttons won't look good anymore, so by default, `superintendent` will switch to displaying options as a dropdown eventually.


In [6]:
ClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=range(15),
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

### Customising how your data is displayed

By default, superintendent will display your data in the same way jupyter notebooks display your data when you return it at the end of a notebook cell: Using the `IPython.display.display` function.

However, in many cases, you will want to customise the data that gets displayed. You can provide a custom display function to all superintendent widgets for this purpose.

For example, if you wanted the text to be displayed with a normal font, rather than a monospaced font, you could define a function to do so:

In [7]:
from IPython.display import display, Markdown

def display_text(data):
    display(Markdown(data))

In [8]:
ClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    display_func=display_text
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

You can also perform custom pre-processing by passing a function as the `display_preprocess` argument.

In [9]:
ClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    display_func=display_text,
    display_preprocess=lambda x: x.lower()
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

The display function can be anything that shows something in a jupyter notebook. For more ways of displaying data, you can take a look at the [IPython documentation](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html).

## Assigning more than one label per data point

If you are, say, classifying text, you often encounter the situation that a data point falls into more than one category. In machine learning, this is often referred to as a "multi-output" problem: You'd like to eventually build models that can assign more than one label to each data point.

In Superintendent, this is achieved with the `MultiClassLabeller` widget, which functions identically to the `ClassLabeller` widget except it allows you to label each data point with multiple options.


In [11]:
from superintendent import MultiClassLabeller

widget = MultiClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "Option 1",
        "Option 2",
        "Option 3",
        "Option 4",
    ]
)

widget

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

In this case, the labels created are actually a list of lists:

In [12]:
widget.new_labels

[['Option 1', 'Option 2'], ['Option 3'], None]

Similar to the single-label widget, when there are too many options to display neatly as buttons, the widget will change to a multi-select input:

In [13]:
from superintendent import MultiClassLabeller

MultiClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "Option {}".format(i)
        for i in range(15)
    ]
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…