# Labelling data with `superintendent`

One of the most important activities in a machine learning problem can be the data labelling stage. If you start out with completely unlabelled data, you can either use unsupervised learning techniques, or you can label data, and either use semi-supervised or supervised machine learning techniques.

`superintendent` was built with this in mind: labelling data as part of a data science / machine learning project. Because the aim of this package is to integrate with existing workflows as much as possible, `superintendent` is designed to work in the interactive python ecosystem: in jupyter notebooks / jupyter lab.

Superintendent provides the following widgets for labelling:

1. `superintendent.SemiSupervisor`, for assigning one label per data point
2. `superintendent.MultiLabeller`, for assigning multiple labels per data point
3. `superintendent.ClusterSupervisor`, for assigning one label to many points (i.e. clusters)

For detailed examples, take a look at the [examples gallery](examples/index.md).


## Getting started: labelling data points

Probably the most common use-case is this: you want to assign a label for every data point. You can use the `SemiSupervisor` widget for this:

In [1]:
from superintendent import SemiSupervisor

For all superintendent widgets, you provide data using the `features` argument:

In [21]:
widget = SemiSupervisor(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "First option",
        "Second option",
    ]
)

widget

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

Once you have labelled points, you can get your new labels with the `new_labels` attribute:

In [22]:
widget.new_labels

['First option', 'Second option', None]

### Customising your input options

What options you provide to the person doing the labelling can have big effects on how your data will be labelled. `superintendent` therefore allows you to modify a lot of aspects of the "Input widget".

#### `Other` text field input

By default, a text field is available for users to submit additional labels. Once submitted, these labels are presented to the user as buttons as well.

If you want to prevent people from being able to submit a "Other" option, you can do so with the `allow_other` keyword argument:

In [3]:
SemiSupervisor(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "First option",
        "Second option",
    ],
    other_option=False,
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

#### Displaying options as buttons or dropdowns

You can provide as many options as you want for your labelling task. However, if there are too many, displaying them as buttons will not look good anymore, so by default, `superintendent` will switch to displaying options as a dropdown eventually. You can control this with the `max_buttons` argument, which is 12 by default:


In [4]:
SemiSupervisor(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=range(15),
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

In [5]:
SemiSupervisor(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=range(15),
    max_buttons=15
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

### Customising how your data is displayed

By default, superintendent will display your data in the same way jupyter notebooks display your data when you return it at the end of a notebook cell: Using the `IPython.display.display` function. However, in many cases, you will want to customise the data that gets displayed. You can provide a custom display function to all superintendent widgets for this purpose.

For example, if you wanted the text to be displayed with a normal font, rather than a monospaced font, you could define a function to do so:

In [6]:
from IPython.display import display, Markdown

def display_text(data):
    display(Markdown(data))

In [7]:
SemiSupervisor(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    display_func=display_text
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

You can use this display function to your advantage: it can perform preprocessing. If, for example, your dataset contained data that isn't suitable for displaying visually, you could choose not to display it in your display function.

In general, you can take a look at different ways of displaying data in the [IPython documentation](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html).

## Assigning more than one label per data point

If you are, say, classifying text, you often encounter the situation that a data point falls into more than one category. In machine learning, this is often referred to as a "multi-output" problem: You'd like to eventually build models that can assign more than one label to each data point.

In Superintendent, this is achieved with the `MultiLabeller` widget, which functions identically to the `SemiSupervisor` widget except it allows you to label each data point with multiple options.


In [8]:
from superintendent import MultiLabeller

MultiLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "Option 1",
        "Option 2",
        "Option 3",
        "Option 4",
    ]
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

Similar to the single-label widget, when there are too many options to display neatly as buttons, the widget will change to a multi-select input:

In [9]:
from superintendent import MultiLabeller

MultiLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "Option 1",
        "Option 2",
        "Option 3",
        "Option 4",
    ],
    max_buttons=2
)

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

## Labelling clusters of points

Often, when you do unsupervised machine learning, you ultimately end up with many data points, each of which has been assigned a "cluster label". These cluster labels tend to just be integers; and that is because the clustering algorithm has no insight into what makes a cluster a cluster. This means you often want to go through each cluster and label it yourself.

To do this, `superintendent` provides a `ClusterLabeller` widget:

In [10]:
from superintendent import ClusterSupervisor

To demonstrate, I am going to collect headlines from two major UK newspaper websites (the code for this comes from the amazing github project [compare-headlines](https://github.com/isobelweinberg/compare-headlines/blob/master/scrape-headlines.ipynb) by [isobelweinberg](https://github.com/isobelweinberg)).

In [15]:
import requests
from bs4 import BeautifulSoup
import datetime
import numpy as np

headlines = []
labels = []

r = requests.get('https://www.theguardian.com/uk').text #get html
soup = BeautifulSoup(r, 'html5lib') #run html through beautiful soup
headlines += [headline.text for headline in
              soup.find_all('span', class_='js-headline-text')][:10]
labels += ['guardian'] * (len(headlines) - len(labels))

soup = BeautifulSoup(requests.get('http://www.dailymail.co.uk/home/index.html').text, 'html5lib')
headlines += [headline.text.replace('\n', '').replace('\xa0', '').strip()
              for headline in soup.find_all(class_="linkro-darkred")][:10]
labels += ['daily mail'] * (len(headlines) - len(labels))

cluster_labels = np.random.choice([1, 2, 3], size=len(headlines))

I am also going to use a custom display function to label the clusters; as I'd like to see the words defining each cluster in a wordcloud:

In [19]:
from wordcloud import WordCloud
from IPython.display import display

def show_wordcloud(text, n_samples=None):
    text = ' '.join(text)
    display(
        WordCloud().generate(text).to_image()
    )


In [20]:
labelling_widget = ClusterSupervisor(
    features=headlines,
    cluster_indices=np.random.choice([1, 2, 3], size=len(headlines)),
    display_func = show_wordcloud
)

labelling_widget

VBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(children=(Out…

You can retrieve the mappings from cluster index to cluster label from the `new_clusters` attribute:

In [25]:
labelling_widget.new_clusters

{1: 'Cluster label 1', 3: 'Cluster label 2'}

And you can retrieve the cluster label for each data point using the new_labels attribute:

In [26]:
labelling_widget.new_labels

['Cluster label 1',
 'Cluster label 1',
 'Cluster label 1',
 'Cluster label 1',
 'Cluster label 1',
 'Cluster label 1',
 'Cluster label 1',
 'Cluster label 1',
 'Cluster label 1',
 'Cluster label 2',
 'Cluster label 2',
 'Cluster label 2',
 'Cluster label 2',
 'Cluster label 2',
 'Cluster label 2',
 'Cluster label 2',
 None,
 None,
 None,
 None]