## Labelling clusters of points

Often, when you do unsupervised machine learning, you ultimately end up with many data points, each of which has been assigned a "cluster label". These cluster labels tend to just be integers; and that is because the clustering algorithm has no insight into what makes a cluster a cluster. This means you often want to go through each cluster and label it yourself.

To do this, `superintendent` provides a `ClusterLabeller` widget:

In [2]:
from superintendent import ClusterSupervisor

ImportError: cannot import name 'ClusterSupervisor' from 'superintendent' (/Users/jan/personal-projects/oss/superintendent/.venv/lib/python3.7/site-packages/superintendent/__init__.py)

To demonstrate, I am going to collect headlines from two major UK newspaper websites (the code for this comes from the fantastic github project [compare-headlines](https://github.com/isobelweinberg/compare-headlines/blob/master/scrape-headlines.ipynb) by [isobelweinberg](https://github.com/isobelweinberg)).

In [None]:
import requests
from bs4 import BeautifulSoup
import datetime
import numpy as np

headlines = []
labels = []

r = requests.get('https://www.theguardian.com/uk').text #get html
soup = BeautifulSoup(r, 'html5lib') #run html through beautiful soup
headlines += [headline.text for headline in
              soup.find_all('span', class_='js-headline-text')][:10]
labels += ['guardian'] * (len(headlines) - len(labels))

soup = BeautifulSoup(requests.get('http://www.dailymail.co.uk/home/index.html').text, 'html5lib')
headlines += [headline.text.replace('\n', '').replace('\xa0', '').strip()
              for headline in soup.find_all(class_="linkro-darkred")][:10]
labels += ['daily mail'] * (len(headlines) - len(labels))

cluster_labels = np.random.choice([1, 2, 3], size=len(headlines))

I am also going to use a custom display function to label the clusters; as I'd like to see the words defining each cluster in a wordcloud:

In [None]:
from wordcloud import WordCloud
from IPython.display import display

def show_wordcloud(text, n_samples=None):
    text = ' '.join(text)
    display(
        WordCloud().generate(text).to_image()
    )


In [None]:
labelling_widget = ClusterSupervisor(
    features=headlines,
    cluster_indices=np.random.choice([1, 2, 3], size=len(headlines)),
    display_func = show_wordcloud
)

labelling_widget

In [None]:
labelling_widget.new_clusters

In [None]:
labelling_widget.new_labels