### Active labeling

Here I'm going to document, lightly, my approach to getting active labeling stood up. The D4AD case, with ongoing variation in abbreviation usage, skills, tools, prereqs, is a good case for active labeling where the number of classes grows and robustness "guarantees" are desired.

This captures a first approach. I suspect, for the backend, the best approach is to incorporate Vowpal Wabbit with Superintendent, within a Python notebook and captured in a simple docker-compose file that includes model reloading and saving. That's a bit.

Given the time constraints I will try only Superintendent and save out labeled cases to disk as a lookup table. The models will be saved, handled manually.


Superintendent needs: 

`jupyter nbextension enable --py --sys-prefix ipyeventsjupyter nbextension enable --py --sys-prefix ipyevents`

for keybindings (not sure how this is going to work in my visual code window). This is needed for every labelling run.

I think it also needs a database but wecan use the defualt SQL that it launches for now.
See this for distributed labeling, using docker-compose, See: https://superintendent.readthedocs.io/en/latest/examples/docker-compose/index.html

In [4]:
from superintendent import ClassLabeller

widget = ClassLabeller(
    features=[
        "First datapoint",
        "Second datapoint",
        "Third datapoint",
    ],
    options=[
        "First option",
        "Second option",
    ]
)
widget

# I think i need to be in the juypter notebook to see this

ClassLabeller(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(chil…

In [5]:
widget.new_labels

['Second option', 'First option', 'Second option']

In [7]:
# let's try the stock example from https://superintendent.readthedocs.io/en/latest/examples/labelling-images-actively.html

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from superintendent import ClassLabeller

digits = load_digits()

data_labeller = ClassLabeller.from_images(
    canvas_size=(200, 200),
    features=digits.data[:500, :],
    model=LogisticRegression(solver="lbfgs", multi_class="multinomial", max_iter=5000),
    options=range(10),
    acquisition_function='entropy',
    display_preprocess=lambda x: x.reshape(8, 8)
)

data_labeller

ClassLabeller(children=(HBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.…

Okay so the labeller works in that it does what it says and it does capture data. The score went up a bit and then dropped quite a bit as labeling continued but, also, the examples began to get crazy ambigious.

Even in the Juypter notebook the hotkeys didn't work when I tried them, although I might have done it for thie visual code environment instead of the Juypter server environment. Either way it would probably work in teh docker-compose environment.

In [15]:
import pandas as pd 

# Do a quick test run on prereqs
rootpath = "/hdd/work/d4ad_standardization/"
interimpath = "D4AD_Standardization/data/interim/"
filepath = "0_prereqs.csv"

df = pd.read_csv(rootpath+interimpath+filepath)

In [31]:
small_df = df.sample(n=100, random_state=42)
small_df.dropna(subset=["content"], inplace=True) # there are NA/NaN floats in the data

In [32]:
# this is crude but first we set up the pipeline, following
# https://superintendent.readthedocs.io/en/latest/examples/preprocessing-data.html
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from superintendent import ClassLabeller
from IPython.display import display, Markdown

pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1,2))),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])


In [34]:
# ... Then we set up a preprocessor for text display for the model
# ... and a display function so we know
# ... and let the transformation be in the pipeline

def display_func(row):
    """
    The display function gets passed your data - in the
    case of a dataframe, it gets passed a row - and then
    has to "display" your data in whatever way you want.

    It doesn't need to return anything
    """
    display(Markdown(row["content"]))
    #display(Markdown("**At:** " + row["timestamp"]))

def preprocessor(x, y):
    # only take Email column, leave everything else
    return x["content"], y


labelling_widget = ClassLabeller(
    features=small_df,
    model=pipeline,
    model_preprocess=preprocessor,
    display_func=display_func,
    options=['professional', 'not professional'],
    acquisition_function='margin'
)

labelling_widget

ClassLabeller(children=(HBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.…

In [38]:
labelling_widget.new_labels

['not professional',
 'professional',
 'not professional',
 'professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 'not professional',
 None,
 'not professional',
 'not professional',
 'not professional',
 'professional',
 'professional',
 'professional',
 'professional',
 'professional',
 'GED',
 'Writing',
 'High School',
 None,
 'Driving',
 'High School',
 'Ability to Benefit',
 'GED',
 None,
 'GED',
 None,
 None,
 'Middle School',
 'High School',
 'Vaccinate',
 'GED',
 'Some College',
 'None',
 'Adult age',
 'None',
 'Familar with mathematics',
 'None',
 'High School',
 'High School',
 'Familar with mathematics',
 'GED',
 None,
 'High School',
 'GED',
 'GED',
 'GED',
 'Submit.',
 'High School',
 'GED',
 'Submi