# Preprocessing data before passing it to the model

In general, you will often want to pass different parts of your data to your display function
and your model. In general, superintendent does not provide "pre-model" hooks. Instead, any
pre-processing that is specific to your model or your display function, can be specified in
the `display_func`, or in a
[scikit-learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
object.

For example, imagine you have a pandas dataframe of emails, which contains some metadata that
you would like to display during labelling. However, you want to build a model that is agnostic
to that information.

## Data

First, let's create some dummy data:


In [1]:
import random
import string
import pandas as pd

n_rows = 50

example_emails = [
    "Hi John,\nthis is just to say nice work yesterday.\nBest,\nJim",
    "Hi Mike,\nthis is just to say terrible work yesterday.\nBest,\nJim",
]

example_recipients = ["John", "Mike"]

example_timestamps = ["2018-02-01 15:00", "2018-02-01 15:03"]

example_df = pd.DataFrame({
    'email': example_emails,
    'recipient': example_recipients,
    'timestamp': example_timestamps
})

display(example_df)

Unnamed: 0,email,recipient,timestamp
0,"Hi John,\nthis is just to say nice work yester...",John,2018-02-01 15:00
1,"Hi Mike,\nthis is just to say terrible work ye...",Mike,2018-02-01 15:03


## Display function

In the display function, we'll re-format the data.

In [2]:
from IPython.display import display, Markdown

def display_email(row):
    """
    The display function gets passed your data - in the
    case of a dataframe, it gets passed a row - and then
    has to "display" your data in whatever way you want.
    
    It doesn't need to return anything
    """
    display(Markdown("**To:** " + row["recipient"]))
    display(Markdown("**At:** " + row["timestamp"]))

    display(Markdown(row["email"].replace("\n", "\n\n")))

## Model Pipeline

We only want to pass the E-Mail text to our model, and to achieve this we can write a small pre-processing function that is applied to **both** the features and labels whenever a model is fit.

We then can write a model that uses scikit-learn's feature-vectorizer and applies a logistic regression.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

def preprocessor(x, y):
    # only take Email column, leave everything else
    return x["email"], y


model = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer()),
    ('logistic_regression', LogisticRegression())
])

## The widget

Now that we have assembled the necessary components, we can create our widget:

In [4]:
from superintendent import ClassLabeller

widget = ClassLabeller(
    features=example_df,
    model=model,
    model_preprocess=preprocessor,
    display_func=display_email,
    options=['positive', 'negative'],
    acquisition_function='margin'
)

widget

VBox(children=(HBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),), lay…