# Kili Tutorial: AutoML for faster labeling with Kili Technology
In this tutorial, we will show how to use [automated machine learning](https://en.wikipedia.org/wiki/Automated_machine_learning) (AutoML) to accelerate labeling in Kili Technology. We will apply it in the context of text classification: given a tweet, I want to classify whether it is about a real disaster or not (as introuced in [Kaggle NLP starter kit](https://www.kaggle.com/c/nlp-getting-started)).

Additionally:

For an overview of Kili, visit kili-technology.com You can also check out the Kili documentation https://kili-technology.github.io/kili-docs.

The tutorial is divided into three parts:

1. AutoML
2. Integrate AutoML scikit-learn pipelines
3. Automating labeling in Kili Technology

## 1. AutoML
Automated machine learning (AutoML) is described as the process of automating both the choice and training of a machine learning algorithm by automatically optimizing its hyperparameters.

There already exist many AutoML framework:

- [H2O](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) provides with an AutoML solution with both Python and R bindings
- [autosklearn](https://automl.github.io/auto-sklearn/master/) can be used for SKLearn pipelines
- [TPOT](http://epistasislab.github.io/tpot) uses genetic algorithms to automatically tune your algorithms
- [fasttext](https://fasttext.cc) has [its own AutoML module](https://fasttext.cc/docs/en/autotune.html) to find the best hyperparameters

We will cover the use of `autosklearn` for automated text classification. `autosklearn` explores the hyperparameters grid as defined by SKLearn as a human would [do it manually](https://scikit-learn.org/stable/modules/grid_search.html). Jobs can be run in parallel in order to speed up the exploration process. `autosklearn` can use either [SMAC](http://ml.informatik.uni-freiburg.de/papers/11-LION5-SMAC.pdf) (Sequential Model-based Algorithm Configuration) or [random search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) to select the next set of hyperparameters to test at each time.

Once AutoML automatically chose and trained a classifier, we can use this classifier to make predictions. Predictions can then be inserted into Kili Technology. When labeling, labelers first see predictions before labeling. For complex tasks, this can considerably speed up the labeling.

For instance, when annotating voice for [automatic speech recognition](https://en.wikipedia.org/wiki/Speech_recognition), if you use a model that pre-annotates by transcribing speeches, you more than double annotation productivity:

<img src="./img/efficiency_comparison_with_without_model.png" alt="Drawing" style="width: 500px;"/>

## 2. Integrate AutoML scikit-learn pipelines

Specifically for text classification, the following pipeline retrieves labeled and unlabeled data from Kili, builds a classifier using AutoML and then enriches back Kili's training set:

<img src="./img/automl_pipeline.png" alt="Drawing" style="width: 900px;"/>

After retrieving data, [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) pre-processes text data by filtering out common words (such as `the`, `a`, etc) in order to make most important features stand out. These pre-processed features will be fed to a classifier.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score
from tqdm import tqdm

MIN_DOC_FREQ = 2
NGRAM_RANGE = (1, 2)
TOP_K = 20000
TOKEN_MODE = 'word'

def ngram_vectorize(train_texts, train_labels, val_texts):
    tfidf_vectorizer_params = {
        'ngram_range': NGRAM_RANGE,
        'dtype': 'int32',
        'strip_accents': 'unicode',
        'decode_error': 'replace',
        'analyzer': TOKEN_MODE,
        'min_df': MIN_DOC_FREQ,
    }

    # Learn vocab from train texts and vectorize train and val sets
    tfidf_vectorizer = TfidfVectorizer(**tfidf_vectorizer_params)
    x_train = tfidf_vectorizer.fit_transform(train_texts)
    x_val = tfidf_vectorizer.transform(val_texts)

    # Select k best features, with feature importance measured by f_classif
    selector = SelectKBest(f_classif, k=min(TOP_K, x_train.shape[1]))
    selector.fit(x_train, train_labels)
    x_train = selector.transform(x_train).astype('float32')
    x_val = selector.transform(x_val).astype('float32')

    return x_train, x_val

Labeled data is split in train and test sets for validation. Then, `autosklearn` classifier is chosen and trained in a limited time.

NB: For macOS users out there, install `autosklearn` by [launching](https://github.com/automl/auto-sklearn/issues/155#issuecomment-481957383):

```
conda install clang_osx-64
conda install clangxx_osx-64
pip install auto-sklearn
```

In [2]:
from tempfile import TemporaryDirectory

import autosklearn
import autosklearn.classification
from sklearn.model_selection import train_test_split

def automl_train_and_predict(X, y, X_to_predict):
    x, x_to_predict = ngram_vectorize(
        X, y, X_to_predict)
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.2, random_state=42)

    tmp_folder = TemporaryDirectory()
    output_folder = TemporaryDirectory()

    # Auto-tuning by autosklearn
    print('going here')
    cls = autosklearn.classification.AutoSklearnClassifier(n_jobs=4, time_left_for_this_task=200,
                                                           per_run_time_limit=20,
                                                           tmp_folder=tmp_folder.name,
                                                           output_folder=output_folder.name,
                                                           seed=10)
    print('going there...', x_train, y_train)
    cls.fit(x_train, y_train)
    print('going there 2...')
    assert x_train.shape[1] == x_to_predict.shape[1]

    # Performance metric
    print('going there 3...')
    predictions_test = cls.predict(x_test)
    print('Accuracy: {}'.format(accuracy_score(y_test, predictions_test)))

    # Generate predictions
    predictions = cls.predict(x_to_predict)
    return predictions

  self.re = re.compile(self.reString)


## 3. Automating labeling in Kili Technology
Let's now feed Kili data to the AutoML pipeline. For that you will need to [create a new](https://cloud.kili-technology.com/label/projects/create-project) `Text classification` project. Assets are taken from Kaggle challenge `Real or Not? NLP with Disaster Tweets`. You can download them [here](https://www.kaggle.com/c/nlp-getting-started/data).

Connect to Kili Technology using `kili-playground` (Kili's official [Python SDK](https://github.com/kili-technology/kili-playground) to interact with Kili API):

In [3]:
# !pip install kili
from kili.authentication import KiliAuth
from kili.playground import Playground

email = 'pierre@kili-technology.com' # 'YOUR EMAIL'
password = '' # 'YOUR PASSWORD'
project_id = 'ck8hd0ab2lf2v0729aqhleqcc' # 'YOUR PROJECT ID'
api_endpoint = 'https://cloud.kili-technology.com/api/label/graphql'

kauth = KiliAuth(email=email, password=password, api_endpoint=api_endpoint)
playground = Playground(kauth)

Let's insert all assets into Kili. You can download the original `test.csv` directly [on Kaggle](https://www.kaggle.com/c/nlp-getting-started/data).

In [8]:
import pandas as pd

df = pd.read_csv('/Users/pmarcenac/Downloads/test.csv')
content_array = []
external_id_array = []
for index, row in df.iterrows():
    if index < 10:
        external_id_array.append(f'tweet_{index}')
        content_array.append(row['text'])

playground.append_many_to_dataset(project_id=project_id,
                                  content_array=content_array,
                                  external_id_array=external_id_array,
                                  is_honeypot_array=[False for _ in content_array],
                                  status_array=['TODO' for _ in content_array],
                                  json_metadata_array=[{} for asset in content_array])

{'id': 'ck8hd0ab2lf2v0729aqhleqcc'}

Retrieve the categories of the first job that you defined in Kili interface. Learn [here](https://kili-technology.github.io/kili-docs/docs/projects/customize-interfaces) what interfaces and jobs are in Kili.

In [4]:
tools = playground.get_tools(project_id=project_id)
assert len(tools) == 1

json_settings = tools[0]['jsonSettings']
jobs = json_settings['jobs']
jobs_list = list(jobs.keys())
assert len(jobs_list) == 1, 'More than one job was defined in the interface'

job_name = jobs_list[0]
job = jobs[job_name]
categories = list(job['content']['categories'].keys())
print(f'Categories are: {categories}')

Categories are: ['YES', 'NO']


We continuously fetch assets from Kili Technology and apply AutoML pipeline. You can launch the next cell and go to Kili in order to label. After labeling a few assets, you'll see predictions automatically pop up in Kili!

In [None]:
import os
import time

SECONDS_BETWEEN_TRAININGS = 0

def extract_train_for_auto_ml(job_name, assets, categories, train_test_threshold=0.8):
    X = []
    y = []
    X_to_predict = []
    ids_X_to_predict = []
    for asset in assets:
        x = asset['content']
        labels = [l for l in asset['labels'] if l['labelType'] in ['DEFAULT', 'REVIEWED']]

        # If no label, add it to X_to_predict
        if len(labels) == 0:
            X_to_predict.append(x)
            ids_X_to_predict.append(asset['id'])

        # Otherwise add it to training examples X, y
        for label in labels:
            jsonResponse = label['jsonResponse'][job_name]
            is_empty_label = 'categories' not in jsonResponse or len(
                jsonResponse['categories']) != 1 or 'name' not in jsonResponse['categories'][0]
            if is_empty_label:
                continue
            X.append(x)
            y.append(categories.index(
                jsonResponse['categories'][0]['name']))
    return X, y, X_to_predict, ids_X_to_predict

while True:
    print('Export assets and labels...')
    assets = playground.get_assets(project_id=project_id, first=100, skip=0) ## Remove that
    X, y, X_to_predict, ids_X_to_predict = extract_train_for_auto_ml(job_name, assets, categories)

    if len(X) > 5:
        print('AutoML is on its way...')
        predictions = automl_train_and_predict(X, y, X_to_predict)
        # Insert pre-annotations
        for i, prediction in enumerate(tqdm(predictions)):
            json_response = {
                job_name: {
                    'categories': [{
                        'name': categories[prediction],
                        'confidence':100
                    }]
                }
            }
            id = ids_X_to_predict[i]
            playground.create_prediction(asset_id=id, json_response=json_response)
        print('Done.\n')
    time.sleep(SECONDS_BETWEEN_TRAININGS)

Export assets and labels...


100%|██████████| 100/100 [00:03<00:00, 27.82it/s]

AutoML is on its way...
going here
going there...   (0, 0)	1.0
  (2, 7)	0.70710677
  (2, 4)	0.70710677
  (3, 11)	0.33333334
  (3, 10)	0.33333334
  (3, 9)	0.33333334
  (3, 8)	0.33333334
  (3, 6)	0.33333334
  (3, 5)	0.33333334
  (3, 3)	0.33333334
  (3, 2)	0.33333334
  (3, 1)	0.33333334
  (7, 7)	0.70710677
  (7, 0)	0.70710677
  (8, 4)	1.0 [1, 0, 0, 0, 1, 0, 1, 0, 0]



  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  return matrix(data, dtype=dtype, copy=False)
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()


  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()




  self._cpu_tac = time.clock()
  self._cpu_tic = time.clock()
  self._cpu_tic = time.clock()



## Summary
In this tutorial, we accomplished the following:

We introduced the concept of AutoML as well as several of the most-used frameworks for AutoML. We demonstrated how to leverage AutoML to automatically create predictions in Kili. If you enjoyed this tutorial, check out the other Recipes for other tutorials that you may find interesting, including demonstrations of how to use Kili.

You can also visit the Kili website or Kili documentation for more info!