# Kili Tutorial: How to leverage Counterfactually augmented data to have a more robust model

This recipe is inspired by the paper *Learning the Difference that Makes a Difference with Counterfactually-Augmented Data*, that you can find here on [arXiv](https://arxiv.org/abs/1909.12434)

In this study, the authors point out the difficulty for Machine Learning models to generalize the classification rules learned, because their decision rules, described as 'spurious patternes', often miss the key elements that affects most the class of a text. They thus decided to delete what can be considered as a confusion factor, by changing the label of an asset at the same time as changing the minimum amount of words so those **key-words** would be much easier for the model to spot.

We'll see in this tutorial :
1. How to create a project in Kili, both for [IMDB](##Data-Augmentation-on-IMDB-dataset) and [SNLI](##Data-Augmentation-on-SNLI-dataset) datasets, to reproduce such a data-augmentation task, in order to improve our model, and decrease its variance when used in production with unseen data.
2. We'll also try to [reproduce the results of the paper](##Reproducing-the-results), using similar models, to show how such a technique can be of key interest while working on a text-classification task.
We'll use the data of the study, both IMDB and Stanford NLI, publicly available [here](https://github.com/acmi-lab/counterfactually-augmented-data).

Additionally, for an overview of Kili, visit the [website](https://kili-technology.com), you can also check out the Kili [documentation](https://cloud.kili-technology.com/docs), or some other recipes.


![data augmentation](https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/data_collection_pipeline.png)

In [None]:
# Authentication
import os

# !pip install kili # uncomment if you don't have kili installed already
from kili.authentication import KiliAuth
from kili.playground import Playground

email = os.getenv('KILI_USER_EMAIL')
password = os.getenv('KILI_USER_PASSWORD')
api_endpoint = os.getenv('KILI_API_ENDPOINT') 
# If you use Kili SaaS, use the url 'https://cloud.kili-technology.com/api/label/graphql'

kauth = KiliAuth(email=email, password=password, api_endpoint=api_endpoint)
playground = Playground(kauth)
user_id = kauth.user_id

## Data Augmentation on IMDB dataset

The data consists in reviews of films, that are classified as positives or negatives. State-of-the-art models performance is often measured against this dataset, making it a reference. 

This is how our task would look like on Kili, into 2 different projects for each task, from Positive to Negative or Negative to Positive.

### Creating the projects

In [None]:
taskname = "NEW_REVIEW"
project_imdb_negative_to_positive = {
'title': 'Counterfactual data-augmentation - Negative to Positive',
'description': 'IMDB Sentiment Analysis',
'instructions': 'https://docs.google.com/document/d/1zhNaQrncBKc3aPKcnNa_mNpXlria28Ij7bfgUvJbyfw/edit?usp=sharing',
'input_type': 'TEXT',
'json_interface':{
    "filetype": "TEXT",
    "jobRendererWidth": 0.5,
    "jobs": {
        taskname : {
            "mlTask": "TRANSCRIPTION",
            "content": {
                "input": None
            },
            "required": 1,
            "isChild": False,
            "instruction": "Write here the new review modified to be POSITIVE. Please refer to the instructions above before starting"
        }
    }
}
}
project_imdb_positive_to_negative = {
'title': 'Counterfactual data-augmentation - Positive to Negative',
'description': 'IMDB Sentiment Analysis',
'instructions': 'https://docs.google.com/document/d/1zhNaQrncBKc3aPKcnNa_mNpXlria28Ij7bfgUvJbyfw/edit?usp=sharing',
'input_type': 'TEXT',
'json_interface':{
    "jobRendererWidth": 0.5,
    "jobs": {
        taskname : {
            "mlTask": "TRANSCRIPTION",
            "content": {
                "input": None
            },
            "required": 1,
            "isChild": False,
            "instruction": "Write here the new review modified to be NEGATIVE. Please refer to the instructions above before starting"
        }
    }
}
}

In [None]:
for project_imdb in [project_imdb_positive_to_negative,project_imdb_negative_to_positive] :
    project_imdb['id'] = playground.create_empty_project(user_id=user_id)['id']
    playground.update_properties_in_project(project_id=project_imdb['id'],
                                            title=project_imdb['title'],
                                            instructions=project_imdb['instructions'],
                                            description=project_imdb['description'],
                                            input_type=project_imdb['input_type'],
                                            json_interface=project_imdb['json_interface'])

We'll just create some useful functions for an improved readability :

In [None]:
def create_assets(dataframe, intro, objective, instructions, truth_label, target_label) :
    return((intro + dataframe[truth_label] + objective + dataframe[target_label] + instructions + dataframe['Text']).tolist())

def create_json_responses(taskname,df,field="Text") :
    return( [{taskname: { "text": df[field].iloc[k] }
          } for k in range(df.shape[0]) ])

### Importing the data into Kili

In [None]:
import pandas as pd
datasets = ['dev','train','test']

for dataset in datasets :
    url = f'https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/sentiment/combined/paired/{dataset}_paired.tsv'
    df = pd.read_csv(url, error_bad_lines=False, sep='\t')
    df = df[df.index%2 == 0] # keep only the original reviews as assets
    
    
    for review_type,project_imdb in zip(['Positive','Negative'],[project_imdb_positive_to_negative,project_imdb_negative_to_positive]) :
        dataframe = df[df['Sentiment']==review_type]
        reviews_to_import = dataframe['Text'].tolist()
        external_id_array = ('IMDB ' + review_type +' review ' + dataset + dataframe['batch_id'].astype('str')).tolist()
    
        playground.append_many_to_dataset(
            project_id=project_imdb['id'],
            content_array=reviews_to_import,
            external_id_array=external_id_array)

### Importing the labels into Kili 
We will fill-in with the results of the study, as if they were predictions. In a real annotation project, we could fill in with the sentences as well so the labeler just has to write the changes. 

In [None]:
model_name = 'results-arxiv:1909.12434'

for dataset in datasets :
    url = f'https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/sentiment/combined/paired/{dataset}_paired.tsv'
    df = pd.read_csv(url, error_bad_lines=False, sep='\t')
    df = df[df.index%2 == 1] # keep only the modified reviews as predictions
    
    for review_type,project_imdb in zip(['Positive','Negative'],[project_imdb_positive_to_negative,project_imdb_negative_to_positive]) :
        dataframe = df[df['Sentiment']!=review_type]

        external_id_array = ('IMDB ' + review_type +' review ' + dataset + dataframe['batch_id'].astype('str')).tolist()
        json_response_array = create_json_responses(taskname,dataframe)
    
        playground.create_predictions(project_id=project_imdb['id'],
            external_id_array=external_id_array,
            model_name_array=[model_name]*len(external_id_array),
            json_response_array=json_response_array)

This is how our interface looks in the end, allowing to quickly perform the task at hand

![IMDB](./img/imdb_review.png)

## Data Augmentation on SNLI dataset

The data consists in a 3-class dataset, where, provided with two phrases, a premise and an hypothesis, the machine-learning task is to find the correct relation between those two sentences, that can be either entailment, contradiction or neutral.

Here is an example of a premise, and three sentences that could be the hypothesis for the three categories :
![examples](https://licor.me/post/img/robust-nlu/SNLI_annotation.png)

This is how our task would look like on Kili, this time keeping it as a single project. To do so, we strongly remind the instructions at each labeler.

### Creating the project

In [None]:
taskname = "SENTENCE_MODIFIED"
project_snli={
'title': 'Counterfactual data-augmentation NLI',
'description': 'Stanford Natural language Inference',
'instructions': '',
'input_type': 'TEXT',
'json_interface':{
    "jobRendererWidth": 0.5,
    "jobs": {
        taskname: {
            "mlTask": "TRANSCRIPTION",
            "content": {
                "input": None
            },
            "required": 1,
            "isChild": False,
            "instruction": "Write here the modified sentence. Please refer to the instructions above before starting"
        }
    }
}
}

In [None]:
project_snli['id'] = playground.create_empty_project(user_id=user_id)['id']
print('Created project ' + project_snli["id"])
playground.update_properties_in_project(project_id=project_snli['id'],
                                        title=project_snli['title'],
                                        instructions=project_snli['instructions'],
                                        description=project_snli['description'],
                                        input_type=project_snli['input_type'],
                                        json_interface=project_snli['json_interface'])

Again, we'll factorize our code a little, to merge datasets and differentiate properly all the cases of sentences : 

In [None]:
def merge_datasets(dataset, sentence_modified) :
    url_original = f'https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/NLI/original/{dataset}.tsv'
    url_revised = f'https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/NLI/revised_{sentence_modified}/{dataset}.tsv'
    df_original = pd.read_csv(url_original, error_bad_lines=False, sep='\t')
    df_original = df_original[df_original.duplicated(keep='first')== False]
    df_original['id'] = df_original.index.astype(str)
    
    df_revised = pd.read_csv(url_revised, error_bad_lines=False, sep='\t')
    axis_merge = 'sentence2' if sentence_modified=='premise' else 'sentence1'
    # keep only one label per set of sentences
    df_revised = df_revised[df_revised[[axis_merge,'gold_label']].duplicated(keep='first')== False]

    df_merged = df_original.merge(df_revised, how='inner', left_on=axis_merge, right_on=axis_merge)
    
    if sentence_modified ==  'premise' :
        df_merged['Text'] = df_merged['sentence1_x'] + '\nSENTENCE 2 :\n' + df_merged['sentence2']
        instructions = " relation, by making a small number of changes in the FIRST SENTENCE\
        such that the document remains coherent and the new label accurately describes the revised passage :\n\n\n\
        SENTENCE 1 :\n"
    else : 
        df_merged['Text'] = df_merged['sentence1'] + '\nSENTENCE 2 :\n' + df_merged['sentence2_x']
        instructions = " relation, by making a small number of changes in the SECOND SENTENCE\
        such that the document remains coherent and the new label accurately describes the revised passage :\n\n\n\
        SENTENCE 1 : \n"
    return(df_merged, instructions)

def create_external_ids(dataset,dataframe, sentence_modified):
    return(('NLI ' + dataset + ' ' + dataframe['gold_label_x'] + ' to ' + dataframe['gold_label_y'] + ' ' + sentence_modified + ' modified ' + dataframe['id']).tolist())


### Importing the data into Kili
We'll add before each set of sentences a small precision of the task for the labeler :

In [None]:
datasets = ['dev','train','test']
sentences_modified = ['premise', 'hypothesis']
intro = "Those two sentences' relation is classified as "
objective = " to convert to a "

for dataset in datasets :
    for sentence_modified in sentences_modified :
        df,instructions = merge_datasets(dataset, sentence_modified)

        sentences_to_import = create_assets(df, intro, objective, instructions, 'gold_label_x', 'gold_label_y')
        external_id_array = create_external_ids(dataset, df, sentence_modified)
    
        playground.append_many_to_dataset(project_id=project_snli['id'],
            content_array=sentences_to_import,
            external_id_array=external_id_array)

### Importing the labels into Kili 
We will fill-in with the results of the study, as if they were predictions.

In [None]:
model_name = 'results-arxiv:1909.12434'

for dataset in datasets :
    for sentence_modified in sentences_modified :
        axis_changed = 'sentence1_y' if sentence_modified=='premise' else 'sentence2_y'
        df,instructions = merge_datasets(dataset, sentence_modified)

        external_id_array = create_external_ids(dataset, df, sentence_modified)
        json_response_array = create_json_responses(taskname,df,axis_changed) 
    
        playground.create_predictions(project_id=project_snli['id'],
            external_id_array=external_id_array,
            model_name_array=[model_name]*len(external_id_array),
            json_response_array=json_response_array)

![NLI](./img/snli_ex1.png)
![NLI](./img/snli_ex2.png)

## Reproducing the results

The study focus on comparing the performance of 5 models, when trained on the original dataset, or when trained on the new entire dataset.
Those five models are SVM, Naïve Bayes, bi-LSTMs, ELMo-LSTM & BERT. 
### SVM & Naïve Bayes models

In [77]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

maxlen=300
datasets = ['train','dev', 'test']

# Read train, validation and test data, and train TF-IDF matrix
def prepare_dataset_imdb(which_data, for_sklearn=True) :
    for dataset in datasets :
        url = f'https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/sentiment/combined/paired/{dataset}_paired.tsv'
        df = pd.read_csv(url, error_bad_lines=False, sep='\t')
        if which_data == 'original' :
            df = df[df.index%2 == 0] # keep only the original reviews
        elif which_data == 'revised' :
            df = df[df.index%2 == 1] # keep only the original reviews
        if dataset == 'train':
            y_train = df['Sentiment'].tolist()
            if for_sklearn :
                vec = TfidfVectorizer(decode_error='ignore', strip_accents='unicode', encoding='utf-8', min_df=10, max_df=1000)
                X_train = vec.fit_transform(df['Text'])
            else :
                tokenizer = Tokenizer(num_words=20001,oov_token='UNK')
                tokenizer.fit_on_texts(df['Text'].tolist())
                print('Found %s unique tokens.' % len(tokenizer.word_index))
                sequences = tokenizer.texts_to_sequences(df['Text'].tolist())
                X_train = pad_sequences(sequences, maxlen=maxlen)
                y_train = np.array([int(y == 'Positive') for y in y_train]).reshape((-1,1))
            print("Train matrix dimensionality: ", X_train.shape)
        elif dataset == 'dev': 
            y_dev = df['Sentiment'].tolist()
            if for_sklearn :
                X_dev = vec.transform(df['Text'])
            if not for_sklearn :
                sequences = tokenizer.texts_to_sequences(df['Text'].tolist())
                X_dev = pad_sequences(sequences, maxlen=maxlen)
                y_dev = np.array([int(y == 'Positive') for y in y_dev]).reshape((-1,1))
            print("Dev matrix dimensionality: ", X_dev.shape)
        else :  
            y_test = df['Sentiment'].tolist()
            if for_sklearn :
                X_test = vec.transform(df['Text'])
            if not for_sklearn :
                sequences = tokenizer.texts_to_sequences(df['Text'].tolist())
                X_test = pad_sequences(sequences, maxlen=maxlen)
                y_test = np.array([int(y == 'Positive') for y in y_test]).reshape((-1,1))
            print("Test matrix dimensionality: ", X_test.shape)
            return(X_train,y_train, X_dev,y_dev, X_test, y_test)


# Use pipeline to search parameters for both TFIDF and SVM :
#steps = [('TFIDF', StandardScaler()), ('SVM', SVC())]
#from sklearn.pipeline import Pipeline
#pipeline = Pipeline(steps) # define the pipeline object.


In [5]:
(X_train,y_train, X_dev,y_dev, X_test,y_test) = prepare_dataset_imdb('original')
model_SVM = SVC()
model_NaiveBayes = MultinomialNB()
for model in (model_SVM,model_NaiveBayes) :
    model.fit(X_train, y_train)
    print(f'Train Accuracy for {model} : {accuracy_score(y_train,model.predict(X_train))}')
    print(f'Dev Accuracy for {model} : {accuracy_score(y_dev,model.predict(X_dev))}')
    print(f'Test Accuracy for {model} : {accuracy_score(y_test,model.predict(X_test))}')

Train Accuracy for SVC() : 0.9964850615114236
Dev Accuracy for SVC() : 0.8775510204081632
Test Accuracy for SVC() : 0.8442622950819673
Train Accuracy for MultinomialNB() : 0.9086115992970123
Dev Accuracy for MultinomialNB() : 0.8857142857142857
Test Accuracy for MultinomialNB() : 0.8463114754098361


We now run grid search to optimize parameters

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid_SVM = {
    'kernel':('linear', 'rbf','poly'),
    'C':[0.001,0.01,0.1,1,10,100,10e5],
    'gamma':[0.1,0.01,0.01,0.001]}
param_grid_NB  = parameters = {
    'alpha': (10,1, 0.1, 0.01, 0.001, 0.0001, 0.00001)  
    }
for model,param_grid in zip((model_SVM,model_NaiveBayes),(param_grid_SVM,param_grid_NB)) :
    grid = GridSearchCV(model, param_grid=param_grid, scoring='accuracy',cv=5, n_jobs=-1)
    grid.fit(X_train, y_train)
    #If there are multiple better models :
    np.asarray(grid.cv_results_[0]['params'])[grid.cv_results_[0]['rank_test_score']==1]
    print("score DEV = %3.2f" %(grid.score(X_dev,y_dev)))
    print("score TEST = %3.2f" %(grid.score(X_test,y_test)))                  
    print(grid.best_params_)    
    print(grid.best_score_)

### bi-LSTM

In [9]:
#from 'text-classification/models' import bi-LSTM
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, 'text-classification')
import models.py


In [46]:
import numpy as np

vocabulary_size = 20000
maxlen = 300
batch_size = 32
lr = 1e-3

(X_train_nn,y_train_nn, X_dev_nn,y_dev_nn, X_test_nn,y_test_nn) = prepare_dataset_imdb('original', for_sklearn=False)


Found 19981 unique tokens.
Train matrix dimensionality:  (1707, 300)
Dev matrix dimensionality:  (245, 300)
Test matrix dimensionality:  (488, 300)


In [95]:
model = bi_LSTM()
# try using different optimizers and different optimizer configs
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

print('Train...')
model.fit(X_train_nn[0:10], y_train_nn[0:10],
          batch_size=batch_size,
          epochs=20,
          validation_data=[X_dev_nn, y_dev_nn])

Train...
Epoch 1/20

AttributeError: in user code:

    /usr/local/Caskroom/miniconda/base/envs/kilienv/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:941 test_function  *
        outputs = self.distribute_strategy.run(
    <ipython-input-89-e0d6bd7b1bcc>:19 call  *
        x = self.embedding(inputs)
    /usr/local/Caskroom/miniconda/base/envs/kilienv/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py:927 __call__  **
        outputs = call_fn(cast_inputs, *args, **kwargs)
    /usr/local/Caskroom/miniconda/base/envs/kilienv/lib/python3.8/site-packages/tensorflow/python/keras/layers/embeddings.py:181 call
        dtype = K.dtype(inputs)
    /usr/local/Caskroom/miniconda/base/envs/kilienv/lib/python3.8/site-packages/tensorflow/python/keras/backend.py:1268 dtype
        return x.dtype.base_dtype.name

    AttributeError: 'tuple' object has no attribute 'dtype'


In [137]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional,GlobalMaxPooling1D
from keras.optimizers import Adam

bi_LSTM = Sequential()
bi_LSTM.add(Embedding(20000, 50, input_length=300))
#model.add(GlobalMaxPooling1D())
bi_LSTM.add(Bidirectional(LSTM(50,recurrent_dropout=0.5,
            recurrent_activation='tanh')))
bi_LSTM.add(Dense(50, activation='relu'))
bi_LSTM.add(Dense(1, activation='sigmoid'))

opt = Adam(learning_rate=lr)
bi_LSTM.compile(opt, 'binary_crossentropy', metrics=['accuracy'])

callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
print('Train...')
bi_LSTM.fit(X_train_nn, y_train_nn,
          batch_size=32,callbacks=[callback],
          epochs=20,
          validation_data=[X_dev_nn, y_dev_nn])

Train...
Train on 1707 samples, validate on 245 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20


<keras.callbacks.callbacks.History at 0x143c5a880>

In [122]:
model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 300, 50)           1000000   
_________________________________________________________________
bidirectional_8 (Bidirection (None, 100)               40400     
_________________________________________________________________
dense_11 (Dense)             (None, 50)                5050      
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 51        
Total params: 1,045,501
Trainable params: 1,045,501
Non-trainable params: 0
_________________________________________________________________


In [121]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional,GlobalMaxPooling1D
from keras.optimizers import Adam
import tensorflow_hub as hub


elmo = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)
def ELMoEmbedding(input_text):
    return elmo(tf.reshape(tf.cast(input_text, tf.string), [-1]), signature="default", as_dict=True)["elmo"]

model = Sequential()
model.add(Embedding(20000, 50, input_length=300))
#model.add(GlobalMaxPooling1D())
ELMo_LSTM.add(Bidirectional(LSTM(50,recurrent_dropout=0.5,
            recurrent_activation='tanh')))
ELMo_LSTM.add(Dense(50, activation='relu'))
ELMo_LSTM.add(Dense(1, activation='sigmoid'))

opt = Adam(learning_rate=lr)
ELMo_LSTM.compile(opt, 'binary_crossentropy', metrics=['accuracy'])

def build_model():
    input_layer = Input(shape=(1,), dtype="string", name="Input_layer")
    embedding_layer = Lambda(ELMoEmbedding, output_shape=(1024, ), name="Elmo_Embedding")(input_layer)
    BiLSTM = Bidirectional(layers.LSTM(1024, return_sequences= False, recurrent_dropout=0.2, dropout=0.2), name="BiLSTM")(embedding_layer)
    Dense_layer_1 = Dense(8336, activation='relu')(BiLSTM)
    Dropout_layer_1 = Dropout(0.5)(Dense_layer_1)
    Dense_layer_2 = Dense(4168, activation='relu')(Dropout_layer_1)
    Dropout_layer_2 = Dropout(0.5)(Dense_layer_2)
    output_layer = Dense(1, activation='sigmoid')(Dropout_layer_2)
    model = Model(inputs=[input_layer], outputs=output_layer, name="BiLSTM with ELMo Embeddings")
    model.summary()
    model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
    return model
elmo_BiDirectional_model = build_model()
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
print('Train...')
ELMo_LSTM.fit(X_train_nn, y_train_nn,
          batch_size=32,callbacks=[callback],
          epochs=20,
          validation_data=[X_dev_nn, y_dev_nn])

array([[0],
       [0],
       [0],
       ...,
       [1],
       [1],
       [1]])

In [91]:


class bi_LSTM(tf.keras.Model):

    def __init__(self, input_review_size=300, vocabulary_size=20000, n_word_embedding=50, n_hidden=50):
        super(bi_LSTM, self).__init__()
        self.embedding = tf.keras.layers.Embedding(
            input_dim=vocabulary_size,
            output_dim=n_word_embedding,
            input_length=input_review_size)
        self.max_pooling = tf.keras.layers.GlobalMaxPooling1D()
        self.lstm = tf.keras.layers.LSTM(
            n_hidden,
            recurrent_dropout=0.5,
            recurrent_activation=tf.nn.relu)
        self.bidirectional_lstm = tf.keras.layers.Bidirectional(self.lstm)
        self.dense = tf.keras.layers.Dense(50, activation=tf.nn.relu)
        self.out = tf.keras.layers.Dense(1, activation=tf.nn.softmax)

    def call(self, inputs, training=False):
        x = self.embedding(inputs)
        #x = self.max_pooling(x)
        x = self.bidirectional_lstm(x)
        x = self.dense(x)
        out = self.out(x)
        return(out)


## Conclusion
In this tutorial, we learned how Kili can be a great help in your data augmentation task, as it allows to set a simple and easy to use interface, with proper instructions for your task.

For the study, the quality of the labeling was a key feature in this complicated task, what Kili allows very simply. To monitor the quality of the results, we could set-up a consensus on a part or all of the annotations, or even keep a part of the dataset as ground truth to measure the performance of every labeler.

For an overview of Kili, visit [kili-technology.com](https://kili-technology.com). You can also check out [Kili documentation](https://cloud.kili-technology.com/docs).