# Kili Tutorial: Importing inference labels

In this tutorial, we will walk through the process of using Kili to evaluate the performance of a machine learning model in production. The goal of this tutorial is to illustrate how to push such labels, and how to visualize the quality of those predicted labels.

Additionally:

For an overview of Kili, visit https://kili-technology.com. You can also check out the Kili documentation https://docs.kili-technology.com/docs.

The tutorial is divided into two parts:

1. Giving a bit of context
2. How to make use of inference labels in practice

This next cell connects the notebook to the Kili API. You need to update the credentials `api_key` before.

In [None]:
import os
#!pip install kili
from kili.client import Kili

api_endpoint = os.getenv('KILI_API_ENDPOINT')
# If you use Kili SaaS, use the url 'https://cloud.kili-technology.com/api/label/v2/graphql'

kili = Kili(api_endpoint=api_endpoint)

## 1 - Context

## 1.1 Agreement

Let's say you have a trained machine learning model $m$, which can, given a data $x$, output a prediction (ie, an inference label) $l^i = m(x)$.

What you will probably want to do is monitor the quality of such predictions, as the model evolves. Kili allows you to better monitor and iterate on your model, thanks to the concept of agreement. An agreement is a quantitative measure of similarity between two different labels. In Kili, there are three main features derived from agreement : 

- [Consensus](https://docs.kili-technology.com/docs/consensus-overview), which is the agreement between two labelers.
- [Honeypot](https://docs.kili-technology.com/docs/honeypot-overview) which is the agreement between a "super human annotator" and a labeler.
- **Inference**, which is the agreement between a machine learning inference label and a human.

Those number can be monitored from the [queue page](https://docs.kili-technology.com/docs/queue-page) or the [analytics page](https://docs.kili-technology.com/docs/analytics-page). You can find how the agreement is computed [here](https://docs.kili-technology.com/docs/calculation-rules-for-quality-metrics)

In this tutorial, we will put an emphasis on **Inference**.

## 1.2 Use cases 

We identify two main use cases for the use of **inference** :

1. **You have a model in production**. When it receives assets, it automatically feeds a Kili project with both the asset and the predicted label. **You also have human workforce, whose job is to monitor the quality of the model**. They regularly manually label some data seen by the model.
    - When a human submits a label, the inference score for that label is automatically computed using the predicted label.
    - Low inference scores can indicate either a model performing badly on some kind of data, or a disagreement between humans and the model. This can help you to :
    
        - `Detect data drift`
        - `Identify data on which the model needs improvement`
       
       
2. **You used Kili to label data**, and you have **the first iteration of your model**. You can use **a part of the dataset as testing data**, and quickly get **test scores**. You could of course use your own metrics (rather than our own definition of agreement), but using Kili allows you to quickly filter and indentify the assets where your model is most different from the ground truth.
    - When you push an inference label on an asset, the inference score is automatically computed for all most recent labels of the different people who labeled this asset.
    - You can filter on low inference score, to understand why your model is failing, and how to fix it (getting more data, splitting or merging categories, etc...)


Using Kili for monitoring or developing your model allows you to quickly iterate on the data used to train your model, allowing to get a better model faster.

# 2 - In practice

## 2.1 Use case 1

We start by creating a project and defining a model which, given an asset input x, returns a category (random in our example)

In [None]:
json_interface ={
    "jobs": {
        "CLASSIFICATION_JOB": {
            "mlTask": "CLASSIFICATION",
            "content": {
                "categories": {
                    "RED": {"name": "Red"},
                    "BLACK": {"name": "Black"},
                    "WHITE": {"name": "White"},
                    "GREY": {"name": "Grey"}
                },
                "input": "radio"
            },
            "required": 0,
            "isChild": False,
            "instruction": "Color"
        }
    }
}

project_id = kili.create_project(
    title='Project demo inference',
    input_type='IMAGE',
    json_interface=json_interface
)['id']


Then we can simulate that our model is in production. Each time it receives an asset, it pushes it as well as the label it predicted to the project.

In [None]:
stream_of_assets = [
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/black_car.jpg",
        'external_id': 'black_car.jpg'
    },
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/grey_car.jpg",
        'external_id': 'grey_car.jpg'
    },
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/white_car.jpg",
        'external_id': 'white_car.jpg'
    },
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/red_car.jpg",
        'external_id': 'red_car.jpg'
    }
]

In [None]:
predictions = {
    'black_car.jpg': 'WHITE',
    'grey_car.jpg': 'GREY',
    'white_car.jpg': 'RED',
    'red_car.jpg': 'BLACK',
}

id_to_external_id = {}

for i, asset in enumerate(stream_of_assets):
    kili.append_many_to_dataset(
        project_id=project_id,
        content_array=[asset['url']],
        external_id_array=[asset['external_id']]
    )
    asset_id = list(kili.assets(project_id=project_id, external_id_contains=[asset['external_id']], fields=["id"], disable_tqdm=True))[0]["id"]
    id_to_external_id[asset_id] = asset['external_id']
    predicted_category = predictions[asset['external_id']]
    inference_label = {
        "CLASSIFICATION_JOB": {
            "categories": [{"name": predicted_category}]
        }
    }
    kili.append_to_labels(
        json_response=inference_label,
        label_asset_id=asset_id,
        label_type='INFERENCE',
    )
    

Then, human labelers can annotate a subsample of the assets pushed to Kili. 

Note : you can even automatically [prioritize assets](https://docs.kili-technology.com/docs/queue-prioritization) to be reviewed by a human by using the model's uncertainty. When the model is unsure of its predictions, it may indicate wrong labels.

In [None]:
assets = kili.assets(project_id=project_id, fields=['id', 'externalId'])
ground_truths = {
    'black_car.jpg': 'BLACK',
    'grey_car.jpg': 'GREY',
    'white_car.jpg': 'WHITE',
    'red_car.jpg': 'RED'
}

for i, asset in enumerate(assets):
    human_label = {
        "CLASSIFICATION_JOB": {
            "categories": [{"name": ground_truths[asset["externalId"]]}]
        }
    }
    kili.append_to_labels(
        json_response=human_label,
        label_asset_id=asset['id'],
        label_type='DEFAULT',
    )

You can now fetch the agreement between the human and the model, for human labels :

In [None]:
labels = kili.labels(project_id=project_id,
            fields=['inferenceMark', 'id', 'labelOf.id'], type_in=['DEFAULT'])
labels

In [None]:
# For testing

for label in labels:
    external_id = id_to_external_id[label["labelOf"]["id"]]
    if predictions[external_id] == ground_truths[external_id]:
        assert label['inferenceMark'] == 1
    else:
        assert label['inferenceMark'] < 1

This allows you to identify problems :

In [None]:
for label in labels:
    if label['inferenceMark'] < 1:
        inference_label = kili.labels(project_id=project_id, asset_id=label['labelOf']['id'], type_in=['INFERENCE'], disable_tqdm=True)[0]
        human_label = kili.labels(project_id=project_id, label_id=label['id'], disable_tqdm=True)[0]
        inference_category = inference_label['jsonResponse']['CLASSIFICATION_JOB']['categories'][0]['name']
        human_category = human_label['jsonResponse']['CLASSIFICATION_JOB']['categories'][0]['name']
        print(f'The model predicted {inference_category} but the human predicted {human_category}')

You can also find the assets with most disagreement directly from the interface with the filter "Human/Model IOU". Low IOU indicates low agreement : 

![inference](https://storage.googleapis.com/label-public-staging/recipes/inference/inference_filter.png)

## 2.2 Use case 2

We can invert the previous use case. We start by having a human labeled dataset, and we insert model predictions, to simulate testing our model on test data.

In [None]:
project_id = kili.create_project(
    title='Project demo inference 2',
    input_type='IMAGE',
    json_interface=json_interface
)['id']

In [None]:
labeled_assets = [
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/black_car.jpg",
        'external_id': 'black_car.jpg'
    },
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/grey_car.jpg",
        'external_id': 'grey_car.jpg'
    },
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/white_car.jpg",
        'external_id': 'white_car.jpg'
    },
    {
        'url': "https://storage.googleapis.com/label-public-staging/recipes/inference/red_car.jpg",
        'external_id': 'red_car.jpg'
    }
]
ground_truths = {
    'black_car.jpg': 'BLACK',
    'grey_car.jpg': 'GREY',
    'white_car.jpg': 'WHITE',
    'red_car.jpg': 'RED'
}

In [None]:
id_to_external_id = {}
for i, asset in enumerate(labeled_assets):
    kili.append_many_to_dataset(
        project_id=project_id,
        content_array=[asset['url']],
        external_id_array=[asset['external_id']]
    )
    asset_id = list(kili.assets(project_id=project_id, fields=['id']))[i]['id']
    id_to_external_id[asset_id] = asset['external_id']
    human_label = {
        "CLASSIFICATION_JOB": {
            "categories": [{"name":ground_truths[asset['external_id']]}]
        }
    }
    kili.append_to_labels(
        json_response=human_label,
        label_asset_id=asset_id,
        label_type='DEFAULT',
    )

Then our model is fit using maybe 80% of the training data. We can then run it against the remaining 20%, and upload its predictions to Kili :

In [None]:
assets = kili.assets(project_id=project_id, fields=['id', "externalId"])

for asset in assets:
    test_label = {
        "CLASSIFICATION_JOB": {
            "categories": [{"name": predictions[asset["externalId"]]}]
        }
    }
    kili.append_to_labels(
        json_response=test_label,
        label_asset_id=asset['id'],
        label_type='INFERENCE',
    )

In [None]:
labels = kili.labels(project_id=project_id,
            fields=['inferenceMark', 'id', 'labelOf.id'], type_in=['DEFAULT'])
labels

In [None]:
# For testing
for label in labels:
    external_id = id_to_external_id[label["labelOf"]["id"]]
    if predictions[external_id] == ground_truths[external_id]:
        assert label['inferenceMark'] == 1
    else:
        assert label['inferenceMark'] < 1

In [None]:
for label in labels:
    if label['inferenceMark'] < 1:
        inference_label = list(kili.labels(project_id=project_id, asset_id=label['labelOf']['id'], type_in=['INFERENCE'], disable_tqdm=True))[0]
        human_label = list(kili.labels(project_id=project_id, label_id=label['id'], disable_tqdm=True))[0]
        inference_category = inference_label['jsonResponse']['CLASSIFICATION_JOB']['categories'][0]['name']
        human_category = human_label['jsonResponse']['CLASSIFICATION_JOB']['categories'][0]['name']
        print(f'The human predicted {human_category} but the model predicted {inference_category}')

You can also find the assets where the prediction and the human disagree most directly from the interface with the filter "Human/Model IOU" : 

![inference](https://storage.googleapis.com/label-public-staging/recipes/inference/inference_test_filter.png)

In this tutorial, we accomplished the following:

We introduced the concept of Kili inference labels. We showed how to make use of such labels, in two practical use cases.

You can also visit the Kili website or Kili documentation for more info!