<a href="https://colab.research.google.com/github/kili-technology/kili-python-sdk/blob/2.45.0/recipes/counterfactual_data_augmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>

# Kili Tutorial: How to leverage Counterfactually augmented data to have a more robust model

This recipe is inspired by the paper *Learning the Difference that Makes a Difference with Counterfactually-Augmented Data*, that you can find here on [arXiv](https://arxiv.org/abs/1909.12434)

In this study, the authors point out the difficulty for machine learning models to generalize the classification rules learned, because their decision rules, described as 'spurious patterns', often miss the key elements that affect most the class of a text. They thus decided to delete what can be considered a confusion factor, by changing the label of an asset at the same time as changing the minimum amount of words so those **key-words** would be much easier for the model to spot.

In this tutorial, we'll:
1. Create projects in Kili, both for IMDB and SNLI datasets, to reproduce such a data-augmentation task, in order to improve our model and decrease its variance when used in production with unseen data.
2. Try to reproduce the results of the paper, by using similar models, to show how such a technique can be of key interest while working on a text-classification task.
We'll use the [publicly available data of the study](https://github.com/acmi-lab/counterfactually-augmented-data) for both IMDB and Stanford NLI.

For an overview of Kili, visit our [website](https://kili-technology.com) or check out the Kili [documentation](https://docs.kili-technology.com).
For a more on-hands experience, you can run some of the other recipes.


![data augmentation](https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/data_collection_pipeline.png)

In [None]:
# Authentication
import os

!pip install  kili
from kili.client import Kili

# Don't forget to set your 'KILI_API_KEY' environment variable with your API Key: os.environ['KILI_API_KEY'] = "<YOUR_API_KEY>"
# If you use Kili on-premise, be sure to set your 'KILI_API_ENDPOINT' environment variable.
api_endpoint = "https://cloud.kili-technology.com/api/label/v2/graphql"

kili = Kili()

## Data Augmentation on IMDB dataset

The data consists of film reviews that are classified as positives or negatives. State-of-the-art model performance is often measured against this reference dataset. 

In Kili, we'll use 2 different projects; one for each task:
- Negative to Positive
- Positive to Negative

### Creating the projects

In [None]:
taskname = "NEW_REVIEW"
project_imdb_negative_to_positive = {
    "title": "Counterfactual data-augmentation - Negative to Positive",
    "description": "IMDB Sentiment Analysis",
    "input_type": "TEXT",
    "json_interface": {
        "filetype": "TEXT",
        "jobs": {
            taskname: {
                "mlTask": "TRANSCRIPTION",
                "content": {"input": None},
                "required": 1,
                "isChild": False,
                "instruction": "Write here the new review modified to be POSITIVE. Please refer to the instructions above before starting",
            }
        },
    },
}
project_imdb_positive_to_negative = {
    "title": "Counterfactual data-augmentation - Positive to Negative",
    "description": "IMDB Sentiment Analysis",
    "input_type": "TEXT",
    "json_interface": {
        "jobs": {
            taskname: {
                "mlTask": "TRANSCRIPTION",
                "content": {"input": None},
                "required": 1,
                "isChild": False,
                "instruction": "Write here the new review modified to be NEGATIVE. Please refer to the instructions above before starting",
            }
        }
    },
}

In [None]:
for project_imdb in [project_imdb_positive_to_negative, project_imdb_negative_to_positive]:
    project_imdb["id"] = kili.create_project(
        title=project_imdb["title"],
        description=project_imdb["description"],
        input_type=project_imdb["input_type"],
        json_interface=project_imdb["json_interface"],
    )["id"]

Now, let's add instructions to our newly-created projects

In [None]:
for project_imdb in [project_imdb_positive_to_negative, project_imdb_negative_to_positive]:
    kili.update_properties_in_project(
        project_id=project_imdb["id"],
        instructions="https://docs.google.com/document/d/1zhNaQrncBKc3aPKcnNa_mNpXlria28Ij7bfgUvJbyfw/edit?usp=sharing",
    )

Now, we'll create some useful functions, for improved readability:

In [None]:
def create_assets(dataframe, intro, objective, instructions, truth_label, target_label):
    return (
        intro
        + dataframe[truth_label]
        + objective
        + dataframe[target_label]
        + instructions
        + dataframe["Text"]
    ).tolist()


def create_json_responses(taskname, df, field="Text"):
    return [{taskname: {"text": df[field].iloc[k]}} for k in range(df.shape[0])]

### Importing the data into Kili

In [None]:
import pandas as pd

datasets = ["dev", "train", "test"]

for dataset in datasets:
    url = f"https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/sentiment/combined/paired/{dataset}_paired.tsv"
    df = pd.read_csv(url, on_bad_lines="skip", sep="\t")
    df = df[df.index % 2 == 0]  # keep only the original reviews as assets

    for review_type, project_imdb in zip(
        ["Positive", "Negative"],
        [project_imdb_positive_to_negative, project_imdb_negative_to_positive],
    ):
        dataframe = df[df["Sentiment"] == review_type]
        reviews_to_import = dataframe["Text"].tolist()
        external_id_array = (
            "IMDB " + review_type + " review " + dataset + dataframe["batch_id"].astype("str")
        ).tolist()

        kili.append_many_to_dataset(
            project_id=project_imdb["id"],
            content_array=reviews_to_import,
            external_id_array=external_id_array,
        )

### Importing the labels into Kili 
We will use the results of the study as if they were predictions. In a real annotation project, we could use the review contents as well, so the labeler just would just have to enter the changes.

In [None]:
model_name = "results-arxiv:1909.12434"

for dataset in datasets:
    url = f"https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/sentiment/combined/paired/{dataset}_paired.tsv"
    df = pd.read_csv(url, on_bad_lines="skip", sep="\t")
    df = df[df.index % 2 == 1]  # keep only the modified reviews as predictions

    for review_type, project_imdb in zip(
        ["Positive", "Negative"],
        [project_imdb_positive_to_negative, project_imdb_negative_to_positive],
    ):
        dataframe = df[df["Sentiment"] != review_type]

        external_id_array = (
            "IMDB " + review_type + " review " + dataset + dataframe["batch_id"].astype("str")
        ).tolist()
        json_response_array = create_json_responses(taskname, dataframe)

        kili.create_predictions(
            project_id=project_imdb["id"],
            external_id_array=external_id_array,
            json_response_array=json_response_array,
            model_name=model_name,
        )

This is how our interface looks in the end. This will allow us to quickly perform the task at hand.

![IMDB](./img/imdb_review_new.png)

## Data Augmentation on SNLI dataset

The data consists of a 3-class dataset. Provided with two phrases (a **premise** and a **hypothesis**) the machine learning model's task is to find the correct relation between those two sentences. The relation can be either "entailment", "contradiction" or "neutral".

Here is an example of a premise, and three sentences that could be the hypothesis for the three categories:
![examples](https://licor.me/post/img/robust-nlu/SNLI_annotation.png)

This time we'll keep it as a single project.

### Creating the project

In [None]:
taskname = "SENTENCE_MODIFIED"
project_snli = {
    "title": "Counterfactual data-augmentation NLI",
    "description": "Stanford Natural language Inference",
    "input_type": "TEXT",
    "json_interface": {
        "jobs": {
            taskname: {
                "mlTask": "TRANSCRIPTION",
                "content": {"input": None},
                "required": 1,
                "isChild": False,
                "instruction": "Write here the modified sentence. Please refer to the instructions above before starting",
            }
        }
    },
}

In [None]:
project_snli["id"] = kili.create_project(
    title=project_snli["title"],
    description=project_snli["description"],
    input_type=project_snli["input_type"],
    json_interface=project_snli["json_interface"],
)["id"]
print(f'Created project {project_snli["id"]}')

Again, we'll factorize our code a little, to merge datasets and properly differentiate all the cases of sentences: 

In [None]:
def merge_datasets(dataset, sentence_modified):
    url_original = f"https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/NLI/original/{dataset}.tsv"
    url_revised = f"https://raw.githubusercontent.com/acmi-lab/counterfactually-augmented-data/master/NLI/revised_{sentence_modified}/{dataset}.tsv"
    df_original = pd.read_csv(url_original, on_bad_lines="skip", sep="\t")
    df_original = df_original[df_original.duplicated(keep="first") == False]
    df_original["id"] = df_original.index.astype(str)

    df_revised = pd.read_csv(url_revised, on_bad_lines="skip", sep="\t")
    axis_merge = "sentence2" if sentence_modified == "premise" else "sentence1"
    # keep only one label per set of sentences
    df_revised = df_revised[
        df_revised[[axis_merge, "gold_label"]].duplicated(keep="first") == False
    ]

    df_merged = df_original.merge(df_revised, how="inner", left_on=axis_merge, right_on=axis_merge)

    if sentence_modified == "premise":
        df_merged["Text"] = df_merged["sentence1_x"] + "\nSENTENCE 2:\n" + df_merged["sentence2"]
        instructions = " relation, by making a small number of changes in the FIRST SENTENCE.\
        \nMake sure that the document remains coherent and the new label accurately describes the revised passage:\n\n\n\
        \nSENTENCE 1:\n"
    else:
        df_merged["Text"] = df_merged["sentence1"] + "\nSENTENCE 2:\n" + df_merged["sentence2_x"]
        instructions = " relation, by making a small number of changes in the SECOND SENTENCE.\
        \nMake sure that the document remains coherent and the new label accurately describes the revised passage:\n\n\n\
        \nSENTENCE 1: \n"
    return (df_merged, instructions)


def create_external_ids(dataset, dataframe, sentence_modified):
    return (
        "NLI "
        + dataset
        + " "
        + dataframe["gold_label_x"]
        + " to "
        + dataframe["gold_label_y"]
        + " "
        + sentence_modified
        + " modified "
        + dataframe["id"]
    ).tolist()

### Importing the data into Kili
Before each set of sentences, we'll add extra information for the labeler:

In [None]:
datasets = ["dev", "train", "test"]
sentences_modified = ["premise", "hypothesis"]
intro = "The relation of these two sentences is classified as "
objective = " to convert to a "

for dataset in datasets:
    for sentence_modified in sentences_modified:
        df, instructions = merge_datasets(dataset, sentence_modified)

        sentences_to_import = create_assets(
            df, intro, objective, instructions, "gold_label_x", "gold_label_y"
        )
        external_id_array = create_external_ids(dataset, df, sentence_modified)

        kili.append_many_to_dataset(
            project_id=project_snli["id"],
            content_array=sentences_to_import,
            external_id_array=external_id_array,
        )

### Importing the labels into Kili 
We will use the results of the study, as if they were predictions.

In [None]:
model_name = "results-arxiv:1909.12434"

for dataset in datasets:
    for sentence_modified in sentences_modified:
        axis_changed = "sentence1_y" if sentence_modified == "premise" else "sentence2_y"
        df, instructions = merge_datasets(dataset, sentence_modified)

        external_id_array = create_external_ids(dataset, df, sentence_modified)
        json_response_array = create_json_responses(taskname, df, axis_changed)

        kili.create_predictions(
            project_id=project_snli["id"],
            external_id_array=external_id_array,
            json_response_array=json_response_array,
            model_name=model_name,
        )

![NLI](./img/snli_ex1_new.png)
![NLI](./img/snli_ex2_new.png)

## Cleanup

In [None]:
# for project in [project_imdb_positive_to_negative, project_imdb_negative_to_positive, project_snli]:
#     kili.delete_project(project['id'])

## Conclusion
In this tutorial, we learned how adding proper instructions in Kili's simple and easy to use interface can help in your data augmentation task.

In the study, the quality of labeling was a very important factor. Luckily, with Kili, you can easily monitor quality. You could set up **consensus** on a portion of or all of the annotations, or even keep a part of the dataset as ground truth (**honeypot**) to measure the performance of every labeler.

For an overview of Kili, visit our [website](https://kili-technology.com) or check out the Kili [documentation](https://docs.kili-technology.com).
For a more on-hands experience, you can run some of the other recipes.