<a href="https://colab.research.google.com/github/PierreLeveau/automl/blob/main/notebooks/image_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification Using AutoML

In this notebook, we will see how we can simply create a text classification model with AutoML to pre-annotate our dataset on the [Kili Platform](https://cloud.kili-technology.com/label/).

## Install

We first follow the install procedure explained in the [README.md](https://github.com/kili-technology/automl/blob/main/README.md). 

In [None]:
!git clone https://github.com/kili-technology/automl.git

In [None]:
%cd automl

Install the packages. This should take less than a minute. 

In [None]:
%%capture
!git submodule update --init
!pip install -r requirements.txt -r kiliautoml/utils/ultralytics/yolov5/requirements.txt
!pip install -e .
!pip install kili

## Imports

In [None]:
import os
from getpass import getpass
from tqdm.autonotebook import tqdm

from kili.client import Kili

from datasets import load_dataset

Setup the python PATH to use kiliautoml.

In [None]:
KILI_URL="https://cloud.kili-technology.com/"
os.environ["PYTHONPATH"] += ":/content/automl/"

After getting your API key from the Kili platform, you can setup your environment variables.

In [None]:
api_key = getpass("Add your API Key here: ")
api_endpoint = os.path.join(KILI_URL, "api/label/v2/graphql") # If you are not using Kili SaaS, change the endpoint to your configuration

## Setup a mock Kili project

Setup the kili connection.

In [None]:
kili = Kili(api_key=api_key, api_endpoint=api_endpoint)

### Create the project

In [None]:
json_interface = {
    "jobRendererWidth": 0.2,
    "jobs": {
        "CLASSIFICATION_JOB": {
            "mlTask": "CLASSIFICATION",
            "content": {
                "categories": {
                    "POSITIVE": {
                        "name": "positive"
                    },
                    "NEGATIVE": {
                        "name": "negative"
                    }
                },
                "input": "radio"
            },
            "required": 0,
            "isChild": False,
            "instruction": "Sentiment Class"
        }
    }
}

In [None]:
project_id = kili.create_project(
        title="Sentiment Analysis IMDB",
        description="Classify sentiment in IMDB Reviewa",
        input_type="TEXT",
        json_interface=json_interface
)["id"]

### Add assets

In [None]:
from datasets import load_dataset

training_dataset = load_dataset("imdb", split="train")
unlabeled_dataset = load_dataset("imdb", split="unsupervised")

Shuffle and downsample dataset.

In [None]:
NUMBER_OF_SAMPLES = 100

In [None]:
shuffled_training_dataset = training_dataset.shuffle(seed=42)
shuffled_unlabeled_dataset = unlabeled_dataset.shuffle(seed=42)

downsampled_training_dataset = shuffled_training_dataset[:NUMBER_OF_SAMPLES]
downsampled_unlabeled_dataset = shuffled_unlabeled_dataset[:NUMBER_OF_SAMPLES]

In [None]:
training_assets = [
    {
        "externalId": f"review {i}",
        "content":  review,
    }
    for i, review in enumerate(downsampled_training_dataset["text"])
]
unlabeled_assets = [
    {
        "externalId": f"review {i + NUMBER_OF_SAMPLES}",
        "content": review,
    }
    for i, review in enumerate(downsampled_unlabeled_dataset["text"])
]
assets_to_import = training_assets + unlabeled_assets

print("Number of Training assets: ", len(training_assets))
print("Number of unlabeled assets: ", len(unlabeled_assets))
print("Total Number of assets: ", len(assets_to_import))

Now we send the data to our Kili project.

In [None]:
CHUNK_SIZE = 100

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

for asset_chunk in tqdm(list(chunks(assets_to_import, CHUNK_SIZE))):
    external_id_array = [a.get("externalId") for a in asset_chunk]
    content_array = [a.get("content") for a in asset_chunk]
    kili.append_many_to_dataset(project_id=project_id, 
                                content_array=content_array,
                                external_id_array=external_id_array)

### Add labels to assets

We add labels to half of the data to simulate a project where we haven't labeled much data and we want to predict the labels of the unlabeled data. 

In [None]:
sentiments = ["NEGATIVE", "POSITIVE"]

In [None]:
asset_ids = kili.assets(project_id=project_id, fields=["id", "externalId"])[:NUMBER_OF_SAMPLES]

for label, asset_id in tqdm(zip(downsampled_training_dataset["label"], asset_ids), total=len(asset_ids)):
    kili.append_to_labels(label_asset_id=asset_id["id"],
                          json_response={
                              "CLASSIFICATION_JOB": {
                                  "categories": [{"name": sentiments[label]}]
                               }
                          })

You can now click on the following link to see the assets in your project:

In [None]:
print(f"{KILI_URL}label/projects/{project_id}/menu/queue?currentPage=1&pageSize=20")

## Training a text classifier with Kiliautoml

The following command will automatically download the labeled data in your Kili project. Then, it will choose the right model for your task, train it with this data and save it locally. You can visualize the training evolution on [Weights and Biases](https://wandb.ai/).

In [None]:
!kiliautoml train \
    --api-key {api_key} \
    --project-id {project_id} \
    --epochs 30
    

### Send predictions

Now we can use our local trained model to predict the classes of our text assets and send the prediction scores to the project on Kili. These preannotations can then be validated or corrected by annotators.

In [None]:
!kiliautoml predict \
    --api-key {api_key} \
    --project-id {project_id}

Now you can ckeck that your assets have predictions on [Kili](https://cloud.kili-technology.com/)!

In [None]:
print(f"{KILI_URL}label/projects/{project_id}/menu/queue?currentPage=1&pageSize=20")