<a href="https://colab.research.google.com/github/kili-technology/automl/blob/main/notebooks/named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition Using AutoML

In this notebook, we will see how we can simply create a Named Entity recognition (NER) model with AutoML to pre-annotate our dataset on the [Kili Platform](https://cloud.kili-technology.com/label/).

## Install

We first follow the install procedure explained in the [README.md](https://github.com/kili-technology/automl/blob/main/README.md). 

In [None]:
!git clone https://github.com/kili-technology/automl.git

In [None]:
%cd automl

Install the packages. This should take less than a minute. 

In [None]:
%%capture
!git submodule update --init
!pip install torch
!pip install -e .

## Imports

In [None]:
from itertools import cycle
import os
from getpass import getpass
from tqdm.autonotebook import tqdm

from kili.client import Kili
from datasets import load_dataset

Setup the python PATH to use kiliautoml.

In [None]:
KILI_URL="https://cloud.kili-technology.com/"
os.environ["PYTHONPATH"] += ":/content/automl/"

After getting your API key from the Kili platform, you can setup your environment variables.

In [None]:
api_key = getpass("Add your API Key here: ")
api_endpoint = os.path.join(KILI_URL, "api/label/v2/graphql") # If you are not using Kili SaaS, change the endpoint to your configuration

## Setup a mock Kili project

Setup the kili connection.

In [None]:
kili = Kili(api_key=api_key, api_endpoint=api_endpoint)

### Create the project

In [None]:
COLORS = [
    "#1f77b4",
    "#ff7f0e",
    "#2ca02c",
    "#d62728",
]

ENTITY_TYPES = [
    ("PERSON", "Person"),
    ("ORGANIZATION", "Organization"),
    ("LOCATION", "Location"),
    ("MISCELLANEOUS", "Miscellaneous")
]

ENTITY_TYPES_WITH_COLORS = [(n[0], n[1], c)
                            for n, c in zip(ENTITY_TYPES, cycle(COLORS))]

json_interface = {
    "jobs": {
        "NAMED_ENTITIES_RECOGNITION_JOB": {
            "mlTask": "NAMED_ENTITIES_RECOGNITION",
            "content": {
                "categories": {
                    name: {"name": name_pretty,
                           "children": [], "color": color}
                    for name, name_pretty, color in ENTITY_TYPES_WITH_COLORS
                },
                "input": "radio",
            },
            "instruction": "",
            "required": 1,
            "isChild": False,
        }
    },
}

In [None]:
project_id = kili.create_project(
        title="CoNLL Named Entity Recognition",
        description="Find named entities in CoNLL 2003 \n For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419",
        input_type="TEXT",
        json_interface=json_interface
)["id"]

### Add assets

In [None]:
def load_connl(split):
    conll_dataset = load_dataset("conll2003", split=split)
    formatted_dataset = []
    for elem in tqdm(conll_dataset):
        formatted_dataset.append(
            {"id": int(elem["id"]), "tokens": elem["tokens"], "tags": elem["ner_tags"]}
        )
    return formatted_dataset

In [None]:
training_dataset = load_dataset("conll2003", split="train")
test_dataset = load_dataset("conll2003", split="test")

Shuffle and downsample dataset.

In [None]:
NUMBER_OF_SAMPLES = 100

In [None]:
shuffled_training_dataset = training_dataset.shuffle(seed=42)
shuffled_test_dataset = test_dataset.shuffle(seed=42)

downsampled_training_dataset = shuffled_training_dataset[:NUMBER_OF_SAMPLES]
downsampled_test_dataset = shuffled_test_dataset[:NUMBER_OF_SAMPLES]

In [None]:
training_assets = [
    {
        "externalId": f"text {i}",
        "content":  " ".join(tokens),
    }
    for i, tokens in enumerate(downsampled_training_dataset["tokens"])
]
test_assets = [
    {
        "externalId": f"text {i + NUMBER_OF_SAMPLES}",
        "content": " ".join(tokens),
    }
    for i, tokens in enumerate(downsampled_test_dataset["tokens"])
]
assets_to_import = training_assets + test_assets

print("Number of Training assets: ", len(training_assets))
print("Number of unlabeled assets: ", len(test_assets))
print("Total Number of assets: ", len(assets_to_import))

Now we send the data to our Kili project.

In [None]:
external_id_array = [a.get("externalId") for a in assets_to_import]
content_array = [a.get("content") for a in assets_to_import]
kili.append_many_to_dataset(project_id=project_id, 
                            content_array=content_array,
                            external_id_array=external_id_array)

### Add labels to assets

We add labels to half of the data to simulate a project where we haven't labeled much data and we want to predict the labels of the unlabeled data. 

In [None]:
categories = ["PERSON", "ORGANIZATION", "LOCATION", "MISCELLANEOUS"]

In [None]:
def get_annotations(tokens, tags):
    offset = 0
    annotations = []
    current_word = ""
    current_mid = None
    category_name = None
    begin_offset = 0
    for token, tag in zip(tokens, tags):
        if tag % 2 == 1:
            category_name = categories[(tag - 1) // 2]
            current_mid = token.lower()
            current_word = token
            begin_offset = offset
        elif tag != 0:
            current_word += f" {token}"
        else:
            if current_mid is not None:
                annotations.append({
                    "categories": [{"name": category_name}],
                    "beginOffset": begin_offset,
                    "content": current_word,
                    "mid": current_mid
                })
                current_mid = None
        offset += len(token) + 1
    return annotations

In [None]:
asset_ids = kili.assets(project_id=project_id, fields=["id", "externalId"])[:NUMBER_OF_SAMPLES]


In [None]:
for (tokens, ner_tags), asset_id in tqdm(zip(zip(downsampled_training_dataset["tokens"],downsampled_training_dataset["ner_tags"]), asset_ids), total=len(asset_ids)):
    annotations = get_annotations(tokens, ner_tags)
    kili.append_to_labels(label_asset_id=asset_id["id"],
                          json_response={"NAMED_ENTITIES_RECOGNITION_JOB": {'annotations': annotations}})

You can now click on the following link to see the assets in your project:

In [None]:
print(f"{KILI_URL}label/projects/{project_id}/menu/queue?currentPage=1&pageSize=20")

## Training a NER model with Kiliautoml

The following command will automatically download the labeled data in your Kili project. Then, it will choose the right model for NER, train it with this data and save it locally. You can visualize the training evolution on [Weights and Biases](https://wandb.ai/).

In [None]:
!kiliautoml train \
    --api-key $api_key \
    --api-endpoint $api_endpoint \
    --project-id $project_id \
    --epochs 30


### Send predictions

Now we can use our local trained model to predict the classes of our text assets and send the prediction scores to the project on Kili. These preannotations can then be validated or corrected by annotators.

In [None]:
!kiliautoml predict \
    --api-key $api_key \
    --api-endpoint $api_endpoint \
    --project-id $project_id

Now you can ckeck that your assets have predictions on [Kili](https://cloud.kili-technology.com/)!

In [None]:
print(f"{KILI_URL}label/projects/{project_id}/menu/queue?currentPage=1&pageSize=20")