# Named Entity Recognition Using AutoML

In this notebook, we will see how we can simply create a Named Entity recognition (NER) model with AutoML to pre-annotate our dataset on the [Kili Platform](https://cloud.kili-technology.com/label/).

## Setup API key

We first setup the api key and automl path.

In [2]:
from getpass import getpass

In [3]:
KILI_URL="https://cloud.kili-technology.com/"  # If you are not using Kili SaaS, change the url to your configuration

api_endpoint = f"{KILI_URL}api/label/v2/graphql"

In [4]:
api_endpoint

'https://cloud.kili-technology.com/api/label/v2/graphql'

You can get your API key from the [Kili platform](https://cloud.kili-technology.com/label/my-account/api-key) and setup your environment variables. If you are working locally, please set your environment variables in a `.env` file. Also, if notebook is used on Colab, the Python path is redirected.

In [5]:
! pip install python-dotenv
%reload_ext dotenv
%dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0
cannot find .env file


In [6]:
from IPython import get_ipython
import os

if "google.colab" in str(get_ipython()):
    os.environ["PYTHONPATH"] += ":/content/automl/"
    api_key = getpass("Add your API Key here: ")
else:
    api_key = os.getenv("KILI_API_KEY")

Add your API Key here: ··········


## Install

We first follow the install procedure explained in the [README.md](https://github.com/kili-technology/automl/blob/main/README.md).

In [7]:
!git clone https://github.com/kili-technology/automl.git

Cloning into 'automl'...
remote: Enumerating objects: 4718, done.[K
remote: Counting objects: 100% (1665/1665), done.[K
remote: Compressing objects: 100% (618/618), done.[K
remote: Total 4718 (delta 1256), reused 1312 (delta 1036), pack-reused 3053[K
Receiving objects: 100% (4718/4718), 45.67 MiB | 21.90 MiB/s, done.
Resolving deltas: 100% (2755/2755), done.


In [8]:
%cd automl

/content/automl


Install the packages. This should take less than a minute.

In [11]:
%%capture
!git submodule update --init
!pip install torch
!pip install -e .

## Imports

In [12]:
from itertools import cycle
from tqdm.autonotebook import tqdm

from kili.client import Kili
from datasets import load_dataset

## Setup a mock Kili project

Setup the kili connection.

In [13]:
api_key = "89d82938-f6cd-4bb9-ba51-0b78acd1a417"
kili = Kili(api_key=api_key, api_endpoint=api_endpoint)

### Create the project

In [14]:
COLORS = [
    "#1f77b4",
    "#ff7f0e",
    "#2ca02c",
    "#d62728",
]

ENTITY_TYPES = [
    ("PERSON", "Person"),
    ("ORGANIZATION", "Organization"),
    ("LOCATION", "Location"),
    ("MISCELLANEOUS", "Miscellaneous")
]

ENTITY_TYPES_WITH_COLORS = [(n[0], n[1], c)
                            for n, c in zip(ENTITY_TYPES, cycle(COLORS))]

json_interface = {
    "jobs": {
        "NAMED_ENTITIES_RECOGNITION_JOB": {
            "mlTask": "NAMED_ENTITIES_RECOGNITION",
            "content": {
                "categories": {
                    name: {"name": name_pretty,
                           "children": [], "color": color}
                    for name, name_pretty, color in ENTITY_TYPES_WITH_COLORS
                },
                "input": "radio",
            },
            "instruction": "",
            "required": 1,
            "isChild": False,
        }
    },
}

In [15]:
project = kili.create_project(
        title="CoNLL Named Entity Recognition",
        description="Find named entities in CoNLL 2003 \n For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419",
        input_type="TEXT",
        json_interface=json_interface
)

In [16]:
project_id = project["id"]

### Add assets

In [17]:
def load_connl(split):
    conll_dataset = load_dataset("conll2003", split=split)
    formatted_dataset = []
    for elem in tqdm(conll_dataset):
        formatted_dataset.append(
            {"id": int(elem["id"]), "tokens": elem["tokens"], "tags": elem["ner_tags"]}
        )
    return formatted_dataset

In [18]:
training_dataset = load_dataset("conll2003", split="train")
test_dataset = load_dataset("conll2003", split="test")

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 959.94 KiB, generated: 9.78 MiB, post-processed: Unknown size, total: 10.72 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14042 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3454 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee. Subsequent calls will reuse this data.




Shuffle and downsample dataset.

In [19]:
training_dataset

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14042
})

In [20]:
NUMBER_OF_SAMPLES = 100

In [21]:
shuffled_training_dataset = training_dataset.shuffle(seed=42)
shuffled_test_dataset = test_dataset.shuffle(seed=42)

downsampled_training_dataset = shuffled_training_dataset[:NUMBER_OF_SAMPLES]
downsampled_test_dataset = shuffled_test_dataset[:NUMBER_OF_SAMPLES]

In [22]:
training_assets = [
    {
        "externalId": f"text {i}",
        "content":  " ".join(tokens),
    }
    for i, tokens in enumerate(downsampled_training_dataset["tokens"])
]
test_assets = [
    {
        "externalId": f"text {i + NUMBER_OF_SAMPLES}",
        "content": " ".join(tokens),
    }
    for i, tokens in enumerate(downsampled_test_dataset["tokens"])
]
assets_to_import = training_assets + test_assets

print("Number of Training assets: ", len(training_assets))
print("Number of unlabeled assets: ", len(test_assets))
print("Total Number of assets: ", len(assets_to_import))

Number of Training assets:  100
Number of unlabeled assets:  100
Total Number of assets:  200


Now we send the data to our Kili project.

In [23]:
external_id_array = [a.get("externalId") for a in assets_to_import]
content_array = [a.get("content") for a in assets_to_import]
kili.append_many_to_dataset(project_id=project_id,
                            content_array=content_array,
                            external_id_array=external_id_array)

100%|██████████| 200/200 [00:44<00:00,  4.46it/s]


{'id': 'cljm8cjnz7c150k272ved1m24'}

### Add labels to assets

We add labels to half of the data to simulate a project where we haven't labeled much data and we want to predict the labels of the unlabeled data.

In [24]:
categories = ["PERSON", "ORGANIZATION", "LOCATION", "MISCELLANEOUS"]

In [25]:
def get_annotations(tokens, tags):
    offset = 0
    annotations = []
    current_word = ""
    current_mid = None
    category_name = None
    begin_offset = 0
    for token, tag in zip(tokens, tags):
        if tag % 2 == 1:
            category_name = categories[(tag - 1) // 2]
            current_mid = token.lower()
            current_word = token
            begin_offset = offset
        elif tag != 0:
            current_word += f" {token}"
        else:
            if current_mid is not None:
                annotations.append({
                    "categories": [{"name": category_name}],
                    "beginOffset": begin_offset,
                    "content": current_word,
                    "mid": current_mid
                })
                current_mid = None
        offset += len(token) + 1
    return annotations

In [26]:
asset_ids = kili.assets(project_id=project_id, fields=["id", "externalId"])[:NUMBER_OF_SAMPLES]


100%|██████████| 200/200 [00:00<00:00, 316.93it/s]


In [27]:
for (tokens, ner_tags), asset_id in tqdm(zip(zip(downsampled_training_dataset["tokens"],downsampled_training_dataset["ner_tags"]), asset_ids), total=len(asset_ids)):
    annotations = get_annotations(tokens, ner_tags)
    kili.append_to_labels(label_asset_id=asset_id["id"],
                          json_response={"NAMED_ENTITIES_RECOGNITION_JOB": {"annotations": annotations}})

  0%|          | 0/100 [00:00<?, ?it/s]

  return func(*args, **kwargs)


You can now click on the following link to see the assets in your project:

In [28]:
print(f"{KILI_URL}label/projects/{project_id}/menu/queue?currentPage=1&pageSize=20")

https://cloud.kili-technology.com/label/projects/cljm8cjnz7c150k272ved1m24/menu/queue?currentPage=1&pageSize=20


## Training a NER model with Kiliautoml

The following command will automatically download the labeled data in your Kili project. Then, it will choose the right model for NER, train it with this data and save it locally. You can visualize the training evolution on [Weights and Biases](https://wandb.ai/).

In [30]:


! pip install -U accelerate
! pip install -U transformers


Collecting accelerate
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/227.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m225.3/227.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.20.3


In [31]:
!kiliautoml train \
    --api-key $api_key \
    --api-endpoint $api_endpoint \
    --project-id $project_id \
    --epochs 1


Loading KiliAutoML...
100% 1/1 [00:00<00:00,  2.75it/s]
[33mKiliAutoML[0m INFO [1mTraining on job: NAMED_ENTITIES_RECOGNITION_JOB[0m
[33mKiliAutoML[0m INFO [1mFetching assets with status in ['LABELED', 'TO_REVIEW', 'REVIEWED'] from Kili project[0m
cache_path /root/.cache/kili/automl/cljm8cjnz7c150k272ved1m24/get_asset_memoized
[34m[1mwandb[0m: Currently logged in as: [33mlokeshdesai7[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/root/.cache/kili/automl/cljm8cjnz7c150k272ved1m24/NAMED_ENTITIES_RECOGNITION_JOB/wandb/run-20230703_024318-b4f6ujb2[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mstill-vortex-2[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/lokeshdesai7/CoNLL%20Named%20Entity%20Recognition_NAMED_ENTITIES_RECOGNITION_JOB[0m
[34m[1mwandb[0m: 🚀 View run

### Send predictions

Now we can use our local trained model to predict the classes of our text assets and send the prediction scores to the project on Kili. These preannotations can then be validated or corrected by annotators.

In [33]:
!kiliautoml predict \
    --api-key $api_key \
    --api-endpoint $api_endpoint \
    --project-id $project_id

Loading KiliAutoML...
[33mKiliAutoML[0m INFO [1mAre you sure You want to send the predictions to Kili? Y/N[0m
Y
[33mKiliAutoML[0m INFO [1mOK, We will send the predictions to Kili![0m
100% 1/1 [00:00<00:00,  2.75it/s]
[33mKiliAutoML[0m INFO [1mFetching assets with status in ['TODO', 'ONGOING'] from Kili project[0m
cache_path /root/.cache/kili/automl/cljm8cjnz7c150k272ved1m24/get_asset_memoized
________________________________________________________________________________
[Memory] Calling kiliautoml.utils.helpers.get_asset_memoized...
get_asset_memoized(kili=<kili.client.Kili object at 0x7faeedf5d0f0>, project_id='cljm8cjnz7c150k272ved1m24', total=None, skip=0, status_in=['TODO', 'ONGOING'], asset_filter=None, query_content=True)
100% 100/100 [00:00<00:00, 101.90it/s]
_______________________________________________get_asset_memoized - 1.4s, 0.0min
[33mKiliAutoML[0m INFO [1mPredicting annotations for job: NAMED_ENTITIES_RECOGNITION_JOB[0m
  job_predictions = model.predic

Now you can ckeck that your assets have predictions on [Kili](https://cloud.kili-technology.com/)!

In [None]:
print(f"{KILI_URL}label/projects/{project_id}/menu/queue?currentPage=1&pageSize=20")