# Finding Errors in the Conll 2003 NER Dataset

Would it surprise you to know that one of the most used and widely known benchmark NER datasets has errors in it? Well there definitely are :) 
In this notebook we're going to walk through finding and correcting some of these errors.

## Loading Data
To get the Conll NER data we'll use HuggingFace Datasets

In [21]:
%reload_ext autoreload
%autoreload 2

In [None]:
!pip install datasets

In [22]:
from datasets import load_dataset, Dataset

In [23]:
conll2003 = load_dataset("conll2003")

Reusing dataset conll2003 (/Users/kabirkhan/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)


  0%|          | 0/3 [00:00<?, ?it/s]

A single annotated example looks like:

In [24]:
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

## Converting to Recon Examples

We're interested in the "tokens" and the "ner_tags". We'll be converting the Integer labels to the standard str labels with the following label map, then converting them into Recon Examples

In [25]:
import spacy
nlp = spacy.blank("en")
conll_labels = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"]

In [26]:
from typing import List, Optional
from datasets import load_dataset, Dataset as HFDataset
from spacy.tokens import Doc
from recon.operations.tokenization import add_tokens
from recon.types import Example, Span, Token


def make_recon_examples(dataset: HFDataset, labels_property: str = "ner_tags", labels: Optional[List[str]] = None) -> List[Example]:
    examples = []
    for i, e in enumerate(dataset):
        if labels:
            tags = [labels[tag_n] for tag_n in e[labels_property]]
        else:
            tags = e[labels_property]
        doc = Doc(nlp.vocab, words=e["tokens"], spaces=[True] * len(e["tokens"]), ents=tags)
        spans = [Span(text=ent.text, start=ent.start_char, end=ent.end_char, label=ent.label_) for ent in doc.ents]
        tokens = [Token(text=t.text, start=t.idx, end=t.idx + len(t), id=t.i) for t in doc]
        examples.append(Example(text=doc.text, spans=spans, tokens=tokens))

    return examples

In [27]:
train = make_recon_examples(conll2003["train"], labels=conll_labels)
dev = make_recon_examples(conll2003["validation"], labels=conll_labels)
test = make_recon_examples(conll2003["test"], labels=conll_labels)

## Load a Recon Corpus

Now that we have a list of Recon Examples for each of the train/dev/test split, we can run stats and operations on this data and identify problematic examples

In [28]:
train[0].show()

### Make a Corpus

Load the examples into a Corpus

In [29]:
from recon.corpus import Corpus
from recon.dataset import Dataset
from recon import get_ner_stats, get_entity_coverage
from recon.insights import get_label_disparities

In [30]:
conll2003_corpus = Corpus(Dataset("train", train), Dataset("dev", dev), Dataset("test", test))
print(conll2003_corpus)

<bound method Dataset.__str__ of <recon.dataset.Dataset object at 0x29ef449d0>>
<bound method Dataset.__str__ of <recon.dataset.Dataset object at 0x2b6224190>>


In [13]:
ec = get_entity_coverage(conll2003_corpus.all)

ec[:10]

[EntityCoverage(text='u.s.', label='LOC', count=157, examples=[]),
 EntityCoverage(text='germany', label='LOC', count=97, examples=[]),
 EntityCoverage(text='london', label='LOC', count=82, examples=[]),
 EntityCoverage(text='australia', label='LOC', count=82, examples=[]),
 EntityCoverage(text='france', label='LOC', count=80, examples=[]),
 EntityCoverage(text='russia', label='LOC', count=79, examples=[]),
 EntityCoverage(text='world cup', label='MISC', count=77, examples=[]),
 EntityCoverage(text='italy', label='LOC', count=64, examples=[]),
 EntityCoverage(text='england', label='LOC', count=58, examples=[]),
 EntityCoverage(text='china', label='LOC', count=58, examples=[])]

In [14]:
per_ec = [e for e in ec if e.label == "PER"]

per_ec[:10]

[EntityCoverage(text='clinton', label='PER', count=23, examples=[]),
 EntityCoverage(text='yeltsin', label='PER', count=19, examples=[]),
 EntityCoverage(text='wang', label='PER', count=19, examples=[]),
 EntityCoverage(text='lebed', label='PER', count=18, examples=[]),
 EntityCoverage(text='arafat', label='PER', count=14, examples=[]),
 EntityCoverage(text='suu kyi', label='PER', count=14, examples=[]),
 EntityCoverage(text='edberg', label='PER', count=13, examples=[]),
 EntityCoverage(text='albright', label='PER', count=13, examples=[]),
 EntityCoverage(text='lara', label='PER', count=12, examples=[]),
 EntityCoverage(text='dole', label='PER', count=11, examples=[])]

In [89]:
len(per_ec)

2271

In [102]:
print(conll2003_corpus.train_ds)

Dataset: dev
Stats: {
    "n_examples":3251,
    "n_examples_no_entities":646,
    "n_annotations":5942,
    "n_annotations_per_type":{
        "PER":1842,
        "LOC":1837,
        "ORG":1341,
        "MISC":922
    },
    "examples_with_type":null
}


In [90]:
for name, stats in conll2003_corpus.apply(get_ner_stats, serialize=True).items():
    print(name)
    print(stats)

train
{
    "n_examples":3251,
    "n_examples_no_entities":646,
    "n_annotations":5942,
    "n_annotations_per_type":{
        "PER":1842,
        "LOC":1837,
        "ORG":1341,
        "MISC":922
    },
    "examples_with_type":null
}
dev
{
    "n_examples":3454,
    "n_examples_no_entities":698,
    "n_annotations":5648,
    "n_annotations_per_type":{
        "LOC":1668,
        "ORG":1661,
        "PER":1617,
        "MISC":702
    },
    "examples_with_type":null
}
test
{
    "n_examples":0,
    "n_examples_no_entities":0,
    "n_annotations":0,
    "n_annotations_per_type":{

    },
    "examples_with_type":null
}
all
{
    "n_examples":6705,
    "n_examples_no_entities":1344,
    "n_annotations":11590,
    "n_annotations_per_type":{
        "LOC":3505,
        "PER":3459,
        "ORG":3002,
        "MISC":1624
    },
    "examples_with_type":null
}


In [94]:
get_label_disparities(conll2003_corpus.dev, "PER", "LOC")

{'china', 'santiago'}

In [30]:
conll2003_corpus.test

[]