## Prepare conll2012_ontonotesv5 dataset for NER evaluation
dataset: https://huggingface.co/datasets/conll2012_ontonotesv5

``` text
@inproceedings{pradhan-etal-2013-towards,
    title = "Towards Robust Linguistic Analysis using {O}nto{N}otes",
    author = {Pradhan, Sameer  and  
        Moschitti, Alessandro  and  
        Xue, Nianwen  and  
        Ng, Hwee Tou  and  
        Bj{\"o}rkelund, Anders  and  
        Uryupina, Olga  and  
        Zhang, Yuchen  and  
        Zhong, Zhi},
    booktitle = "Proceedings of the Seventeenth Conference on Computational Natural Language Learning",
    month = aug,
    year = "2013",
    address = "Sofia, Bulgaria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W13-3516",
    pages = "143--152",
}
```

In [1]:
%pip install datasets
import datasets
import re
from datasets import load_dataset, concatenate_datasets

Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
NAMED_ENTITIES = [
    "O",               # Outside of any named entity
    "B-PERSON",        # Beginning of a person's name
    "I-PERSON",        # Inside of a person's name (for multi-token names)
    "B-NORP",          # Beginning of a nationalities or religious or political groups
    "I-NORP",          # Inside of a NORP (for multi-token entities)
    "B-FAC",           # Beginning of a facility (building, airport, etc.)
    "I-FAC",           # Inside of a facility (for multi-token entities)
    "B-ORG",           # Beginning of an organization's name
    "I-ORG",           # Inside of an organization's name (for multi-token names)
    "B-GPE",           # Beginning of a geopolitical entity (countries, cities, etc.)
    "I-GPE",           # Inside of a GPE (for multi-token entities)
    "B-LOC",           # Beginning of a location (not geopolitical)
    "I-LOC",           # Inside of a location (for multi-token entities)
    "B-PRODUCT",       # Beginning of a product name
    "I-PRODUCT",       # Inside of a product name (for multi-token names)
    "B-DATE",          # Beginning of a date
    "I-DATE",          # Inside of a date (for multi-token entities)
    "B-TIME",          # Beginning of a time expression
    "I-TIME",          # Inside of a time expression (for multi-token entities)
    "B-PERCENT",       # Beginning of a percentage
    "I-PERCENT",       # Inside of a percentage (for multi-token entities)
    "B-MONEY",         # Beginning of a monetary value
    "I-MONEY",         # Inside of a monetary value (for multi-token entities)
    "B-QUANTITY",      # Beginning of a quantity
    "I-QUANTITY",      # Inside of a quantity (for multi-token entities)
    "B-ORDINAL",       # Beginning of an ordinal number
    "I-ORDINAL",       # Inside of an ordinal number (for multi-token entities)
    "B-CARDINAL",      # Beginning of a cardinal number
    "I-CARDINAL",      # Inside of a cardinal number (for multi-token entities)
    "B-EVENT",         # Beginning of an event name
    "I-EVENT",         # Inside of an event name (for multi-token names)
    "B-WORK_OF_ART",   # Beginning of a work of art (books, songs, etc.)
    "I-WORK_OF_ART",   # Inside of a work of art (for multi-token entities)
    "B-LAW",           # Beginning of a law/legal reference
    "I-LAW",           # Inside of a law/legal reference (for multi-token entities)
    "B-LANGUAGE",      # Beginning of a language name
    "I-LANGUAGE",      # Inside of a language name (for multi-token names)
]

In [3]:
dataset_name = "conll2012_ontonotesv5"

dataset = load_dataset(dataset_name, 'english_v12')

In [4]:
train_data = dataset['train']
test_data = concatenate_datasets([dataset['test'], dataset['validation']])

In [5]:
# method to return data for evaluating the NER task
# dataset columns:
#  - sentence: sentence to be evaluated
#  - ner_type: natural description of the entity
#  - named_entity: correctly identified entity
#  - sentence with annotated entity 
# the same sentence can appear appear multiple times with different entities
# entity annotated like '@@NER##'
# NER used for evaluation: 
#   - person name (1+2), 
#   - nationalities or religious or political groups (3+4), 
#   - facility (building, airport, etc.) (5+6),
#   - organization's name (7+8),
#   - geopolitical entity (countries, cities, etc.) (9+10),
#   - location (not geopolitical) (11+12),
#   - product name (13+14),
#   - date (15+16),
#   - time expression (17+18),
#   - percentage (19+20),
#   - monetary value (21+22),
#   - quantity (23+24),
#   - event name (29+30),
#   - work of art (books, songs, etc.) (31+32),
#   - law/legal reference (33+34),
#   - language name (35+36)
selected_ner = {
    1: 'person',
    3: 'nationalities or religious or political groups',
    5: 'facility',
    7: 'organization',
    9: 'geopolitical entity',
    11: 'location',
    13: 'product',
    15: 'date',
    17: 'time expression',
    19: 'percentage',
    21: 'monetary value',
    23: 'quantity',
    29: 'event',
    31: 'work of art',
    33: 'law/legal reference',
    35: 'language name' 
}

def format_text(text: str) -> str:
    formatted_text = re.sub(r'\s*([.,;:!?%])\s*', r'\1 ', text)
    formatted_text = re.sub(r'\s*([-])\s*', r'\1', formatted_text)
    formatted_text = re.sub(r'\s+', ' ', formatted_text)
    return formatted_text

def extract_sentence(words: [str]) -> str:
    text = ' '.join(words)
    return format_text(text)

def extract_named_entities(words: [str], ner_index: int, named_entities: [int]) -> [str]:
    named_ents = []
    index = 0

    while index < len(words):
        word, ner = words[index], named_entities[index]
        if ner == ner_index:
            named_ent = [word]
            for next_word, next_ner in zip(words[index+1:], named_entities[index+1:]):
                if next_ner == ner_index+1:
                    named_ent.append(next_word)
                else:
                    break

            named_ents.append(format_text(' '.join(named_ent)))
            index += len(named_ent)
        else:
            index += 1
    return named_ents

def extract_annotated_sentence(words: [str], ner_index: int, named_entities: [int]) -> str:
    text = ''
    index = 0

    while index < len(words):
        word, ner = words[index], named_entities[index]
        if ner == ner_index:
            named_ent = [word]
            for next_word, next_ner in zip(words[index+1:], named_entities[index+1:]):
                if next_ner == ner_index+1:
                    named_ent.append(next_word)
                else:
                    break

            ent = ' @@' + ' '.join(named_ent) + '## '
            text += ent
            index += len(named_ent)
        else:
            text += word + ' '
            index += 1
    return format_text(text)

def preprocess_dataset(data: datasets.arrow_dataset.Dataset):
    sentences = []
    ner_type = []
    named_entities = [] # holds lists of named entities
    annotated_sentences = []

    for i in range(data.shape[0]):
        sents = data[i]['sentences']
        for sent in sents:
            for ner_index, ner in selected_ner.items():
                if ner_index in sent['named_entities']:
                    text = extract_sentence(sent['words'])
                    named_ents = extract_named_entities(sent['words'], ner_index, sent['named_entities'])
                    annotated_text = extract_annotated_sentence(sent['words'], ner_index, sent['named_entities'])

                    sentences.append(text)
                    ner_type.append(ner)
                    named_entities.append(named_ents)
                    annotated_sentences.append(annotated_text)

    # return Dataset
    return datasets.Dataset.from_dict({
        'sentence': sentences,
        'ner_type': ner_type,
        'named_entity': named_entities,
        'sentence with annotated entity': annotated_sentences
    })


In [6]:
words = [
    'The',
    'Hundred',
    'Regiments',
    'Offensive',
    'was',
    'the',
    'campaign',
    'of',
    'the',
    'largest',
    'scale',
    'launched',
    'by',
    'the',
    'Eighth',
    'Route',
    'Army',
    'during',
    'the',
    'War',
    'of',
    'Resistance',
    'against',
    'Japan',
    '.'
]

named_ents = [
    29,
    30,
    30,
    30,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    7,
    8,
    8,
    8,
    0,
    29,
    30,
    30,
    30,
    30,
    30,
    0
]

print(extract_sentence(words))
print(extract_named_entities(words, 29, named_ents))
print(extract_annotated_sentence(words, 29, named_ents))

The Hundred Regiments Offensive was the campaign of the largest scale launched by the Eighth Route Army during the War of Resistance against Japan. 
['The Hundred Regiments Offensive', 'the War of Resistance against Japan']
 @@The Hundred Regiments Offensive## was the campaign of the largest scale launched by the Eighth Route Army during @@the War of Resistance against Japan##. 


In [7]:
train = preprocess_dataset(train_data)
test = preprocess_dataset(test_data)

In [8]:
print(train.shape[0])
print(test.shape[0])

87265
22132


In [9]:
train.save_to_disk('./data/train')
test.save_to_disk('./data/test')

Saving the dataset (0/1 shards):   0%|          | 0/87265 [00:00<?, ? examples/s]

Saving the dataset (1/1 shards): 100%|██████████| 87265/87265 [00:00<00:00, 568388.33 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 22132/22132 [00:00<00:00, 472036.49 examples/s]


In [10]:
train = datasets.load_from_disk('./data/train')
test = datasets.load_from_disk('./data/test')

print(train.shape[0])
print(test.shape[0])

87265
22132
