# NLP Exercise 6: Named Entity Recognition (NER)
---

Part-of-speech tagging (POS): mark each word in a sentence as corresponding to a particular part of speech.

- O: the word does not correspond to any entity.
- B-PER/I-PER: corresponds to the begginning/inside a person entity.
- B-ORG/I-ORG: corresponds to the begginning/inside an organization entity.
- B-LOC/I-LOC: corresponds to the begginning/inside a location entity.
- B-MISC/I-MISC: corresponds to the begginning/inside a miscellaneous entity.

You can get more information about the dataset we used below in the link:
https://huggingface.co/datasets/eriktks/conll2003

## Preprocessing

In [1]:
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load Datasets
ner_dataset = datasets.load_dataset('conll2003', trust_remote_code=True)

In [3]:
ner_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [4]:
ner_dataset['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [7]:
train_data = ner_dataset['train']
test_data = ner_dataset['test']

In [5]:
# Define tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [17]:
# Retrieve labels for NER, same for both train and test dataset
ner_tags = ner_dataset['train'].features['ner_tags'].feature.names
ner_tags

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']