<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/sequence_labeling_mlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence labeling (POS tagging) with MLP

This notebook builds upon the [classification with MLP notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb) and shows how to implement a basic sequence labeling method.

---

# Setup

Install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package primarily on top of torch
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets
* [`evaluate`](https://huggingface.co/docs/evaluate/index) is a library of performance metrics (like accuracy etc)

In [1]:
# !pip install --quiet transformers[torch] datasets evaluate

---

# Get and prepare data

*   Let us work with the venerable, if somewhat dated [CoNLL'03 shared task](https://aclanthology.org/W03-0419.pdf) English data
*   These are English news articles, and have annotation for POS, syntactic chunks, and named entities (in the IOB format)

The data as originally distributed for the 2003 shared task has the following format:

```
Only RB B-NP O
France NNP I-NP B-LOC
and CC I-NP O
Britain NNP I-NP B-LOC
backed VBD B-VP O
Fischler NNP B-NP B-PER
's POS I-NP O
proposal NN I-NP O
. . O O
```

Here, the four space-separated columns are token text, POS tag, chunk tag, and NER tag. The goal of the original task is to predict the NER tags using the other information as features, but the dataset can be used to study predicting the other columns too.

The dataset happens to be in the HF datasets collection, so we can grab it from there


In [2]:
import torch
import transformers
import datasets

from pprint import pprint    # pretty-print

dataset = datasets.load_dataset("conll2003")

print(dataset)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [3]:
pprint(dataset["train"][12])

{'chunk_tags': [11, 12, 12, 12, 21, 11, 11, 12, 0],
 'id': '12',
 'ner_tags': [0, 5, 0, 5, 0, 1, 0, 0, 0],
 'pos_tags': [30, 22, 10, 22, 38, 22, 27, 21, 7],
 'tokens': ['Only',
            'France',
            'and',
            'Britain',
            'backed',
            'Fischler',
            "'s",
            'proposal',
            '.']}


As you can see above, the various labels (POS, NER and chunk tags) are converted into IDs in this dataset. We can access the textual labels of these tags through the dataset `features`:

In [4]:
POS_TAG_NAMES = dataset['train'].features['pos_tags'].feature.names
NER_TAG_NAMES = dataset['train'].features['ner_tags'].feature.names
CHUNK_TAG_NAMES = dataset['train'].features['chunk_tags'].feature.names

We can then create mappings from names to IDs and back as Python dictionaries:

In [5]:
POS2ID = { n: i for i, n in enumerate(POS_TAG_NAMES) }
ID2POS = { i: n for i, n in enumerate(POS_TAG_NAMES) }

NER2ID = { n: i for i, n in enumerate(NER_TAG_NAMES) }
ID2NER = { i: n for i, n in enumerate(NER_TAG_NAMES) }

CHUNK2ID = { n: i for i, n in enumerate(CHUNK_TAG_NAMES) }
ID2CHUNK = { i: n for i, n in enumerate(CHUNK_TAG_NAMES) }

This is what these mappings look like:

In [6]:
print(NER2ID)

{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}


In [7]:
print(ID2NER)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


In [8]:
print(POS2ID)

{'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12, 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23, 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33, 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43, 'WP': 44, 'WP$': 45, 'WRB': 46}


In [42]:
print(NER_TAG_NAMES)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Let's also add in explanations from Penn Treebank for the POS tags:

In [9]:
# From the documentation page and from here https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

POS2DESCRIPTION = {
    "CC": "Coordinating conjunction",
    "CD": "Cardinal number",
    "DT": "Determiner",
    "EX": "Existential there",
    "FW": "Foreign word",
    "IN": "Preposition or subordinating conjunction",
    "JJ": "Adjective",
    "JJR": "Adjective, comparative",
    "JJS": "Adjective, superlative",
    "LS": "List item marker",
    "MD": "Modal",
    "NN": "Noun, singular or mass",
    "NNS": "Noun, plural",
    "NNP": "Proper noun, singular",
    "NNPS": "Proper noun, plural",
    "PDT": "Predeterminer",
    "POS": "Possessive ending",
    "PRP": "Personal pronoun",
    "PRP$": "Possessive pronoun",
    "RB": "Adverb",
    "RBR": "Adverb, comparative",
    "RBS": "Adverb, superlative",
    "RP": "Particle",
    "SYM": "Symbol",
    "TO": "to",
    "UH": "Interjection",
    "VB": "Verb, base form",
    "VBD": "Verb, past tense",
    "VBG": "Verb, gerund or present participle",
    "VBN": "Verb, past participle",
    "VBP": "Verb, non-3rd person singular present",
    "VBZ": "Verb, 3rd person singular present",
    "WDT": "Wh-determiner",
    "WP": "Wh-pronoun",
    "WP$": "Possessive wh-pronoun",
    "WRB": "Wh-adverb"
}

We can now try to make sense of the tags:

In [10]:
import tabulate

e = dataset["train"][12]    # work on the same example

table = []
for token, pos_id, chunk_id, ner_id in zip(e["tokens"], e["pos_tags"], e["chunk_tags"], e["ner_tags"]):
    ner_tag = ID2NER[ner_id]
    chunk_tag = ID2CHUNK[chunk_id]
    pos_tag = ID2POS[pos_id]
    pos_def = POS2DESCRIPTION.get(pos_tag,pos_tag)
    table.append([token, ner_tag, chunk_tag, pos_tag, pos_def])

print(tabulate.tabulate(table,headers=["Token", "NER", "Chunk", "POS", "POS definition"]))

Token     NER    Chunk    POS    POS definition
--------  -----  -------  -----  ------------------------
Only      O      B-NP     RB     Adverb
France    B-LOC  I-NP     NNP    Proper noun, singular
and       O      I-NP     CC     Coordinating conjunction
Britain   B-LOC  I-NP     NNP    Proper noun, singular
backed    O      B-VP     VBD    Verb, past tense
Fischler  B-PER  B-NP     NNP    Proper noun, singular
's        O      B-NP     POS    Possessive ending
proposal  O      I-NP     NN     Noun, singular or mass
.         O      O        .      .


Note that the data is organized into sentences.

---

# Create features

We'll define a simple function that takes a token sequence, the index of the focus token, and a window size and generates a few basic explicit features relevant to the task.

(Note that as we'll be predicting the POS tag, we won't look at the chunk or NER tags, which would typically only be predicted _after_ predicting POS in a "traditional" NLP pipeline)

In [11]:
def token_features(tokens, pos_tags, index, window_size):
    # Generate features for token in position `index` in given list of tokens
    features = []

    # Context window start and end
    window_start = max(0, index-window_size)
    window_end = min(index+window_size+1, len(tokens))    # note +1 for range

    for i in range(window_start, window_end):
          offset = i - index    # relative position
          features.append(f"token[{offset}]={tokens[i]}")
          features.append(f"pos_tag_of_token[{offset}]={pos_tags[i]}")

    # Example custom feature: does focus token start with an upper-case letter?
    if tokens[index][0].isupper():
        features.append("first-letter-capitalized")

    return features

We can call this function for all tokens in a sentence like so:

In [12]:
def add_features_to_sentence(sentence):
    # Collect lists of features for all tokens here
    all_features = []

    tokens = sentence["tokens"]
    pos_tags = sentence["pos_tags"]
    for index in range(len(tokens)):
        all_features.append(token_features(tokens, pos_tags, index, window_size=3))

    return { "features": all_features }

In [13]:
for feats in add_features_to_sentence(dataset["train"][12])["features"]:
    print(feats)

['token[0]=Only', 'pos_tag_of_token[0]=30', 'token[1]=France', 'pos_tag_of_token[1]=22', 'token[2]=and', 'pos_tag_of_token[2]=10', 'token[3]=Britain', 'pos_tag_of_token[3]=22', 'first-letter-capitalized']
['token[-1]=Only', 'pos_tag_of_token[-1]=30', 'token[0]=France', 'pos_tag_of_token[0]=22', 'token[1]=and', 'pos_tag_of_token[1]=10', 'token[2]=Britain', 'pos_tag_of_token[2]=22', 'token[3]=backed', 'pos_tag_of_token[3]=38', 'first-letter-capitalized']
['token[-2]=Only', 'pos_tag_of_token[-2]=30', 'token[-1]=France', 'pos_tag_of_token[-1]=22', 'token[0]=and', 'pos_tag_of_token[0]=10', 'token[1]=Britain', 'pos_tag_of_token[1]=22', 'token[2]=backed', 'pos_tag_of_token[2]=38', 'token[3]=Fischler', 'pos_tag_of_token[3]=22']
['token[-3]=Only', 'pos_tag_of_token[-3]=30', 'token[-2]=France', 'pos_tag_of_token[-2]=22', 'token[-1]=and', 'pos_tag_of_token[-1]=10', 'token[0]=Britain', 'pos_tag_of_token[0]=22', 'token[1]=backed', 'pos_tag_of_token[1]=38', 'token[2]=Fischler', 'pos_tag_of_token[2]=

The dataset is organized into sentences, so we can use the above function to add features to the entire dataset as follows.

**Note**: unlike e.g. the Python`map` function, [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function _updates_ its argument dataset, keeping existing values.

In [14]:
dataset = dataset.map(add_features_to_sentence)

Map: 100%|██████████| 14041/14041 [00:03<00:00, 3857.11 examples/s]
Map: 100%|██████████| 3250/3250 [00:00<00:00, 3625.27 examples/s]
Map: 100%|██████████| 3453/3453 [00:00<00:00, 4013.28 examples/s]


Let's check that one more time:

In [15]:
pprint(dataset["train"][12])

{'chunk_tags': [11, 12, 12, 12, 21, 11, 11, 12, 0],
 'features': [['token[0]=Only',
               'pos_tag_of_token[0]=30',
               'token[1]=France',
               'pos_tag_of_token[1]=22',
               'token[2]=and',
               'pos_tag_of_token[2]=10',
               'token[3]=Britain',
               'pos_tag_of_token[3]=22',
               'first-letter-capitalized'],
              ['token[-1]=Only',
               'pos_tag_of_token[-1]=30',
               'token[0]=France',
               'pos_tag_of_token[0]=22',
               'token[1]=and',
               'pos_tag_of_token[1]=10',
               'token[2]=Britain',
               'pos_tag_of_token[2]=22',
               'token[3]=backed',
               'pos_tag_of_token[3]=38',
               'first-letter-capitalized'],
              ['token[-2]=Only',
               'pos_tag_of_token[-2]=30',
               'token[-1]=France',
               'pos_tag_of_token[-1]=22',
               'token[0]=and',
        

---

# Flatten dataset

The MLP code that we introduced previously expects each of the `train`, `validation` and `test` subsets of the data to consist of simple sequences of examples.

Now that we have run the feature generation, we no longer need the sentence structure and can "flatten" the data into such sequences.

In [16]:
def flatten(subset):
    # Keys for values to flatten
    keys = ["tokens", "pos_tags", "chunk_tags", "ner_tags", "features"]

    # Initialize to empty lists of tokens etc.
    flattened = { k: [] for k in keys }

    # Concatenate per-sentence lists of tokens etc.
    for sentence in subset:
        for key in keys:
            flattened[key].extend(sentence[key])

    # Return as Dataset object
    return datasets.Dataset.from_dict(flattened)

Call `flatten` for each of the subsets and make a new `DatasetDict` containing the flattened subsets:

In [17]:
flattened_dict = {
    "train": flatten(dataset["train"]),
    "validation": flatten(dataset["validation"]),
    "test": flatten(dataset["test"]),
}

flat_dataset = datasets.DatasetDict(flattened_dict)

Check that the new dataset looks OK:

In [18]:
flat_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'features'],
        num_rows: 203621
    })
    validation: Dataset({
        features: ['tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'features'],
        num_rows: 51362
    })
    test: Dataset({
        features: ['tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'features'],
        num_rows: 46435
    })
})

In [19]:
for i in range(10):
    token = flat_dataset["train"]["tokens"][i]
    pos_tag = ID2POS[flat_dataset["train"]["pos_tags"][i]]
    description = POS2DESCRIPTION.get(pos_tag, pos_tag)
    features = flat_dataset["train"]["features"][i]
    print(f"{token}\t{pos_tag}\t{description}\t{features}")

EU	NNP	Proper noun, singular	['token[0]=EU', 'pos_tag_of_token[0]=22', 'token[1]=rejects', 'pos_tag_of_token[1]=42', 'token[2]=German', 'pos_tag_of_token[2]=16', 'token[3]=call', 'pos_tag_of_token[3]=21', 'first-letter-capitalized']
rejects	VBZ	Verb, 3rd person singular present	['token[-1]=EU', 'pos_tag_of_token[-1]=22', 'token[0]=rejects', 'pos_tag_of_token[0]=42', 'token[1]=German', 'pos_tag_of_token[1]=16', 'token[2]=call', 'pos_tag_of_token[2]=21', 'token[3]=to', 'pos_tag_of_token[3]=35']
German	JJ	Adjective	['token[-2]=EU', 'pos_tag_of_token[-2]=22', 'token[-1]=rejects', 'pos_tag_of_token[-1]=42', 'token[0]=German', 'pos_tag_of_token[0]=16', 'token[1]=call', 'pos_tag_of_token[1]=21', 'token[2]=to', 'pos_tag_of_token[2]=35', 'token[3]=boycott', 'pos_tag_of_token[3]=37', 'first-letter-capitalized']
call	NN	Noun, singular or mass	['token[-3]=EU', 'pos_tag_of_token[-3]=22', 'token[-2]=rejects', 'pos_tag_of_token[-2]=42', 'token[-1]=German', 'pos_tag_of_token[-1]=16', 'token[0]=call', 

Note that this is now a single long sequence of tokens without sentence boundaries.

---

## Vectorize data

We'll next follow the steps that you should already be familiar with from the [text classification notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb), with a few changes:

* Since the data is already tokenized, we only need to **vectorize** it, i.e. get the non-zero elements of the feature vector
* Unlike in the text classification notebook, here we are **vectorizing token features**
* We'll again use sklearn's feature extraction package, in particular `CountVectorizer`
* Since our features are now lists of strings, we can skip tokenization and use these as-is

In [20]:
import sklearn.feature_extraction


# Dummy function for tokenization and preprocessing
def do_nothing(features):
    return features

vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    binary=True,
    max_features=30000,
    tokenizer=do_nothing,
    preprocessor=do_nothing,
)

# Get a list of all feature strings from the training data
features = [e["features"] for e in flat_dataset["train"]]

# "Train" the vectorizer, i.e. build its vocabulary
vectorizer.fit(features)



As in the text classification notebook, we then invoke the vectorizer and get non-zero elements as a sparse matrix:

In [21]:
def vectorize_example(e):
    vectorized = vectorizer.transform([e["features"]])

    # nonzero() gives a pair of (rows,columns), we want the columns
    non_zero_features = vectorized.nonzero()[1]

    # Feature index 0 will have a special meaning, so let us not produce
    # it by adding +1 to everything
    non_zero_features += 1

    return {
        "input_ids": non_zero_features,
        "label": e["ner_tags"]
    }

Check one example:

In [22]:
vectorized = vectorize_example(flat_dataset["train"][10])

print(flat_dataset["train"][10])
print(vectorized)

{'tokens': 'Blackburn', 'pos_tags': 22, 'chunk_tags': 12, 'ner_tags': 2, 'features': ['token[-1]=Peter', 'pos_tag_of_token[-1]=22', 'token[0]=Blackburn', 'pos_tag_of_token[0]=22', 'first-letter-capitalized']}
{'input_ids': array([    1,    16,   145,  1872, 14094]), 'label': 2}


Map `input_ids` back to the original feature names to confirm that everything works:

In [23]:
# Invert the feature dictionary
idx2feat = { i: w for w, i in vectorizer.vocabulary_.items() }

feats = []
for idx in vectorized["input_ids"]:
    feats.append(idx2feat[idx-1])    # It is easy to forget we moved all by +1

# This is now the bag of features representation of the token in context
pprint(", ".join(feats))

('first-letter-capitalized, pos_tag_of_token[-1]=22, pos_tag_of_token[0]=22, '
 'token[-1]=Peter, token[0]=Blackburn')


---

# Vectorizing the whole dataset

We'll again use [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) to process the whole dataset:

In [24]:
vectorized_dataset = flat_dataset.map(vectorize_example)

pprint(vectorized_dataset["train"][0])

Map: 100%|██████████| 203621/203621 [01:03<00:00, 3224.01 examples/s]
Map: 100%|██████████| 51362/51362 [00:15<00:00, 3352.51 examples/s]
Map: 100%|██████████| 46435/46435 [00:13<00:00, 3346.48 examples/s]

{'chunk_tags': 11,
 'features': ['token[0]=EU',
              'pos_tag_of_token[0]=22',
              'token[1]=rejects',
              'pos_tag_of_token[1]=42',
              'token[2]=German',
              'pos_tag_of_token[2]=16',
              'token[3]=call',
              'pos_tag_of_token[3]=21',
              'first-letter-capitalized'],
 'input_ids': [1, 145, 210, 227, 276, 14347, 23219, 27857],
 'label': 3,
 'ner_tags': 3,
 'pos_tags': 22,
 'tokens': 'EU'}





* Our `input_ids` are an array containing the indices of the features
* This corresponds to the indices into the row of the embedding matrix in the model


---

# Batching and padding

As detailed in the [text classification notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb), we typically train neural networks on _batches_ of multiple examples rather than a single example at a time (efficiency and regularization).

As examples in a batch need to have identical length, we _pad_ shorter examples to the maximum example length in each batch with the "dummy" feature with index 0.

(This code is basically unchanged from the previous notebook.)

In [25]:
def collator(list_of_examples):
    # Labels are simply converted into a tensor
    batch={
        "labels": torch.tensor([e["label"] for e in list_of_examples])
    }

    # Examples need to be padded
    tensors = []

    # Find length of longest example
    max_len = max(len(e["input_ids"]) for e in list_of_examples)
    max_len = max(1,max_len)

    # Pad everything with zeros to length of longest example
    for example in list_of_examples:
        ids = torch.LongTensor(example["input_ids"])
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
         #pad by max - current length, pads with zero by default
        padded = torch.nn.functional.pad(ids, (0, max_len-ids.shape[0]))
        tensors.append(padded)

    # Now that all examples are of the same length, vstack() can be used
    # to vertically stack these into a tensor
    batch["input_ids"]=torch.vstack(tensors)

    return batch

Test that out with a minimal batch of two examples, one requiring padding:

In [26]:
batch=collator([vectorized_dataset["train"][2], vectorized_dataset["train"][7]])

print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
print("labels:",batch["labels"])
print("input_ids:",batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 11])
labels: tensor([7, 0])
input_ids: tensor([[    1,    37,    60,   139,   188,   246,   292,  5659, 14459, 20267,
         26083],
        [   10,    75,   115,   144,   217,   872,  7164, 13380, 18173,     0,
             0]])


---

# MLP model

With the data now ready, we'll build the MLP model. Note that this is _identical_ to the MLP model we used for text classification: the only difference between the two applications is in the data.

The model class in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation. For more information on these, please see the [PyTorch turorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

In [27]:
# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self, config):
        super().__init__(config)

        self.vocab_size=config.vocab_size    # embedding matrix row count

        # Build and initialize embedding of vocab size +1 x hidden size
        # (+1 because of the padding index 0!)
        self.embedding = torch.nn.Embedding(
            num_embeddings=self.vocab_size+1,
            embedding_dim=config.hidden_size,
            padding_idx=0
        )

        # Initialize the embeddings with small random values
        torch.nn.init.uniform_(self.embedding.weight.data, -0.001, 0.001)
        # Enforce zero values for padding
        torch.nn.init.zeros_(self.embedding.weight.data[0,:])

        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(
            in_features=config.hidden_size,
            out_features=config.nlabels
        )

    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`:
    # - if given `labels`, returns (loss, output)
    # - if not, only returns (output,)
    def forward(self, input_ids, labels=None):
        # 1) Look up embeddings of features, sum them up
        embedded = self.embedding(input_ids)    # (batch, ids) -> (batch, ids, embedding_dim)
        embedded_summed = torch.sum(embedded, dim=1)    # (batch, ids, embedding_dim) -> (batch, embedding_dim)

        # NOTE: we're explicitly *not* applying a nonlinearity here to keep
        # things linear for later analysis

        # 2) Apply output layer
        # (batch, embedding_dim) -> (batch, num_classes)
        logits = self.output(embedded_summed)

        if labels is not None:
            # We have labels, so we ought to calculate the loss
            loss_fn = torch.nn.CrossEntropyLoss()    # Classification loss function
            loss = loss_fn(logits, labels)
            return (loss, logits)
        else:
            # No labels, so just return the logits
            return (logits,)

Configure the model

In [28]:
num_labels = len(NER2ID)

mlp_config = MLPConfig(
    vocab_size=len(vectorizer.vocabulary_),
    hidden_size=20,
    nlabels=num_labels
)

---

# Train the model

We will use the Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class for training

* Loads of arguments that control the training
* Configurable metrics to evaluate performance
* Data collator builds the batches
* Early stopping callback stops when eval loss no longer improves
* Model load/save
* Good foundation for later deep learning course
  

First, let's create a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments) object to specify hyperparameters and various other settings for training.

Printing this simple dataclass object will show not only the values we set, but also the defaults for all other arguments. Don't worry if you don't understand what all of these do! Many are not relevant to us here, and you can find the details in [`Trainer` documentation](https://huggingface.co/docs/transformers/main_classes/trainer) if you are interested.

In [29]:
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_l

Next, let's create a metric for evaluating performance during and after training. We can use the convenience function [`load_metric`](https://huggingface.co/docs/datasets/about_metrics) to load one of many pre-made metrics and wrap this for use by the trainer.

We can use the basic `accuracy` metric, defined as the proportion of correctly predicted labels out of all labels. This time, though, the data is not evenly split.

In [30]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

We can then create the `Trainer` and train the model by invoking the [`Trainer.train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In addition to the model, the settings passed in through the `TrainingArguments` object created above (`trainer_args`), the data, and the metric defined above, we create and pass the following to the `Trainer`:

* [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator): groups input into batches
* [`EarlyStoppingCallback`](https://huggingface.co/docs/transformers/main_classes/callback#transformers.EarlyStoppingCallback): stops training when performance stops improving

In [31]:
# Make a new model
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=vectorized_dataset["train"],
    eval_dataset=vectorized_dataset["validation"],
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  2%|▎         | 500/20000 [00:06<04:08, 78.43it/s]

{'loss': 1.6058, 'grad_norm': 0.8244581818580627, 'learning_rate': 9.75e-05, 'epoch': 0.31}


                                                   
  3%|▎         | 510/20000 [00:16<1:24:38,  3.84it/s]

{'eval_loss': 0.9984506964683533, 'eval_accuracy': 0.832444219461859, 'eval_runtime': 9.7664, 'eval_samples_per_second': 5259.058, 'eval_steps_per_second': 657.459, 'epoch': 0.31}


  5%|▌         | 1000/20000 [00:22<03:54, 81.01it/s] 

{'loss': 0.7508, 'grad_norm': 0.29746729135513306, 'learning_rate': 9.5e-05, 'epoch': 0.63}


                                                    
  5%|▌         | 1008/20000 [00:32<1:51:09,  2.85it/s]

{'eval_loss': 0.626826286315918, 'eval_accuracy': 0.8325999766364238, 'eval_runtime': 10.1231, 'eval_samples_per_second': 5073.728, 'eval_steps_per_second': 634.29, 'epoch': 0.63}


  8%|▊         | 1500/20000 [00:39<03:56, 78.29it/s]  

{'loss': 0.5676, 'grad_norm': 0.20013108849525452, 'learning_rate': 9.250000000000001e-05, 'epoch': 0.94}


                                                    
  8%|▊         | 1513/20000 [00:48<1:11:39,  4.30it/s]

{'eval_loss': 0.5344957113265991, 'eval_accuracy': 0.8522837895720572, 'eval_runtime': 8.9011, 'eval_samples_per_second': 5770.32, 'eval_steps_per_second': 721.374, 'epoch': 0.94}


 10%|█         | 2000/20000 [00:54<03:35, 83.64it/s]  

{'loss': 0.4942, 'grad_norm': 0.18634437024593353, 'learning_rate': 9e-05, 'epoch': 1.26}


                                                    
 10%|█         | 2011/20000 [01:03<1:09:09,  4.33it/s]

{'eval_loss': 0.4802820682525635, 'eval_accuracy': 0.8617265682800515, 'eval_runtime': 9.3497, 'eval_samples_per_second': 5493.415, 'eval_steps_per_second': 686.757, 'epoch': 1.26}


 12%|█▎        | 2500/20000 [01:10<03:45, 77.46it/s]  

{'loss': 0.4467, 'grad_norm': 0.17014947533607483, 'learning_rate': 8.75e-05, 'epoch': 1.57}


                                                    
 13%|█▎        | 2506/20000 [01:20<1:43:00,  2.83it/s]

{'eval_loss': 0.43987753987312317, 'eval_accuracy': 0.8683073089054164, 'eval_runtime': 9.5485, 'eval_samples_per_second': 5379.055, 'eval_steps_per_second': 672.46, 'epoch': 1.57}


 15%|█▌        | 3000/20000 [01:26<03:27, 82.08it/s]  

{'loss': 0.4088, 'grad_norm': 0.1574796736240387, 'learning_rate': 8.5e-05, 'epoch': 1.89}


                                                    
 15%|█▌        | 3016/20000 [01:35<1:05:32,  4.32it/s]

{'eval_loss': 0.40738043189048767, 'eval_accuracy': 0.877789026907052, 'eval_runtime': 9.3877, 'eval_samples_per_second': 5471.205, 'eval_steps_per_second': 683.981, 'epoch': 1.89}


 18%|█▊        | 3500/20000 [01:41<03:24, 80.64it/s]  

{'loss': 0.3879, 'grad_norm': 0.1427707076072693, 'learning_rate': 8.25e-05, 'epoch': 2.2}


                                                    
 18%|█▊        | 3514/20000 [01:51<1:02:51,  4.37it/s]

{'eval_loss': 0.3807116150856018, 'eval_accuracy': 0.8879911218410498, 'eval_runtime': 9.2228, 'eval_samples_per_second': 5569.023, 'eval_steps_per_second': 696.209, 'epoch': 2.2}


 20%|██        | 4000/20000 [01:57<03:19, 80.26it/s]  

{'loss': 0.3508, 'grad_norm': 0.13871832191944122, 'learning_rate': 8e-05, 'epoch': 2.51}


                                                    
 20%|██        | 4011/20000 [02:07<1:03:16,  4.21it/s]

{'eval_loss': 0.3581397831439972, 'eval_accuracy': 0.8959931466843192, 'eval_runtime': 9.3897, 'eval_samples_per_second': 5470.047, 'eval_steps_per_second': 683.836, 'epoch': 2.51}


 22%|██▎       | 4500/20000 [02:13<03:27, 74.75it/s]  

{'loss': 0.3337, 'grad_norm': 0.187480628490448, 'learning_rate': 7.75e-05, 'epoch': 2.83}


                                                    
 23%|██▎       | 4512/20000 [02:22<1:06:48,  3.86it/s]

{'eval_loss': 0.33958542346954346, 'eval_accuracy': 0.9029827498929169, 'eval_runtime': 9.3878, 'eval_samples_per_second': 5471.159, 'eval_steps_per_second': 683.975, 'epoch': 2.83}


 25%|██▌       | 5000/20000 [02:29<03:18, 75.56it/s]  

{'loss': 0.3161, 'grad_norm': 0.15909534692764282, 'learning_rate': 7.500000000000001e-05, 'epoch': 3.14}


                                                    
 25%|██▌       | 5008/20000 [02:38<1:28:07,  2.84it/s]

{'eval_loss': 0.3233102560043335, 'eval_accuracy': 0.9083953117090456, 'eval_runtime': 9.2988, 'eval_samples_per_second': 5523.537, 'eval_steps_per_second': 690.523, 'epoch': 3.14}


 28%|██▊       | 5500/20000 [02:44<02:59, 80.84it/s]  

{'loss': 0.297, 'grad_norm': 0.16669060289859772, 'learning_rate': 7.25e-05, 'epoch': 3.46}


                                                    
 28%|██▊       | 5511/20000 [02:54<54:54,  4.40it/s]  

{'eval_loss': 0.3089526295661926, 'eval_accuracy': 0.9127954518905027, 'eval_runtime': 9.2067, 'eval_samples_per_second': 5578.759, 'eval_steps_per_second': 697.426, 'epoch': 3.46}


 30%|███       | 6000/20000 [03:00<02:53, 80.54it/s]

{'loss': 0.2829, 'grad_norm': 0.14174656569957733, 'learning_rate': 7e-05, 'epoch': 3.77}


                                                    
 30%|███       | 6007/20000 [03:09<1:13:33,  3.17it/s]

{'eval_loss': 0.29627758264541626, 'eval_accuracy': 0.9159884739690822, 'eval_runtime': 9.0748, 'eval_samples_per_second': 5659.827, 'eval_steps_per_second': 707.561, 'epoch': 3.77}


 32%|███▎      | 6500/20000 [03:15<02:49, 79.51it/s]  

{'loss': 0.2683, 'grad_norm': 0.18344546854496002, 'learning_rate': 6.750000000000001e-05, 'epoch': 4.09}


                                                    
 33%|███▎      | 6514/20000 [03:25<52:45,  4.26it/s]  

{'eval_loss': 0.28461042046546936, 'eval_accuracy': 0.9195514193372533, 'eval_runtime': 9.2238, 'eval_samples_per_second': 5568.403, 'eval_steps_per_second': 696.132, 'epoch': 4.09}


 35%|███▌      | 7000/20000 [03:31<02:37, 82.44it/s]

{'loss': 0.2605, 'grad_norm': 0.17911693453788757, 'learning_rate': 6.500000000000001e-05, 'epoch': 4.4}


                                                    
 35%|███▌      | 7013/20000 [03:40<49:27,  4.38it/s]  

{'eval_loss': 0.27430418133735657, 'eval_accuracy': 0.9225497449476266, 'eval_runtime': 9.2518, 'eval_samples_per_second': 5551.564, 'eval_steps_per_second': 694.027, 'epoch': 4.4}


 38%|███▊      | 7500/20000 [03:46<02:32, 81.88it/s]

{'loss': 0.2449, 'grad_norm': 0.18445806205272675, 'learning_rate': 6.25e-05, 'epoch': 4.71}


                                                    
 38%|███▊      | 7513/20000 [03:55<48:19,  4.31it/s]  

{'eval_loss': 0.2651596665382385, 'eval_accuracy': 0.925002920447023, 'eval_runtime': 9.0749, 'eval_samples_per_second': 5659.77, 'eval_steps_per_second': 707.554, 'epoch': 4.71}


 40%|████      | 8000/20000 [04:02<02:34, 77.50it/s]

{'loss': 0.2366, 'grad_norm': 0.1476937085390091, 'learning_rate': 6e-05, 'epoch': 5.03}


                                                    
 40%|████      | 8011/20000 [04:11<47:00,  4.25it/s]  

{'eval_loss': 0.2566370964050293, 'eval_accuracy': 0.9271056423036486, 'eval_runtime': 9.2468, 'eval_samples_per_second': 5554.574, 'eval_steps_per_second': 694.403, 'epoch': 5.03}


 42%|████▎     | 8500/20000 [04:17<02:25, 78.83it/s]

{'loss': 0.2237, 'grad_norm': 0.160065159201622, 'learning_rate': 5.7499999999999995e-05, 'epoch': 5.34}


                                                    
 43%|████▎     | 8506/20000 [04:27<1:04:13,  2.98it/s]

{'eval_loss': 0.2493765503168106, 'eval_accuracy': 0.9282348818192439, 'eval_runtime': 9.2653, 'eval_samples_per_second': 5543.471, 'eval_steps_per_second': 693.015, 'epoch': 5.34}


 45%|████▌     | 9000/20000 [04:33<02:18, 79.43it/s]  

{'loss': 0.2233, 'grad_norm': 0.17599479854106903, 'learning_rate': 5.500000000000001e-05, 'epoch': 5.66}


                                                    
 45%|████▌     | 9007/20000 [04:42<1:03:07,  2.90it/s]

{'eval_loss': 0.243071049451828, 'eval_accuracy': 0.929461469568942, 'eval_runtime': 9.3057, 'eval_samples_per_second': 5519.384, 'eval_steps_per_second': 690.004, 'epoch': 5.66}


 48%|████▊     | 9500/20000 [04:48<02:12, 79.47it/s]  

{'loss': 0.2162, 'grad_norm': 0.11292988806962967, 'learning_rate': 5.25e-05, 'epoch': 5.97}


                                                    
 48%|████▊     | 9510/20000 [04:58<42:14,  4.14it/s]

{'eval_loss': 0.23687264323234558, 'eval_accuracy': 0.9312526770764379, 'eval_runtime': 9.4406, 'eval_samples_per_second': 5440.519, 'eval_steps_per_second': 680.144, 'epoch': 5.97}


 50%|█████     | 10000/20000 [05:04<01:58, 84.05it/s]

{'loss': 0.2066, 'grad_norm': 0.16909322142601013, 'learning_rate': 5e-05, 'epoch': 6.29}


                                                     
 50%|█████     | 10012/20000 [05:14<39:23,  4.23it/s]

{'eval_loss': 0.23196715116500854, 'eval_accuracy': 0.9322456290642888, 'eval_runtime': 9.3687, 'eval_samples_per_second': 5482.298, 'eval_steps_per_second': 685.367, 'epoch': 6.29}


 52%|█████▎    | 10500/20000 [05:20<01:56, 81.40it/s]

{'loss': 0.2014, 'grad_norm': 0.20081928372383118, 'learning_rate': 4.75e-05, 'epoch': 6.6}


                                                     
 53%|█████▎    | 10509/20000 [05:29<50:41,  3.12it/s]

{'eval_loss': 0.22695891559123993, 'eval_accuracy': 0.9331996417584985, 'eval_runtime': 9.3267, 'eval_samples_per_second': 5506.967, 'eval_steps_per_second': 688.451, 'epoch': 6.6}


 55%|█████▌    | 11000/20000 [05:35<01:52, 80.27it/s]

{'loss': 0.2003, 'grad_norm': 0.16003520786762238, 'learning_rate': 4.5e-05, 'epoch': 6.91}


                                                     
 55%|█████▌    | 11011/20000 [05:45<35:29,  4.22it/s]

{'eval_loss': 0.22229568660259247, 'eval_accuracy': 0.9343288812740936, 'eval_runtime': 9.1657, 'eval_samples_per_second': 5603.705, 'eval_steps_per_second': 700.545, 'epoch': 6.91}


 57%|█████▊    | 11500/20000 [05:51<01:41, 83.97it/s]

{'loss': 0.1894, 'grad_norm': 0.19198378920555115, 'learning_rate': 4.25e-05, 'epoch': 7.23}


                                                     
 58%|█████▊    | 11506/20000 [06:00<45:36,  3.10it/s]

{'eval_loss': 0.2185603678226471, 'eval_accuracy': 0.9354775904365095, 'eval_runtime': 9.2758, 'eval_samples_per_second': 5537.221, 'eval_steps_per_second': 692.233, 'epoch': 7.23}


 60%|██████    | 12000/20000 [06:06<01:35, 83.71it/s]

{'loss': 0.1929, 'grad_norm': 0.15697139501571655, 'learning_rate': 4e-05, 'epoch': 7.54}


                                                     
 60%|██████    | 12006/20000 [06:16<41:54,  3.18it/s]

{'eval_loss': 0.2148643583059311, 'eval_accuracy': 0.9366652388925665, 'eval_runtime': 9.2508, 'eval_samples_per_second': 5552.173, 'eval_steps_per_second': 694.103, 'epoch': 7.54}


 62%|██████▎   | 12500/20000 [06:22<01:36, 77.97it/s]

{'loss': 0.1868, 'grad_norm': 0.1652078479528427, 'learning_rate': 3.7500000000000003e-05, 'epoch': 7.86}


                                                     
 63%|██████▎   | 12512/20000 [06:31<30:55,  4.04it/s]

{'eval_loss': 0.21180714666843414, 'eval_accuracy': 0.9378334177018028, 'eval_runtime': 9.1729, 'eval_samples_per_second': 5599.35, 'eval_steps_per_second': 700.001, 'epoch': 7.86}


 65%|██████▌   | 13000/20000 [06:37<01:26, 80.70it/s]

{'loss': 0.1794, 'grad_norm': 0.1509096920490265, 'learning_rate': 3.5e-05, 'epoch': 8.17}


                                                     
 65%|██████▌   | 13009/20000 [06:47<27:57,  4.17it/s]

{'eval_loss': 0.20885561406612396, 'eval_accuracy': 0.9384564464000623, 'eval_runtime': 9.2848, 'eval_samples_per_second': 5531.858, 'eval_steps_per_second': 691.563, 'epoch': 8.17}


 68%|██████▊   | 13500/20000 [06:53<01:21, 80.15it/s]

{'loss': 0.1778, 'grad_norm': 0.2160128802061081, 'learning_rate': 3.2500000000000004e-05, 'epoch': 8.49}


                                                     
 68%|██████▊   | 13509/20000 [07:02<35:00,  3.09it/s]

{'eval_loss': 0.2063482701778412, 'eval_accuracy': 0.9394883376815545, 'eval_runtime': 9.2998, 'eval_samples_per_second': 5522.933, 'eval_steps_per_second': 690.447, 'epoch': 8.49}


 70%|███████   | 14000/20000 [07:09<01:20, 74.36it/s]

{'loss': 0.1782, 'grad_norm': 0.1755862683057785, 'learning_rate': 3e-05, 'epoch': 8.8}


                                                     
 70%|███████   | 14012/20000 [07:18<25:10,  3.96it/s]

{'eval_loss': 0.20409664511680603, 'eval_accuracy': 0.940150305673455, 'eval_runtime': 9.2928, 'eval_samples_per_second': 5527.1, 'eval_steps_per_second': 690.968, 'epoch': 8.8}


 72%|███████▎  | 14500/20000 [07:24<01:09, 79.08it/s]

{'loss': 0.17, 'grad_norm': 0.1540101170539856, 'learning_rate': 2.7500000000000004e-05, 'epoch': 9.11}


                                                     
 73%|███████▎  | 14513/20000 [07:34<22:42,  4.03it/s]

{'eval_loss': 0.20193910598754883, 'eval_accuracy': 0.9408122736653557, 'eval_runtime': 9.2388, 'eval_samples_per_second': 5559.379, 'eval_steps_per_second': 695.004, 'epoch': 9.11}


 75%|███████▌  | 15000/20000 [07:40<01:03, 79.30it/s]

{'loss': 0.1734, 'grad_norm': 0.18108461797237396, 'learning_rate': 2.5e-05, 'epoch': 9.43}


                                                     
 75%|███████▌  | 15014/20000 [07:49<19:30,  4.26it/s]

{'eval_loss': 0.2001609057188034, 'eval_accuracy': 0.9412016666017679, 'eval_runtime': 9.3018, 'eval_samples_per_second': 5521.745, 'eval_steps_per_second': 690.299, 'epoch': 9.43}


 78%|███████▊  | 15500/20000 [07:55<00:56, 79.77it/s]

{'loss': 0.1709, 'grad_norm': 0.15790991485118866, 'learning_rate': 2.25e-05, 'epoch': 9.74}


                                                     
 78%|███████▊  | 15513/20000 [08:05<17:54,  4.18it/s]

{'eval_loss': 0.19833721220493317, 'eval_accuracy': 0.9417273470659242, 'eval_runtime': 9.043, 'eval_samples_per_second': 5679.775, 'eval_steps_per_second': 710.055, 'epoch': 9.74}


 80%|████████  | 16000/20000 [08:11<00:50, 78.67it/s]

{'loss': 0.1655, 'grad_norm': 0.18245558440685272, 'learning_rate': 2e-05, 'epoch': 10.06}


                                                     
 80%|████████  | 16009/20000 [08:20<16:38,  4.00it/s]

{'eval_loss': 0.19692155718803406, 'eval_accuracy': 0.9420972703555157, 'eval_runtime': 9.3198, 'eval_samples_per_second': 5511.082, 'eval_steps_per_second': 688.966, 'epoch': 10.06}


 82%|████████▎ | 16500/20000 [08:27<00:43, 80.97it/s]

{'loss': 0.1655, 'grad_norm': 0.18488885462284088, 'learning_rate': 1.75e-05, 'epoch': 10.37}


                                                     
 83%|████████▎ | 16513/20000 [08:36<13:28,  4.32it/s]

{'eval_loss': 0.1956760287284851, 'eval_accuracy': 0.9425645418792103, 'eval_runtime': 9.3757, 'eval_samples_per_second': 5478.207, 'eval_steps_per_second': 684.856, 'epoch': 10.37}


 85%|████████▌ | 17000/20000 [08:42<00:35, 85.10it/s]

{'loss': 0.1643, 'grad_norm': 0.15298046171665192, 'learning_rate': 1.5e-05, 'epoch': 10.69}


                                                     
 85%|████████▌ | 17008/20000 [08:51<15:18,  3.26it/s]

{'eval_loss': 0.19460168480873108, 'eval_accuracy': 0.94287605622834, 'eval_runtime': 9.029, 'eval_samples_per_second': 5688.581, 'eval_steps_per_second': 711.156, 'epoch': 10.69}


 88%|████████▊ | 17500/20000 [08:58<00:32, 77.62it/s]

{'loss': 0.1647, 'grad_norm': 0.2053229808807373, 'learning_rate': 1.25e-05, 'epoch': 11.0}


                                                     
 88%|████████▊ | 17505/20000 [09:07<14:21,  2.90it/s]

{'eval_loss': 0.1937597244977951, 'eval_accuracy': 0.9431486312838285, 'eval_runtime': 9.2298, 'eval_samples_per_second': 5564.792, 'eval_steps_per_second': 695.68, 'epoch': 11.0}


 90%|█████████ | 18000/20000 [09:13<00:24, 82.48it/s]

{'loss': 0.1625, 'grad_norm': 0.16734929382801056, 'learning_rate': 1e-05, 'epoch': 11.31}


                                                     
 90%|█████████ | 18012/20000 [09:23<07:33,  4.39it/s]

{'eval_loss': 0.19307014346122742, 'eval_accuracy': 0.9433043884583934, 'eval_runtime': 9.2071, 'eval_samples_per_second': 5578.509, 'eval_steps_per_second': 697.395, 'epoch': 11.31}


 92%|█████████▎| 18500/20000 [09:29<00:20, 74.72it/s]

{'loss': 0.1621, 'grad_norm': 0.1451369971036911, 'learning_rate': 7.5e-06, 'epoch': 11.63}


                                                     
 93%|█████████▎| 18507/20000 [09:38<08:46,  2.83it/s]

{'eval_loss': 0.19248662889003754, 'eval_accuracy': 0.9434990849265994, 'eval_runtime': 9.2958, 'eval_samples_per_second': 5525.318, 'eval_steps_per_second': 690.745, 'epoch': 11.63}


 95%|█████████▌| 19000/20000 [09:44<00:11, 83.73it/s]

{'loss': 0.1605, 'grad_norm': 0.12593694031238556, 'learning_rate': 5e-06, 'epoch': 11.94}


                                                     
 95%|█████████▌| 19012/20000 [09:54<03:53,  4.23it/s]

{'eval_loss': 0.19206297397613525, 'eval_accuracy': 0.94351855457342, 'eval_runtime': 9.2828, 'eval_samples_per_second': 5533.047, 'eval_steps_per_second': 691.712, 'epoch': 11.94}


 98%|█████████▊| 19500/20000 [10:00<00:06, 80.56it/s]

{'loss': 0.1606, 'grad_norm': 0.17351014912128448, 'learning_rate': 2.5e-06, 'epoch': 12.26}


                                                     
 98%|█████████▊| 19508/20000 [10:09<02:36,  3.14it/s]

{'eval_loss': 0.1918441504240036, 'eval_accuracy': 0.9435964331607025, 'eval_runtime': 9.0749, 'eval_samples_per_second': 5659.769, 'eval_steps_per_second': 707.554, 'epoch': 12.26}


100%|██████████| 20000/20000 [10:15<00:00, 80.64it/s]

{'loss': 0.16, 'grad_norm': 0.14675122499465942, 'learning_rate': 0.0, 'epoch': 12.57}


                                                     
100%|██████████| 20000/20000 [10:25<00:00, 31.99it/s]

{'eval_loss': 0.19175395369529724, 'eval_accuracy': 0.9435964331607025, 'eval_runtime': 9.2568, 'eval_samples_per_second': 5548.588, 'eval_steps_per_second': 693.654, 'epoch': 12.57}
{'train_runtime': 625.177, 'train_samples_per_second': 4094.84, 'train_steps_per_second': 31.991, 'train_loss': 0.2902182357788086, 'epoch': 12.57}





TrainOutput(global_step=20000, training_loss=0.2902182357788086, metrics={'train_runtime': 625.177, 'train_samples_per_second': 4094.84, 'train_steps_per_second': 31.991, 'train_loss': 0.2902182357788086, 'epoch': 12.57})

We can then evaluate the trained model on a given dataset (here our test subset) by calling [`Trainer.evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate):

In [32]:
eval_results = trainer.evaluate(vectorized_dataset["test"])

print("Accuracy:", eval_results["eval_accuracy"])

100%|██████████| 5805/5805 [00:08<00:00, 716.69it/s]

Accuracy: 0.9308711101539787





That's pretty poor performance for a task as simple as POS tagging where state-of-the-art accuracies are generally 97-99%. (The approach demonstrated in this notebook should be considered more of a teaching tool than a serious tagger implementation.)

However, the result is certainly much better than random, so we can conclude that the model is learning something about the task.

---

# Save model for later use

* You can save it with `trainer.save_model()`
* You can load it with `MLP.from_pretrained()`


In [33]:
trainer.save_model("mlp-postagger")

---

# What has the model learned?

* The embeddings should have some meaning to them
* Similar features should have similar embeddings

In [34]:
# Grab the embedding matrix out of the trained model
# and drop the first row (padding 0)
# then we can treat the embeddings as vectors

weights=mlp.embedding.weight.detach().cpu().numpy()
weights=weights[1:,:]

In [45]:
qry_idx=vectorizer.vocabulary_["token[0]=David"]

#calculate the distance of the "in" embedding to all other embeddings
distance_to_qry=sklearn.metrics.pairwise.euclidean_distances(weights[qry_idx:qry_idx+1,:],weights)
nearest_neighbors=np.argsort(distance_to_qry) #indices of words nearest to "in"
for nearest in nearest_neighbors[0,:20]:
    print(idx2feat[nearest])

token[0]=David
token[0]=Paul
token[0]=John
token[0]=Michael
token[-1]=c
token[0]=M.
token[0]=A.
token[-1]=champion
token[0]=Wasim
token[-2]=Prime
token[0]=Mark
token[0]=Peter
token[-3]=c
token[-1]=4.
token[1]=Akram
token[0]=Bill
token[-1]=6.
token[0]=G.
token[-1]=7.
token[-1]=midfielder


The closest embeddings to a first name are pretty much all first names also or initials of first names. First names are also preceded a lot by number or a "c" which persumably stands for captain. Titles also precede first names.

In [46]:
qry_idx=vectorizer.vocabulary_["token[0]=NATO"]

#calculate the distance of the "in" embedding to all other embeddings
distance_to_qry=sklearn.metrics.pairwise.euclidean_distances(weights[qry_idx:qry_idx+1,:],weights)
nearest_neighbors=np.argsort(distance_to_qry) #indices of words nearest to "in"
for nearest in nearest_neighbors[0,:20]:
    print(idx2feat[nearest])

token[0]=NATO
token[0]=PSV
token[0]=PUK
token[0]=EU
token[0]=Ajax
token[0]=Oakland
token[0]=Honda
token[0]=KDP
token[0]=Marseille
token[0]=Senate
token[0]=Gencor
token[3]=v
token[0]=Feyenoord
token[0]=Yorkshire
token[0]=Atletico
token[0]=U.N.
token[-3]=v
token[0]=Lens
token[0]=Barrick
token[0]=Milwaukee


An organization is surrounded by other organizations. This corpus seems to include some text about football. "v" probably stands for "versus" which is written between two football teams.

* The embeddings indeed seem to reflect the task and capture aspects of the meaning of words relevant to the task
* But now we have many classes, so we should take that into account too
* We can take the dot-product of the feature embeddings with the output layer weight of the class we care about
* When you think how the information propagates in the network, this will give us a single number reflecting each feature w.r.t. the selected label
* Technically speaking, it is the prediction of an example which only has that one feature, with respect to that one class
* Here is how we can implement it (here we rely on the fact that the model is linear, since we didn't include a nonlinearity earlier in the model's `forward()`

In [36]:
import numpy

embedding_weights=weights    #shape (features, embedding-dim)
output_weights=mlp.output.weight.detach().cpu().numpy()    #shape (num-labels, embedding-dim)

# We just matrix-multiply these together, since this gives us all the dot-products
weights_by_label=numpy.matmul(embedding_weights, output_weights.T)
weights_by_label.shape

(30000, 9)

In [47]:
def get_most_important_features_for_and_against(label):
    label_idx = NER2ID[label]
    feature_weights = weights_by_label[:,label_idx] #pick the column that interests us

    #The shape of feature_weights is (feature_vocab_size,) i.e. it is a vector
    features_weight_idx = numpy.argsort(-feature_weights) #sort in descending order, this will be vector of indices
    features_for = [idx2feat[feature_idx] for feature_idx in features_weight_idx[:20]]
    features_against = [idx2feat[feature_idx] for feature_idx in features_weight_idx[-20:][::-1]]
    return features_for, features_against

for label in NER_TAG_NAMES:
    dt_plus,dt_minus=get_most_important_features_for_and_against(label)
    print(f"Most important features *for* label {label}:")
    pprint("   ".join(dt_plus))
    print()
    print(f"Most important features *against* label {label}:")
    pprint("   ".join(dt_minus))
    print("\n------\n")

Most important features *for* label O:
('pos_tag_of_token[0]=28   pos_tag_of_token[0]=11   token[0]="   '
 'pos_tag_of_token[0]=0   token[0]=to   token[0]=a   token[0]=.   '
 'pos_tag_of_token[0]=7   token[0]=AT   token[0]=said   token[0]=Thursday   '
 'token[0]=He   token[0]=0   token[0]=It   token[0]=-   token[0]=Tuesday   '
 'token[0]=Monday   pos_tag_of_token[0]=8   token[0]=Wednesday   '
 'token[0]=Minister')

Most important features *against* label O:
('first-letter-capitalized   pos_tag_of_token[0]=23   token[0]=Russian   '
 'pos_tag_of_token[0]=22   token[0]=&   token[0]=British   token[0]=Iraqi   '
 'token[0]=German   token[0]=de   token[0]=Palestinian   token[0]=European   '
 'token[0]=Kurdish   token[0]=French   token[0]=Israeli   token[0]=U.S.   '
 'token[1]=Oval   token[-1]=South   token[-1]=Lloyd   token[0]=RTRS   '
 'token[-1]=New')

------

Most important features *for* label B-PER:
('token[-1]=President   token[-1]=Minister   token[-1]=b   token[0]=Clinton   '
 'token[

We can see that the model does quite well. 

POS tags of 22 and 28 seem to be very popular in preceding Named Entities. This makes sense since they are "PRP": "Personal pronoun" and "NNP": "Proper noun, singular". Personal pronouns are heavily against in named entities since they refer to the named entity and therefore kind of replaces the place of it in the text. 

The "I-labels" are preceded by other named entities.