# Classroom 6 - Training a Named Entity Recognition Model with a LSTM

The classroom today is primarily geared towards preparing you for Assignment 4 which you'll be working on after today. The notebook is split into three main parts to get you thinking. You should work through these sections in groups together in class. 

If you have any questions or things you don't understand, make a note of them so you can remember to ask - or, even better, post them to Slack!

If you get through everything here, make a start on the assignment. If you don't, dont' worry about it - but I suggest you finish all of the exercises here before starting the assignment.

## 1. A very short intro to NER
Named entity recognition (NER) also known as named entity extraction, and entity identification is the task of tagging an entity is the task of extracting which seeks to extract named entities from unstructured text into predefined categories such as names, medical codes, quantities or similar.

The most common variant is the [CoNLL-20003](https://www.clips.uantwerpen.be/conll2003/ner/) format which uses the categories, person (PER), organization (ORG) location (LOC) and miscellaneous (MISC), which for example denote cases such nationalies. For example:

*Hello my name is $Ross_{PER}$ I live in $Aarhus_{LOC}$ and work at $AU_{ORG}$.*

For example, let's see how this works with ```spaCy```. NB: you might need to remember to install a ```spaCy``` model:

```python -m spacy download en_core_web_sm```

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello my name is Ross. I live in Denmark and work at Aarhus University, I am Scottish and today is Friday 27th.")

In [3]:
from spacy import displacy
displacy.render(doc, style="ent")

## Tagging standards
There exist different tag standards for NER. The most used one is the BIO-format which frames the task as token classification denoting inside, outside and beginning of a token. 

Words marked with *O* are not a named entity. Words with NER tags which start with *B-\** indicate the start of a multiword entity (i.e. *B-ORG* for the *Aarhus* in *Aarhus University*), while *I-\** indicate the continuation of a token (e.g. University).

    B = Beginning
    I = Inside
    O = Outside

<details>
<summary>Q: What other formats and standards are available? What kinds of entities do they make it possible to tag?</summary>
<br>
You can see more examples on the spaCy documentation for their [different models(https://spacy.io/models/en)
</details>

Answer: https://towardsdatascience.com/named-entity-recognition-ner-using-spacy-nlp-part-4-28da2ece57c6 

In [4]:
for t in doc:
    if t.ent_type:
        print(t, f"{t.ent_iob_}-{t.ent_type_}")
    else:
        print(t, t.ent_iob_)

Hello O
my O
name O
is O
Ross B-PERSON
. O
I O
live O
in O
Denmark B-GPE
and O
work O
at O
Aarhus B-ORG
University I-ORG
, O
I O
am O
Scottish B-NORP
and O
today B-DATE
is O
Friday B-DATE
27th I-DATE
. O


### Some challenges with NER
While NER is currently framed as above this formulating does contain some limitations. 

For instance the entity Aarhus University really refers to both the location Aarhus, the University within Aarhus, thus nested NER (N-NER) argues that it would be more correct to tag it in a nested fashion as \[\[$Aarhus_{LOC}$\] $University$\]$_{ORG}$ (Plank, 2020). 

Other task also include named entity linking. Which is the task of linking an entity to e.g. a wikipedia entry, thus you have to both know that it is indeed an entity and which entity it is (if it is indeed a defined entity).

In this assignment, we'll be using Bi-LSTMs to train an NER model on a predifined data set which uses IOB tags of the kind we outlined above.

## 2. Training in batches

When you trained your document classifier for the last assignment, you probably noticed that the neural network was quite brittle. Small changes in the hyperparameters could cause massive changes in performance. Likewise, you probably noticed that they tend to substantially overfit the training data and underperform on the validation and test data.

One way we can get around this is by processing the data in smaller chunks known as *batches*. 

<details>
<summary>Q: Why might it be a good idea to train on batches, rather than the whole dataset?</summary>
<br>
These batches are usually small (something like 32 instances at a time) but they have couple of important effects on training:

- Batches can be processed in parallel, rather the sequentially. This can result in substantial speed up from computational perspective
- Similarly, smaller batch sizes make it easier to fit training data into memory
- Lastly,  smaller batch sizes are noisy, meaning that they have a regularizing effect and thus lead to less overfitting.

In this assignment, we're going to be using batches of data to train our NER model. To do that, we first have to prepare our batches for training. You can read more about batching in [this blog post](https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/).

</details>



A: To avoid overfitting

In [8]:
# this allows us to look one step up in the directory
# for importing custom modules from src
import sys
sys.path.append("..")
from src.util import batch
from src.LSTM import RNN
from src.embedding import gensim_to_torch_embedding

# numpy and pytorch
import numpy as np
import torch

# loading data and embeddings
from datasets import load_dataset
import gensim.downloader as api

# use pip install torch datasets gensim

We can download the datset using the ```load_dataset()``` function we've already seen. Here we take only the training data.

When you've downloaded the dataset, you're welcome to save a local copy so that we don't need to constantly download it again everytime the code runs.

Q: What do the ```train.features``` values refer to?

In [9]:
train.features
train[0]

NameError: name 'train' is not defined

We get information on the sentences for its ner tag. So we can see what the ner_tags and then look at that number. When we look at the overall train.feature, we can see that e.g. the number 3 refers to organisation meaning that the first word "EU" is a organisation. We also have pos_tags and chunk_tags. Not sure yet what that is. 
So we need to look into each id and see the overall label for that. With some fancy code this can be made more automatic. 

In [11]:
# DATASET
dataset = load_dataset("conllpp")
train = dataset["train"]

#train.to_csv("/work/NLP-AU/train.csv")
# inspect the dataset
train["tokens"][:1]
train["ner_tags"][:1]

# get number of classes
num_classes = train.features["ner_tags"].feature.num_classes

Downloading builder script:   0%|          | 0.00/8.73k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

Downloading and preparing dataset conllpp/conllpp (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /home/coder/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/650k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/163k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conllpp downloaded and prepared to /home/coder/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

We then use ```gensim``` to get some pretrained word embeddings for the input layer to the model. 

In this example, we're going to use a GloVe model pretrained on Wikipedia, with 50 dimensions.

I've provided a helper function to take the ```gensim``` embeddings and prepare them for ```pytorch```.

In [12]:
# CONVERTING EMBEDDINGS
model = api.load("glove-wiki-gigaword-50")

# convert gensim word embedding to torch word embedding
embedding_layer, vocab = gensim_to_torch_embedding(model)



### Preparing a batch

The first thing we want to do is to shuffle our dataset before training. 

Why might it be a good idea to shuffle the data?

To avoid the data being sequentioal. The model should not learn how the words are relying on eachother. It should be random without context.

In [13]:
# shuffle dataset
shuffled_train = dataset["train"].shuffle(seed=1)

Next, we want to bundle the shuffled training data into smaller batches of predefined size. I've written a small utility function here to help. 

<details>
<summary>Q: Can you explain how the ```batch()``` function works?</summary>
<br>
 Hint: Check out [this link](https://realpython.com/introduction-to-python-generators/).
</details>



In [14]:
type(shuffled_train)

datasets.arrow_dataset.Dataset

In [15]:
batch_size = 32
batches_tokens = batch(shuffled_train["tokens"], batch_size)
batches_tags = batch(shuffled_train["ner_tags"], batch_size)

In [22]:
help(batch)

Help on function batch in module src.util:

batch(dataset: Iterable, batch_size: int) -> Iterable
    Creates batches from an iterable.
    
    Args:
        dataset (Iterable): Your dataset you want to batch given as an iterable (e.g. a list).
        batch_size (int): Your desired batch size
    
    Returns:
        Iterable: An iterable of tuples of size equal to batch_size.
    
    Example:
        >>> batches = batch([1,2, 3, 4, 5], 2)
        >>> print(list(batches))
        [(1, 2), (3, 4), (5,)]



batch() = something you can iterate over (gentage over)
It is a type in python.
Argument: dataset e.g. a list.
batch size: the size of the batch. Write a number.
Returns: an iterable of tuples of size equal to batch_size. It is ordered and unchangeable (but yoy can iterate)

Next, we want to use the ```tokens_to_idx()``` function below on our batches.

<details>
<summary>Q: What is this function doing? Why is it doing it?</summary>
<br>
We're making everything lowercase and adding a new, arbitrary token called <UNK> to the vocabulary. This <UNK> means "unknown" and is used to replace out-of-vocabulary tokens in the data - i.e. tokens that don't appear in the vocabulary of the pretrained word embeddings.
</details>


In [17]:
def tokens_to_idx(tokens, vocab=model.key_to_index):
    """
    - Iterates over tokens with the model based on English words. Assigns the index numbers to the words in each batch

    Args
        tokens (tuple): each batch you want to interate over
        vocab (list): the dictionary model you want to compare your batches with to find index number

    Returns
        The index number from the English dictionary assigned to the words in the batch (.get). All words are in lower-case to avoud words being counted several times if they are written with upper- or lower case.  
 
    """
    return [vocab.get(t.lower(), vocab["UNK"]) for t in tokens]

Each token is the individual batch. In this case we have 32 batches, meaning that we have 32 tokens. 
The model.key_to_index: big dictionary where each word has its own index (from wikipedia) with 50 dimension. 
.get function = we get the index, and we get all the words where we want the index to make it lowercase.
UNK is probably the United Kingdom version of the model we have downloaded. 

We'll check below that everything is working as expected as expected by testing it on a single batch.

In [18]:
# sample using only the first batch
batch_tokens = next(batches_tokens) # Does the next thing. It iterates over it with the next function. We have to unpack the generators. When we use these functions, they suddenly work as their type (e.g. tuple). But first when you unpack it. 
batch_tags = next(batches_tags)
batch_tok_idx = [tokens_to_idx(sent) for sent in batch_tokens]

In [20]:
print(batch_tokens[2])

batch_tok_idx[2]



['drew', '1-1', '(', 'halftime', '1-0', ')', 'in', 'a', 'friendly', 'soccer', 'international', 'on']


[2417, 5661, 23, 7029, 3835, 24, 6, 7, 2567, 1733, 146, 13]

As with document classification, our model needs to take input sequences of a fixed length. To get around this we do a couple of different steps.

- Find the length of the longest sequence in the batch
- Pad shorter sequences to the max length using an arbitrary token like <PAD>
- Give the <PAD> token a new label ```-1``` to differentiate it from the other labels

In [21]:
# compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_tok_idx])

In [22]:
batch_max_len

43

Q: Can you figure out the logic of what is happening in the next two cells?

In [23]:
batch_input = vocab["PAD"] * np.ones((batch_size, batch_max_len))
batch_labels = -1 * np.ones((batch_size, batch_max_len))

In [27]:
#batch_input[2]
batch_labels

array([[-1., -1., -1., ..., -1., -1., -1.],
       [-1., -1., -1., ..., -1., -1., -1.],
       [-1., -1., -1., ..., -1., -1., -1.],
       ...,
       [-1., -1., -1., ..., -1., -1., -1.],
       [-1., -1., -1., ..., -1., -1., -1.],
       [-1., -1., -1., ..., -1., -1., -1.]])

A: batch_input: uses the max length of the batch and pads the other sentences to the same length so all have the same length. The output is an array of the numbers 400001.
Now we know which sentences are padded so we want to assign them with an arbritrary number to know that these are the sequences, which have been filled out. Here batch_labels assigns -1 to all the places where this has been filled out. 

In [28]:
# copy the data to the numpy array
for i in range(batch_size):
    tok_idx = batch_tok_idx[i]
    tags = batch_tags[i]
    size = len(tok_idx)

    batch_input[i][:size] = tok_idx
    batch_labels[i][:size] = tags

A: in this cell we insert the batch size and say for every batch do the following:
create the value tox_idx based on the batch_tok_idx
create the value tags based on batch_tags
create the value size based on the length og tok_idx
Then assign the batch input to the token index value
And all the labels with the -1 should be assigned to tags. 

The last step is to conver the arrays into ```pytorch``` tensors, ready for the NN model.

In [None]:
# since all data are indices, we convert them to torch LongTensors (integers)
batch_input, batch_labels = torch.LongTensor(batch_input), torch.LongTensor(
    batch_labels
)

With our data now batched and processed, we want to run it through our RNN the same way as when we trained a clasifier. Note that this cell is incomplete and won't yet run; that's part of the assignment!

Q: Why is ```output_dim = num_classes + 1```?

In [None]:
# CREATE MODEL
model = RNN(
    embedding_layer=embedding_layer, output_dim=num_classes + 1, hidden_dim_size=256
)

# FORWARD PASS
X = batch_input
y = model(X)

loss = model.loss_fn(outputs=y, labels=batch_labels)

# etc, etc

## 3. Creating an LSTM with ```pytorch```

In the file [LSTM.py](../src/LSTM.py), I've aready created an LSTM for you using ```pytorch```. Take some time to read through the code and make sure you understand how it's built up.

Some questions for you to discuss in groups:

- How is an LSTM layer created using ```pytorch```? How does the code compare to the classifier code you wrote last week?
- What's going on with that weird bit that says ```@staticmethod```?
  - [This might help](https://realpython.com/instance-class-and-static-methods-demystified/).
- On the forward pass, we use ```log_softmax()``` to make output predictions. What is this, and how does it relate to the output from the sigmoid function that we used in the document classification?
- How would we make this LSTM model *bidirectional* - i.e. make it a Bi-LSTM? 
  - Hint: Check the documentation for the LSTM layer on the ```pytorch``` website.