*Prepared for the course "TSTS22: Natural Language Processing and Text Mining" at Jönköping University, Teacher: [Marcel Bollmann](marcel.bollmann@ju.se)*

# Assignment 2: Named Entity Recognition

In this assignment, we are using the [**Broad Twitter Corpus**](https://huggingface.co/datasets/strombergnlp/broad_twitter_corpus) provided through HuggingFace Datasets. You will find some code below that loads the dataset, assuming you have the `datasets` library installed.  The [description of the corpus on HF Datasets](https://huggingface.co/datasets/strombergnlp/broad_twitter_corpus) provides more information about the dataset itself.

### Instructions

Your task is to **train and evaluate a neural sequence labeling model** on the task of named entity recognition (NER) on the Broad Twitter Corpus.
To do this, you will need to provide code that performs the following:

1. **Preprocess and postprocess the data**, in particular to deal with
    - converting between tokens/labels and numeric indices, and
    - padding and "un-padding" sequences.
    
   &nbsp;

2. **Define and train the model**.  Your model should consist of at least:

   - an Embedding layer;
   - a **bidirectional recurrent** layer (e.g., a Bi-LSTM); and
   - a linear layer with softmax for predicting NER labels.
  
   &nbsp;
  
3. **Evaluate the model** through a function that takes pre-tokenized tweets as input, and outputs a classification report on the F1 scores by entity type. _(You don't need to calculate these F1 scores yourself; some code for this is provided under "Utility Functions" below.)_

   &nbsp;

**Your final model should be able to achieve an F1-score greater than zero** for most entity types.  However, obtaining the best performance/F1-score is **_not_** required to pass this assignment – any correct implementation that trains and evaluates a model as described above passes.  The focus of this assignment is to learn how to correctly work with sequential input and output data in a neural network architecture.

### Required Additional Python Libraries

- `datasets` for accessing the dataset.
- `seqeval` for a ready-made implementation of span-level F1 score for NER.

Both of these libraries can be installed via `pip`.

### Grading

- This assignment is graded Pass/Fail.

- To _pass_ this assignment, you must provide a working solution for _all parts_ of the assignment. This means that:
    - Your notebook should run from start to finish without errors.
    - Your solutions should fulfill the requirements described above.
    - The provided evaluation code below must not be modified, and run correctly along with the rest of your notebook.

- - - 

## Dataset and Utility Functions

The following cells import utility libraries, load the dataset (it will be downloaded the first time you run this code), and show examples of how the dataset can be accessed:

In [1]:
!pip install datasets
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Imports and loading the dataset
import numpy as np
from datasets import load_dataset
from seqeval.metrics import classification_report, f1_score

dataset = load_dataset("strombergnlp/broad_twitter_corpus", ignore_verifications=True)



  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
# What does the dataset object look like?
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 5342
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2002
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2002
    })
})

In [4]:
# Look at first training sample:
dataset["train"][0]

{'id': '0',
 'tokens': ['I',
  'hate',
  'the',
  'words',
  'chunder',
  ',',
  'vomit',
  'and',
  'puke',
  '.',
  'BUUH',
  '.'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [5]:
# Look at a slightly more interesting training sample:
dataset["train"][6]

{'id': '6',
 'tokens': ['middle',
  'aged',
  'man',
  'band',
  'playing',
  'blink',
  '182',
  '.',
  'l0',
  'l',
  '.'],
 'ner_tags': [0, 0, 0, 0, 0, 3, 4, 4, 0, 0, 0]}

In [6]:
# Class labels can be viewed and transformed through this class:
dataset["train"].features["ner_tags"].feature

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)

### F1 scores and classification reports

The `seqeval` library provides functions that compute span-level F1-scores specifically for NER. It can be used as follows:


In [7]:
tags_gold = [
    ["B-LOC", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O"],
    ["B-PER", "I-PER", "O", "B-LOC"],
]
tags_pred = [
    ["B-LOC", "O", "O", "B-ORG", "I-ORG", "O", "O"],
    ["B-PER", "I-PER", "O", "B-PER"],
]

In [8]:
f1_score(tags_gold, tags_pred)

0.5

In [9]:
print(classification_report(tags_gold, tags_pred))

              precision    recall  f1-score   support

         LOC       1.00      0.50      0.67         2
         ORG       0.00      0.00      0.00         1
         PER       0.50      1.00      0.67         1

   micro avg       0.50      0.50      0.50         4
   macro avg       0.50      0.50      0.44         4
weighted avg       0.62      0.50      0.50         4



- - -

## Implementation

Add your implementation below. Use as many cells as you want and structure your code any way you like, with only **one strict requirement:**

- Your code needs to provide a function called `predict_and_evaluate` that takes an array of pre-tokenized sentences and their gold-standard NER labels as _input_, calls your trained model to _predict NER labels_ for them, and finally _prints a classification report_ of precision/recall/F1-scores like in the example above.  It does not need to return anything.

  A stub for this function and an example for how it should be called is given below.

<div style="background-color:#008148; padding:4px 8px; border-radius:4px; color:#F8F0E3">
    <strong>Add your implementation below.</strong>
</div>
   

In [10]:
import tensorflow as tf

In [11]:
X_train = dataset["train"]["tokens"]
y_train = dataset["train"]["ner_tags"]
X_test = dataset["test"]["tokens"]
y_test = dataset["test"]["ner_tags"]
X_val = dataset["validation"]["tokens"]
y_val = dataset["validation"]["ner_tags"]

In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [16]:
# cutoff reviews after 110 words
maxlen = 110

# consider the top 36000 words in the dataset
max_words = 36000

# tokenize each sentence in the dataset
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train+X_test)
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_test = tokenizer.texts_to_sequences(X_test)

In [17]:
# Pad x and y to length of 110
X_train_preprocessed = pad_sequences(sequences_train, maxlen=maxlen, padding='post')
X_test_preprocessed = pad_sequences(sequences_test, maxlen=maxlen, padding='post')
y_train_preprocessed = pad_sequences(y_train, maxlen=maxlen, padding='post')
y_test_preprocessed = pad_sequences(y_test, maxlen=maxlen, padding='post')

In [18]:
embedding_dim = 300
maxlen = 110
max_words = 36000
num_tags = 7

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(max_words, embedding_dim, input_length=maxlen),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=100, activation='tanh', return_sequences=True)),
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(num_tags, activation='softmax'))
])

In [19]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [20]:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

In [21]:
history = model.fit(X_train_preprocessed, y_train_preprocessed,
                    validation_data=(X_test_preprocessed, y_test_preprocessed),
                    epochs=20, callbacks=[callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20


In [22]:
# Generate dictionary variable from string tags to numbers
tags2id = {}
for i, tag in enumerate(dataset["train"].features["ner_tags"].feature.names):
    tags2id[tag] = i

In [23]:
# Generate dictionary variable from numbers to string tags
id2tag = {}
for key, value in tags2id.items():
    id2tag[value] = key

In [24]:
# Generate predicted string tag list
def make_prediction(model, original_sentence, preprocessed_sentence, id2tag):
    
    #if preprocessed_sentence.shape() != (1, 110):
    preprocessed_sentence = preprocessed_sentence.reshape((1, 110))
    
    len_orginal_sententce = len(original_sentence)
    
    # make prediction
    prediction = model.predict(preprocessed_sentence)
    prediction = np.argmax(prediction[0], axis=1)
    
    # return the prediction to its orginal form
    prediction = list(prediction)[ : len_orginal_sententce] 
    
    pred_tag_list = []
    for tag_id in prediction:
        pred_tag_list.append(id2tag[tag_id])
    
    return pred_tag_list

In [25]:
def predict_and_evaluate(X, y_gold):
    """Predicts NER labels and prints a classification report for them.

    Input:
      X:       a list of input sentences, where each input sentence is
               a list of tokens.
      y_gold:  a list of gold-standard label sequences belonging to
               the input X.
    """
    #Preprocesss input X by tokenize and padding 
    sequences = tokenizer.texts_to_sequences(X)
    X_preprocessed = pad_sequences(sequences, maxlen=maxlen, padding='post')

    #Predict tag variable
    pred_tag =[]
    for i in range(len(X)) :
        pred_tag_list = make_prediction(model=model,
                                      original_sentence=X[i],
                                      preprocessed_sentence=X_preprocessed[i],
                                      id2tag=id2tag)
        pred_tag.append(pred_tag_list)

    #Convert original tag variable from numeric to string
    original_tag = []
    if (type(y_gold[0][0]) == str):
        original_tag = y_gold
    else :
        for sentence in y_gold:
            original_tag_one = []
            for tag in sentence:
                original_tag_one.append(id2tag[tag])
            original_tag.append(original_tag_one)
          
    
    # This prints the classification report:
    print(classification_report(pred_tag, original_tag))

In [26]:
predict_and_evaluate(X_val, y_val)

              precision    recall  f1-score   support

         LOC       0.05      0.11      0.06       132
         ORG       0.04      0.14      0.06       182
         PER       0.03      0.28      0.05       229

   micro avg       0.03      0.19      0.05       543
   macro avg       0.04      0.18      0.06       543
weighted avg       0.03      0.19      0.06       543



<div style="background-color:#EF8A17; padding:4px 8px; border-radius:4px; color:#F8F0E3; margin-bottom:1em;">
  <strong>Do NOT modify the code cell below.</strong>
</div>

The cell below gives an example of how the `predict_and_evaluate` function should be called. I should be able to change the inputs for different ones and still get a correct result.

In [27]:
X_example = [
    ["Landslide", "victory", "for", "Sinn", "Féin", ":", "the", "1918", "general", "election", ":", "http://t.co/96TkhnBUfo"],
    ['Professor', 'Jan', 'Leach', 'met', 'with', 'students', 'from', 'the', 'Journalism', 'School', 'at', 'SRM', 'University', 'in', 'Chennai', ',', 'India', '.', 'http://t.co/TiiZUCUvuv']
]
y_example = [
    ['O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
    ['O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-LOC', 'O', 'B-LOC', 'O', 'O']
]

predict_and_evaluate(X_example, y_example)

              precision    recall  f1-score   support

         LOC       0.50      1.00      0.67         1
         ORG       0.00      0.00      0.00         3
         PER       0.00      0.00      0.00         0

   micro avg       0.17      0.25      0.20         4
   macro avg       0.17      0.33      0.22         4
weighted avg       0.12      0.25      0.17         4



  _warn_prf(average, modifier, msg_start, len(result))
