#Setup

In [None]:
!pip install datasets 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 10.2 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 2.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 62.1 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 4.4 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 84.8 

In [None]:
from datasets import list_datasets, load_dataset
from transformers import DataCollatorForTokenClassification, AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, SpatialDropout1D, Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from livelossplot.tf_keras import PlotLossesCallback
import numpy as np
import math
import random
import copy
import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F

#Data Preparation

The chosen dataset is from [Hugging face](https://huggingface.co/datasets/nlpaueb/finer-139). <br> It comprises of 1.1M sentences annotated with eXtensive Business Reporting Language (XBRL) tags extracted from annual and quarterly reports of publicly-traded companies in the US. 15k/3.5k/3.5k samples with a maximum length of 64 tokens will be selected from the dataset as train, test and validation data.

In [None]:
class FinerDataset():

    def __init__(self, split_name):
      
        self.data = {}

        dataset = load_dataset("nlpaueb/finer-139", split=split_name)
        dataset = dataset.filter((lambda x: len(x["tokens"]) <= 64)) #to set the maximum length of tokens to 64
        

        if split_name == "train":
          dataset = dataset.select(range(15000))
          
        elif (split_name == "validation" or split_name == "test"):
          dataset = dataset.select(range(3500))
        


        for i in range((len(dataset))):
          tokens = dataset[i]['tokens']
          y_ners = dataset[i]['ner_tags']
          idx = len(self.data)
          self.data[idx] = {
                            'text': ' '.join(tokens),
                            'tokens': tokens, 
                            'y_ners': y_ners, 
                            'idx': idx
                        }

    # We return the length of the dataset
    def __len__(self):
        return len(self.data)

    # We return the idx'th sample
    def __getitem__(self, idx):
        return {
            'idx': idx,
            'word_idx': torch.tensor(self.data[idx]['word_idx']).long(),
            'y_ners': torch.tensor(self.data[idx]['y_ners']).long(),
            'chars_idx': torch.tensor(self.data[idx]['chars_idx']).long(),
        }

Next, we split the data into train, validate and test

In [None]:
train_data = FinerDataset("train")
val_data = FinerDataset("validation")
test_data = FinerDataset("test")

Downloading builder script:   0%|          | 0.00/19.1k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/15.9k [00:00<?, ?B/s]

Downloading and preparing dataset finer139/finer-139 (download: 98.42 MiB, generated: 824.09 MiB, post-processed: Unknown size, total: 922.51 MiB) to /root/.cache/huggingface/datasets/nlpaueb___finer139/finer-139/1.0.0/5f5a8eb2a38e8b142bb8ca63f3f9600634cc6c8963e4c982926cf2b48e4e55ff...


Downloading data:   0%|          | 0.00/103M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/900384 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/112494 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/108378 [00:00<?, ? examples/s]

Dataset finer139 downloaded and prepared to /root/.cache/huggingface/datasets/nlpaueb___finer139/finer-139/1.0.0/5f5a8eb2a38e8b142bb8ca63f3f9600634cc6c8963e4c982926cf2b48e4e55ff. Subsequent calls will reuse this data.


  0%|          | 0/901 [00:00<?, ?ba/s]

Reusing dataset finer139 (/root/.cache/huggingface/datasets/nlpaueb___finer139/finer-139/1.0.0/5f5a8eb2a38e8b142bb8ca63f3f9600634cc6c8963e4c982926cf2b48e4e55ff)


  0%|          | 0/113 [00:00<?, ?ba/s]

Reusing dataset finer139 (/root/.cache/huggingface/datasets/nlpaueb___finer139/finer-139/1.0.0/5f5a8eb2a38e8b142bb8ca63f3f9600634cc6c8963e4c982926cf2b48e4e55ff)


  0%|          | 0/109 [00:00<?, ?ba/s]

In [None]:
#check the length of the data
len(train_data), len(val_data), len(test_data) 

(15000, 3500, 3500)

In [None]:
# Set seeds for reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

In [None]:
vocabulary = set()
for idx, sample in train_data.data.items(): # Pay attention we only use the training set!
  for token in sample['tokens']:
    vocabulary.add(token)
len(vocabulary)

11556

In [None]:
# Create the mappindg word - index and vice-versa
word2idx = {'_PAD_': 0, '_UNK_': 1}
for word in vocabulary:
  word2idx[word] = len(word2idx)
idx2word = {idx:word for word, idx in word2idx.items()}
word2idx

{'_PAD_': 0,
 '_UNK_': 1,
 '': 2,
 'Share': 3,
 'grant': 4,
 'Codification': 5,
 'Subpoenas': 6,
 'computation': 7,
 '2,250,360': 8,
 'Catterton': 9,
 '40.0': 10,
 '0.5632': 11,
 '242,068': 12,
 'Turnaround': 13,
 'Ms.': 14,
 'GENERATORS': 15,
 'Stapleton': 16,
 'Appreciation': 17,
 'Raising': 18,
 'HPE': 19,
 'borrowed': 20,
 'C-15': 21,
 '4,173,014': 22,
 '253': 23,
 'Parcel': 24,
 'wrote': 25,
 'Purpose': 26,
 '40.1': 27,
 '6,279': 28,
 '21,600': 29,
 'reputation': 30,
 '776,000': 31,
 'PLANT': 32,
 'Legacy': 33,
 'Commercial': 34,
 'Firm': 35,
 '5,932': 36,
 'braking': 37,
 'fraudulent': 38,
 'Inflation': 39,
 'incur': 40,
 'outputs': 41,
 'work': 42,
 'Connecticut': 43,
 '500,000,000': 44,
 'fixed': 45,
 'Provision': 46,
 'Warrants': 47,
 '11.8': 48,
 'Sarl': 49,
 '8,580': 50,
 '13.1': 51,
 'group': 52,
 'online': 53,
 'confirm': 54,
 '76.9': 55,
 'ENERGY': 56,
 'prioritizes': 57,
 '26.4': 58,
 '3Consolidated': 59,
 'interchange': 60,
 'entire': 61,
 'equates': 62,
 'ordinary': 63

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

--2022-06-10 11:01:21--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-06-10 11:01:22--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-06-10 11:04:02 (5.14 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [None]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


Word embeddings

In [None]:
# Easier to load with gensim
WORD_DIM = 300
from gensim.models import KeyedVectors
#word2vec = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Glove
from gensim.scripts.glove2word2vec import glove2word2vec
_ = glove2word2vec('glove.6B.300d.txt', 'glove.6B.300d_w2v.txt')
word2vec = KeyedVectors.load_word2vec_format('glove.6B.300d_w2v.txt', binary=False)

In [None]:
# Initialize randomly the word embedding matrix
word_embeddings = np.random.rand(len(word2idx), WORD_DIM)

# Set the values to 0 for padding
word_embeddings[word2idx['_PAD_']] = np.zeros(WORD_DIM)

# Copy from word2vec
for word in vocabulary:
  if word in word2vec:
    word_embeddings[word2idx[word], :] = word2vec[word]
word_embeddings.shape

(11558, 300)

Use the attribute word_idx for all samples. In case a word is unknown, we will simply replace it with the work "UNK".

In [None]:
# We add the word indeces to all data splits
for split_data in [train_data.data, val_data.data, test_data.data]:
  for idx, sample in split_data.items():
    sample['word_idx'] = []
    for token in sample['tokens']:
      # If a word is not in our vocabulary, we put the UNK token instead
      sample['word_idx'].append(word2idx[token] if token in word2idx else word2idx['_UNK_'])

In [None]:
max(len(sample['tokens']) for idx, sample in train_data.data.items()), \
max(len(sample['tokens']) for idx, sample in val_data.data.items()), \
max(len(sample['tokens']) for idx, sample in test_data.data.items())

(64, 64, 64)

The last step is to pad all entries to 64 tokens.

In [None]:
PAD_LENGTH = 64

for split_data in [train_data.data, val_data.data, test_data.data]:
  for idx, sample in split_data.items():
    while len(sample['word_idx']) < PAD_LENGTH:
      sample['word_idx'].append(word2idx['_PAD_'])

      # add special value -100 to exlude "_PAD_" prediction in the lost function.
      sample['y_ners'].append(-100)

    # Sanity check
    assert len(sample['word_idx']) == PAD_LENGTH

Character embeddings

In [None]:
all_characters = set()
for idx, sample in train_data.data.items():
  for token in sample['tokens']:
    for char in token:
      all_characters.add(char)
print(all_characters)

{'Q', '5', 'D', "'", 'y', 'ó', '/', 'u', 'é', 'ñ', '▪', '·', 'p', 'N', 'w', '2', 'k', '-', 'K', 'i', '¼', 's', 'C', 't', 'v', 'O', '+', '"', 'r', 'î', '”', 'Y', '%', 'V', 'o', 'W', 'q', '4', 'Z', 'j', '_', 'L', 'J', '&', 'H', '0', 'b', 'F', '’', 'á', '•', '£', '®', '3', 'g', '1', '8', '$', '*', '§', '“', '€', ')', '#', 'E', 'n', 'U', '(', 'h', '\uf0b7', '!', '.', 'x', 'T', 'S', ',', ';', '9', 'z', 'í', 'M', 'G', 'I', 'X', 'c', '™', ';', 'R', 'P', 'A', 'l', '6', '7', 'e', 'f', 'm', ':', 'a', 'd', 'B'}


In [None]:
char2idx = {'_PAD_': 0, '_UNK_': 1}
for char in all_characters:
  char2idx[char] = len(char2idx)
idx2char = {idx:char for char, idx in char2idx.items()}
char2idx

{'!': 72,
 '"': 29,
 '#': 65,
 '$': 59,
 '%': 34,
 '&': 45,
 "'": 5,
 '(': 69,
 ')': 64,
 '*': 60,
 '+': 28,
 ',': 77,
 '-': 19,
 '.': 73,
 '/': 8,
 '0': 47,
 '1': 57,
 '2': 17,
 '3': 55,
 '4': 39,
 '5': 3,
 '6': 93,
 '7': 94,
 '8': 58,
 '9': 79,
 ':': 98,
 ';': 78,
 'A': 91,
 'B': 101,
 'C': 24,
 'D': 4,
 'E': 66,
 'F': 49,
 'G': 83,
 'H': 46,
 'I': 84,
 'J': 44,
 'K': 20,
 'L': 43,
 'M': 82,
 'N': 15,
 'O': 27,
 'P': 90,
 'Q': 2,
 'R': 89,
 'S': 76,
 'T': 75,
 'U': 68,
 'V': 35,
 'W': 37,
 'X': 85,
 'Y': 33,
 'Z': 40,
 '_': 42,
 '_PAD_': 0,
 '_UNK_': 1,
 'a': 99,
 'b': 48,
 'c': 86,
 'd': 100,
 'e': 95,
 'f': 96,
 'g': 56,
 'h': 70,
 'i': 21,
 'j': 41,
 'k': 18,
 'l': 92,
 'm': 97,
 'n': 67,
 'o': 36,
 'p': 14,
 'q': 38,
 'r': 30,
 's': 23,
 't': 25,
 'u': 9,
 'v': 26,
 'w': 16,
 'x': 74,
 'y': 6,
 'z': 80,
 '£': 53,
 '§': 61,
 '®': 54,
 '·': 13,
 '¼': 22,
 'á': 51,
 'é': 10,
 'í': 81,
 'î': 31,
 'ñ': 11,
 'ó': 7,
 ';': 88,
 '’': 50,
 '“': 62,
 '”': 32,
 '•': 52,
 '€': 63,
 '™': 87,


In [None]:
CHAR_DIM = 32

# Initialize randomly the word embedding matrix
char_embeddings = np.random.rand(len(idx2char), CHAR_DIM)

# Set the values to 0 for padding
char_embeddings[char2idx['_PAD_']] = np.zeros(CHAR_DIM)

In [None]:
longest_words = list(sorted({(len(token), token) for idx, sample in train_data.data.items() for token in sample['tokens']}, reverse=True))[:100]
longest_words

[(29, 'DISPOSITIONSAcquisitionsMarco'),
 (27, 'PresentationBusinessNoodles'),
 (21, 'www.malone-bailey.com'),
 (20, 'ProConnectProConnect'),
 (20, 'OPERATIONSSummaryThe'),
 (20, 'DivestituresHypackOn'),
 (19, 'EnerSysConsolidated'),
 (18, 'telecommunications'),
 (18, 'disproportionately'),
 (18, 'Telecommunications'),
 (18, 'IncidentOverviewOn'),
 (17, 'reclassifications'),
 (17, 'opportunistically'),
 (17, 'contemporaneously'),
 (17, 'commercialization'),
 (17, 'cardiorespiratory'),
 (17, 'Reclassifications'),
 (17, '2:16-cv-00255-TJH'),
 (16, 'unenforceability'),
 (16, 'underutilization'),
 (16, 'undercapitalized'),
 (16, 'uncollateralized'),
 (16, 'responsibilities'),
 (16, 'reclassification'),
 (16, 'recapitalization'),
 (16, 'misappropriation'),
 (16, 'indemnifications'),
 (16, 'extraterritorial'),
 (16, 'administratively'),
 (16, 'Reclassification'),
 (16, 'L.P.Consolidated'),
 (16, 'L.P.CONSOLIDATED'),
 (16, 'Indemnifications'),
 (16, 'Divestitures2017'),
 (15, 'unconditionally'

We will consider that a word has maximum 18 chars. In 
practice, there might be more preprocessing to do, which could potentially reduce noise.

In [None]:
PAD_CHAR_LENGTH = 18

for split_data in [train_data.data, val_data.data, test_data.data]:
  for idx, sample in split_data.items():
    sample['chars_idx'] = []
    for token in sample['tokens']:
      # We trunk in case we have more chars
      chars = [(char2idx[char] if char in char2idx else char2idx['_UNK_']) for char in token][:PAD_CHAR_LENGTH]
      
      # Transform chars into indeces
      sample['chars_idx'].append(chars)

      # Pad chars with PAD tokens
      while len(sample['chars_idx'][-1]) < PAD_CHAR_LENGTH:
        sample['chars_idx'][-1].append(char2idx['_PAD_'])
    
    # Pad with empty chars to reach the padding length of the sentence
    while len(sample['chars_idx']) < PAD_LENGTH:
      sample['chars_idx'].append([0 for _ in range(PAD_CHAR_LENGTH)])
    
    # Sanity check
    assert len(sample['chars_idx']) == PAD_LENGTH
    for chars in sample['chars_idx']:
      assert len(chars) == PAD_CHAR_LENGTH

# Modeling

## RNN

Modeling

In [None]:
class RNN_Model(nn.Module):
    def __init__(self, dropout, hidden_dim, classes_num, words_num, word_dim, chars_num, char_dim):
        super(RNN_Model, self).__init__()

        self.word_embedding = nn.Embedding(num_embeddings=words_num, embedding_dim=word_dim)
        
        # Our main component
        self.word_rnn = nn.RNN(input_size=word_dim,
                               hidden_size=hidden_dim,
                               num_layers=1,
                               batch_first=True,
                               dropout=0, # No dropout; it is complicated for RNNs. 
                               bidirectional=False)
        
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.Tanh()

        # The last layer to compute the probabilities for the output classes
        self.final_layer = nn.Linear(in_features=hidden_dim, out_features=classes_num)

        self.char_embedding = nn.Embedding(num_embeddings=chars_num, embedding_dim=char_dim)
        self.char_rnn = nn.RNN(input_size=char_dim,
                               hidden_size=hidden_dim,
                               num_layers=1,
                               batch_first=True,
                               dropout=0, # No dropout; it is complicated for RNNs. Do you have an intuition why?
                               bidirectional=False)
        
        
    def forward(self, x):
        x_words = self.word_embedding(x['word_idx']) # Convert into word embeddings

        x_words = self.dropout(x_words)
        
        output, last_hidden_state = self.word_rnn(x_words)
        
        output = self.activation(output)
        output = self.dropout(output)

        logits = self.final_layer(output)
        return logits

Let's see if our model can compute a foward pass

In [None]:
# Example of a batch=1 and 6 word indeces
input = {'word_idx': torch.tensor([[0,1,2,3,4,5]])}
input

{'word_idx': tensor([[0, 1, 2, 3, 4, 5]])}

In [None]:
ner = RNN_Model(dropout=0.3, hidden_dim=50, classes_num=3, words_num=6, word_dim=10, chars_num=6, char_dim=5)
logits = ner(input)
logits, logits.size()

(tensor([[[ 0.0764, -0.1668,  0.0908],
          [ 0.1358, -0.1125, -0.0273],
          [-0.2773, -0.3301,  0.0274],
          [-0.1854, -0.0084,  0.3041],
          [ 0.4168, -0.0840,  0.1346],
          [-0.4787,  0.0061,  0.3210]]], grad_fn=<AddBackward0>),
 torch.Size([1, 6, 3]))

The model seems to work and is able to compute the tag probability of a word.

**Loss function**
<br>
What is the loss to use to optimize the model? Because we are doing classification, the most natural one is to use the [cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html?highlight=cross%20entropy#torch.nn.CrossEntropyLoss). 

In [None]:
criterion = nn.CrossEntropyLoss(reduction='mean', ignore_index=-100)

**ATTENTION**: What is this ignore_index? In this can of task, where the output depends of the input length, we use PADDING. However, we do not want the prediction for the PAD words to have any influence. One way of doing this is to use a special value for the class (here -100) that will be ignored in the computation loss! See the documentation for more information.

**Metric**
<br>
As our task implies multi classification and the distribution of data is unbalanced originally, we will use the macro F1 score. Basically, it consists of the average F1 score for each class. (The micro F1 score applies a weighted average).

In [None]:
from sklearn.metrics import f1_score
def compute_f1(preds, golds):
  return f1_score(preds, golds, average='macro')

**Training + Testing** 

In [None]:
BATCH_SIZE = 10

# We initialize our model
model = RNN_Model(dropout=0.3, 
                   hidden_dim=128, 
                   classes_num=class_num, 
                   words_num=len(word2idx), 
                   word_dim=WORD_DIM,
                   chars_num=len(char2idx),
                   char_dim=CHAR_DIM)

# Copy the word embedding matrix
model.word_embedding.weight.data = torch.from_numpy(word_embeddings).float()
model.word_embedding.weight.requires_grad = False # We do NOT want to fine-tune the word embedding

model.char_embedding.weight.data = torch.from_numpy(char_embeddings).float()
model.char_embedding.weight.requires_grad = True # We DO want to fine-tune the char embedding as they were randomly initialized.

In [None]:
# We initialize our optimizer to update the weights of the model
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-8) # L2 = weight_decay

In [None]:
from torch.utils.data import Dataset, DataLoader
import torch

In [None]:
# We can load our dataset using a dataloader
train_loader = DataLoader(
        train_data,
        batch_size=BATCH_SIZE,
        shuffle=True, # Pay attention that we can shuffle the samples for training
        num_workers=0, # And specify how many working we want. 0/1 = 1
        drop_last=False) # Finally, it is possible to drop the last batch if its size is smaller than args.batch_size. In some applications, it is easier to ignore it instead of handling it.

val_loader = DataLoader(
        val_data,
        batch_size=BATCH_SIZE,
        shuffle=False, # Pay attention here that the data is not shuffled.
        num_workers=0, 
        drop_last=False)

test_loader = DataLoader(
        test_data,
        batch_size=BATCH_SIZE,
        shuffle=False, # Pay attention here that the data is not shuffled.
        num_workers=0, 
        drop_last=False)

In [None]:
# Move the model to the device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Initialize the loss function
criterion = nn.CrossEntropyLoss(reduction='mean', ignore_index=-100)

best_epoch = 0
best_val_so_far = 0
test_perf = 0


# TRAINING LOOP
for epoch in range(5):
  print('------------------------------------------------------------------------')
  print('Epoch: {}'.format(epoch))

  # TRAIN
  # We set the model in train mode. It will store information to compute the gradients
  # Also, the implementation of dropout, batchnorm, etc is different at training and inference time.
  model.train()

  train_losses = []
 
  for idx, batch in tqdm.tqdm(enumerate(train_loader), desc='Training'):
    # move them to GPU
    batch['word_idx'] = batch['word_idx'].to(device)
    batch['chars_idx'] = batch['chars_idx'].to(device)
    batch['y_ners'] = batch['y_ners'].to(device) #BxL

    # Compute the model output and the loss
    y_logits = model(batch)
    # We have to "flatten" the predictions because CE only handle tensors like BxC and B
    loss = criterion(y_logits.view(-1, class_num), batch['y_ners'].view(-1))

    # Update model parameters
    optimizer.zero_grad() # This is very important! By default, gradients are cumulated in tensors.
    loss.backward() # Now that gradients have been empties, we compute the new ones using the loss.
    optimizer.step() # We do gradient update with our optimization function (i.e., the weights of the model are updated).
  
    train_losses.append(loss.item())
  
  
  # VAL + TEST
  val_test_losses = {'val': [], 'test': []}
  val_test_f1 = {'val': [], 'test': []}
  
  # Unlike before, we set the model in eval mode to compute correctly dropout, batchnorm etc
  model.eval()

  # We do not store information relative to gradients as we do not update the model.
  # That's the reason why inference requires less memory and is faster.
  with torch.no_grad():
    for split_data, data in [('val', val_loader), ('test', test_loader)]:
      # Pay attention how the data loading become easiers!
      for idx, batch in tqdm.tqdm(enumerate(data), desc=split_data.capitalize()):

        # move them to GPU
        batch['word_idx'] = batch['word_idx'].to(device)
        batch['chars_idx'] = batch['chars_idx'].to(device)
        batch['y_ners'] = batch['y_ners'].to(device) #BxL

        # Compute the model output and the loss
        y_logits = model(batch) 
        # We have to "flatten" the predictions because CE only handle tensors like BxC and B
        loss = criterion(y_logits.view(-1, class_num), batch['y_ners'].view(-1))

        val_test_losses[split_data].append(loss.item())

        # Compute the macro f1 to evaluate our model
        y_probs = F.softmax(y_logits, dim=-1)
        y_pred = torch.argmax(y_logits, dim=-1)

        f1 = compute_f1(y_pred.view(-1).cpu().numpy(), batch['y_ners'].view(-1).cpu().numpy())
        val_test_f1[split_data].append(f1)
  
  # Monitoring
  print('Train loss: {:.4f}'.format(np.mean(train_losses)))
  print('Val   loss: {:.4f}'.format(np.mean(val_test_losses['val'])))
  print('Test  loss: {:.4f}'.format(np.mean(val_test_losses['test'])))
  print()

  val_f1 = np.mean(val_test_f1['val'])
  test_f1 = np.mean(val_test_f1['test'])
  print('Val   Macro F1: {:.4f}'.format(val_f1))
  print('Test  Macro F1: {:.4f}'.format(test_f1))
  print()

  if best_val_so_far < val_f1:
    best_val_so_far = val_f1
    test_perf = test_f1
    best_epoch = epoch
  
  print('Best Epoch: {}, best val macro F1: {:.4f}, test macro F1: {:.4f}'.format(best_epoch, best_val_so_far, test_perf))
  print()
  print()

------------------------------------------------------------------------
Epoch: 0


Training: 1500it [00:32, 46.25it/s]
Val: 350it [00:03, 92.40it/s]
Test: 350it [00:04, 80.27it/s]


Train loss: 0.1183
Val   loss: 0.0658
Test  loss: 0.0644

Val   Macro F1: 0.2371
Test  Macro F1: 0.2374

Best Epoch: 0, best val macro F1: 0.2371, test macro F1: 0.2374


------------------------------------------------------------------------
Epoch: 1


Training: 1500it [00:32, 46.45it/s]
Val: 350it [00:03, 93.45it/s]
Test: 350it [00:03, 90.54it/s]


Train loss: 0.0579
Val   loss: 0.0631
Test  loss: 0.0611

Val   Macro F1: 0.2373
Test  Macro F1: 0.2376

Best Epoch: 1, best val macro F1: 0.2373, test macro F1: 0.2376


------------------------------------------------------------------------
Epoch: 2


Training: 1500it [00:32, 46.32it/s]
Val: 350it [00:03, 93.34it/s]
Test: 350it [00:03, 93.15it/s]


Train loss: 0.0531
Val   loss: 0.0581
Test  loss: 0.0566

Val   Macro F1: 0.2383
Test  Macro F1: 0.2414

Best Epoch: 2, best val macro F1: 0.2383, test macro F1: 0.2414


------------------------------------------------------------------------
Epoch: 3


Training: 1500it [00:33, 44.36it/s]
Val: 350it [00:03, 88.08it/s]
Test: 350it [00:03, 88.88it/s]


Train loss: 0.0498
Val   loss: 0.0591
Test  loss: 0.0572

Val   Macro F1: 0.2384
Test  Macro F1: 0.2408

Best Epoch: 3, best val macro F1: 0.2384, test macro F1: 0.2408


------------------------------------------------------------------------
Epoch: 4


Training: 1500it [00:33, 44.90it/s]
Val: 350it [00:03, 90.95it/s]
Test: 350it [00:03, 92.73it/s]

Train loss: 0.0481
Val   loss: 0.0578
Test  loss: 0.0562

Val   Macro F1: 0.2379
Test  Macro F1: 0.2408

Best Epoch: 3, best val macro F1: 0.2384, test macro F1: 0.2408







## LSTM

In [None]:
vocabulary = list(set(vocab for li in train_data['tokens'] for vocab in li))
vocabulary.append("endpad")
num_vocab = len(vocabulary)

In [None]:
finer = load_dataset("nlpaueb/finer-139")
ner_tags = finer.features[f"ner_tags"].feature.names
num_ner_tags = len(ner_tags)

In [None]:
ner_tags

['O',
 'B-AccrualForEnvironmentalLossContingencies',
 'B-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife',
 'I-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife',
 'B-AllocatedShareBasedCompensationExpense',
 'B-AmortizationOfFinancingCosts',
 'B-AmortizationOfIntangibleAssets',
 'I-AmortizationOfIntangibleAssets',
 'B-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount',
 'I-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount',
 'B-AreaOfRealEstateProperty',
 'I-AreaOfRealEstateProperty',
 'B-AssetImpairmentCharges',
 'B-BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued',
 'B-BusinessAcquisitionPercentageOfVotingInterestsAcquired',
 'I-BusinessAcquisitionPercentageOfVotingInterestsAcquired',
 'B-BusinessCombinationAcquisitionRelatedCosts',
 'B-BusinessCombinationConsiderationTransferred1',
 'B-BusinessCombinationContingentConsiderationLiability',
 'B-BusinessCombinationRecognizedIdentifiableAssetsAcquired

In [None]:
tokens = data['tokens']
tags = data['ner_tags']

In [None]:
word2idx = {w: i + 1 for i, w in enumerate(vocabulary)}

In [None]:
tag2idx = {t: i for i, t in enumerate(ner_tags)}

In [None]:
#define Mappings between Sentences and Tags
sentences = []

for to in tokens:
  for ta in tags:
    sentences.append(list(zip(to, ta)))

In [None]:
sentences[0]

[('The', 0),
 ('changes', 0),
 ('in', 0),
 ('the', 0),
 ('fair', 0),
 ('value', 0),
 ('of', 0),
 ('the', 0),
 ('derivatives', 0),
 ('and', 0),
 ('the', 0),
 ('related', 0),
 ('underlying', 0),
 ('foreign', 0),
 ('currency', 0),
 ('exposures', 0),
 ('resulted', 0),
 ('in', 0),
 ('net', 0),
 ('gains', 0),
 ('of', 0),
 ('$', 0),
 ('11', 0),
 ('million', 0),
 ('and', 0),
 ('$', 0),
 ('23', 0),
 ('million', 0),
 ('for', 0),
 ('the', 0),
 ('three', 0),
 ('months', 0),
 ('ended', 0),
 ('March', 0),
 ('31', 0),
 (',', 0),
 ('2020', 0),
 ('and', 0),
 ('2019', 0),
 (',', 0),
 ('respectively', 0),
 (',', 0),
 ('that', 0),
 ('are', 0),
 ('recognized', 0),
 ('in', 0),
 ('Other', 0),
 (',', 0),
 ('net', 0),
 ('expenses', 0),
 ('on', 0),
 ('the', 0),
 ('Consolidated', 0),
 ('Statements', 0),
 ('of', 0),
 ('Income', 0),
 ('.', 0),
 ('5', 0),
 ('.', 0)]

In [None]:
max_len = 64

#padding Input Sentences
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_vocab-1)

In [None]:
y = [[w[1] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_vocab, output_dim=64, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)

In [None]:
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_ner_tags, activation="softmax"))(model)
model = Model(input_word, out)



In [None]:
model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [None]:
%%time

chkpt = ModelCheckpoint("model_weights.h5", monitor='val_loss',verbose=1, save_best_only=True, save_weights_only=True, mode='min')

early_stopping = EarlyStopping(monitor='val_accuracy', min_delta=0, patience=1, verbose=0, mode='max', baseline=None, restore_best_weights=False)

#callbacks = [PlotLossesCallback(), chkpt, early_stopping]

history = model.fit(
    x=x_train,
    y=y_train,
    validation_data=(x_test,y_test),
    batch_size=32, 
    epochs=3,
    #callbacks=callbacks,
    verbose=1
)

Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 20h 16min 28s, sys: 1h 48min 41s, total: 22h 5min 9s
Wall time: 15h 28min 31s


In [None]:
model.evaluate(x_test, y_test)



[0.021217001602053642, 0.9969900250434875]

## Transformer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
data_train_tokenized = data_train.map(tokenize_and_align_labels, batched=True)
data_test_tokenized = data_test.map(tokenize_and_align_labels, batched=True)
data_valid_tokenized = data_valid.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN t

In [None]:
batch_size = 64


args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01
)

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset = data_train_tokenized.select(range(5000)),
    eval_dataset = data_test_tokenized.select(range(500)),
    data_collator=data_collator,
    tokenizer=tokenizer
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 237


Epoch,Training Loss,Validation Loss
1,No log,0.098765
2,No log,0.095106
3,No log,0.078951


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 500
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 500
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DistilBertForTokenClassification.forward`,  you can 

TrainOutput(global_step=237, training_loss=0.35482530553632646, metrics={'train_runtime': 4845.7416, 'train_samples_per_second': 3.096, 'train_steps_per_second': 0.049, 'total_flos': 288254788298064.0, 'train_loss': 0.35482530553632646, 'epoch': 3.0})

# Summary

This is my first time training deep learning mdoels and I realized that deep learning models are way more complicated than machine learning models and they require much more resources. It takes a long time to train a deep learning model and my colab notebook has crashed multiple times due to insufficient RAM. It took me multiple days to make the codes work. I would have loved to train the models better and dive deeper into the details, but unfortunately, with the time constraint, I could only apply the basics of the models and present how the models work with the chosen dataset. 
<br> 
<br> **Comparison of Models**
<br> RNN processes data sequentially and can process any length input. However, it is not very good at handling long sequences as it forgets contents of distant position. 
<br> LSTM is nowadays the replacement of RNN cells and it includes the forget gate, input gate and output gate. It is better than RNN because the gates determine which information should be remembered and which should be forgot. 
<br> BERT is chosen in the Transformer model. It contains encoder layers and self-attention heads. By jointly conditioning on both left and right context in all layers, it is designed to pre-train deep bidirectional representations from text.