<a href="https://colab.research.google.com/github/mkaramib/NLP/blob/main/NER/NER_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition(NER)
In this Jupyter notebook, a NER using LSTM will be implemented. I will use Trax as the development library. 


In [1]:
import os
import random as rnd
import numpy as np

In [None]:
# install Trax
!pip install -q -U trax
import trax
from trax import layers as tl  # core building block
from trax import shapes  # data signatures: dimensionality and type
from trax import fastmath  # uses jax, offers numpy on steroids
from trax.supervised import training

# import trax.fastmath.numpy
import trax.fastmath.numpy as jax_np

## Data
This section loads the data such as sentences, tags, words, etc. 

In [4]:
# define corresponding files.
sentences_file = "./data/sentences.txt"
labels_file = "./data/labels.txt"
words_file = "./data/words.txt"
tags_file = "./data/tags.txt"

### Sentences, Labels, Words, Tags
Sentences, corresponding sequence of NER labels, unique words, and unique tags are loaded. 

In [5]:
# load content from given file
def load_content(file):
  f = open(file, mode="r", encoding="ISO-8859-1")
  return [line.replace("\n","") for line in f.readlines()]

# load sentences
sentences = load_content(sentences_file)
labels = load_content(labels_file)
words = load_content(words_file)
tags_raw = load_content(tags_file)

### Vocabulary of Words and Tags
In order to vectorize the sentences, it is required to build vocabulary of words, similarly for the tags.

In [6]:
# build the vocabulary
vocab = {words[i]:i for i in range(len(words))}

# add <PAD> to vocab
vocab['<PAD>'] = len(vocab)
vocab['<UNK>'] = len(vocab)

# build the tags vocab
tags = {tags_raw[i]:i for i in range(len(tags_raw))}

### Vectorize Sentences and Labels
In this step, we need to vectorize the sentences and labels using the vocab and tag dictionaries. 

In [7]:
# vectorize sentences
v_sentences = [ [vocab[t] if t in vocab else vocab['<UNK>'] for t in sentence.split(' ')] for sentence in sentences]

# vectorize labels
v_labels = [[tags[l] for l in label.split(' ')] for label in labels]

### Train, Validation, Test split
In this section, the sentences and corresponding label sequences are divived into train, validation, and test set. The split is based on ration.

In [9]:
# define train/val/test retio(percentage)
train_r, val_r, test_r = 70, 10, 20

# find the end index for train split
train_end_i = int(len(v_sentences) * train_r/100)

# find the end index for validaition set. It located after the train set
val_end_i = train_end_i + int(len(v_sentences) * val_r/100)

# generate the train/val/test sentenes and label-sequences
train_s, train_l = v_sentences[:train_end_i], v_labels[:train_end_i]
val_s, val_l = v_sentences[train_end_i:val_end_i], v_labels[train_end_i:val_end_i]
test_s, test_l = v_sentences[val_end_i:], v_labels[val_end_i]

# assert the split
assert len(v_sentences) == len(train_s) + len(val_s) + len(test_s)
print(f'train size = {len(train_s)}, validation size = {len(val_s)}, test size = {len(test_s)}')

train size = 33570, validation size = 4795, test size = 9593


### Data Generator
Data generator is a key part of most on NLP applications using deep learning. 

In [11]:
# Data generator
def data_generator(x, y, batch_size, pad, shuffle=False):
  '''
  Input:
    x: list of inputs, each input is a sentence(sequence)
    y: list of labels, each label is a sequence of tags
    batch_size: num for the batch-size
    pad: word id for the <PAD> in the vocab.
    shuffle: indicates if the shuffle is needed or not.
  Output:
    
  '''
  l = len(x)
  x_indexes = [*range(l)]

  # shuffle the data if required
  if shuffle:
    rnd.shuffle(x_indexes)

  index = 0
  while True:
      
      # max length of sentence
      max_l = 0

      # instaniate output indexes
      x_out, y_out = [0]*batch_size, [0]*batch_size

      # select a list of size of batch.
      for i in range(batch_size):
        # at the end of data, reset the index
        if index >= l:
          index = 0
          if shuffle:
            rnd.shuffle(x_indexes)
        
        # add to the 
        x_out[i] = x[x_indexes[index]]
        y_out[i] = y[x_indexes[index]]
      
        # check max row
        lenx = len(x_out[i])
        if lenx > max_l:
          max_l = lenx
        
        # increase the index
        index += 1

      # convert to the output 
      X = np.full((batch_size, max_l), pad)
      Y = np.full((batch_size, max_l), pad)
      for i in range(batch_size):
        for j in range(len(x_out[i])):
          X[i, j] = x_out[i][j]
          Y[i, j] = y_out[i][j]
      
      # yield the result
      yield((X,Y))

## Model
This section desing a LSTM based model for the NER.


In [12]:
# NER as the NN model.
def NER(vocab_size=35181, d_model=50, tags=tags):
    '''
      Input: 
        vocab_size - integer containing the size of the vocabulary
        d_model - integer describing the embedding size
      Output:
        model - a trax serial model
    '''
    # define the model
    model = tl.Serial(
      tl.Embedding(vocab_size=vocab_size, d_feature=d_model), # Embedding layer
      tl.LSTM(n_units=d_model),     # LSTM layer
      tl.Dense(n_units=len(tags)),  # Dense layer with len(tags) units
      tl.LogSoftmax()               # LogSoftmax layer
      )
     
    return model

### Training/Validation Generators
In this section, training and validation data are generated. 

In [13]:
rnd.seed(33)
batch_size = 64

# Create training data
train_generator = trax.data.inputs.add_loss_weights(
    data_generator(train_s, train_l, batch_size=batch_size, pad=vocab['<PAD>'], shuffle=True),
    id_to_mask=vocab['<PAD>'])

# Create validation data
eval_generator = trax.data.inputs.add_loss_weights(
    data_generator(val_s, val_l, batch_size=batch_size, pad=vocab['<PAD>'], shuffle=True),
    id_to_mask=vocab['<PAD>'])

### Training
In this section, the training tasks and loop will be build. 

In [14]:
# training loop
def train_model(NER_model, train_generator, eval_generator, train_steps=1, output_dir='model'):
    '''
    Input: 
        NER_model - the model you are building
        train_generator - The data generator for training examples
        eval_generator - The data generator for validation examples,
        train_steps - number of training steps
        output_dir - folder to save your model
    Output:
        training_loop - a trax supervised training Loop
    '''
    # step 1- training task
    train_task = training.TrainTask(
      train_generator,                    # A train data generator
      loss_layer = tl.CrossEntropyLoss(), # A cross-entropy loss function
      optimizer = trax.optimizers.Adam(0.01),  # The adam optimizer
      n_steps_per_checkpoint=200, #This will print the results at every 200 training steps.
    )

    # step 2- evaluation task
    eval_task = training.EvalTask(
      labeled_data = eval_generator,      # A labeled data generator
      metrics = [tl.CrossEntropyLoss(), tl.Accuracy()], # Evaluate with cross-entropy loss and accuracy
      n_eval_batches = 10         # Number of batches to use on each evaluation
    )

    # step 3- training loop
    training_loop = training.Loop(
        NER_model,# A model to train
        train_task, # A train task
        eval_tasks = [eval_task], # The evaluation task
        output_dir = output_dir) # The output directory

    # run with train_steps
    training_loop.run(n_steps = train_steps)

    # return the training loop
    return training_loop

### Test the training
In the following code, the training is tested.

In [None]:
train_steps = 1000            # In coursera we can only train 100 steps
!rm -f 'model/model.pkl.gz'  # Remove old model.pkl if it exists

# Train the model
training_loop = train_model(NER(), train_generator, eval_generator, train_steps)