<a href="https://colab.research.google.com/github/mkaramib/NLP/blob/main/NER/NER_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition(NER)
In this Jupyter notebook, a NER using LSTM will be implemented. I will use Trax as the development library. 


In [1]:
import os


In [2]:
# install Trax
!pip install -q -U trax
import trax
from trax import layers as tl  # core building block
from trax import shapes  # data signatures: dimensionality and type
from trax import fastmath  # uses jax, offers numpy on steroids
from trax.supervised import training

# import trax.fastmath.numpy
import trax.fastmath.numpy as np

[K     |████████████████████████████████| 522kB 9.0MB/s 
[K     |████████████████████████████████| 215kB 10.5MB/s 
[K     |████████████████████████████████| 3.4MB 15.9MB/s 
[K     |████████████████████████████████| 3.7MB 36.0MB/s 
[K     |████████████████████████████████| 71kB 8.6MB/s 
[K     |████████████████████████████████| 1.1MB 38.2MB/s 
[K     |████████████████████████████████| 1.5MB 46.0MB/s 
[K     |████████████████████████████████| 368kB 43.7MB/s 
[K     |████████████████████████████████| 890kB 47.7MB/s 
[K     |████████████████████████████████| 2.9MB 48.9MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


## Data
This section loads the data such as sentences, tags, words, etc. 

In [5]:
# define corresponding files.
sentences_file = "./data/sentences.txt"
labels_file = "./data/labels.txt"
words_file = "./data/words.txt"
tags_file = "./data/tags.txt"

### Sentences, Labels, Words, Tags
Sentences, corresponding sequence of NER labels, unique words, and unique tags are loaded. 

In [42]:
# load content from given file
def load_content(file):
  f = open(file, mode="r", encoding="ISO-8859-1")
  return [line.replace("\n","") for line in f.readlines()]

# load sentences
sentences = load_content(sentences_file)
labels = load_content(labels_file)
words = load_content(words_file)
tags_raw = load_content(tags_file)

### Vocabulary of Words and Tags
In order to vectorize the sentences, it is required to build vocabulary of words, similarly for the tags.

In [55]:
# build the vocabulary
vocab = {words[i]:i for i in range(len(words))}

# add <PAD> to vocab
vocab['<PAD>'] = len(vocab)
vocab['<UNK>'] = len(vocab)

# build the tags vocab
tags = {tags_raw[i]:i for i in range(len(tags_raw))}

### Vectorize Sentences and Labels
In this step, we need to vectorize the sentences and labels using the vocab and tag dictionaries. 

In [62]:
# vectorize sentences
v_sentences = [ [vocab[t] if t in vocab else vocab['<UNK>'] for t in sentence.split(' ')] for sentence in sentences]

# vectorize labels
v_labels = [[tags[l] for l in label.split(' ')] for label in labels]

### Train, Validation, Test split
In this section, the sentences and corresponding label sequences are divived into train, validation, and test set. The split is based on ration.

In [None]:
# define train/val/test retio(percentage)
train_r, val_r, test_r = 70, 10, 20

# find the end index for train split
train_end_i = int(len(v_sentences) * train_r/100)

# find the end index for validaition set. It located after the train set
val_end_i = train_end_i + int(len(v_sentences) * val_r/100)

# generate the train/val/test sentenes and label-sequences
train_s, train_l = v_sentences[:train_end_i], v_labels[:train_end_i]
val_s, val_l = v_sentences[train_end_i:val_end_i], v_labels[train_end_i:val_end_i]
test_s, test_l = v_sentences[val_end_i:], v_labels[val_end_i]

# assert the split
assert len(v_sentences) == len(train_s) + len(val_s) + len(test_s)
print(f'train size = {len(train_s)}, validation size = {len(val_s)}, test size = {len(test_s)}')

### Data Generator
Data generator is a key part of most on NLP applications using deep learning. 

In [None]:
# Data generator
def data_generator(x, y, batch_size, shuffle=False):
  '''
  Input:
    x: list of inputs, each input is a sentence(sequence)
    y: list of labels, each label is a sequence of tags
    batch_size: num for the batch-size
    shuffle: indicates if the shuffle is needed or not.
  Output:
    
  '''