<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/tf/trax_ner_reformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER Reformer

## Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.



In [2]:
#@title ## Install Dependencies
#@markdown - trax
#@markdown - kaggle client: downloads dataset

%%capture --no-stdout --no-stderr
!pip install -Uqq trax 
!pip install -Uqq kaggle

# %%python
print("Dependencies successfully installed.")

[K     |████████████████████████████████| 522kB 4.1MB/s 
[K     |████████████████████████████████| 3.4MB 8.1MB/s 
[K     |████████████████████████████████| 215kB 33.9MB/s 
[K     |████████████████████████████████| 1.1MB 29.5MB/s 
[K     |████████████████████████████████| 3.7MB 46.7MB/s 
[K     |████████████████████████████████| 71kB 6.4MB/s 
[K     |████████████████████████████████| 368kB 49.5MB/s 
[K     |████████████████████████████████| 1.5MB 50.6MB/s 
[K     |████████████████████████████████| 890kB 43.3MB/s 
[K     |████████████████████████████████| 2.9MB 47.9MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
Dependencies successfully installed.


In [3]:
#@title ## Download Kaggle Dataset
#@markdown Dataset: Annotated Corpus for Named Entity Recognition <br>
#@markdown [https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
#@markdown <br><br>
#@markdown This is the extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc.
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p ~/.kaggle
!cp /content/drive/MyDrive/kaggle/kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d abhinavwalia95/entity-annotated-corpus
!unzip -o /content/entity-annotated-corpus


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Downloading entity-annotated-corpus.zip to /content
 34% 9.00M/26.4M [00:01<00:02, 8.78MB/s]
100% 26.4M/26.4M [00:01<00:00, 20.3MB/s]
Archive:  /content/entity-annotated-corpus.zip
  inflating: ner.csv                 
  inflating: ner_dataset.csv         


In [4]:
#@title ## Import packages
#@markdown DL framework: trax<br>
#@markdown Data Manipulation: pandas<br>
import random as rnd

import numpy as np
import pandas as pd
import trax
from trax import layers as tl

#print('trax:', trax.__version__)
print('numpy:', np.__version__)
print('pandas:', pd.__version__)

numpy: 1.19.4
pandas: 1.1.5


## Preprocessing

Padding tokens

In [5]:
PAD_TOKEN = "PAD"
PAD_INDEX = 0
PAD_TAG = "O"

Loading dataset

In [6]:
data = pd.read_csv("ner_dataset.csv", encoding="ISO-8859-1", error_bad_lines=False)
#data = data.rename(columns={"Sentence #": "sentence_id", "Word": "word", "Tag": "tag"})
#data = data[["sentence_id", "word", "tag"]]
data = data.fillna(method= "ffill")
data.head(3)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O


Tag Entities

In [7]:
#data["tag"].value_counts()

Build Vocab

In [8]:
## Extract the 'Word' column from the dataframe
words = data.loc[:, "Word"]
!touch words.txt
vocab = {}
with open('words.txt') as f:
  for i, l in enumerate(f.read().splitlines()):
    vocab[l] = i
  print("Number of words:", len(vocab))
  vocab['<PAD>'] = len(vocab)

## Convert into a text file using the .savetxt() function
np.savetxt(r'words.txt', words.values, fmt="%s")

class Get_sentence(object):
    def __init__(self,data):
        self.n_sent=1
        self.data = data
        agg_func = lambda s:[(w,p,t) for w,p,t in zip(s["Word"].values.tolist(),
                                                     s["POS"].values.tolist(),
                                                     s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]

getter = Get_sentence(data)
sentence = getter.sentences

words = list(set(data["Word"].values))
words_tag = list(set(data["Tag"].values))

word_idx = {w : i+1 for i ,w in enumerate(words)}
tag_idx =  {t : i for i ,t in enumerate(words_tag)}

X = [[word_idx[w[0]] for w in s] for s in sentence]
y = [[tag_idx[w[2]] for w in s] for s in sentence]

def data_generator(batch_size, x, y,pad, shuffle=False, verbose=False):

    num_lines = len(x)
    lines_index = [*range(num_lines)]
    if shuffle:
        rnd.shuffle(lines_index)
    
    index = 0 
    while True:
        buffer_x = [0] * batch_size 
        buffer_y = [0] * batch_size 

        max_len = 0
        for i in range(batch_size):
            if index >= num_lines:
                index = 0
                if shuffle:
                    rnd.shuffle(lines_index)
            
            buffer_x[i] = x[lines_index[index]]
            buffer_y[i] = y[lines_index[index]]
            
            lenx = len(x[lines_index[index]])    
            if lenx > max_len:
                max_len = lenx                  
            
            index += 1

        X = np.full((batch_size, max_len), pad)
        Y = np.full((batch_size, max_len), pad)


        for i in range(batch_size):
            x_i = buffer_x[i]
            y_i = buffer_y[i]

            for j in range(len(x_i)):

                X[i, j] = x_i[j]
                Y[i, j] = y_i[j]

        if verbose: print("index=", index)
        yield((X,Y))

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.1,random_state=1)

Number of words: 0


## Model

In [9]:
!pip install --upgrade jax # install jax(base)

Requirement already up-to-date: jax in /usr/local/lib/python3.6/dist-packages (0.2.7)


In [10]:
def NERmodel(tags, vocab_size=35178, d_model=50):
    model = tl.Serial(
        trax.models.reformer.Reformer(vocab_size, d_model, ff_activation=tl.LogSoftmax),
        tl.Dense(tags),
        tl.LogSoftmax()
    )
    return model

In [11]:
model = NERmodel(tags=17)
#print(model)

In [12]:
from trax.supervised import training

rnd.seed(33)

batch_size = 64

train_generator = trax.data.inputs.add_loss_weights(
    data_generator(batch_size, x_train, y_train,vocab['<PAD>'], True),
    id_to_mask=vocab['<PAD>'])

eval_generator = trax.data.inputs.add_loss_weights(
    data_generator(batch_size, x_test, y_test,vocab['<PAD>'] ,True),
    id_to_mask=vocab['<PAD>'])

In [13]:
def train_model(model, train_generator, eval_generator, train_steps=1, output_dir='model'):
    train_task = training.TrainTask(
      train_generator,  
      loss_layer = tl.CrossEntropyLoss(), 
      optimizer = trax.optimizers.Adam(0.01), 
      n_steps_per_checkpoint=10
    )

    eval_task = training.EvalTask(
      labeled_data = eval_generator, 
      metrics = [tl.CrossEntropyLoss(), tl.Accuracy()], 
      n_eval_batches = 10 
    )

    training_loop = training.Loop(
        model, 
        train_task, 
        eval_tasks = eval_task, 
        output_dir = output_dir) 

    training_loop.run(n_steps = train_steps)
    return training_loop

In [14]:
train_steps = 100
training_loop = train_model(model, train_generator, eval_generator, train_steps)


Step      1: Total number of trainable weights: 64264085
Step      1: Ran 1 train steps in 85.09 secs
Step      1: train CrossEntropyLoss |  3.15675759
Step      1: eval  CrossEntropyLoss |  3.98971963
Step      1: eval          Accuracy |  0.85465195

Step     10: Ran 9 train steps in 401.32 secs
Step     10: train CrossEntropyLoss |  4.18205214
Step     10: eval  CrossEntropyLoss |  4.08520691
Step     10: eval          Accuracy |  0.85012255

Step     20: Ran 10 train steps in 344.50 secs
Step     20: train CrossEntropyLoss |  5.38061810
Step     20: eval  CrossEntropyLoss |  5.23265591
Step     20: eval          Accuracy |  0.85214944

Step     30: Ran 10 train steps in 85.92 secs
Step     30: train CrossEntropyLoss |  4.72383165
Step     30: eval  CrossEntropyLoss |  3.02188914
Step     30: eval          Accuracy |  0.84257923

Step     40: Ran 10 train steps in 242.10 secs
Step     40: train CrossEntropyLoss |  2.26916385
Step     40: eval  CrossEntropyLoss |  1.50706003
Step   

In [16]:
train_steps

100