# Training an entities recognition model

Importing the required code files

In [1]:
from os import getcwd, path
import sys

BASE_PATH = path.dirname(getcwd())
sys.path.append(BASE_PATH)

from config import START_TAG, STOP_TAG

Default language for this instance: en


In [2]:
print(BASE_PATH)

/Users/2359media/Documents/botbot-nlp


The training data must be an array that:
- Contains tuples of (sentence, tags)
- Sentence will be splitted using nltk.wordpunct_tokenize
- Tags will be splitted using .split() - hence spaces by default

Each entity must be separated into 3 kinds of tag: B- (Begin), I- (Inside) and O- (Outside)

_This is to help with separation in the case of consecutive entities_

A `dictionary` to translate from these tags into consecutive indices must be defined
This dictionary will contain:
- The empty token
- `START_TAG` and `END_TAG` tokens (imported from global configs - used internally to indicate start and end of sentence)
- Entities B-, I-, O- tokens

**Sample training data for email recognition:**

In [3]:
# training_data = [('hi thanh', '- - B-name'), ('hello duc, how are you?', '- - B-name - - - - - - - -')]

# tag_to_ix = {'-': 0, '<START>': 1, '<STOP>': 2, 'B-name': 3, 'I-name': 4}

training_data = [(
    'My email address is at luungoc2005@gmail.com.',
    '- - - - - - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL -'
), (
    'Contact me at contact@2359media.net.',
    '- - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL -'
), (
    'test.email@microsoft.com is a testing email address',
    'B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL - - - - - - - - - -'
), (
    'Any inquiries email thesloth_197@gmail.com for assistance',
    '- - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL - - - -'
), (
    'Email addresses include test.noreply@gmail.com hello.vietnam@hallo.org contact@rocket.net',
    '- - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL'
), (
    'Contact: tester@github.com at any hours',
    '- - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL - - - - - -'
)]

tag_to_ix = {
    '-': 1, # O tag but using '-' for readability
    'B-EMAIL': 2,
    'I-EMAIL': 3,
}

In [4]:
from entities_recognition.transformer.model import TransformerSequenceTaggerWrapper
from entities_recognition.transformer.train import TransformerSequenceTaggerLearner
from entities_recognition.transformer.data import TransformerEntitiesRecognitionDataset
from common.callbacks import PrintLoggerCallback, EarlyStoppingCallback, ReduceLROnPlateau
from common.modules import BertAdam

n_epochs = 500
batch_size = 4
model = TransformerSequenceTaggerWrapper({'tag_to_ix': tag_to_ix})
learner = TransformerSequenceTaggerLearner(model, 
    optimizer_fn=BertAdam,
    optimizer_kwargs={
        'lr': 1e-4,
        'warmup': .1, 
        't_total': n_epochs * (len(training_data) // batch_size)
    }
)
training_data = TransformerEntitiesRecognitionDataset(training_data, tag_to_ix)

In [5]:
learner.fit(
    training_data=training_data,
    epochs=n_epochs,
    batch_size=4,
    callbacks=[
        PrintLoggerCallback(log_every=5),
#         ReduceLROnPlateau(reduce_factor=4, patience=10)
        EarlyStoppingCallback(patience=50)
    ]
)

Word vectors data exists for the following languages: en, en_elmo, vi
0m 3s (- 6m 1s) (5 1%) - loss: 21.8958 - accuracy: 0.3287
0m 6s (- 4m 57s) (10 2%) - loss: 16.6184 - accuracy: 0.3418
0m 8s (- 4m 25s) (15 3%) - loss: 13.5118 - accuracy: 0.4052
0m 10s (- 4m 5s) (20 4%) - loss: 8.3136 - accuracy: 0.6378
0m 12s (- 3m 56s) (25 5%) - loss: 7.0325 - accuracy: 0.6471
0m 16s (- 4m 13s) (30 6%) - loss: 6.5257 - accuracy: 0.6187
0m 19s (- 4m 18s) (35 7%) - loss: 5.9569 - accuracy: 0.5406
0m 21s (- 4m 12s) (40 8%) - loss: 4.5557 - accuracy: 0.5305
0m 24s (- 4m 6s) (45 9%) - loss: 4.1863 - accuracy: 0.6242
0m 26s (- 3m 59s) (50 10%) - loss: 3.7140 - accuracy: 0.6481
0m 28s (- 3m 50s) (55 11%) - loss: 0.9362 - accuracy: 0.7004
0m 30s (- 3m 43s) (60 12%) - loss: 3.5254 - accuracy: 0.5112
0m 32s (- 3m 35s) (65 13%) - loss: 0.6173 - accuracy: 0.7004
0m 33s (- 3m 28s) (70 14%) - loss: 1.4790 - accuracy: 0.6566
0m 35s (- 3m 22s) (75 15%) - loss: 1.5057 - accuracy: 0.5948
0m 37s (- 3m 15s) (80 16%) -

In [6]:
from common.utils import wordpunct_space_tokenize
# model([wordpunct_space_tokenize('test.email@microsoft.com is a testing email address')])
# model([wordpunct_space_tokenize('Any inquiries email thesloth_197@gmail.com for assistance')])
model([wordpunct_space_tokenize('My first email address is actually luungoc2005@yahoo.com')])

[[{'name': 'EMAIL', 'values': ['2005', 'yahoo', 'com']}]]

Evaluate model accuracy by using