# Training an entities recognition model

Importing the required code files

In [1]:
from os import getcwd, path
import sys
import matplotlib.pyplot as plt

BASE_PATH = path.dirname(getcwd())
sys.path.append(BASE_PATH)

from entities_recognition.bilstm.train import trainIters, evaluate
from config import START_TAG, STOP_TAG

In [2]:
print(BASE_PATH)

/Users/2359media/Documents/botbot-nlp


The training data must be an array that:
- Contains tuples of (sentence, tags)
- Sentence will be splitted using nltk.wordpunct_tokenize
- Tags will be splitted using .split() - hence spaces by default

Each entity must be separated into 3 kinds of tag: B- (Begin), I- (Inside) and O- (Outside)

_This is to help with separation in the case of consecutive entities_

A `dictionary` to translate from these tags into consecutive indices must be defined
This dictionary will contain:
- The empty token
- `START_TAG` and `END_TAG` tokens (imported from global configs - used internally to indicate start and end of sentence)
- Entities B-, I-, O- tokens

**Sample training data for email recognition:**

In [3]:
training_data = [(
    'My email address is at luungoc2005@gmail.com',
    '- - - - - - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL'
), (
    'Contact me at contact@2359media.net',
    '- - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL'
), (
    'test.email@microsoft.com is a testing email address',
    'B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL - - - - - - - - - -'
), (
    'Any inquiries email thesloth_197@gmail.com for assistance',
    '- - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL - - - -'
), (
    'Email addresses include test.noreply@gmail.com hello.vietnam@hallo.org contact@rocket.net',
    '- - - - - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL - B-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL - B-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL'
), (
    'Contact: tester@github.com at any hours',
    '- - - B-EMAIL I-EMAIL I-EMAIL I-EMAIL O-EMAIL - - - - - -'
)]

tag_to_ix = {
    '-': 0,
    'B-EMAIL': 1,
    'I-EMAIL': 2,
    'O-EMAIL': 3,
    START_TAG: 4,
    STOP_TAG: 5
}

Begin training the network
Logs will be saved into `entities_recognition/bilstm/logs` by default

Run `tensorboard --logdir=entities_recognition/bilstm/logs` from the root directory for training logs

Verbosity:
- `verbose = 0` for almost no console output
- `verbose = 1` will only log on `log_every` (10 epochs by default)
- `verbose = 2` (default) will use tqdm for both loops

In this case ~50 epochs should be sufficient (found by trial and error)

In [4]:
model = trainIters(training_data, tag_to_ix, n_iters=50, log_every=5, verbose=1)

Importing /Users/2359media/Documents/botbot-nlp/data/glove/glove.6B.300d.txt...
1m 8s (- 10m 17s) (5 10%) 65.4277
1m 11s (- 4m 45s) (10 20%) 2.0216
1m 14s (- 2m 53s) (15 30%) 0.1957
1m 17s (- 1m 55s) (20 40%) 0.1180
1m 19s (- 1m 19s) (25 50%) 0.0816
1m 23s (- 0m 55s) (30 60%) 0.0584
1m 26s (- 0m 36s) (35 70%) 0.0501
1m 28s (- 0m 22s) (40 80%) 0.0462
1m 31s (- 0m 10s) (45 90%) 0.0345
1m 35s (- 0m 0s) (50 100%) 0.0349


Evaluate model accuracy by using

In [5]:
evaluate(model, training_data, tag_to_ix)

1.0