# Training an entities recognition model

Importing the required code files

In [1]:
from os import getcwd, path
import sys

BASE_PATH = path.dirname(getcwd())
sys.path.append(BASE_PATH)

from config import START_TAG, STOP_TAG

In [2]:
print(BASE_PATH)

/Users/2359media/Documents/botbot-nlp


The training data must be an array that:
- Contains tuples of (sentence, tags)
- Sentence will be splitted using nltk.wordpunct_tokenize
- Tags will be splitted using .split() - hence spaces by default

Each entity must be separated into 3 kinds of tag: B- (Begin), I- (Inside) and O- (Outside)

_This is to help with separation in the case of consecutive entities_

A `dictionary` to translate from these tags into consecutive indices must be defined
This dictionary will contain:
- The empty token
- `START_TAG` and `END_TAG` tokens (imported from global configs - used internally to indicate start and end of sentence)
- Entities B-, I-, O- tokens

**Sample training data for email recognition:**

In [3]:
training_data = [('hi thanh', '- - B-name'), ('hello duc', '- - B-name')]

tag_to_ix = {'-': 0, '<START>': 1, '<STOP>': 2, 'B-name': 3, 'I-name': 4}

In [4]:
from entities_recognition.bilstm.model import SequenceTaggerWrapper
from entities_recognition.bilstm.train import SequenceTaggerLearner
from common.callbacks import PrintLoggerCallback, EarlyStoppingCallback

model = SequenceTaggerWrapper({'tag_to_ix': tag_to_ix})
learner = SequenceTaggerLearner(model)

In [5]:
learner.fit(
    training_data=training_data,
    epochs=50,
    callbacks=[
        PrintLoggerCallback(log_every=5),
        EarlyStoppingCallback()
    ]
)

0m 1s (- 0m 12s) (5 10%) - loss: 3.3695 - accuracy: 0.3333
0m 1s (- 0m 6s) (10 20%) - loss: 2.6775 - accuracy: 0.5000
0m 1s (- 0m 3s) (15 30%) - loss: 2.1680 - accuracy: 0.6667
0m 1s (- 0m 2s) (20 40%) - loss: 1.5052 - accuracy: 0.6667
0m 1s (- 0m 1s) (25 50%) - loss: 0.7993 - accuracy: 1.0000
0m 2s (- 0m 1s) (30 60%) - loss: 0.3977 - accuracy: 1.0000
0m 2s (- 0m 0s) (35 70%) - loss: 0.1968 - accuracy: 1.0000
0m 2s (- 0m 0s) (40 80%) - loss: 0.1446 - accuracy: 1.0000
0m 2s (- 0m 0s) (45 90%) - loss: 0.0768 - accuracy: 1.0000
0m 2s (- 0m 0s) (50 100%) - loss: 0.0637 - accuracy: 1.0000


In [8]:
model(['it\'s just getting the last word'])

[{'name': ['last', 'word']}]

Evaluate model accuracy by using