In [1]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='ae1a755d-e162-4f07-9f5a-130d2280e78e', project_access_token='p-aa90b9b21de435c3f4c94494a24b5c5e69d030f8')
pc = project.project_context

This notebook demonstrates how to train entity extraction models using Watson NLP.

The dataset has been downloaded and saved in the [Box folder - training.json](https://ibm.box.com/s/llw7q2gzwbqhgt1h7ek5s0d1mulb0uxx) and [Box folder - dev.json](https://ibm.box.com/s/euz5fpn7jmx7um2giopczcct4lk9uyv3) for you. The text data are labeled with types `GeographicFeature`, `Location`, or `Duration`.


## What you'll learn in this notebook

Watson NLP implements state-of-the-art classification algorithms from three different families: 
- Classic machine learning using CRF (Conditional Random Field)
- Deep learning using BiLSTM (Bidirectional Long Short Term Memory)
- A transformer-based algorithm using the Google BERT multilingual model 

In this notebook, you'll learn how to:

- **Prepare your data** so that it can be used as training data for the Watson NLP classification algorithms.
- **Train a custom CRF model** using `watson_nlp.blocks.entity_mentions.SIRE`.
- **Train a BiLSTM** using `watson_nlp.blocks.entity_mentions.BiLSTM`.
- **Train a BERT** using `watson_nlp.blocks.entity_mentions.BERT`.
- **Store and load models** as an asset of a Watson Studio project.

## Table of Contents


1.	[Before You Start](#beforeYouStart)
1.  [Prepare Training](#prepareTraining)
1.  [Model Building](#buildModel)
    1. [SIRE Training](#sire)
    1. [BiLSTM Training](#bilstm)
    1. [BERT](#bert)
1.  [Summary](#summary)

<a id="beforeYouStart"></a>
## Before You Start

<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.8 + Watson NLP XS_ environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [2]:
import json
import pandas as pd
import watson_nlp
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json, create_iob_labels

In [3]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

In [4]:
# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))

<a id="prepareTraining"></a>
### Preparing Training Data

The dataset is required to have a dictionary format as follows:
```
[
  {
    "id": 1,
    "text": "This waterfall is actually hours away from Portland, basically in California.",
    "mentions": 
    [
      {
        "text": "waterfall", "type": "GeographicFeature", 
        "location": 
          {
            "begin": 5, 
            "end": 14
          }
      },
      {
        "text": "Portland", 
        "type": "Location", 
        "location": 
          {
            "begin": 43, 
            "end": 51
          }
      },
      {
        "text": "California", 
        "type": "Location", 
        "location": 
          {
            "begin": 66, 
            "end": 76
          }
       }
    ]
  },
  ...
]
```

Since the data is already formatted correctly, the following process is needed to read the JSON data files from Watson Studio project assets and save them to the runtime working directory where they will be used as input for training the models.

In [5]:
buffer = project.get_file("entity_train.json")
pd.read_json(buffer).to_json('train.json', orient='records')
buffer = project.get_file("entity_dev.json")
pd.read_json(buffer).to_json('dev.json', orient='records')
buffer = project.get_file("entity_test.json")
pd.read_json(buffer).to_json('test.json', orient='records')

The text inputs will be converted into a streaming array where the text is broken down by the syntax model.

In [6]:
train_data = dm.DataStream.from_json_array("train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("dev.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

<a id="buildModel"></a>
## Model Building

<a id="sire"></a>
### SIRE Training

You can train SIRE models using either CRF & Maximum Entropy template as base models. Between the two, CRF based template takes longer to train but gives better results.

These algorithms accept a set of featured in the form of dictionaries and regular expressions. A set of predefined feature extractors are provided for multiple languages, and you can also define your own features.

In [7]:
#help(watson_nlp.blocks.entity_mentions.SIRE.train)

In [7]:
# Download the algorithm template
mentions_train_template = watson_nlp.load(watson_nlp.download('file_path_entity-mentions_sire_multi_template-crf'))
# Download the feature extractor
default_feature_extractor = watson_nlp.load(watson_nlp.download('feature-extractor_rbr_entity-mentions_sire_en_stock'))

In [8]:
# Train the model
sire_custom = watson_nlp.blocks.entity_mentions.SIRE.train(train_iob_stream, 
                                                           'en', 
                                                           mentions_train_template,
                                                           feature_extractors=[default_feature_extractor])

Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Get Feature str 534
Done get feature str 534
done. [25[33mg[0m192[33mm[0m1012[33mk[0m,1[33mg[0m756[33mm[0m904[33mk[0m]
gramSize = 2
number of processes: 5
Initial processing:  (# of words: 980, # of sentences: 98)
senIndex[1] = 19, wordIndex = 200
senIndex[2] = 38, wordIndex = 393
senIndex[3] = 58, wordIndex = 593
senIndex[4] = 78, wordIndex = 793
senIndex[5] = 97, wordIndex = 980
[32m[ME_CRF::scaleModel][0m Updater -- l1=[32m0.1[0m, l2=[32m0.005[0m, history size=[32m5[0m, progress windows size [32m20[0m
 Iteration           Obj             WErr                         Timing       %Eff        Per thread timing
                 1338.85     15.00/100.00             E:0.00 s, M:0.00 s.       1.00 [m:0.00, M:0.00, av:0.00]
         0      379.75      0.00/  0.00             E:0.00 s, M:0.00 s.       1.00 [m:0.00,

The following code will save the custom model to Watson Studio by using the project library.

In [9]:
# Save the model
project.save_data('sire_custom', data=sire_custom.as_file_like_object(), overwrite=True)

Saved 166 features.


{'file_name': 'sire_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '8b197850-caca-4071-ad73-b60d639c9f3c'}

Let's run the model on one example input.

In [60]:
text = pd.read_json('dev.json')['text'][1]
text

'I work at California and Portland.'

In [64]:
# Run the model
syntax_result = syntax_model.run(text)
sire_result = sire_custom.run(syntax_result)
sire_result

{
  "mentions": [
    {
      "span": {
        "begin": 10,
        "end": 20,
        "text": "California"
      },
      "type": "Duration",
      "producer_id": null,
      "confidence": 0.9749692485802756,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 25,
        "end": 33,
        "text": "Portland"
      },
      "type": "Location",
      "producer_id": null,
      "confidence": 0.9673596800654085,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "SIRE Entity Mentions",
    "version": "0.0.1"
  }
}

Now you are able to run the trained models on new data. You will run the models on the test data so that the results can also be used for model evaluation.

In [40]:
# Execute the model and generate the quality report
preprocess_func = lambda raw_doc: syntax_model.run(raw_doc)
quality_report = sire_custom.evaluate_quality('test.json', preprocess_func)

# Print the quality report
print(json.dumps(quality_report, indent=4))



{
    "per_class_confusion_matrix": {
        "GeographicFeature": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 10,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Number": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 2,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Person": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 1,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Time": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 2,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Location": {
            "true_positive": 1,
            "false_positive": 25,
            "false_negative": 12

<a id="bilstm"></a>
### BiLSTM Training

The deep-learning algorithm used in this block performs sequence labelling based on the BiLSTM architecture followed by a CRF layer. It uses GloVe embeddings as features.

In [None]:
#help(watson_nlp.blocks.entity_mentions.BiLSTM.train)

In [30]:
# Download the GloVe model to be used as embeddings in the BiLSTM
glove_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))

In [34]:
# Train the model
bilstm_model = watson_nlp.blocks.entity_mentions.BiLSTM.train(train_iob_stream,
                                                              dev_iob_stream,
                                                              glove_model.embedding,
                                                              num_train_epochs=3)



The following code will save the custom model to Watson Studio by using the project library.

In [27]:
# Save the model
project.save_data('bilstm_custom', data=bilstm_custom.as_file_like_object(), overwrite=True)

{'file_name': 'bilstm_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'a1420859-9a8e-4905-8448-e1a33bd6673a'}

Let's run the model on one example input.

In [62]:
# Run the model
syntax_result = syntax_model.run(text)
bilstm_result = bilstm_custom.run(syntax_result)
bilstm_result

{
  "mentions": [
    {
      "span": {
        "begin": 10,
        "end": 20,
        "text": "California"
      },
      "type": "Duration",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.13714201748371124,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 25,
        "end": 33,
        "text": "Portland"
      },
      "type": "Location",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.3342318534851074,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "BiLSTM Entity Mentions",
    "version": "1.0.0"
  }
}

Now you are able to run the trained models on new data. You will run the models on the test data so that the results can also be used for model evaluation.

In [39]:
# Execute the model and generate the quality report
preprocess_func = lambda raw_doc: syntax_model.run(raw_doc)
quality_report = bilstm_custom.evaluate_quality('test.json', preprocess_func)

# Print the quality report
print(json.dumps(quality_report, indent=4))



{
    "per_class_confusion_matrix": {
        "GeographicFeature": {
            "true_positive": 0,
            "false_positive": 14,
            "false_negative": 10,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Location": {
            "true_positive": 3,
            "false_positive": 9,
            "false_negative": 10,
            "precision": 0.25,
            "recall": 0.23076923076923078,
            "f1": 0.24000000000000002
        },
        "Number": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 2,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Person": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 1,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Time": {
            "true_positive": 0,
            "false_positive": 0

<a id="bert"></a>
### BERT Training

The algorithm used is a Transformer-based sequence labeling algorithm using the BERT architecture.

In [57]:
# Download and load the pretrained model resource
pretrained_model_resource = watson_nlp.load(watson_nlp.download('pretrained-model_bert_multi_bert_multi_cased'))

# Labels you are interested in training the model for
labels = ['Duration', 'Location', 'GeographicFeature']

# Generate IOB labels: B-Duration, I-Duration, B-Location, I-Location
iob_labels = create_iob_labels(labels)

# Train the model
bert_custom = watson_nlp.blocks.entity_mentions.BERT.train(train_iob_stream,
                                                        dev_iob_stream,
                                                        iob_labels,
                                                        pretrained_model_resource,
                                                        do_lower_case=True,
                                                        num_train_epochs=10,
                                                        train_batch_size=1,
                                                        dev_batch_size=1,
                                                        keep_model_artifacts=False)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The following code will save the custom model to Watson Studio by using the project library.

In [58]:
# Save the model
project.save_data('bert_custom', data=bert_custom.as_file_like_object(), overwrite=True)



{'file_name': 'bert_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '445cdcc2-9390-4f41-85c4-f5c11db2d86c'}

Let's run the model on one example input.

In [61]:
# Run the model
syntax_result = syntax_model.run(text)
bert_result = bert_custom.run(syntax_result)
bert_result

{
  "mentions": [
    {
      "span": {
        "begin": 10,
        "end": 20,
        "text": "California"
      },
      "type": "Duration",
      "producer_id": {
        "name": "BERT Entity Mentions",
        "version": "0.0.1"
      },
      "confidence": 0.9996285438537598,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 25,
        "end": 33,
        "text": "Portland"
      },
      "type": "Location",
      "producer_id": {
        "name": "BERT Entity Mentions",
        "version": "0.0.1"
      },
      "confidence": 0.9997851252555847,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "BERT Entity Mentions",
    "version": "0.0.1"
  }
}

Now you are able to run the trained models on new data. You will run the models on the test data so that the results can also be used for model evaluation.

In [63]:
# Execute the model and generate the quality report
preprocess_func = lambda raw_doc: syntax_model.run(raw_doc)
quality_report = bert_custom.evaluate_quality('test.json', preprocess_func)

# Print the quality report
print(json.dumps(quality_report, indent=4))



{
    "per_class_confusion_matrix": {
        "GeographicFeature": {
            "true_positive": 0,
            "false_positive": 1,
            "false_negative": 10,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Number": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 2,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Person": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 1,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Time": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 2,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Date": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 14,
   

<a id="summary"></a>
## 5. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library and how quickly and easily you can train and run different entity extraction models using Watson NLP.</span>

Please note that this content is made available to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the [GitHub](https://github.com/ibm-build-labs/Watson-NLP). 

Developed by IBM Build Lab 

Copyright - 2022 IBM Corporation 