# Custom Train Models for Entity Etraction using Watson NLP

## Use Case


This notebook demonstrates how to train entity extraction models using Watson NLP. The goal of entity extraction is to automatically identify and classify specific entities or concepts within a text, such as people, organizations, locations, dates, times, and more.


## What you'll learn in this notebook

Watson NLP implements state-of-the-art classification algorithms from three different families: 
- Classic machine learning using CRF (Conditional Random Field)
- Deep learning using BiLSTM (Bidirectional Long Short Term Memory)
- A transformer-based algorithm using the Google BERT multilingual model 

In this notebook, you'll learn how to:

- **Prepare your data** so that it can be used as training data for the Watson NLP classification algorithms.
- **Train a custom CRF model** using `watson_nlp.workflows.entity_mentions.SIRE`.
- **Train a BiLSTM** using `watson_nlp.blocks.entity_mentions.BiLSTM`.
- **Train a BERT** using `watson_nlp.workflows.entity_mentions.BERT`.
- **Store and load models** as an asset of a Watson Studio project.

## Table of Contents

1. [Before You Start](#beforeYouStart)
1.  [Prepare Training](#prepareTraining)
1.  [Model Building](#buildModel)
    1. [SIRE Training](#sire)
    1. [BiLSTM Training](#bilstm)
    1. [BERT Training](#bert)
1.  [Preparing Training Data](#Pre-Data)
1.  [Summary](#summary)

##### <a id="beforeYouStart"></a>
## 1. Before You Start

<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.x environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [3]:
!pip install faker

Collecting faker
  Downloading Faker-17.0.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-17.0.0


In [25]:
import json
import pandas as pd
import watson_nlp
from faker import Faker
import random 
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json, create_iob_labels

In [3]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

In [37]:
# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))

<a id="prepareTraining"></a>
## 2. Preparing Training Data

The dataset is required to have a dictionary format as follows:
```
[
  {
    "id": 1,
    "text": "This waterfall is actually hours away from Portland, basically in California.",
    "mentions": 
    [
      {
        "text": "waterfall", "type": "GeographicFeature", 
        "location": 
          {
            "begin": 5, 
            "end": 14
          }
      },
      {
        "text": "Portland", 
        "type": "Location", 
        "location": 
          {
            "begin": 43, 
            "end": 51
          }
      },
      {
        "text": "California", 
        "type": "Location", 
        "location": 
          {
            "begin": 66, 
            "end": 76
          }
       }
    ]
  },
  ...
]
```

Since the data is already formatted correctly, the following process is needed to read the JSON data files from Watson Studio project assets and save them to the runtime working directory where they will be used as input for training the models.

In [5]:
buffer = project.get_file("entity_train.json")
pd.read_json(buffer).to_json('train.json', orient='records')
buffer = project.get_file("entity_dev.json")
pd.read_json(buffer).to_json('dev.json', orient='records')
buffer = project.get_file("entity_test.json")
pd.read_json(buffer).to_json('test.json', orient='records')

The text inputs will be converted into a streaming array where the text is broken down by the syntax model.

In [6]:
train_data = dm.DataStream.from_json_array("train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("dev.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

<a id="buildModel"></a>
## 3. Model Building

Entity extraction uses the entity-mentions block to encapsulate algorithms for the task of extracting mentions of entities (person, organizations, dates, locations,...) from the input text. The blocks and workflows offer implementations of strong entity extraction algorithms from each of the four families: rule-based, classic ML, deep-learning and transformers.

<a id="sire"></a>
### 3.1 SIRE Training

You can train SIRE models using either CRF & Maximum Entropy template as base models. Between the two, CRF based template takes longer to train but gives better results.

These algorithms accept a set of featured in the form of dictionaries and regular expressions. A set of predefined feature extractors are provided for multiple languages, and you can also define your own features.

In [3]:
#help(watson_nlp.workflows.entity_mentions.SIRE)

In [8]:
# Download the algorithm template
mentions_train_template = watson_nlp.load(watson_nlp.download('file_path_entity-mentions_sire_multi_template-crf'))
# Download the feature extractor
default_feature_extractor = watson_nlp.load(watson_nlp.download('feature-extractor_rbr_entity-mentions_sire_en_stock'))

In [9]:
# Train the model
sire_custom = watson_nlp.workflows.entity_mentions.SIRE.train(syntax_model=syntax_model,
                                                              labeled_entity_mentions='/home/wsuser/work/', 
                                                              #labeled_entity_mentions=train_data,
                                                              model_language='en', 
                                                              template_resource=mentions_train_template, 
                                                              feature_extractors=[default_feature_extractor], 
                                                              l1=0.1, 
                                                              l2=0.005, 
                                                              num_epochs=50, 
                                                              num_workers=5)

{'log_code': '<NLP89404519W>', 'message': "Dropping mention: Mention '3:25 p.m' (102, 110) overlaps with token 'p.m.' (107, 111) and has                                     been discarded. Ensure the entity span begins at the beginning of                                         a token and ends at the end of a token.", 'args': None}
{'log_code': '<NLP89404519W>', 'message': "Dropping mention: Mention 'St' (63, 65) overlaps with token 'St.' (63, 66) and has                                     been discarded. Ensure the entity span begins at the beginning of                                         a token and ends at the end of a token.", 'args': None}
{'log_code': '<NLP35814863W>', 'message': 'Dropped 2 mentions in total from this text due to invalid mention spans', 'args': None}
{'log_code': '<NLP35814863W>', 'message': 'Dropped 2 mentions in total from this text due to invalid mention spans', 'args': None}
Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClas

The following code will save the custom model to Watson Studio by using the project library.

In [10]:
# Save the model
project.save_data('sire_custom', data=sire_custom.as_file_like_object(), overwrite=True)

Saved 4241 features.


{'file_name': 'sire_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '8b197850-caca-4071-ad73-b60d639c9f3c'}

Let's run the model on one example input from the dev dataset.

In [11]:
text = pd.read_json('dev.json')['text'][1]
text

'I work at California and Portland.'

In [12]:
# Run the model
sire_result = sire_custom.run(text)
sire_result

{
  "mentions": [
    {
      "span": {
        "begin": 10,
        "end": 20,
        "text": "California"
      },
      "type": "Duration",
      "producer_id": null,
      "confidence": 0.9894799488701607,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 25,
        "end": 33,
        "text": "Portland"
      },
      "type": "Location",
      "producer_id": null,
      "confidence": 0.9990983833955226,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "Entity-Mentions SIRE Workflow",
    "version": "0.0.1"
  }
}

<a id="bilstm"></a>
### 3.2 BiLSTM Training

The deep-learning algorithm used in this block performs sequence labelling based on the BiLSTM architecture followed by a CRF layer. It uses GloVe embeddings as features.

In [4]:
#help(watson_nlp.blocks.entity_mentions.BiLSTM)

In [41]:
# Download the GloVe model to be used as embeddings in the BiLSTM
glove_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))

In [15]:
# Train the model
bilstm_custom = watson_nlp.blocks.entity_mentions.BiLSTM.train(train_iob_stream,
                                                              dev_iob_stream,
                                                              glove_model.embedding,
                                                              num_train_epochs=3)



The following code will save the custom model to Watson Studio by using the project library.

In [16]:
# Save the model
project.save_data('bilstm_custom', data=bilstm_custom.as_file_like_object(), overwrite=True)

{'file_name': 'bilstm_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'a1420859-9a8e-4905-8448-e1a33bd6673a'}

Let's run the model on one example input.

In [17]:
# Run the model
syntax_result = syntax_model.run(text)
bilstm_result = bilstm_custom.run(syntax_result)
bilstm_result

{
  "mentions": [
    {
      "span": {
        "begin": 25,
        "end": 33,
        "text": "Portland"
      },
      "type": "Location",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.7481237649917603,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "BiLSTM Entity Mentions",
    "version": "1.0.0"
  }
}

Now you are able to run the trained models on new data. You will run the models on the test data so that the results can also be used for model evaluation.

Watson NLP includes methods for quality testing supported models. Given a model and test data, a quality report can be generated. The following example includes the steps required to generate a quality report for a BiLSTM entity mention extactor model. The same example can be applied to any entity mention extractor model.

In [18]:
# Execute the model and generate the quality report
preprocess_func = lambda raw_doc: syntax_model.run(raw_doc)
quality_report = bilstm_custom.evaluate_quality('test.json', 
                                               preprocess_func)

# Print the quality report
print(json.dumps(quality_report, indent=4))



{
    "per_class_confusion_matrix": {
        "Location": {
            "true_positive": 1,
            "false_positive": 5,
            "false_negative": 12,
            "precision": 0.16666666666666666,
            "recall": 0.07692307692307693,
            "f1": 0.10526315789473684
        },
        "GeographicFeature": {
            "true_positive": 0,
            "false_positive": 2,
            "false_negative": 10,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Time": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 2,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Number": {
            "true_positive": 0,
            "false_positive": 0,
            "false_negative": 2,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0
        },
        "Person": {
            "true_positive": 0,
            "fals

<a id="bert"></a>
### 3.3 BERT Training

The algorithm used is a Transformer-based sequence labeling algorithm using the BERT architecture.

In [5]:
#help(watson_nlp.workflows.entity_mentions.BERT)

In [20]:
# Download and load the pretrained model resource
pretrained_model_resource = watson_nlp.load(watson_nlp.download('pretrained-model_bert_multi_bert_multi_cased'))

# Labels you are interested in training the model for
labels = ['Duration', 'Location', 'GeographicFeature']

# Generate IOB labels: B-Duration, I-Duration, B-Location, I-Location
iob_labels = create_iob_labels(labels)

# Train the model
bert_custom = watson_nlp.workflows.entity_mentions.BERT.train(syntax_model_train_data_map={syntax_model:'train.json'}, 
                                                              syntax_model_dev_data_map={syntax_model:'dev.json'},
                                                              label_list=labels,
                                                              pretrained_model_resource=pretrained_model_resource,
                                                              learning_rate=0.0005, 
                                                              num_train_epochs=10, 
                                                              do_lower_case=False, 
                                                              train_max_seq_length=128, 
                                                              train_stride=64, 
                                                              train_batch_size=32, 
                                                              dev_batch_size=32, 
                                                              predict_batch_size=512, 
                                                              predict_max_seq_length=48, 
                                                              predict_stride=40, 
                                                              keep_model_artifacts=False)

{'log_code': '<NLP35814863W>', 'message': 'Dropped 2 mentions in total from this text due to invalid mention spans', 'args': None}
{'log_code': '<NLP96245348W>', 'message': 'Dropped 2 mentions in total from this text due to invalid mention spans', 'args': None}
{'log_code': '<NLP35814863W>', 'message': 'Dropped 2 mentions in total from this text due to invalid mention spans', 'args': None}
{'log_code': '<NLP96245348W>', 'message': 'Dropped 2 mentions in total from this text due to invalid mention spans', 'args': None}


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The following code will save the custom model to Watson Studio by using the project library.

In [21]:
# Save the model
project.save_data('bert_custom', data=bert_custom.as_file_like_object(), overwrite=True)



{'file_name': 'bert_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '445cdcc2-9390-4f41-85c4-f5c11db2d86c'}

Let's run the model on one example input.

In [22]:
# Run the model
bert_result = bert_custom.run(text, 'en')
bert_result

{
  "mentions": [
    {
      "span": {
        "begin": 10,
        "end": 20,
        "text": "California"
      },
      "type": "Duration",
      "producer_id": {
        "name": "BERT Entity Mentions",
        "version": "0.0.1"
      },
      "confidence": 0.9898326992988586,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 25,
        "end": 33,
        "text": "Portland"
      },
      "type": "Location",
      "producer_id": {
        "name": "BERT Entity Mentions",
        "version": "0.0.1"
      },
      "confidence": 0.9967482089996338,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "BERT Entity Mentions Workflow",
    "version": "0.0.1"
  }
}

<a id="Pre-Data"></a>
## 4. Preparing Training Data

In [31]:
#Generate the dataset using faker
fake = Faker(locale='en_US')

def format_data():
    # Generate a random degree level
    degree_level = fake.random_element(elements=('Bachelor\'s', 'Master\'s', 'Doctorate'))

    # Generate a random field of study
    field_of_study = fake.random_element(elements=('Computer Science', 'Engineering', 'Business', 'Psychology','Medical'))

    text_1 = "I studied %s in %s" %(degree_level,field_of_study)
    text_2 = " Hello, I done my %s in %s" %(degree_level, field_of_study)
    
    text = random.choice([text_1, text_2])
    
    
    field_of_study_begin = text.find(field_of_study)
    field_of_study_end = field_of_study_begin + len(field_of_study)

    degree_level_begin = text.find(degree_level)
    degree_level_end = degree_level_begin + len(degree_level)
  
    
    
    data = {
                "text": text,
                "mentions": [
                    {
                        "location": {
                            "begin": field_of_study_begin,
                            "end": field_of_study_end
                        },
                        "text": field_of_study,
                        "type": "field_of_study"
                    },
                    {
                        "location": {
                            "begin": degree_level_begin,
                            "end": degree_level_end
                        },
                        "text": degree_level,
                        "type": "degree_level"
                    }
                ]   
            }
    
    return data

In [32]:
#Sample dataset
format_data()

{'text': 'I studied Doctorate in Engineering',
 'mentions': [{'location': {'begin': 23, 'end': 34},
   'text': 'Engineering',
   'type': 'field_of_study'},
  {'location': {'begin': 10, 'end': 19},
   'text': 'Doctorate',
   'type': 'degree_level'}]}

In [61]:
#Prepared and store Training dataset for Driving License dataset
train_list_faker = []
for i in range(0, 10000):
    train_list_faker.append(format_data())

with open('faker_Educational_text_train.json', 'w') as f:
    json.dump(train_list_faker, f)
project.save_data('faker_Educational_text_train.json', data=json.dumps(train_list_faker), overwrite=True)

{'file_name': 'faker_Educational_text_train.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '263d99a1-954d-485f-ad9c-d81c95b5f526'}

In [62]:
train_data = dm.DataStream.from_json_array("faker_Educational_text_train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("faker_Educational_text_train.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

In [63]:
# Train BILSTM Model for Educational details entity
bilstm_custom = watson_nlp.blocks.entity_mentions.BiLSTM.train(train_iob_stream,
                                                              dev_iob_stream,
                                                              glove_model.embedding,
                                                              num_train_epochs=5)



In [69]:
text= "Hello, I done my master of Engineering"

In [71]:
# Run the BILSTM model
syntax_result = syntax_model.run(text)
bilstm_result = bilstm_custom.run(syntax_result)

bilstm_result

{
  "mentions": [
    {
      "span": {
        "begin": 17,
        "end": 23,
        "text": "master"
      },
      "type": "degree_level",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.9999746084213257,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 27,
        "end": 38,
        "text": "Engineering"
      },
      "type": "field_of_study",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.9999901056289673,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "BiLSTM Entity Mentions",
    "version": "1.0.0"
  }
}

<a id="summary"></a>
## 5. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library and how quickly and easily you can train and run different entity extraction models using Watson NLP.</span>

Please note that this content is made available to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the [GitHub](https://github.com/ibm-build-labs/Watson-NLP). 

Developed by IBM Build Lab 

Copyright - 2022 IBM Corporation 