# Entities Extraction using Watson NLP

## Use Case


This notebook showcases the process of extracting entities through the utilization of  fine-tuning custom entities using Watson NLP. The primary objective of entity extraction is to automatically identify and classify specific entities such as people, dates, locations, and more.

## What you'll learn in this notebook

Watson NLP implements state-of-the-art classification algorithms from three different families: 
- Classic machine learning using CRF (Conditional Random Field)
- Deep learning using BiLSTM (Bidirectional Long Short Term Memory)
- SIRE (Statistical Information and Relation Extraction) Watson NLP Model

In this notebook, you'll learn how to:

- **Prepare your data** so that it can be used as training data for the Watson NLP classification algorithms.
- **Train a custom CRF model** using `watson_nlp.workflows.entity_mentions.SIRE`.
- **Train a BiLSTM** using `watson_nlp.blocks.entity_mentions.BiLSTM`.
- **Store and load models** as an asset of a Watson Studio project.

## Table of Contents

1. [Before You Start](#beforeYouStart)
1.  [Preparing Sample Data set](#prepareTraining)
1.  [Fine-Tune Models for Entities Extraction](#buildModel)
    1. [SIRE Training](#sire)
    1. [BiLSTM Training](#bilstm)
1.  [Summary](#summary)

##### <a id="beforeYouStart"></a>
## 1. Before You Start

<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.x environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [3]:
!pip install faker

Collecting faker
  Downloading Faker-17.5.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-17.5.0


In [15]:
import json
import pandas as pd
import watson_nlp
from faker import Faker
import random
import string
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json, create_iob_labels

In [16]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

In [17]:
# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))

<a id="prepareTraining"></a>
## 2. Preparing Sample Data Set

#### Preparing the training data for fine-tune the model with Custom Entities.

* Language 
* Nationality 
* periodical_set
* Festival
* Color


In [11]:
#Generate the dataset using faker
fake = Faker(locale='en_US')

def format_data():

    language = fake.language_name()
    nationality = fake.random_element(elements=('American', 'British', 'Canadian', 'Chinese', 'French', 'German', 'Italian', 'Japanese', 'Korean', 'Mexican', 'Russian', 'Spanish'))
    periodical_set = fake.random_element(['daily', 'biannually', 'hebdomadally', 'fortnightly', 'monthly', 'Weekly','quarterly', 'semiannually', 'yearly','every week', 'each afternoon', 'on Fridays', 'at night', 'on Wednesdays', 'on weekends'])
    festival = random.choice(["New Year","Super Bowl Sunday","Valentine day","Presidents day","St. Patrick","Easter","Memorial Day","Independence Day","Labor Day","Columbus Day","Halloween","Veterans Day","Thanksgiving","Christmas"])
    color = fake.color_name()
    
    text1= "Their %s friend recently started learning %s, they can prepare it %s, tomorrow they have holiday due to %s, we can go for drive in my %s car."%(nationality,language,periodical_set,festival,color)
    text2="my %s neighbour can speak %s, they can practice %s. We can meet them on %s with %s book."%(nationality,language,periodical_set,festival, color)

    text = random.choice([text1, text2])
    
    color_begin = text.find(color)
    color_end = color_begin + len(color)

    nationality_begin = text.find(nationality)
    nationality_end = nationality_begin + len(nationality)
  
    language_begin = text.find(language)
    language_end = language_begin + len(language)
    
    festival_begin = text.find(festival)
    festival_end = festival_begin + len(festival)
    
    periodical_set_begin = text.find(periodical_set)
    periodical_set_end = periodical_set_begin + len(periodical_set)

    data = {
                "text": text,
                "mentions": [
                    {
                        "location": {
                            "begin": color_begin,
                            "end": color_end
                        },
                        "text": color,
                        "type": "color"
                    },                    
                    {
                        "location": {
                            "begin": nationality_begin,
                            "end": nationality_end
                        },
                        "text": nationality,
                        "type": "nationality"
                    },
                    {
                        "location": {
                            "begin": language_begin,
                            "end": language_end
                        },
                        "text": language,
                        "type": "language"
                    },
                    {
                        "location": {
                            "begin": festival_begin,
                            "end": festival_end
                        },
                        "text": festival,
                        "type": "festival"
                    },
                    {
                        "location": {
                            "begin": periodical_set_begin,
                            "end": periodical_set_end
                        },
                        "text": periodical_set,
                        "type": "periodical_set"
                    }
                ]   
            }
    
    return data

In [12]:
#Sample dataset
format_data()

{'text': 'Their Korean friend recently started learning Marshallese, they can prepare it every week, tomorrow they have holiday due to Valentine day, we can go for drive in my DarkGoldenRod car.',
 'mentions': [{'location': {'begin': 166, 'end': 179},
   'text': 'DarkGoldenRod',
   'type': 'color'},
  {'location': {'begin': 6, 'end': 12},
   'text': 'Korean',
   'type': 'nationality'},
  {'location': {'begin': 46, 'end': 57},
   'text': 'Marshallese',
   'type': 'language'},
  {'location': {'begin': 125, 'end': 138},
   'text': 'Valentine day',
   'type': 'festival'},
  {'location': {'begin': 79, 'end': 89},
   'text': 'every week',
   'type': 'periodical_set'}]}

In [18]:
#Prepared and store Training dataset for Custom Entities
train_list_faker = []
for i in range(0, 30000):
    train_list_faker.append(format_data())

with open('custom_entity_train_data.json', 'w') as f:
    json.dump(train_list_faker, f)
project.save_data('custom_entity_train_data.json', data=json.dumps(train_list_faker), overwrite=True)

{'file_name': 'custom_entity_train_data.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '8722380f-01db-41a1-9fec-8e4a1befe2d9'}

In [19]:
#Prepared and store Training dataset for Custom Entities
test_list_faker = []
for i in range(0, 1000):
    test_list_faker.append(format_data())

with open('custom_entity_test_data.json', 'w') as f:
    json.dump(test_list_faker, f)
project.save_data('custom_entity_test_data.json', data=json.dumps(test_list_faker), overwrite=True)

{'file_name': 'custom_entity_test_data.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'e1e7eb51-6182-411b-92a0-cdb9951ee6eb'}

Since the data is already formatted correctly, the following process is needed to read the JSON data files from Watson Studio project assets and save them to the runtime working directory where they will be used as input for training the models.

The text inputs will be converted into a streaming array where the text is broken down by the syntax model.

In [20]:
train_data = dm.DataStream.from_json_array("custom_entity_train_data.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("custom_entity_test_data.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)


<a id="FTbuildModel"></a>
## 3. Fine-Tune Models

Entity extraction uses the entity-mentions block to encapsulate algorithms for the task of extracting mentions of entities (person, organizations, dates, locations,...) from the input text. The blocks and workflows offer implementations of strong entity extraction algorithms from each of the four families: rule-based, classic ML, deep-learning and transformers.

<a id="sire"></a>
### 3.1 SIRE Training

You can train SIRE models using either CRF & Maximum Entropy template as base models. Between the two, CRF based template takes longer to train but gives better results.

These algorithms accept a set of featured in the form of dictionaries and regular expressions. A set of predefined feature extractors are provided for multiple languages, and you can also define your own features.

`labeled_entity_mentions` : Path to a collection of labeled data (.json) or loaded DataStream of JSONs, which prepared above in [Preparing Sample Data Set](#prepareTraining). `/home/wsuser/work/` is home directory.

In [3]:
#help(watson_nlp.workflows.entity_mentions.SIRE)

In [21]:
# Download the algorithm template
mentions_train_template = watson_nlp.load(watson_nlp.download('file_path_entity-mentions_sire_multi_template-crf'))
# Download the feature extractor
default_feature_extractor = watson_nlp.load(watson_nlp.download('feature-extractor_rbr_entity-mentions_sire_en_stock'))

In [22]:
# Train the model
sire_custom = watson_nlp.workflows.entity_mentions.SIRE.train(syntax_model=syntax_model,
                                                              labeled_entity_mentions='/home/wsuser/work/', 
                                                              model_language='en', 
                                                              template_resource=mentions_train_template, 
                                                              feature_extractors=[default_feature_extractor], 
                                                              l1=0.1, 
                                                              l2=0.005, 
                                                              num_epochs=50, 
                                                              num_workers=5)

Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Get Feature str 27516
Done get feature str 27516
done. [53[33mg[0m1001[33mm[0m776[33mk[0m,12[33mg[0m916[33mm[0m864[33mk[0m]
gramSize = 2
number of processes: 5
Initial processing:  (# of words: 876792, # of sentences: 46439)
senIndex[1] = 9259, wordIndex = 175364
senIndex[2] = 18510, wordIndex = 350723
senIndex[3] = 27744, wordIndex = 526086
senIndex[4] = 37113, wordIndex = 701443
senIndex[5] = 46438, wordIndex = 876792
[32m[ME_CRF::scaleModel][0m Updater -- l1=[32m0.1[0m, l2=[32m0.005[0m, history size=[32m5[0m, progress windows size [32m20[0m
 Iteration           Obj             WErr                         Timing       %Eff        Per thread timing
              1894685.49      5.67/ 75.40             E:3.19 s, M:0.02 s.       1.00 [m:3.17, M:3.18, av:3.18]
         0   839323.40     22.13/100.00            

The following code will save the custom model to Watson Studio by using the project library.

In [23]:
# Save the model
project.save_data('entity_sire_custom', data=sire_custom.as_file_like_object(), overwrite=True)

 0.02/  0.23             E:3.13 s, M:0.02 s.       1.00 [m:3.10, M:3.13, av:3.13]
        48     1070.96      0.02/  0.23             E:3.14 s, M:0.02 s.       1.00 [m:3.12, M:3.14, av:3.14]
        49     1070.53      0.02/  0.23             E:3.14 s, M:0.02 s.       1.00 [m:3.12, M:3.14, av:3.14]
    Thread     Total      Wait Effective      %Eff         #Sents/sec
         0    160.13      0.00    160.13      1.00             2956.83
         1    160.04      0.00    160.04      1.00             2960.90
         2    160.19      0.00    160.19      1.00             2951.05
         3    160.11      0.00    160.11      1.00             2971.18
         4    160.18      0.00    160.18      1.00             2950.29
Parent: the end!
Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Saved 2353 features.


{'file_name': 'entity_sire_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'fe0ac856-5fb9-4a00-8e1c-87448d3c9ce1'}

Let's run the model on one example input from the dev dataset.

In [35]:
text = pd.read_json('custom_entity_test_data.json')['text'][3]
text

'my Chinese neighbour can speak Arabic, they can practice fortnightly. We can meet them on Columbus Day with Silver book.'

In [36]:
# Run the model
sire_result = sire_custom.run(text)
sire_result

{
  "mentions": [
    {
      "span": {
        "begin": 3,
        "end": 10,
        "text": "Chinese"
      },
      "type": "nationality",
      "producer_id": null,
      "confidence": 0.9972540536035641,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 31,
        "end": 37,
        "text": "Arabic"
      },
      "type": "language",
      "producer_id": null,
      "confidence": 0.9999835617867643,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 57,
        "end": 68,
        "text": "fortnightly"
      },
      "type": "periodical_set",
      "producer_id": null,
      "confidence": 0.9999952747452828,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 90,
        "end": 102,
        "text": "Columbus Day"
      },
 

<a id="bilstm"></a>
### 3.2 BiLSTM Training

The deep-learning algorithm used in this block performs sequence labelling based on the BiLSTM architecture followed by a CRF layer. It uses GloVe embeddings as features.

In [26]:
#help(watson_nlp.blocks.entity_mentions.BiLSTM)

In [27]:
# Download the GloVe model to be used as embeddings in the BiLSTM
glove_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))

In [28]:
# Train BILSTM Model for Educational details entity
bilstm_custom = watson_nlp.blocks.entity_mentions.BiLSTM.train(train_iob_stream,
                                                              dev_iob_stream,
                                                              glove_model.embedding,
                                                              num_train_epochs=5)



If we want to save the trained block model as a workflow, to be run with raw text later, we can use the following code snippet to do so

In [29]:
#Save the Trained block model as a workflow model 
from watson_nlp.workflows.entity_mentions.bilstm import BiLSTM 

mentions_workflow = BiLSTM(syntax_model, bilstm_custom)


The following code will save the custom model to Watson Studio by using the project library.

In [30]:
# Save the model
project.save_data('Entity_workflow_bilstm_custom', data=mentions_workflow.as_file_like_object(), overwrite=True)

{'file_name': 'Entity_workflow_bilstm_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'fc7ff781-8e34-4ecc-b30b-bf1afcd9a5ae'}

Let's run the model on one example input.

In [37]:
# Run the BILSTM workflow model
#syntax_result = syntax_model.run(text)
bilstm_result = mentions_workflow.run(text)

bilstm_result

{
  "mentions": [
    {
      "span": {
        "begin": 3,
        "end": 10,
        "text": "Chinese"
      },
      "type": "nationality",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.9999990463256836,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 31,
        "end": 37,
        "text": "Arabic"
      },
      "type": "language",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.999930739402771,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 57,
        "end": 68,
        "text": "fortnightly"
      },
      "type": "periodical_set",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 1.

Now you are able to run the trained models on new data. You will run the models on the test data so that the results can also be used for model evaluation.

Watson NLP includes methods for quality testing supported models. Given a model and test data, a quality report can be generated. The following example includes the steps required to generate a quality report for a BiLSTM entity mention extactor model. The same example can be applied to any entity mention extractor model.

<a id="summary"></a>
## 4. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library and how quickly and easily you can train and run different entities extraction models using Watson NLP. </span>

Please note that this content is made available to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the [GitHub](https://github.com/ibm-build-labs/Watson-NLP). 

Developed by IBM Build Lab 

Copyright - 2023 IBM Corporation 