# Custom Train Models for PII Etraction using Watson NLP

## Use Case


This notebook demonstrates how to train PII extraction models using Watson NLP. The goal of PII extraction is to automatically identify and classify specific PII entities , such as Educational details, Employee ID, Salary and more.


## What you'll learn in this notebook

Watson NLP implements state-of-the-art classification algorithms from three different families: 
- Classic machine learning using CRF (Conditional Random Field)
- Deep learning using BiLSTM (Bidirectional Long Short Term Memory)

In this notebook, you'll learn how to:

- **Prepare your data** so that it can be used as training data for the Watson NLP classification algorithms.
- **Train a custom CRF model** using `watson_nlp.workflows.entity_mentions.SIRE`.
- **Train a BiLSTM** using `watson_nlp.blocks.entity_mentions.BiLSTM`.
- **Store and load models** as an asset of a Watson Studio project.

## Table of Contents

1. [Before You Start](#beforeYouStart)
1.  [Prepare Training](#prepareTraining)
1.  [Model Building](#buildModel)
    1. [SIRE Training](#sire)
    1. [BiLSTM Training](#bilstm)
1.  [Summary](#summary)

##### <a id="beforeYouStart"></a>
## 1. Before You Start

<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.x environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [11]:
!pip install faker



In [18]:
import json
import pandas as pd
import watson_nlp
from faker import Faker
import random
import string
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json, create_iob_labels

In [13]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

In [14]:
# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))

<a id="prepareTraining"></a>
## 2. Preparing Training Data

In [27]:
#Generate the dataset using faker
fake = Faker(locale='en_US')

def format_data():
    # Generate a random degree level
    degree_level = fake.random_element(elements=('Bachelor\'s', 'Master\'s', 'Doctorate'))

    # Generate a random field of study
    field_of_study = fake.random_element(elements=('Computer Science', 'Engineering', 'Business', 'Psychology','Medical'))


    # Generate a random prefix with 1-2 alphabets
    prefix = ''.join(random.choices(string.ascii_uppercase, k=random.randint(1, 2)))
    # Generate a random employee ID with the prefix and a random integer
    employee_id = f"{prefix}{fake.random_int(min=10000, max=99999):05d}"

    # Generate salary using faker
    salary = str(fake.pyfloat(left_digits=5, right_digits=2, positive=True, min_value=1000, max_value=5000))
    
    
    
    text_1 = "I studied %s in %s, My employee id is %s and salary is %s" %(degree_level,field_of_study,employee_id,salary)
    text_2 = " Hello, My employee id is %s and I done my %s in %s, I am earning %s per month" %(employee_id,degree_level, field_of_study,salary)
    text_3 = "My monthly Earning is %s and employee code is %s, I studied %s in %s" %(salary,employee_id,degree_level,field_of_study)
    text = random.choice([text_1, text_2,text_3])
    
    
    field_of_study_begin = text.find(field_of_study)
    field_of_study_end = field_of_study_begin + len(field_of_study)

    degree_level_begin = text.find(degree_level)
    degree_level_end = degree_level_begin + len(degree_level)
  
    employee_id_begin = text.find(employee_id)
    employee_id_end = employee_id_begin + len(employee_id)

    salary_begin = text.find(salary)
    salary_end = salary_begin + len(salary)
    
    
    data = {
                "text": text,
                "mentions": [
                    {
                        "location": {
                            "begin": field_of_study_begin,
                            "end": field_of_study_end
                        },
                        "text": field_of_study,
                        "type": "field_of_study"
                    },
                    {
                        "location": {
                            "begin": degree_level_begin,
                            "end": degree_level_end
                        },
                        "text": degree_level,
                        "type": "degree_level"
                    },
                                        {
                        "location": {
                            "begin": employee_id_begin,
                            "end": employee_id_end
                        },
                        "text": employee_id,
                        "type": "employee_id"
                    },
                    {
                        "location": {
                            "begin": salary_begin,
                            "end": salary_end
                        },
                        "text": salary,
                        "type": "salary"
                    }
                ]   
            }
    
    return data

In [28]:
#Sample dataset
format_data()

{'text': "My monthly Earning is 4463.7 and employee code is T43358, I studied Master's in Business",
 'mentions': [{'location': {'begin': 80, 'end': 88},
   'text': 'Business',
   'type': 'field_of_study'},
  {'location': {'begin': 68, 'end': 76},
   'text': "Master's",
   'type': 'degree_level'},
  {'location': {'begin': 50, 'end': 56},
   'text': 'T43358',
   'type': 'employee_id'},
  {'location': {'begin': 22, 'end': 28}, 'text': '4463.7', 'type': 'salary'}]}

In [69]:
#Prepared and store Training dataset for Driving License dataset
train_list_faker = []
for i in range(0, 30000):
    train_list_faker.append(format_data())

with open('faker_PII_text_train.json', 'w') as f:
    json.dump(train_list_faker, f)
project.save_data('faker_PII_text_train.json', data=json.dumps(train_list_faker), overwrite=True)

{'file_name': 'faker_PII_text_train.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'e952dfbd-f642-4712-b7a5-deae8425af2a'}

In [44]:
#Prepared and store Training dataset for Driving License dataset
test_list_faker = []
for i in range(0, 1000):
    test_list_faker.append(format_data())

with open('faker_PII_text_test.json', 'w') as f:
    json.dump(test_list_faker, f)
project.save_data('faker_PII_text_test.json', data=json.dumps(test_list_faker), overwrite=True)

{'file_name': 'faker_PII_text_test.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '0059c8e9-2566-4288-a1c2-092dc29d418e'}

Since the data is already formatted correctly, the following process is needed to read the JSON data files from Watson Studio project assets and save them to the runtime working directory where they will be used as input for training the models.

In [70]:
train_data = dm.DataStream.from_json_array("faker_PII_text_train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("faker_PII_text_test.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

The text inputs will be converted into a streaming array where the text is broken down by the syntax model.

<a id="buildModel"></a>
## 3. Model Building

Entity extraction uses the entity-mentions block to encapsulate algorithms for the task of extracting mentions of entities (person, organizations, dates, locations,...) from the input text. The blocks and workflows offer implementations of strong entity extraction algorithms from each of the four families: rule-based, classic ML, deep-learning and transformers.

<a id="sire"></a>
### 3.1 SIRE Training

You can train SIRE models using either CRF & Maximum Entropy template as base models. Between the two, CRF based template takes longer to train but gives better results.

These algorithms accept a set of featured in the form of dictionaries and regular expressions. A set of predefined feature extractors are provided for multiple languages, and you can also define your own features.

In [3]:
#help(watson_nlp.workflows.entity_mentions.SIRE)

In [31]:
# Download the algorithm template
mentions_train_template = watson_nlp.load(watson_nlp.download('file_path_entity-mentions_sire_multi_template-crf'))
# Download the feature extractor
default_feature_extractor = watson_nlp.load(watson_nlp.download('feature-extractor_rbr_entity-mentions_sire_en_stock'))

In [32]:
# Train the model
sire_custom = watson_nlp.workflows.entity_mentions.SIRE.train(syntax_model=syntax_model,
                                                              labeled_entity_mentions='/home/wsuser/work/', 
                                                              #labeled_entity_mentions=train_data,
                                                              model_language='en', 
                                                              template_resource=mentions_train_template, 
                                                              feature_extractors=[default_feature_extractor], 
                                                              l1=0.1, 
                                                              l2=0.005, 
                                                              num_epochs=50, 
                                                              num_workers=5)

Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Get Feature str 818099
Done get feature str 818099
done. [51[33mg[0m573[33mm[0m340[33mk[0m,8[33mg[0m985[33mm[0m520[33mk[0m]
gramSize = 2
number of processes: 5
Initial processing:  (# of words: 265660, # of sentences: 20000)
senIndex[1] = 7222, wordIndex = 53136
senIndex[2] = 11699, wordIndex = 106285
senIndex[3] = 14474, wordIndex = 159406
senIndex[4] = 17249, wordIndex = 212535
senIndex[5] = 19999, wordIndex = 265660
[32m[ME_CRF::scaleModel][0m Updater -- l1=[32m0.1[0m, l2=[32m0.005[0m, history size=[32m5[0m, progress windows size [32m20[0m
 Iteration           Obj             WErr                         Timing       %Eff        Per thread timing
               543176.67      6.63/ 63.18             E:1.08 s, M:0.08 s.       1.00 [m:1.04, M:1.07, av:1.06]
         0   240271.27     18.58/ 73.16             

The following code will save the custom model to Watson Studio by using the project library.

In [33]:
# Save the model
project.save_data('PII_sire_custom', data=sire_custom.as_file_like_object(), overwrite=True)

Saved 9722 features.


{'file_name': 'PII_sire_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '0941d329-e971-45c2-b766-082fe06434a4'}

Let's run the model on one example input from the dev dataset.

In [46]:
text = pd.read_json('faker_PII_text_test.json')['text'][1]
text

" Hello, My employee id is MN34275 and I done my Master's in Medical, I am earning 3362.18 per month"

In [47]:
# Run the model
sire_result = sire_custom.run(text)
sire_result

{
  "mentions": [
    {
      "span": {
        "begin": 26,
        "end": 33,
        "text": "MN34275"
      },
      "type": "employee_id",
      "producer_id": null,
      "confidence": 0.999635088942728,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 48,
        "end": 56,
        "text": "Master's"
      },
      "type": "degree_level",
      "producer_id": null,
      "confidence": 0.9996819393489853,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 60,
        "end": 67,
        "text": "Medical"
      },
      "type": "field_of_study",
      "producer_id": null,
      "confidence": 0.9999710541077216,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 82,
        "end": 89,
        "text": "3362.18"
      },
     

<a id="bilstm"></a>
### 3.2 BiLSTM Training

The deep-learning algorithm used in this block performs sequence labelling based on the BiLSTM architecture followed by a CRF layer. It uses GloVe embeddings as features.

In [4]:
#help(watson_nlp.blocks.entity_mentions.BiLSTM)

In [38]:
# Download the GloVe model to be used as embeddings in the BiLSTM
glove_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))

In [39]:
# Train BILSTM Model for Educational details entity
bilstm_custom = watson_nlp.blocks.entity_mentions.BiLSTM.train(train_iob_stream,
                                                              dev_iob_stream,
                                                              glove_model.embedding,
                                                              num_train_epochs=5)



The following code will save the custom model to Watson Studio by using the project library.

In [40]:
# Save the model
project.save_data('PII_bilstm_custom', data=bilstm_custom.as_file_like_object(), overwrite=True)

{'file_name': 'PII_bilstm_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'f82985ac-5595-40bd-be49-2f6d3d87b1a1'}

Let's run the model on one example input.

In [48]:
# Run the BILSTM model
syntax_result = syntax_model.run(text)
bilstm_result = bilstm_custom.run(syntax_result)

bilstm_result

{
  "mentions": [
    {
      "span": {
        "begin": 26,
        "end": 33,
        "text": "MN34275"
      },
      "type": "employee_id",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.9999963045120239,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 48,
        "end": 56,
        "text": "Master's"
      },
      "type": "degree_level",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 0.9999946355819702,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 60,
        "end": 67,
        "text": "Medical"
      },
      "type": "field_of_study",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence"

Now you are able to run the trained models on new data. You will run the models on the test data so that the results can also be used for model evaluation.

Watson NLP includes methods for quality testing supported models. Given a model and test data, a quality report can be generated. The following example includes the steps required to generate a quality report for a BiLSTM entity mention extactor model. The same example can be applied to any entity mention extractor model.

In [49]:
# Execute the model and generate the quality report
preprocess_func = lambda raw_doc: syntax_model.run(raw_doc)
quality_report = bilstm_custom.evaluate_quality('faker_PII_text_test.json', 
                                               preprocess_func)

# Print the quality report
print(json.dumps(quality_report, indent=4))



{
    "per_class_confusion_matrix": {
        "field_of_study": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        },
        "employee_id": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        },
        "salary": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        },
        "degree_level": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        }
    },
    "macro_true_positive": null,
    "macro_false_positive": null,
    "macro_false_negative"

<a id="summary"></a>
## 4. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library and how quickly and easily you can train and run different PII extraction models using Watson NLP.</span>

Please note that this content is made available to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the [GitHub](https://github.com/ibm-build-labs/Watson-NLP). 

Developed by IBM Build Lab 

Copyright - 2022 IBM Corporation 