# Extract the Personal Identifiable Information (PII) using Watson NLP

<h2>Use Case</h2>

This notebook demonstrates how to extract PII entities using Watson NLP Custom train or Fine-tune models. PII extraction is the process of identifying and extracting personal information from a document or dataset. This information can include names, addresses, phone numbers, email addresses, Social Security numbers, Credit Card number, and other types of information that can be used to identify an individual. 

<h2>What you'll learn in this notebook</h2>

Watson NLP offers  fine-tune functionality for custom training. This notebooks shows:

* <b>BILSTM</b>: the BiLSTM network would take the preprocessed text as input and learn to identify patterns and relationships between words that are indicative of PII data. The BiLSTM network would then output a probability score for each word in the text, indicating the likelihood that the word is part of a PII entity. The BiLSTM network may also be trained to recognize specific entities such as names, addresses, phone numbers, email addresses, etc.


* <b>SIRE</b>: Statistical Information and Relation Extraction (SIRE) is a technique used in natural language processing (NLP) to extract specific information and relationships from text. It involves using machine learning algorithms to identify and extract structured data such as entities, attributes, and relations from unstructured text. SIRE is used in a variety of applications, including information extraction, knowledge graph construction, and question answering. SIRE typically uses supervised learning approach, where a model is trained using annotated examples of text and the corresponding structured data. The model can then be used to extract the same information from new, unseen text.

* <b>BERT</b>: Bidirectional Encoder Representations from Transformers is a pre-trained natural language processing model  that can be fine-tuned on specific language tasks with smaller amounts of task-specific data to achieve state-of-the-art results. 

## Table of Contents


1. [Before you start](#beforeYouStart)
1. [Load Entity PII Models](#LoadModel)
1. [Load PII XLSX Dataset from Data Assets](#Loaddata)
1. [TrainingData](#TrainingData)
1. [Watson NLP Models](#NLPModels)    
   1.  [BiLSTM Fine-tuned](#BILSTMFINE)
   1.  [SIRE Fine-tuned](#SIRETune)
   1.  [Transformer Fine-tuned](#TransTUne)
1. [Fine-Tune Model For Driving License Number](#DLNFine)  
   1.  [Sire Fine-Tune Model For Driving License Number Extraction](#SireDLNFine)
   1.  [RBR Fine-Tune Model For Driving License Number Extraction](#RBRDLNFine)
1. [Additional PII Entities extraction](#addPII)
   1.   [Preparing PII Training Data](#PIIData)
1. [Model Building on Custom PII Entities](#buildModel)
   1.   [SIRE Training](#sire)
   1.   [BiLSTM Training](#bilstm)
1. [Summary](#summary)

<a id="beforeYouStart"></a>
### 1. Before you start


<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.x environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [18]:
import json
import pandas as pd
import watson_nlp
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json

In [19]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

<a id="LoadModel"></a>
### 2. Load Entity PII Models

In [4]:
# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
# Load bilstm model in WatsonNLP
bilstm_model = watson_nlp.load(watson_nlp.download('entity-mentions_bilstm_en_pii'))
# Download the GloVe model to be used as embeddings in the BiLSTM
glove_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))
# Download the algorithm template
mentions_train_template = watson_nlp.load(watson_nlp.download('file_path_entity-mentions_sire_multi_template-crf'))
# Download the feature extractor
default_feature_extractor = watson_nlp.load(watson_nlp.download('feature-extractor_rbr_entity-mentions_sire_en_stock'))
# Load rbr model in WatsonNLP
rbr_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_multi_pii'))

<a id="Loaddata"></a>
### 3. Load PII XLSX Dataset from Data Assets

In [10]:
import os, types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='o0avUc3SDky2d6pNzjuewCSTPPX7tQNz6BKKvL37nBL3',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.private.us.cloud-object-storage.appdomain.cloud')

bucket = 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1'
object_key = '10-MB-Test.xlsx'

body = cos_client.get_object(Bucket=bucket,Key=object_key)['Body']

df = pd.read_excel(body.read())
df = df.dropna()
df.head()

Unnamed: 0,First and Last Name,SSN,Credit Card Number,First and Last Name.1,SSN.1,Credit Card Number.1,First and Last Name.2,SSN.2,Credit Card Number.2,First and Last Name.3,...,Credit Card Number.3,First and Last Name.4,SSN.4,Credit Card Number.4,First and Last Name.5,SSN.5,Credit Card Number.5,First and Last Name.6,SSN.6,Credit Card Number.6
1,Robert Aragon,489-36-8350,4929-3813-3266-4295,Robert Aragon,489-36-8351,4929-3813-3266-4296,Robert Aragon,489-36-8352,4929-3813-3266-4297,Robert Aragon,...,4929-3813-3266-4298,Robert Aragon,489-36-8354,4929-3813-3266-4299,Robert Aragon,489-36-8355,4929-3813-3266-4300,Robert Aragon,489-36-8355,4929-3813-3266-4300
2,Ashley Borden,514-14-8905,5370-4638-8881-3020,Ashley Borden,514-14-8906,5370-4638-8881-3021,Ashley Borden,514-14-8907,5370-4638-8881-3022,Ashley Borden,...,5370-4638-8881-3023,Ashley Borden,514-14-8909,5370-4638-8881-3024,Ashley Borden,514-14-8910,5370-4638-8881-3025,Ashley Borden,514-14-8910,5370-4638-8881-3025
3,Thomas Conley,690-05-5315,4916-4811-5814-8111,Thomas Conley,690-05-5316,4916-4811-5814-8112,Thomas Conley,690-05-5317,4916-4811-5814-8113,Thomas Conley,...,4916-4811-5814-8114,Thomas Conley,690-05-5319,4916-4811-5814-8115,Thomas Conley,690-05-5320,4916-4811-5814-8116,Thomas Conley,690-05-5320,4916-4811-5814-8116
4,Susan Davis,421-37-1396,4916-4034-9269-8783,Susan Davis,421-37-1397,4916-4034-9269-8784,Susan Davis,421-37-1398,4916-4034-9269-8785,Susan Davis,...,4916-4034-9269-8786,Susan Davis,421-37-1400,4916-4034-9269-8787,Susan Davis,421-37-1401,4916-4034-9269-8788,Susan Davis,421-37-1401,4916-4034-9269-8788
5,Christopher Diaz,458-02-6124,5299-1561-5689-1938,Christopher Diaz,458-02-6125,5299-1561-5689-1939,Christopher Diaz,458-02-6126,5299-1561-5689-1940,Christopher Diaz,...,5299-1561-5689-1941,Christopher Diaz,458-02-6128,5299-1561-5689-1942,Christopher Diaz,458-02-6129,5299-1561-5689-1943,Christopher Diaz,458-02-6129,5299-1561-5689-1943


<a id="TrainingData"></a>
### 4. Preparing Training Data

Let's generate sentences using the columns of PII information. Ideally, the sentences would include name, SSN, and credit card number in context.

In [11]:
def format_data(df, name_col, ssn_col, ccn_col):  
    import random
    
    train_list = []
    for i in range(1, len(df)):
        name = df[name_col][i] 
        ssn = str(df[ssn_col][i])
        ccn = str(df[ccn_col][i])
        
        text1 = "My name is %s, and my social security number is %s. Here's the number to my Visa credit card, %s" % (name, ssn, ccn)
        text2 = "%s is my social security number. The name on my American Express card %s is %s." % (ssn, ccn, name)
        text3 = ""
        text = random.choice([text1, text2])

        name_begin = text.find(name)
        name_end = text.find(name) + len(name)
        ssn_begin = text.find(ssn)
        ssn_end = text.find(ssn) + len(ssn)
        ccn_begin = text.find(ccn)
        ccn_end = text.find(ccn) + len(ccn)

        data = {
                    "text": text,
                    "mentions": [
                        {
                            "location": {
                                "begin": name_begin,
                                "end": name_end
                            },
                            "text": name,
                            "type": "Name"
                        },
                        {
                            "location": {
                                "begin": ssn_begin,
                                "end": ssn_end
                            },
                            "text": ssn,
                            "type": "SocialSecurityNumber"
                        },
                        {
                            "location": {
                                "begin": ccn_begin,
                                "end": ccn_end
                            },
                            "text": ccn,
                            "type": "CreditCardNumber"
                        }
                    ]   
                }

        train_list.append(data)
    return train_list

In [12]:
train_list = format_data(df=df, name_col='First and Last Name', ssn_col='SSN', ccn_col='Credit Card Number')

Save the sentences into a json training file and a json dev file. This will save the file to the runtime local as well as the project data assets.

In [13]:
with open('PII_text_train.json', 'w') as f:
    json.dump(train_list, f)
project.save_data('PII_text_train.json', data=json.dumps(train_list), overwrite=True)

{'file_name': 'PII_text_train.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '216b85be-aabe-4ff6-b264-acd101222fbc'}

In [14]:
dev_list = format_data(df=df, name_col='First and Last Name.1', ssn_col='SSN.1', ccn_col='Credit Card Number.1')

In [15]:
with open('PII_text_dev.json', 'w') as f:
    json.dump(dev_list, f)
project.save_data('PII_text_dev.json', data=json.dumps(dev_list), overwrite=True)

{'file_name': 'PII_text_dev.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '76834e31-ab93-4aca-b86b-ce6e71476478'}

In [16]:
text = "My name is %s, and my social security number is %s. Here's the number to my Visa credit card, %s" % (df['First and Last Name'][1], df['SSN'][1], df['Credit Card Number'][1])

In [17]:
train_data = dm.DataStream.from_json_array("PII_text_train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("PII_text_dev.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

<a id="NLPModels"></a>
### 5. Watson NLP Models

<a id="BILSTMFINE"></a>

### 5.1 BiLSTM Fine-tuned

In [18]:
bilstm_custom = bilstm_model.train(train_iob_stream, 
                                   dev_iob_stream, 
                                   embedding=glove_model.embedding,
                                   #vocab_tags=None, 
                                   #char_embed_dim=32, 
                                   #dropout=0.2, 
                                   #num_oov_buckets=1, 
                                   num_train_epochs=5,
                                   num_conf_epochs=5, 
                                   checkpoint_interval=5, 
                                   learning_rate=0.005, 
                                   #shuffle_buffer=2000, 
                                   #char_lstm_size=64, 
                                   #char_bidir=False, 
                                   lstm_size=16, 
                                   #train_batch_size=32, 
                                   #lower_case=False, 
                                   #embedding_lowercase=True, 
                                   #keep_model_artifacts=False)
                                  )



In [19]:
project.save_data('bilstm_pii_custom', data=bilstm_custom.as_file_like_object(), overwrite=True)

{'file_name': 'bilstm_pii_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '7cb485ec-42c3-4a4d-aa10-37502be1266f'}

In [20]:
syntax_result = syntax_model.run(text)
bilstm_result = bilstm_custom.run(syntax_result)

for i in bilstm_result.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)

Text:  Robert Aragon   Type:  Name
Text:  489-36-8350     Type:  SocialSecurityNumber
Text:  4929-3813-3266-4295 Type:  CreditCardNumber


<a id="SIRETune"></a>

### 5.2 SIRE Fine-tuned


In [21]:
#help(watson_nlp.blocks.entity_mentions.SIRE)

In [22]:
sire_custom = watson_nlp.blocks.entity_mentions.SIRE.train(train_iob_stream, 
                                                           'en', 
                                                           mentions_train_template,
                                                           feature_extractors=[default_feature_extractor])

Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Get Feature str 81791
Done get feature str 81791
done. [21[33mg[0m638[33mm[0m752[33mk[0m,11[33mg[0m89[33mm[0m684[33mk[0m]
gramSize = 2
number of processes: 5
Initial processing:  (# of words: 1080738, # of sentences: 68412)
senIndex[1] = 13683, wordIndex = 216154
senIndex[2] = 27367, wordIndex = 432296
senIndex[3] = 41053, wordIndex = 648455
senIndex[4] = 54742, wordIndex = 864600
senIndex[5] = 68411, wordIndex = 1080738
[32m[ME_CRF::scaleModel][0m Updater -- l1=[32m0.1[0m, l2=[32m0.005[0m, history size=[32m5[0m, progress windows size [32m20[0m
 Iteration           Obj             WErr                         Timing       %Eff        Per thread timing
              2079758.43     15.06/ 81.86             E:5.66 s, M:0.03 s.       1.00 [m:5.63, M:5.66, av:5.66]
         0  1183402.25     25.34/100.00           

In [23]:
project.save_data('sire_pii_custom', data=sire_custom.as_file_like_object(), overwrite=True)

Saved 17897 features.


{'file_name': 'sire_pii_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '3b572121-d3e2-439d-8622-813f6536f335'}

In [24]:
syntax_result = syntax_model.run(text)
sire_result = sire_custom.run(syntax_result)
sire_result

{
  "mentions": [
    {
      "span": {
        "begin": 11,
        "end": 24,
        "text": "Robert Aragon"
      },
      "type": "Name",
      "producer_id": null,
      "confidence": 0.9993409548558251,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 59,
        "end": 70,
        "text": "489-36-8350"
      },
      "type": "SocialSecurityNumber",
      "producer_id": null,
      "confidence": 0.9972113139661557,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 114,
        "end": 133,
        "text": "4929-3813-3266-4295"
      },
      "type": "CreditCardNumber",
      "producer_id": null,
      "confidence": 0.998660825805895,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "SIRE Entity Mentions",
    "version": "0

<a id="DLNFine"></a>
### 6. Fine-Tune Model For Driving License Number 

In [14]:
#load the DLN dataset
train_data = dm.DataStream.from_json_array("PII_faker_LicenseNumber_text_train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)

dev_data = dm.DataStream.from_json_array("PII_faker_LicenseNumber_text_train.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

<a id="SireDLNFine"></a>
### 6.1 Sire Fine-Tune Model For Driving License Number Extraction 

In [55]:
#Fine-tune sire model on DLN dataset
sire_custom = watson_nlp.blocks.entity_mentions.SIRE.train(train_iob_stream, 
                                                           'en', 
                                                           mentions_train_template,
                                                           feature_extractors=[default_feature_extractor])

Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Get Feature str 949962
Done get feature str 949962
done. [22[33mg[0m393[33mm[0m312[33mk[0m,11[33mg[0m1017[33mm[0m440[33mk[0m]
gramSize = 2
number of processes: 5
Initial processing:  (# of words: 604530, # of sentences: 50156)
senIndex[1] = 10036, wordIndex = 120907
senIndex[2] = 20142, wordIndex = 241827
senIndex[3] = 30110, wordIndex = 362724
senIndex[4] = 40157, wordIndex = 483640
senIndex[5] = 50155, wordIndex = 604530
[32m[ME_CRF::scaleModel][0m Updater -- l1=[32m0.1[0m, l2=[32m0.005[0m, history size=[32m5[0m, progress windows size [32m20[0m
 Iteration           Obj             WErr                         Timing       %Eff        Per thread timing
              1164292.05     18.91/ 80.55             E:4.93 s, M:0.08 s.       1.00 [m:4.89, M:4.92, av:4.92]
         0   629438.33     27.21/100.00         

In [56]:
#Save Fine-TUne Sire for DLN 
project.save_data('sire_pii_dl_custom', data=sire_custom.as_file_like_object(), overwrite=True)

9      154.25      0.00/  0.00             E:4.96 s, M:0.18 s.       1.00 [m:4.92, M:4.95, av:4.95]
Not enough progress in the last 5 iters.. converged.
    Thread     Total      Wait Effective      %Eff         #Sents/sec
         0    103.27      0.02    103.24      1.00             2040.98
         1    103.25      0.04    103.21      1.00             2062.27
         2    103.17      0.03    103.14      1.00             2041.69
         3    103.33      0.03    103.29      1.00             2040.51
         4    103.28      0.05    103.23      1.00             2015.00
Parent: the end!
Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Saved 24610 features.


{'file_name': 'sire_pii_dl_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '560e8d34-f492-4156-a2b4-cec4b1e6cf58'}

In [31]:
text1 = "My name is William Thomas. I belong to the Hawaii, My Driving License number is H12414887."
text2 = "My name is Michael Garcia. I belong to the North Carolina, My Driving License number is 656915532402."
text3 = "My name is Scott Thompson. I belong to the New York, My Driving License number is 225 961 856."
text4 = "My name is Michelle Perez. I belong to the Hawaii, My Driving License number is H30716114."
text5 = "My name is Timothy Noble. I belong to the Texas, My Driving License number is 10418683."
text6 = "My name is Jason Parks. I belong to the Colarado, My Driving License number is 31-331-5620."
text7 = "My name is Janice Hernandez. I belong to the Colarado, My Driving License number is 97-054-8209."
text8 = "My name is Zachary Flynn. I belong to the North Carolina, My Driving License number is 221653380787."
text9 = "My name is Brittney Davis. I belong to the New York, My credit card number is 4929-3813-3266-4295."
text10 = "My name is Jill Diaz. I belong to the California, My age is 26."

In [32]:
text = [text1, text2, text3, text4, text5, text6, text7, text8, text9, text10]

In [82]:
t=1
for test in text:
    syntax_result = syntax_model.run(test)
    sire_result = sire_custom.run(syntax_result)
    
    for i in sire_result.mentions:
        print("Text"+str(t), i.span.text.ljust(15, " "), "Type: ", i.type)
    t+=1

Text1 William Thomas  Type:  Name
Text1 Hawaii          Type:  state
Text1 H12414887       Type:  driving_license_number
Text2 Michael Garcia  Type:  Name
Text2 North Carolina  Type:  state
Text2 656915532402    Type:  driving_license_number
Text3 Scott Thompson  Type:  Name
Text3 New York        Type:  state
Text3 225 961 856     Type:  driving_license_number
Text4 Michelle Perez  Type:  Name
Text4 Hawaii          Type:  state
Text4 H30716114       Type:  driving_license_number
Text5 Timothy Noble   Type:  Name
Text5 Texas           Type:  state
Text5 10418683        Type:  driving_license_number
Text6 Jason Parks     Type:  Name
Text6 Colarado        Type:  state
Text6 31-331-5620     Type:  driving_license_number
Text7 Janice Hernandez Type:  Name
Text7 Colarado        Type:  state
Text7 97-054-8209     Type:  driving_license_number
Text8 Zachary Flynn   Type:  Name
Text8 North Carolina  Type:  state
Text8 221653380787    Type:  driving_license_number
Text9 Brittney Davis  Type:  Na

<a id="RBRDLNFine"></a>
### 6.2 RBR Fine-Tune Model For Driving License Number Extraction

In [21]:
# Download and load the pretrained model resource
pretrained_model_resource = watson_nlp.load(watson_nlp.download('pretrained-model_watbert_multi_transformer_multi_uncased'))

In [None]:
transformer_custom = watson_nlp.blocks.entity_mentions.Transformer.train(train_iob_stream,
                                                                         dev_iob_stream,
                                                                         pretrained_model_resource,
                                                                         num_train_epochs=1,
                                                                         learning_rate=3e-5,
                                                                         per_device_train_batch_size=1,
                                                                         per_device_eval_batch_size=32)

In [23]:
project.save_data('transformer_pii_dl_custom', data=transformer_custom.as_file_like_object(), overwrite=True)

[INFO|tokenization_utils_base.py:2094] 2023-02-07 14:14:19,804 >> tokenizer config file saved in /tmp/wsuser/tmp_poajs55/.model/artifacts/tokenizer_config.json
[INFO|tokenization_utils_base.py:2100] 2023-02-07 14:14:19,806 >> Special tokens file saved in /tmp/wsuser/tmp_poajs55/.model/artifacts/special_tokens_map.json
[INFO|trainer.py:2139] 2023-02-07 14:14:20,324 >> Saving model checkpoint to /tmp/wsuser/tmp_poajs55/.model/artifacts
[INFO|configuration_utils.py:439] 2023-02-07 14:14:20,327 >> Configuration saved in /tmp/wsuser/tmp_poajs55/.model/artifacts/config.json
[INFO|modeling_utils.py:1084] 2023-02-07 14:14:22,274 >> Model weights saved in /tmp/wsuser/tmp_poajs55/.model/artifacts/pytorch_model.bin
[INFO|tokenization_utils_base.py:2094] 2023-02-07 14:14:22,277 >> tokenizer config file saved in /tmp/wsuser/tmp_poajs55/.model/artifacts/tokenizer_config.json
[INFO|tokenization_utils_base.py:2100] 2023-02-07 14:14:22,278 >> Special tokens file saved in /tmp/wsuser/tmp_poajs55/.model/

{'file_name': 'transformer_pii_dl_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '672b5a24-226f-4916-a9bd-a7659a2f1300'}

In [34]:
#Test the custom train transformer model 

t=1
for test in text:
    syntax_result = syntax_model.run(test)
    transformer_result = transformer_custom.run(syntax_result)
    
    for i in transformer_result.mentions:
        print("Text"+str(t), i.span.text.ljust(15, " "), "Type: ", i.type)
    t+=1

[INFO|trainer.py:2389] 2023-02-07 14:18:46,759 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:46,762 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:46,763 >>   Batch size = 64
[INFO|trainer.py:2389] 2023-02-07 14:18:46,882 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:46,882 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:46,883 >>   Batch size = 64


Text1 William Thomas  Type:  Name
Text1 Hawaii          Type:  state
Text1 H12414887       Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:46,994 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:46,995 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:46,995 >>   Batch size = 64


Text2 Michael Garcia  Type:  Name
Text2 North Carolina  Type:  state
Text2 656915532402    Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:47,107 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:47,108 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:47,109 >>   Batch size = 64


Text3 Scott Thompson  Type:  Name
Text3 New York        Type:  state
Text3 225 961 856     Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:47,219 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:47,219 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:47,220 >>   Batch size = 64


Text4 Michelle Perez  Type:  Name
Text4 Hawaii          Type:  state
Text4 H30716114       Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:47,329 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:47,330 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:47,331 >>   Batch size = 64


Text5 Timothy Noble   Type:  Name
Text5 Texas           Type:  state
Text5 10418683        Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:47,441 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:47,442 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:47,442 >>   Batch size = 64


Text6 Jason Parks     Type:  Name
Text6 Colarado        Type:  state
Text6 31-331-5620     Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:47,552 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:47,553 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:47,554 >>   Batch size = 64


Text7 Janice Hernandez Type:  Name
Text7 Colarado        Type:  state
Text7 97-054-8209     Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:47,663 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:47,664 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:47,665 >>   Batch size = 64


Text8 Zachary Flynn   Type:  Name
Text8 North Carolina  Type:  state
Text8 221653380787    Type:  driving_license_number


[INFO|trainer.py:2389] 2023-02-07 14:18:47,775 >> ***** Running Prediction *****
[INFO|trainer.py:2391] 2023-02-07 14:18:47,776 >>   Num examples = 2
[INFO|trainer.py:2394] 2023-02-07 14:18:47,777 >>   Batch size = 64


Text9 Brittney Davis  Type:  Name
Text9 New York        Type:  state
Text9 4929-3813-3266-4295 Type:  driving_license_number
Text10 Jill Diaz       Type:  Name
Text10 California      Type:  state
Text10 26              Type:  driving_license_number


<a id="addPII"></a>

### 7. Additional PII Entities extraction

In [2]:
!pip install faker

Collecting faker
  Downloading Faker-17.0.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-17.0.0


<a id="PIIData"></a>
### 7.1 Preparing PII Training Data


In [7]:
#Generate the dataset using faker
fake = Faker(locale='en_US')

def format_data():
    # Generate a random degree level
    degree_level = fake.random_element(elements=('Bachelor\'s', 'Master\'s', 'Doctorate'))

    # Generate a random field of study
    field_of_study = fake.random_element(elements=('Computer Science', 'Engineering', 'Business', 'Psychology','Medical'))


    # Generate a random prefix with 1-2 alphabets
    prefix = ''.join(random.choices(string.ascii_uppercase, k=random.randint(1, 2)))
    # Generate a random employee ID with the prefix and a random integer
    employee_id = f"{prefix}{fake.random_int(min=10000, max=99999):05d}"

    # Generate salary using faker
    salary = str(fake.pyfloat(left_digits=5, right_digits=2, positive=True, min_value=1000, max_value=5000))
    
    
    
    text_1 = "I studied %s in %s, My employee id is %s and salary is %s" %(degree_level,field_of_study,employee_id,salary)
    text_2 = " Hello, My employee id is %s and I done my %s in %s, I am earning %s per month" %(employee_id,degree_level, field_of_study,salary)
    text_3 = "My monthly Earning is %s and employee code is %s, I studied %s in %s" %(salary,employee_id,degree_level,field_of_study)
    text = random.choice([text_1, text_2,text_3])
    
    
    field_of_study_begin = text.find(field_of_study)
    field_of_study_end = field_of_study_begin + len(field_of_study)

    degree_level_begin = text.find(degree_level)
    degree_level_end = degree_level_begin + len(degree_level)
  
    employee_id_begin = text.find(employee_id)
    employee_id_end = employee_id_begin + len(employee_id)

    salary_begin = text.find(salary)
    salary_end = salary_begin + len(salary)
    
    
    data = {
                "text": text,
                "mentions": [
                    {
                        "location": {
                            "begin": field_of_study_begin,
                            "end": field_of_study_end
                        },
                        "text": field_of_study,
                        "type": "field_of_study"
                    },
                    {
                        "location": {
                            "begin": degree_level_begin,
                            "end": degree_level_end
                        },
                        "text": degree_level,
                        "type": "degree_level"
                    },
                                        {
                        "location": {
                            "begin": employee_id_begin,
                            "end": employee_id_end
                        },
                        "text": employee_id,
                        "type": "employee_id"
                    },
                    {
                        "location": {
                            "begin": salary_begin,
                            "end": salary_end
                        },
                        "text": salary,
                        "type": "salary"
                    }
                ]   
            }
    
    return data

In [8]:
#Sample dataset
format_data()

{'text': ' Hello, My employee id is I45456 and I done my Doctorate in Business, I am earning 2950.66 per month',
 'mentions': [{'location': {'begin': 60, 'end': 68},
   'text': 'Business',
   'type': 'field_of_study'},
  {'location': {'begin': 47, 'end': 56},
   'text': 'Doctorate',
   'type': 'degree_level'},
  {'location': {'begin': 26, 'end': 32},
   'text': 'I45456',
   'type': 'employee_id'},
  {'location': {'begin': 83, 'end': 90}, 'text': '2950.66', 'type': 'salary'}]}

In [9]:
#Prepared and store Training dataset for Driving License dataset
train_list_faker = []
for i in range(0, 30000):
    train_list_faker.append(format_data())

with open('faker_PII_text_train.json', 'w') as f:
    json.dump(train_list_faker, f)
project.save_data('faker_PII_text_train.json', data=json.dumps(train_list_faker), overwrite=True)

{'file_name': 'faker_PII_text_train.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': 'e952dfbd-f642-4712-b7a5-deae8425af2a'}

In [10]:
#Prepared and store Training dataset for Driving License dataset
test_list_faker = []
for i in range(0, 1000):
    test_list_faker.append(format_data())

with open('faker_PII_text_test.json', 'w') as f:
    json.dump(test_list_faker, f)
project.save_data('faker_PII_text_test.json', data=json.dumps(test_list_faker), overwrite=True)

{'file_name': 'faker_PII_text_test.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '0059c8e9-2566-4288-a1c2-092dc29d418e'}

Since the data is already formatted correctly, the following process is needed to read the JSON data files from Watson Studio project assets and save them to the runtime working directory where they will be used as input for training the models.

In [11]:
train_data = dm.DataStream.from_json_array("faker_PII_text_train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("faker_PII_text_test.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

The text inputs will be converted into a streaming array where the text is broken down by the syntax model.

<a id="buildModel"></a>
## 8. Model Building on Custom PII Entities

Entity extraction uses the entity-mentions block to encapsulate algorithms for the task of extracting mentions of entities (person, organizations, dates, locations,...) from the input text. The blocks and workflows offer implementations of strong entity extraction algorithms from each of the four families: rule-based, classic ML, deep-learning and transformers.

<a id="sire"></a>
### 8.1 SIRE Training

You can train SIRE models using either CRF & Maximum Entropy template as base models. Between the two, CRF based template takes longer to train but gives better results.

These algorithms accept a set of featured in the form of dictionaries and regular expressions. A set of predefined feature extractors are provided for multiple languages, and you can also define your own features.

In [3]:
#help(watson_nlp.workflows.entity_mentions.SIRE)

In [31]:
# Download the algorithm template
mentions_train_template = watson_nlp.load(watson_nlp.download('file_path_entity-mentions_sire_multi_template-crf'))
# Download the feature extractor
default_feature_extractor = watson_nlp.load(watson_nlp.download('feature-extractor_rbr_entity-mentions_sire_en_stock'))

In [32]:
# Train the model
sire_custom = watson_nlp.workflows.entity_mentions.SIRE.train(syntax_model=syntax_model,
                                                              labeled_entity_mentions='/home/wsuser/work/', 
                                                              #labeled_entity_mentions=train_data,
                                                              model_language='en', 
                                                              template_resource=mentions_train_template, 
                                                              feature_extractors=[default_feature_extractor], 
                                                              l1=0.1, 
                                                              l2=0.005, 
                                                              num_epochs=50, 
                                                              num_workers=5)

Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Get Feature str 818099
Done get feature str 818099
done. [51[33mg[0m573[33mm[0m340[33mk[0m,8[33mg[0m985[33mm[0m520[33mk[0m]
gramSize = 2
number of processes: 5
Initial processing:  (# of words: 265660, # of sentences: 20000)
senIndex[1] = 7222, wordIndex = 53136
senIndex[2] = 11699, wordIndex = 106285
senIndex[3] = 14474, wordIndex = 159406
senIndex[4] = 17249, wordIndex = 212535
senIndex[5] = 19999, wordIndex = 265660
[32m[ME_CRF::scaleModel][0m Updater -- l1=[32m0.1[0m, l2=[32m0.005[0m, history size=[32m5[0m, progress windows size [32m20[0m
 Iteration           Obj             WErr                         Timing       %Eff        Per thread timing
               543176.67      6.63/ 63.18             E:1.08 s, M:0.08 s.       1.00 [m:1.04, M:1.07, av:1.06]
         0   240271.27     18.58/ 73.16             

The following code will save the custom model to Watson Studio by using the project library.

In [33]:
# Save the model
project.save_data('PII_sire_custom', data=sire_custom.as_file_like_object(), overwrite=True)

Saved 9722 features.


{'file_name': 'PII_sire_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '0941d329-e971-45c2-b766-082fe06434a4'}

Let's run the model on one example input from the dev dataset.

In [17]:
text = pd.read_json('faker_PII_text_test.json')['text'][1]
text

'My monthly Earning is 3608.13 and employee code is Q50443, I studied Doctorate in Business'

In [47]:
# Run the model
sire_result = sire_custom.run(text)
sire_result

{
  "mentions": [
    {
      "span": {
        "begin": 26,
        "end": 33,
        "text": "MN34275"
      },
      "type": "employee_id",
      "producer_id": null,
      "confidence": 0.999635088942728,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 48,
        "end": 56,
        "text": "Master's"
      },
      "type": "degree_level",
      "producer_id": null,
      "confidence": 0.9996819393489853,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 60,
        "end": 67,
        "text": "Medical"
      },
      "type": "field_of_study",
      "producer_id": null,
      "confidence": 0.9999710541077216,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 82,
        "end": 89,
        "text": "3362.18"
      },
     

<a id="bilstm"></a>
### 8.2 BiLSTM Training

The deep-learning algorithm used in this block performs sequence labelling based on the BiLSTM architecture followed by a CRF layer. It uses GloVe embeddings as features.

In [4]:
#help(watson_nlp.blocks.entity_mentions.BiLSTM)

In [12]:
# Download the GloVe model to be used as embeddings in the BiLSTM
glove_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))

In [13]:
# Train BILSTM Model for Educational details entity
bilstm_custom = watson_nlp.blocks.entity_mentions.BiLSTM.train(train_iob_stream,
                                                              dev_iob_stream,
                                                              glove_model.embedding,
                                                              num_train_epochs=5)



If we want to save the trained block model as a workflow, to be run with raw text later, we can use the following code snippet to do so

In [14]:
#Save the Trained block model as a workflow model 
from watson_nlp.workflows.entity_mentions.bilstm import BiLSTM 

mentions_workflow = BiLSTM(syntax_model, bilstm_custom)


The following code will save the custom model to Watson Studio by using the project library.

In [15]:
# Save the model
project.save_data('PII_workflow_bilstm_custom', data=mentions_workflow.as_file_like_object(), overwrite=True)

{'file_name': 'PII_workflow_bilstm_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '615a3f7d-75a6-46ce-8e59-e78e7c4e26c2'}

Let's run the model on one example input.

In [19]:
# Run the BILSTM workflow model
#syntax_result = syntax_model.run(text)
bilstm_result = mentions_workflow.run(text)

bilstm_result

{
  "mentions": [
    {
      "span": {
        "begin": 22,
        "end": 29,
        "text": "3608.13"
      },
      "type": "salary",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 1.0,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 51,
        "end": 57,
        "text": "Q50443"
      },
      "type": "employee_id",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 1.0,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 69,
        "end": 78,
        "text": "Doctorate"
      },
      "type": "degree_level",
      "producer_id": {
        "name": "BiLSTM Entity Mentions",
        "version": "1.0.0"
      },
      "confidence": 1.0,
      "mention_type": "MENTT_UN

Now you are able to run the trained models on new data. You will run the models on the test data so that the results can also be used for model evaluation.

Watson NLP includes methods for quality testing supported models. Given a model and test data, a quality report can be generated. The following example includes the steps required to generate a quality report for a BiLSTM entity mention extactor model. The same example can be applied to any entity mention extractor model.

In [49]:
# Execute the model and generate the quality report
preprocess_func = lambda raw_doc: syntax_model.run(raw_doc)
quality_report = bilstm_custom.evaluate_quality('faker_PII_text_test.json', 
                                               preprocess_func)

# Print the quality report
print(json.dumps(quality_report, indent=4))



{
    "per_class_confusion_matrix": {
        "field_of_study": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        },
        "employee_id": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        },
        "salary": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        },
        "degree_level": {
            "true_positive": 1000,
            "false_positive": 0,
            "false_negative": 0,
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0
        }
    },
    "macro_true_positive": null,
    "macro_false_positive": null,
    "macro_false_negative"

<a id="summary"></a>
## 9. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library and how quickly and easily you can train and run different PII extraction models using Watson NLP.</span>

Please note that this content is made available to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the [GitHub](https://github.com/ibm-build-labs/Watson-NLP). 

Developed by IBM Build Lab 

Copyright - 2022 IBM Corporation 