In [1]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='ae1a755d-e162-4f07-9f5a-130d2280e78e', project_access_token='p-aa90b9b21de435c3f4c94494a24b5c5e69d030f8')
pc = project.project_context


# Extract the Personal Identifiable Information (PII) using Watson NLP

<h2>Use Case</h2>

This notebook demonstrates how to extract PII entities using Watson NLP Custom train or Fine-tune models. PII extraction is the process of identifying and extracting personal information from a document or dataset. This information can include names, addresses, phone numbers, email addresses, Social Security numbers, Credit Card number, and other types of information that can be used to identify an individual. 


<h2>What you'll learn in this notebook</h2>

Watson NLP offers  fine-tune functionality for custom training. This notebooks shows:

* <b>BILSTM</b>: the BiLSTM network would take the preprocessed text as input and learn to identify patterns and relationships between words that are indicative of PII data. The BiLSTM network would then output a probability score for each word in the text, indicating the likelihood that the word is part of a PII entity. The BiLSTM network may also be trained to recognize specific entities such as names, addresses, phone numbers, email addresses, etc.


* <b>SIRE</b>: Statistical Information and Relation Extraction (SIRE) is a technique used in natural language processing (NLP) to extract specific information and relationships from text. It involves using machine learning algorithms to identify and extract structured data such as entities, attributes, and relations from unstructured text. SIRE is used in a variety of applications, including information extraction, knowledge graph construction, and question answering. SIRE typically uses supervised learning approach, where a model is trained using annotated examples of text and the corresponding structured data. The model can then be used to extract the same information from new, unseen text.

## Table of Contents


- [1. Before you start](#beforeYouStart)
- [2. Load Entity PII Models](#LoadModel)
- [3. Load PII XLSX Dataset from Data Assets](#Loaddata)
- [4. TrainingData](#TrainingData)
- [5. Watson NLP Models](#NLPModels)    
    * [5.1 BiLSTM Fine-tuned](#BILSTMFINE)
    * [5.2 SIRE Fine-tuned](#SIRETune)
    * [5.3 Transformer Fine-tuned](#TransTUne)
  
- [6. Summary](#summary)

<a id="beforeYouStart"></a>
### 1. Before you start


<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.x environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [2]:
import json
import pandas as pd
import watson_nlp
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json

In [3]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

<a id="LoadModel"></a>
### 2. Load Entity PII Models

In [4]:
# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
# Load bilstm model in WatsonNLP
bilstm_model = watson_nlp.load(watson_nlp.download('entity-mentions_bilstm_en_pii'))
# Download the GloVe model to be used as embeddings in the BiLSTM
glove_model = watson_nlp.load(watson_nlp.download('embedding_glove_en_stock'))
# Download the algorithm template
mentions_train_template = watson_nlp.load(watson_nlp.download('file_path_entity-mentions_sire_multi_template-crf'))
# Download the feature extractor
default_feature_extractor = watson_nlp.load(watson_nlp.download('feature-extractor_rbr_entity-mentions_sire_en_stock'))

<a id="Loaddata"></a>
### 3. Load PII XLSX Dataset from Data Assets

In [5]:
import os, types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='o0avUc3SDky2d6pNzjuewCSTPPX7tQNz6BKKvL37nBL3',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.private.us.cloud-object-storage.appdomain.cloud')

bucket = 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1'
object_key = '10-MB-Test.xlsx'

body = cos_client.get_object(Bucket=bucket,Key=object_key)['Body']

df = pd.read_excel(body.read())
df = df.dropna()
df.head()

Unnamed: 0,First and Last Name,SSN,Credit Card Number,First and Last Name.1,SSN.1,Credit Card Number.1,First and Last Name.2,SSN.2,Credit Card Number.2,First and Last Name.3,...,Credit Card Number.3,First and Last Name.4,SSN.4,Credit Card Number.4,First and Last Name.5,SSN.5,Credit Card Number.5,First and Last Name.6,SSN.6,Credit Card Number.6
1,Robert Aragon,489-36-8350,4929-3813-3266-4295,Robert Aragon,489-36-8351,4929-3813-3266-4296,Robert Aragon,489-36-8352,4929-3813-3266-4297,Robert Aragon,...,4929-3813-3266-4298,Robert Aragon,489-36-8354,4929-3813-3266-4299,Robert Aragon,489-36-8355,4929-3813-3266-4300,Robert Aragon,489-36-8355,4929-3813-3266-4300
2,Ashley Borden,514-14-8905,5370-4638-8881-3020,Ashley Borden,514-14-8906,5370-4638-8881-3021,Ashley Borden,514-14-8907,5370-4638-8881-3022,Ashley Borden,...,5370-4638-8881-3023,Ashley Borden,514-14-8909,5370-4638-8881-3024,Ashley Borden,514-14-8910,5370-4638-8881-3025,Ashley Borden,514-14-8910,5370-4638-8881-3025
3,Thomas Conley,690-05-5315,4916-4811-5814-8111,Thomas Conley,690-05-5316,4916-4811-5814-8112,Thomas Conley,690-05-5317,4916-4811-5814-8113,Thomas Conley,...,4916-4811-5814-8114,Thomas Conley,690-05-5319,4916-4811-5814-8115,Thomas Conley,690-05-5320,4916-4811-5814-8116,Thomas Conley,690-05-5320,4916-4811-5814-8116
4,Susan Davis,421-37-1396,4916-4034-9269-8783,Susan Davis,421-37-1397,4916-4034-9269-8784,Susan Davis,421-37-1398,4916-4034-9269-8785,Susan Davis,...,4916-4034-9269-8786,Susan Davis,421-37-1400,4916-4034-9269-8787,Susan Davis,421-37-1401,4916-4034-9269-8788,Susan Davis,421-37-1401,4916-4034-9269-8788
5,Christopher Diaz,458-02-6124,5299-1561-5689-1938,Christopher Diaz,458-02-6125,5299-1561-5689-1939,Christopher Diaz,458-02-6126,5299-1561-5689-1940,Christopher Diaz,...,5299-1561-5689-1941,Christopher Diaz,458-02-6128,5299-1561-5689-1942,Christopher Diaz,458-02-6129,5299-1561-5689-1943,Christopher Diaz,458-02-6129,5299-1561-5689-1943


<a id="TrainingData"></a>
### 4. Preparing Training Data

Let's generate sentences using the columns of PII information. Ideally, the sentences would include name, SSN, and credit card number in context.

In [6]:
def format_data(df, name_col, ssn_col, ccn_col):  
    import random
    
    train_list = []
    for i in range(1, len(df)):
        name = df[name_col][i] 
        ssn = str(df[ssn_col][i])
        ccn = str(df[ccn_col][i])
        
        text1 = "My name is %s, and my social security number is %s. Here's the number to my Visa credit card, %s" % (name, ssn, ccn)
        text2 = "%s is my social security number. The name on my American Express card %s is %s." % (ssn, ccn, name)
        text3 = ""
        text = random.choice([text1, text2])

        name_begin = text.find(name)
        name_end = text.find(name) + len(name)
        ssn_begin = text.find(ssn)
        ssn_end = text.find(ssn) + len(ssn)
        ccn_begin = text.find(ccn)
        ccn_end = text.find(ccn) + len(ccn)

        data = {
                    "text": text,
                    "mentions": [
                        {
                            "location": {
                                "begin": name_begin,
                                "end": name_end
                            },
                            "text": name,
                            "type": "Name"
                        },
                        {
                            "location": {
                                "begin": ssn_begin,
                                "end": ssn_end
                            },
                            "text": ssn,
                            "type": "SocialSecurityNumber"
                        },
                        {
                            "location": {
                                "begin": ccn_begin,
                                "end": ccn_end
                            },
                            "text": ccn,
                            "type": "CreditCardNumber"
                        }
                    ]   
                }

        train_list.append(data)
    return train_list

In [7]:
train_list = format_data(df=df, name_col='First and Last Name', ssn_col='SSN', ccn_col='Credit Card Number')

Save the sentences into a json training file and a json dev file. This will save the file to the runtime local as well as the project data assets.

In [8]:
with open('PII_text_train.json', 'w') as f:
    json.dump(train_list, f)
project.save_data('PII_text_train.json', data=json.dumps(train_list), overwrite=True)

{'file_name': 'PII_text_train.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '216b85be-aabe-4ff6-b264-acd101222fbc'}

In [9]:
dev_list = format_data(df=df, name_col='First and Last Name.1', ssn_col='SSN.1', ccn_col='Credit Card Number.1')

In [10]:
with open('PII_text_dev.json', 'w') as f:
    json.dump(dev_list, f)
project.save_data('PII_text_dev.json', data=json.dumps(dev_list), overwrite=True)

{'file_name': 'PII_text_dev.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '76834e31-ab93-4aca-b86b-ce6e71476478'}

In [11]:
text = "My name is %s, and my social security number is %s. Here's the number to my Visa credit card, %s" % (df['First and Last Name'][1], df['SSN'][1], df['Credit Card Number'][1])

In [12]:
train_data = dm.DataStream.from_json_array("PII_text_train.json")
train_iob_stream = prepare_train_from_json(train_data, syntax_model)
dev_data = dm.DataStream.from_json_array("PII_text_dev.json")
dev_iob_stream = prepare_train_from_json(dev_data, syntax_model)

<a id="NLPModels"></a>
### 5. Watson NLP Models

<a id="BILSTMFINE"></a>

### 5.1 BiLSTM Fine-tuned

In [28]:
bilstm_custom = bilstm_model.train(train_iob_stream, 
                                   dev_iob_stream, 
                                   embedding=glove_model.embedding,
                                   #vocab_tags=None, 
                                   #char_embed_dim=32, 
                                   #dropout=0.2, 
                                   #num_oov_buckets=1, 
                                   num_train_epochs=5,
                                   num_conf_epochs=5, 
                                   checkpoint_interval=5, 
                                   learning_rate=0.005, 
                                   #shuffle_buffer=2000, 
                                   #char_lstm_size=64, 
                                   #char_bidir=False, 
                                   lstm_size=16, 
                                   #train_batch_size=32, 
                                   #lower_case=False, 
                                   #embedding_lowercase=True, 
                                   #keep_model_artifacts=False)
                                  )



In [29]:
project.save_data('bilstm_pii_custom', data=bilstm_custom.as_file_like_object(), overwrite=True)

{'file_name': 'bilstm_pii_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '7cb485ec-42c3-4a4d-aa10-37502be1266f'}

In [35]:
syntax_result = syntax_model.run(text)
bilstm_result = bilstm_custom.run(syntax_result)

for i in bilstm_result.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)

Text:  Robert Aragon   Type:  Name
Text:  489-36-8350     Type:  SocialSecurityNumber
Text:  4929-3813-3266-4295 Type:  CreditCardNumber


<a id="SIRETune"></a>

### 5.2 SIRE Fine-tuned


In [None]:
#help(watson_nlp.blocks.entity_mentions.SIRE)

In [36]:
sire_custom = watson_nlp.blocks.entity_mentions.SIRE.train(train_iob_stream, 
                                                           'en', 
                                                           mentions_train_template,
                                                           feature_extractors=[default_feature_extractor])

Initializing viterbi classifier
[32m[MEVitClassifier::initModel][0m MEVitClassifier initialized.
[32m[MEVitClassifier2::initModel][0m model initialized.
Get Feature str 81754
Done get feature str 81754
done. [21[33mg[0m909[33mm[0m748[33mk[0m,11[33mg[0m713[33mm[0m408[33mk[0m]
gramSize = 2
number of processes: 5
Initial processing:  (# of words: 1080636, # of sentences: 68412)
senIndex[1] = 13691, wordIndex = 216132
senIndex[2] = 27367, wordIndex = 432254
senIndex[3] = 41041, wordIndex = 648383
senIndex[4] = 54729, wordIndex = 864517
senIndex[5] = 68411, wordIndex = 1080636
[32m[ME_CRF::scaleModel][0m Updater -- l1=[32m0.1[0m, l2=[32m0.005[0m, history size=[32m5[0m, progress windows size [32m20[0m
 Iteration           Obj             WErr                         Timing       %Eff        Per thread timing
              2079562.11     14.72/ 78.73             E:3.86 s, M:0.07 s.       1.00 [m:3.84, M:3.86, av:3.86]
         0  1183329.48     25.34/100.00          

In [37]:
project.save_data('sire_pii_custom', data=sire_custom.as_file_like_object(), overwrite=True)

Saved 17925 features.


{'file_name': 'sire_pii_custom',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '3b572121-d3e2-439d-8622-813f6536f335'}

In [38]:
syntax_result = syntax_model.run(text)
sire_result = sire_custom.run(syntax_result)
sire_result

{
  "mentions": [
    {
      "span": {
        "begin": 11,
        "end": 24,
        "text": "Robert Aragon"
      },
      "type": "Name",
      "producer_id": null,
      "confidence": 0.999336911506031,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 59,
        "end": 70,
        "text": "489-36-8350"
      },
      "type": "SocialSecurityNumber",
      "producer_id": null,
      "confidence": 0.9971765834281675,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    },
    {
      "span": {
        "begin": 114,
        "end": 133,
        "text": "4929-3813-3266-4295"
      },
      "type": "CreditCardNumber",
      "producer_id": null,
      "confidence": 0.9986597152704909,
      "mention_type": "MENTT_UNSET",
      "mention_class": "MENTC_UNSET",
      "role": ""
    }
  ],
  "producer_id": {
    "name": "SIRE Entity Mentions",
    "version": "0

<a id="summary"></a>
## 6. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library to:
1. Extract PII Using Custom or Fine tune Models </span>

Please note that this content is made available by IBM Build Lab to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the GitHub.

Developed by IBM Build Lab

Copyright - 2022 IBM Corporation