# Extract the Personal Identifiable Information (PII) using Watson NLP

<h2>Use Case</h2>

This notebook demonstrates how to extract PII entities using Watson NLP Pre-trained  models also demonstrates how to prepare custom train models. PII extraction is the process of identifying and extracting personal information from a document or dataset. This information can include names, addresses, phone numbers, email addresses, Social Security numbers, Credit Card number, and other types of information that can be used to identify an individual. 


<h2>What you'll learn in this notebook</h2>

Watson NLP offers Custom RBR Models for various NLP tasks which provides custom rules Generation. This notebooks shows:

* <b>RBR</b>:  A Rule-Based Reasoner (RBR) in NLP works by using a set of predefined rules to process and understand natural language input. These rules are used to identify specific patterns or structures in the input text and determine the meaning of the text based on those patterns.

## Table of Contents


1. [Before you start](#beforeYouStart)
1. [Custom Training for RBR Model](#NLPModels)    
1. [Summary](#summary)

<a id="beforeYouStart"></a>
### 1. Before you start


<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.x environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [2]:
import json
import pandas as pd
import watson_nlp
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json

In [3]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

<a id="NLPModels"></a>
### 2.  Custom Training for RBR Model

In [266]:
# Load rbr model in WatsonNLP
rbr_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_multi_pii'))

Download the Custom rule zip file for Driving Licence Number 

In [42]:
# NLP_Canvas_Export is the Custom rules for Driving Licence Number Prepared in Elyra NLP Editor
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='o0avUc3SDky2d6pNzjuewCSTPPX7tQNz6BKKvL37nBL3',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.private.us.cloud-object-storage.appdomain.cloud')

bucket = 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1'
object_key = 'NLP_Canvas_Export.zip'

cos_client.download_file(Bucket=bucket, Key=object_key,Filename=object_key)

In [43]:
ls

NLP_Canvas_Export.zip  [0m[01;34mNLP_RBR_Module_2[0m/


Loading the custom Rule model

In [44]:
model = watson_nlp.toolkit.rule_utils.RBRExecutor.load("NLP_Canvas_Export.zip")

In [45]:
text1="Hello, My self Tracy Arias, I am living in Alaska and my driving License number is 9839434"
text2="Hello, My self Shane Escobar, I am living in New York and my driving License number is 052 289 084"
text3="Hello, My self Laura Parrish, I am living in Colarado and my driving License number is 25-157-3852"
text4="My name is Curtis Mccullough I belong to the Alabama , My Driving License number is 1470583?"
text5="I am Randall Barton. H45768237 this is my driving license number. I am from Hawaii state."
text6="Hello, My self Michael Peterson, I am living in Colarado and my driving License number is 87-361-4145"
text7="Hello, My self Ms. Jennifer Hart, I am living in North Carolina and my driving License number is 844144533108"
text8="Hello, My self Derek Martin, I am living in California and my driving License number is A06798902"
text9="I am Lauren Martinez. 493 671 140 this is my driving license number. I am from New York state."
text10="My name is Mark Thomas I belong to the California , My age is 68 years"

all_test=[text1,text2,text3,text4,text5,text6,text7,text8,text9,text10]

In [46]:
for test in all_test:
    rbr_result_dl = model.run(test, language='en')
    print(rbr_result_dl.views[0])


{
  "name": "Driving_Licence_Number",
  "properties": [
    {
      "aql_property": {
        "Driving_Licence_Number": {
          "begin": 83,
          "end": 90,
          "text": "9839434"
        }
      }
    }
  ]
}
{
  "name": "Driving_Licence_Number",
  "properties": [
    {
      "aql_property": {
        "Driving_Licence_Number": {
          "begin": 87,
          "end": 98,
          "text": "052 289 084"
        }
      }
    }
  ]
}
{
  "name": "Driving_Licence_Number",
  "properties": [
    {
      "aql_property": {
        "Driving_Licence_Number": {
          "begin": 87,
          "end": 98,
          "text": "25-157-3852"
        }
      }
    }
  ]
}
{
  "name": "Driving_Licence_Number",
  "properties": [
    {
      "aql_property": {
        "Driving_Licence_Number": {
          "begin": 84,
          "end": 91,
          "text": "1470583"
        }
      }
    }
  ]
}
{
  "name": "Driving_Licence_Number",
  "properties": [
    {
      "aql_property": {
        "D

### RegexConfig RBR model test

In [5]:
import os
module_folder = "NLP_RBR_Module_2" 
os.makedirs(module_folder, exist_ok=True)

In [47]:
# Train the RBR Custom rule model
regexes = watson_nlp.toolkit.rule_utils.RegexConfig.load_all([
    {
        'name': 'Driving_Lincense_Number',
        'regexes': ['\b[a-zA-Z]{1}[0-9]{8}(\b)|($)[0-9]{9}($|\b)|[0-9]{7}($|\b)|[0-9]{12}($|\b)|[0-9]{2}(-| |)[0-9]{3}(-| |)[0-9]{4}|[0-9]{3}(-| |)[0-9]{3}(-| |)[0-9]{3}'],
        'groups': ['Driving _License_Number']
    }])

custom_regex_block = watson_nlp.resources.feature_extractor.RBR.train(module_path=module_folder, language='en', regexes=regexes)


In [48]:
rbr_result = custom_regex_block.run(text1)
rbr_result

{(83, 90): ['regex::Driving_Lincense_Number']}

<a id="summary"></a>
## 3. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library to:
1. Prepare Custom train RBR models.
</span>

Please note that this content is made available by IBM Build Lab to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the GitHub.

Developed by IBM Build Lab

Copyright - 2023 IBM Corporation