# Extract the Personal Identifiable Information (PII) using Watson NLP

<h2>Use Case</h2>

<b>PII extraction is the process of identifying and extracting personal information from a document or dataset</b>. This notebook demonstrates how to extract PII entities using Watson NLP Pre-trained models. This information can include **names, addresses, phone numbers, email addresses, Social Security numbers, Credit Card number, and other types of information** that can be used to identify an individual. 


<h2>What you'll learn in this notebook</h2>

Watson NLP offers Pre-trained Models for various NLP tasks and also provides fine-tune functionality for custom training. This notebooks shows:

* <b>RBR</b>:  A Rule-Based Reasoner (RBR) in NLP works by using a set of predefined rules to process and understand natural language input. These rules are used to identify specific patterns or structures in the input text and determine the meaning of the text based on those patterns.


* <b>BILSTM</b>: the BiLSTM network would take the preprocessed text as input and learn to identify patterns and relationships between words that are indicative of PII data. The BiLSTM network would then output a probability score for each word in the text, indicating the likelihood that the word is part of a PII entity. The BiLSTM network may also be trained to recognize specific entities such as names, addresses, phone numbers, email addresses, etc.


* <b>BERT</b>: Bidirectional Encoder Representations from Transformers (BERT) uses the Google Multilingual BERT model, meaning that a single model can analyze input texts from multiple languages. BERT uses a transformer-based neural network architecture that allows for bidirectional processing of input text, meaning that it can take into account both the context before and after a given word in a sentence. BERT can be used for entity extraction, which involves identifying and extracting important pieces of information (such as named entities) from unstructured text data. In the context of entity extraction, BERT can be fine-tuned on a labeled dataset of text data and entity labels (such as person, organization, location, or date). 

## Table of Contents


1. [Before you start](#beforeYouStart)
1. [Load Entity PII Models](#LoadModel)
1. [Data generation for Testing](#Loaddata)
1. [Watson NLP Models](#NLPModels)    
   1. [BiLSTM Pretrained](#BILSTMPre)
   1. [RBR Pretrained](#RBRPre)
   1. [BERT Pretrained](#Bert)
1. [Testing Usecase](#Testing)  
   1.  [RBR Model Testing for Caredit card Extraction, URL and Email](#5.1)
       1. [RBR Stock - URL Extraction](#5.1.1)
       1. [Combine RBR PII and RBR Stock (URL)](#5.1.2)
   1. [EmailAddress Extraction using RBR](#5.2)
1. [Summary](#summary)

<a id="beforeYouStart"></a>
### 1. Before you start


<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _Default Python 3.x environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [None]:
!pip install faker

In [None]:
import json
import pandas as pd
import watson_nlp
import random
from faker import Faker
from watson_nlp import data_model as dm
from watson_nlp.toolkit.entity_mentions_utils import prepare_train_from_json
# import warnings
# warnings.filterwarnings('ignore')

In [None]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

<a id="LoadModel"></a>
### 2. Load Entity PII Models

The function "watson_nlp.load()" is used to download a pre-trained NLP model, such as IBM Watson NLP, which can then be used to process and analyze text data. This function allows you to quickly and easily integrate advanced NLP capabilities into your application, without having to train a model from scratch. By downloading a pre-trained model, you can save time and resources, and benefit from the expertise of the model developers.

In [None]:
# check the models catalog here - https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/watson-nlp-block-catalog.html?context=cpdaas&audience=wdp

# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_nl_stock'))
# Load bilstm model in WatsonNLP
# Note that we are loading the English PII BiLSTM model as there is no Dutch model available yet - the English model works, but will result in lower confidence scores as you will see
bilstm_model = watson_nlp.load(watson_nlp.download('entity-mentions_bilstm_en_pii'))
# Load rbr model in WatsonNLP
rbr_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_multi_pii'))
# BERT Load BERT entity model in WatsonNLP
bert_entity_model = watson_nlp.load(watson_nlp.download('entity-mentions_bert_multi_stock'))

<a id="Loaddata"></a>
### 3. Data generation for Testing

Generate the data __Name__, __SSN__ and __credit card numbers__ using faker. 

In [None]:
# Faker is used to generate fake data for testing purposes
fake = Faker(locale='nl_NL')

def format_data():  
        #Generate a random
        name = fake.name() 

        #Generate a random SSN 
        ssn = fake.ssn()

        #Generate a random CCN 
        ccn = fake.credit_card_number()

        text_1 = """Mijn naam is %s, en mijn BSN is %s. Hier is het nummer van mijn credit card %s""" % (name, ssn, ccn)

        text_2 = """%s is mijn burgerservicenummer. De naam op mijn creditcard %s is %s."""% (ssn, ccn, name)

        text_3 = """Mijn creditcardnummer is %s en burgerservicenummer is %s, ik ben %s""" %(ccn,ssn,name)

        print(text_1)
        print(text_2)
        print(text_3)        

        text = random.choice([text_1, text_2,text_3])
        
        return text

In [None]:
text = format_data()
text

<a id="NLPModels"></a>
### 4.  Watson NLP Models

<a id="BILSTMPre"></a>
### 4. 1 BiLSTM Pretrained

The term "pretrained" refers to a pre-trained BiLSTM model, which has already been trained on a large corpus of text data and can be __fine-tuned__ or used __as-is__ for specific NLP tasks, such as sentiment analysis, named entity recognition, and so on. By using a pretrained BiLSTM model, you can __leverage the knowledge learned from the training data to quickly build NLP applications with improved accuracy__.

In [None]:
# help(bilstm_model)

In [None]:
#Test Pretrained bilstm_model model in WatsonNLP
syntax_result = syntax_model.run(text)
bilstm_result = bilstm_model.run(syntax_result)
print(bilstm_result)
for i in bilstm_result.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)

<a id="RBRPre"></a>

### 4.2 RBR Pretrained


A pretrained rule-based model is a model that has already been trained on a large corpus of text data and has a set of predefined rules for processing text data. By using a pretrained rule-based model, you can leverage the knowledge learned from the training data to quickly build NLP applications with improved accuracy.

In [None]:
text1= format_data()
text2= format_data()
text3= format_data()

In [None]:
all_test=[text1,text2,text3]
all_test

In [None]:
# help(rbr_model)

In [None]:
t=1
# Test the pretrain
for test in all_test:
    rbr_result = rbr_model.run(test, language_code='nl')
    print(rbr_result)
    
    for i in rbr_result.mentions:
        print("Text",t, i.span.text.ljust(15, " "), "Type: ", i.type)
    t+=1

<a id="Bert"></a>

### 4.3 Entity Mentions BERT multilang


In [None]:
#test dataset for URL extraction 
test1 = "Mijn naam is Robert van Linschoten. Ik ben geboren te Haarlem en mijn burgerservicenummer is 574106984. Hier is het nummer van mijn creditcard is 213117117387576, http://www.example.com/page_id=5555555555554444"

In [None]:
#Test Pretrained bert_model model in WatsonNLP
syntax_prediction = syntax_model.run(test1)
bert_result = bert_entity_model.run(syntax_prediction)
print(bert_result)
for i in bert_result.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)



<a id="Testing"></a>

## 5. Testing Usecase

<a id="5.1"></a>
### 5.1 RBR Model Testing for Credit card Extraction, URL and Email

<a id="5.1.1"></a>
#### 5.1.1 RBR Stock (URL Extraction)

In [None]:
#Test Pretrained rbr stock model in WatsonNLP
rbr_ent_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_nl_stock'))
rbr_ent_result = rbr_ent_model.run(test1)

for i in rbr_ent_result.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)

In [None]:
#Test Pretrained rbr PII model in WatsonNLP

rbr_result_pii = rbr_model.run(test1, language_code='nl')
rbr_result_pii

for i in rbr_result_pii.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)

<a id="5.1.2"></a>
#### 5.1.2 Combine RBR PII and RBR Stock (URL)

In [None]:
# combine to be one object
combined_mentions = rbr_ent_result + rbr_result_pii
combined_mentions

for i in combined_mentions.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)

<a id="5.2"></a>
### 5.2 EmailAddress Extraction using RBR 

In [None]:
#test dataset for EMail extraction 
text = format_data() +" my maild id is sample.Email@gmail.com"
text

In [None]:
#Test Pretrained rbr PII model in WatsonNLP

rbr_result_pii = rbr_model.run(text, language_code='nl')
rbr_result_pii

for i in rbr_result_pii.mentions:
    print("Text: ", i.span.text.ljust(15, " "), "Type: ", i.type)

<a id="summary"></a>
## 6. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library to:
1. Extract PII using Pre-trained Models
</span>

Please note that this content is made available by IBM Build Lab to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the GitHub.

Developed by IBM Build Lab

Copyright - 2023 IBM Corporation