**Presidio Analyzer**

After some data exploration, we test out Microsoft Presidio Analyzer's performance in recognizing named entities in the text. Microsoft Presidio Analyzer is a Named Entity Recognizer that combines both rules-based pattern recognition with machine learning techniques to detect sensitive information. It is pre-trained and "provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more." (https://microsoft.github.io/presidio/)

In [1]:
from IPython.display import Image
Image(url = 'https://microsoft.github.io/presidio/assets/detection_flow.gif')

The named entities labelled in the Presidio Learning Agency PII dataset differ from the supported named entities, so we will adapt Presidio's base recognition capabilities and enhance them with custom, rule-based entitities. We also created a signal phrase recognizer to try to facilitate the model in distinguishing the name of student writer from the names of other people the student might be citing in the work.

These adaptations to Presidio Analyzer yield a F5 score of 0.77, which indicates high recall. This means that the model makes a lot of guesses, but most of them are wrong (relatively high number of false positives to the true positives). However, making a lot of guesses means that the model does not miss out on classifying PII labels (comparatively low number of false negatives). Out of the 2739 PII labels, the enhanced Presidio Analyzer model correctly labels 2256, which means it has a relatively high level of accuracy, given the large imbalance in positive and negative classification (only 2739 out of nearly 5 million word tokens are labelled as PII). 

The results for this model will set up a baseline to which we will compare deep learning models in the next notebook.

In [3]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import nltk
import re
from tqdm.auto import tqdm

import spacy
spacy.load('en_core_web_lg')
from presidio_analyzer import Pattern, PatternRecognizer, RecognizerResult, AnalyzerEngine
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer
from presidio_analyzer.nlp_engine import NlpArtifacts, NlpEngineProvider
from presidio_analyzer.predefined_recognizers import EmailRecognizer, UrlRecognizer, PhoneRecognizer
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from sklearn.model_selection import train_test_split

from collections import Counter
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from collections import Counter
from itertools import repeat
from bisect import bisect_left
import nltk
import os
import shutil

import tensorflow as tf
import keras
from keras import ops
from keras.utils import pad_sequences
from conlleval import evaluate
import sklearn
import keras_nlp

tf.get_logger().setLevel('ERROR')

train = pd.read_json('data/train.json')
test = pd.read_json('data/test.json')

**Functions to Process Data for Presidio Analyzer**

These functions convert word and sentence tokens into start and end indicies in the full text. This is helpful because Presidio Analyzer takes in the full text and outputs entries of named entities accompanied by the start and end indices of the entities. Having the indices of the sentences is also helpful so that we can add sentence-based rules to boost the performance of Presidio Analyzer.

In [4]:
#Function to convert token words as indices of the full text
def convert_word_tokens_to_index(row):
    tokens = row['tokens']
    start_ind = []
    end_ind = []
    prev_ind = 0
    for tok in tokens:
        start = prev_ind + row['full_text'][prev_ind:].index(tok)
        end = start + len(tok)
        start_ind.append(start)
        end_ind.append(end)
        prev_ind = end
    return start_ind, end_ind

indexed_text = pd.DataFrame(train.apply(func = convert_word_tokens_to_index, axis = 1))

#Create columns of start and end indices to add to the training dataframe
train['word_start_ind'] = ''
train['word_end_ind'] = ''
for row_num in tqdm(range(len(indexed_text))):
    train.at[row_num,'word_start_ind'] = indexed_text.iloc[row_num,0][0]
    train.at[row_num,'word_end_ind'] = indexed_text.iloc[row_num,0][1]


#Checking that the length of the index list equals the length of the tokens and labels lists in the first row
len(train.word_start_ind[0]) == len(train.labels[0])

  0%|          | 0/6807 [00:00<?, ?it/s]

True

In [5]:
#Create sentence tokens based on the text
train['sentence_tokens'] = ''
for row_num in tqdm(range(len(train))):
    split_text = [x.split('\n\n') for x in nltk.sent_tokenize(train.full_text[row_num])]
    split_text = [x
                for s in split_text
                for x in s]
    train.at[row_num,'sentence_tokens'] = split_text

  0%|          | 0/6807 [00:00<?, ?it/s]

In [6]:
def convert_sent_tokens_to_index(row):
    tokens = row['sentence_tokens']
    start_ind = []
    end_ind = []
    prev_ind = 0
    for tok in tokens:
        start = prev_ind + row['full_text'][prev_ind:].index(tok)
        end = start + len(tok)
        start_ind.append(start)
        end_ind.append(end)
        prev_ind = end
    return start_ind, end_ind

indexed_text = pd.DataFrame(train.apply(func = convert_sent_tokens_to_index, axis = 1))
train['sent_start_ind'] = ''
train['sent_end_ind'] = ''
for row_num in tqdm(range(len(indexed_text))):
    train.at[row_num, 'sent_start_ind'] = indexed_text.iloc[row_num,0][0]
    train.at[row_num,'sent_end_ind'] = indexed_text.iloc[row_num,0][1]

  0%|          | 0/6807 [00:00<?, ?it/s]

**Setting Up Custom Recognizers for Presidio Analyzer**

We created custom recognizers in Presidio Analyzer using regex. In order to further enhance the model, we added a signal phrases recognizer. The signal phrases in the text file were extracted from the site https://www.yourdictionary.com/articles/examples-signal-phrases. This recognizer will be used in the sentence context to help distinguish between the names of students and the names of authors, because the default for Microsoft Presidio is to classify everyone as a person.

In [7]:
with open('signal_phrases.txt','r') as f:
    contents = f.readlines()
    f.close()
contents = [str(entry).replace("\n","").split("/") for entry in contents]
contents = [x
            for i in contents
            for x in i]
signal_phrases_recognizer = PatternRecognizer(supported_entity = 'SIGNAL_PHRASE', deny_list = contents)

id_regex = r'([A-Za-z]{2}[.?]:)?\d{12,12}'
id_pattern = Pattern(name="id", regex=id_regex, score = 0.5)
id_recognizer = PatternRecognizer(supported_entity="ID_CUSTOM", patterns = [id_pattern])

address_regex = r'\b\d+\s+\w+(\s+\w+)*\s+((st(\.)?)|(ave(\.)?)|(cir(\.)?)|(rd(\.)?)|(blvd(\.)?)|(ln(\.)?)|(ct(\.)?)|(dr(\.)?))\b'
address_pattern = Pattern(name="address", regex=address_regex, score=0.5)
address_recognizer = PatternRecognizer(supported_entity="ADDRESS_CUSTOM", patterns = [address_pattern], context=["st", "Apt"])

email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
email_pattern = Pattern(name="email address", regex=email_regex, score=0.5)
email_recognizer = PatternRecognizer(supported_entity="EMAIL_CUSTOM", patterns = [email_pattern])

url_regex = r'((https?)|(http?)|(ftp?))://\S+|www\.\S+'
url_pattern = Pattern(name="url", regex=url_regex, score=0.5)
url_recognizer = PatternRecognizer(supported_entity="URL_CUSTOM", patterns = [url_pattern])

phone_regex = r'^[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4,6}$'
phone_pattern = Pattern(name='phone', regex=phone_regex, score=0.5)
phone_recognizer = PatternRecognizer(supported_entity='PHONE_CUSTOM', patterns=[phone_pattern])


The ALLOW_LIST parameter in Presidio Analyzer allows us to customize which entries should not be classified as entities. During exploratory Presidio Analyzer runs, we found that the model tended to incorrectly classify names of companies and stores as student names. Including stopwords in the corpus also safeguards Presidio Analyzer from classifying stopwords, or commonly used language, as entities.

In [10]:
nltk.download('stopwords')
ALLOW_LIST = []
all_stopwords = list(stopwords.words())
words = Counter()
for doc in train.tokens:
    words.update(doc)
for doc in test.tokens:
    words.update(doc)
all_stopwords  += [str(w).lower() for w, i in words.items() if i > 55]
all_stopwords = list(sorted(set(all_stopwords)))
del words

ALLOW_LIST.extend(all_stopwords)
ALLOW_LIST = [word for word in ALLOW_LIST if word not in contents]

#S&P 500 companies dataset taken from https://github.com/datasets/s-and-p-500-companies/tree/main/data
sp500 = pd.read_csv('data/names/companies-sp.csv')['Security'].str.upper().tolist()
famous_brands = ['FACEBOOK','GOOGLE','MICROSOFT','TIKTOK','INSTAGRAM','NETFLIX','DISNEY+','APPLE','TABLEAU','POWER BI','SNOWFLAKE',
                 'AZURE','ORACLE','IBM','TED TALK']
ALLOW_LIST.extend(famous_brands)
ALLOW_LIST.extend(sp500)
ALLOW_LIST.extend(['DESIGN','CHALLENGE','COMPETITION'])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/natalieyeo/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(address_recognizer)
registry.add_recognizer(email_recognizer)
registry.add_recognizer(url_recognizer)
registry.add_recognizer(phone_recognizer)
registry.add_recognizer(id_recognizer)
registry.add_recognizer(signal_phrases_recognizer)

#Played around with the context aware enhancer settings but didn't add many context phrases
#The problem is that the Presidio Model tends to overclassify with a very high amount of false positives
#Couldn't conceptualize a way to use the Lemma Context Aware Enhancer to stop the model from classifying something as PII
context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=0.7, min_score_with_context_similarity=0.6
)
analyzer = AnalyzerEngine(registry = registry,context_aware_enhancer = context_aware_enhancer)

In [12]:
#Functions which will be used in conjunction with the Presidio Analyzer function to predict BIO labels
from bisect import bisect_right

def find_le(a, x):
    'Find rightmost value less than or equal to x'
    i = bisect_right(a, x)
    if i:
        return a[i-1]
    raise ValueError
    
def count_trailing_whitespaces(word):
    return len(word) - len(word.rstrip())

def is_valid_date(text):
    try:
        parsed_date = parser.parse(text)
        return True
    except:
        return False
    
def index_to_word_token_num(df, row_num, start, end):
    'Convert the index of the full text to the word token list index for a particular row.'
    text = df['full_text'][row_num][start:end]
    token_start_ind = df.word_start_ind[row_num].index(start)
    token_end_ind = df.word_end_ind[row_num].index(end - count_trailing_whitespaces(text))
    
    return text.rstrip(), token_start_ind, token_end_ind

**Prediction Function**

In the function below, we use Presidio Analyzer recognize a list of supported and custom entities. We apply a bit of data cleaning to make sure that whitespace is not classified as entities, and we also use the signal phrases detected by Presidio Analyzer to weed out person entities that are in the same sentence as the signal phrase and decrease false positives in the results.

In [13]:
def predict_label(df, row_num):
    'Function that takes the dataframe and row number, analyzes entities, and returns a dataframe with document, token number, and label.'
    
    #Analyze a corpus using Microsoft Presidio
    result = analyzer.analyze(text = df['full_text'][row_num],
                              entities = ["PHONE_NUMBER", "URL_CUSTOM", "EMAIL_ADDRESS","EMAIL_CUSTOM", "ADDRESS_CUSTOM", "US_SSN",
                                          "US_ITIN","US_PASSPORT", "US_BANK_NUMBER", "USERNAME", "ID_CUSTOM",'SIGNAL_PHRASE','PERSON'],
                              allow_list=ALLOW_LIST,
                              language='en', 
                              score_threshold=0.5) #Increased score threshold from 0.005 to 0.5
    preds = []
    label = ""
    
    #Loading compilers for regex patterns
    parenthesis = re.compile("^\\(.*$|^.\\)$")
    
    starts = []
    whitespace = ['\n\n', '\xa0b', ' ']
    
    for x in result:
        if x.start not in df.word_start_ind[row_num]:
            break
        if x.end not in df.word_end_ind[row_num]:
            break
        if x.start not in starts: #This allows function to skip over entries that it already labelled
            starts.append(x.start)
            text, word_token_start, word_token_end = index_to_word_token_num(df, row_num, x.start, x.end)

            if x.entity_type == 'PERSON' or x.entity_type == 'PERSON_CUSTOM':
                sentence_start = find_le(df.sent_start_ind[row_num], x.start)
                sentence_idx = df.sent_start_ind[row_num].index(sentence_start)
                sentence_text = df.sentence_tokens[row_num][sentence_idx]
                sentences_with_signals = [find_le(df.sent_start_ind[row_num], s.start) 
                                          for s in result if s.entity_type == 'SIGNAL_PHRASE']
                if sentence_start in sentences_with_signals:
                    label = "O"
                elif parenthesis.search(sentence_text):
                    label = "O"
                else:
                    label = "NAME_STUDENT"
            if x.entity_type == 'PHONE_NUMBER':
                label = "PHONE_NUM"
            if x.entity_type == 'EMAIL_ADDRESS' or x.entity_type == 'EMAIL_CUSTOM':
                label = "EMAIL"
            if x.entity_type == 'URL_CUSTOM':
                label = 'URL_PERSONAL'
            if x.entity_type == 'ADDRESS_CUSTOM':
                label = 'STREET_ADDRESS'
            if x.entity_type in ['US_SSN', 'US_ITIN', 'US_PASSPORT', 'US_BANK_NUMBER', 'ID_CUSTOM']:
                label = 'ID_NUM'
            if x.entity_type == 'USERNAME':
                label =  'USERNAME'

            if (label != "O" and x.entity_type != 'SIGNAL_PHRASE'):
                labels = ['B-' + label]
                if (word_token_end - word_token_start > 0):
                    inner_labels = ['I-' + label for i in range((word_token_start + 1), word_token_end + 1)]
                    labels.extend(inner_labels)
                    if '\n\n' in text:
                        end_idx = word_token_end + 1
                        tab_idx = df.loc[row_num, 'tokens'][word_token_start:end_idx].index('\n\n')
                        labels[tab_idx] = "O"
                        labels[tab_idx + 1] = 'B-' + label
                token_range = range(word_token_start, word_token_end + 1)
                df_result = pd.DataFrame({'document': [df.document[row_num] for i in token_range],
                                          'token': list(token_range),
                                          'token_text': [df.tokens[row_num][i] for i in token_range],
                                          #'sentence_text': [sentence_text for i in token_range],
                                          'preds': labels})
                df_result = df_result[~df_result.isin(whitespace)]
                df_result = df_result[df_result.token_text.isna() == False]
                preds.append(df_result)
    if len(preds) > 1:
        result = pd.concat(preds, axis = 0, ignore_index=True)
        if result[['document','token']].duplicated().sum() > 0:
            duplicated_tokens = list(result.loc[result[['document','token']].duplicated(),'token'])
            for i in duplicated_tokens:
                duplicated_labels = result.loc[result.token == i, 'preds']
                for j in duplicated_labels.index:
                    if 'I-' in result.loc[j, 'preds']:
                        result = result.drop(index = j)
        return result
    elif len(preds) == 1:
        return preds[0]
    else: 
        pass

In [14]:
#Create a dataframe of the correct labels
correct_labels = []

for i in tqdm(range(len(train))):
    tmp_df = pd.DataFrame({'document' : train.document[i],
                           'token': range(len(train.tokens[i])),
                           'token_text': train.tokens[i],
                           'correct_label': train.labels[i]})
    correct_labels.append(tmp_df)

correct_df = pd.concat(correct_labels,axis = 0,ignore_index = True)

  0%|          | 0/6807 [00:00<?, ?it/s]

Below is a function to count the number true positives, false positives, true negatives, and false negatives in the prediction. We use these to calculate metrics like precision, recall, F1 score, and the F beta 5 score that the kaggle competition is using to evaluate the results. F beta 5 score is like the F1 score but puts 5 times as much weight on recall than on precision, indicating that recall is much more important. We don't want the model to miss out on PII.

In [23]:
def fbeta_score(pred_df,correct_df,beta=5):
    """
    Parameters:
    - pred_df (DataFrame): DataFrame containing predicted PII labels.
    - gt_df (DataFrame): DataFrame containing ground truth PII labels.
    - beta (float): The beta parameter for the F-beta score, controlling the trade-off between precision and recall.

    Returns:
    - DataFrame: prediction DataFrame merged with the ground truth DataFrame
    - float: Micro F-beta score.
    """   
    df = correct_df.merge(pred_df.drop('token_text',axis = 1), on = ['document','token'],how = 'left')
    df.preds.fillna('O',axis = 0,inplace=True)
    df.loc[(df.correct_label == df.preds) & (df.correct_label != "O"), 'cm'] = 'TP'
    df.loc[(df.correct_label == "O") & (df.preds != "O"), 'cm'] = 'FP'
    df.loc[(df.correct_label != "O") & (df.preds == "O"), 'cm'] = 'FN'
    df.loc[(df.correct_label == "O") & (df.preds =="O"),'cm'] = 'TN'

    TP = (df['cm'] == 'TP').sum()
    FP = (df['cm'] == 'FP').sum()
    FN = (df['cm'] == 'FN').sum()
    TN = (df['cm'] == 'TN').sum()
    total = TP + FP + FN + TN

    precision = TP / (TP + FP)
    recall = (TP / (TP + FN))
    f1 = 2* (precision * recall / (precision + recall))
    
    print(f'True Positives: {TP}, False Positives: {FP}')
    print(f'True Negatives: {TN}, False Negatives: {FN}')
    print(f'{len(df) - total} correctly identified as PII but mislabelled.')
    print("Precision: " + str(precision))
    print("Recall: " + str(recall))
    print("F1-Score " + str(f1))
    
    s_micro = (1+(beta**2))*TP/(((1+(beta**2))*TP) + ((beta**2)*FN) + FP)
    print(s_micro)
    
    return df, s_micro

**Results**

Overall, this rules-based model tends to guess many more false positives than true positives, with much fewer false negatives. This results in a high recall and low precision. The F5 Score is quite high though. And the accuracy of non-O labels is also quite high. The model performs quite well, but at the expense of low precision.

In [None]:
predictions = []

for i in tqdm(range(len(train))):
    predictions.append(predict_label(train, i))

round2 = pd.concat(predictions,axis = 0,ignore_index=True)
del predictions
total_preds_df2, score = fbeta_score(round2, correct_df)

After Round 2, we noticed that some of the labels were being classified twice, so we fixed the predict_labels() function to only allow for one classification. We also prevented white space characters from being identified as PII and increased the Presidio Analyzer score threshold to 0.5 from 0.005. This led to slightly better results.

In [24]:
predictions = []

for i in tqdm(range(len(train))):
    predictions.append(predict_label(train, i))

round3 = pd.concat(predictions,axis = 0,ignore_index=True)
del predictions
total_preds_df3, score = fbeta_score(round3, correct_df)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.preds.fillna('O',axis = 0,inplace=True)


True Positives: 2256, False Positives: 5411
True Negatives: 4984383, False Negatives: 463
20 correctly identified as PII but mislabelled.
Precision: 0.2942480761706013
Recall: 0.8297168076498713
F1-Score 0.4344309647602542
0.7754422146426588


In [2]:
TP = 2256
FN = 483
FP = 5411
beta = 5
s_micro = (1+(beta**2))*TP/(((1+(beta**2))*TP) + ((beta**2)*FN) + FP)
print('Micro F5 Score:', s_micro)

Micro F5 Score: 0.7703501352735678


In [31]:
#Function to add sentence context to each token in the dataset
from itertools import repeat
from bisect import bisect_left
def find_lt_idx(a, x):
    'Find rightmost value less than x'
    i = bisect_left(a, x)
    if i:
        return i-1

def find_sent_token(row):
    sentences = row['sentence_tokens']
    word_starts = row['word_start_ind']
    words = row['tokens']
    labels = row['labels']
    sentence_starts = row['sent_start_ind']
    
    sentence_token = [] #Should end up the length of word starts, aka the num of words in the essay
    word_tokens_in_sentence = []
    labels_in_sentence = []
    start_ind = 0
    for next_sent_token_num, next_sentence_start in enumerate(sentence_starts[1:],1):
        end_ind = find_lt_idx(word_starts,next_sentence_start) + 1
        x = end_ind - start_ind
        sentence_token[start_ind:end_ind] = repeat(sentences[next_sent_token_num - 1],x)
        word_tokens_in_sentence[start_ind:end_ind] = repeat(words[start_ind:end_ind],x)
        labels_in_sentence[start_ind:end_ind] = repeat(labels[start_ind:end_ind],x)
        start_ind = end_ind
    end_ind = len(word_starts)
    x = end_ind - start_ind
    sentence_token[start_ind:end_ind] = repeat(sentences[len(sentences)-1], x)
    word_tokens_in_sentence[start_ind:end_ind] = repeat(words[start_ind:end_ind],x)
    labels_in_sentence[start_ind:end_ind] = repeat(labels[start_ind:end_ind],x)
    return sentence_token, word_tokens_in_sentence, labels_in_sentence

sent_word_conversion = pd.DataFrame(train.apply(func = find_sent_token, axis = 1))

train['sent_for_word'] = ''
train['words_in_sentence'] = ''
train['labels_in_sentence'] = ''
for row_num in tqdm(range(len(sent_word_conversion))):
    train.at[row_num, 'sent_for_word'] = sent_word_conversion.iloc[row_num,0][0]
    train.at[row_num, 'words_in_sentence'] = sent_word_conversion.iloc[row_num,0][1]
    train.at[row_num, 'labels_in_sentence'] = sent_word_conversion.iloc[row_num,0][2]


total_preds_df3['sentence_text'] = train.sent_for_word.explode().reset_index(drop = True)
total_preds_df3['tokenized_sentence'] = train.words_in_sentence.explode().reset_index(drop = True)
total_preds_df3['labels_in_sentence'] = train.labels_in_sentence.explode().reset_index(drop = True)

  0%|          | 0/6807 [00:00<?, ?it/s]

In [26]:
total_preds_df3

Unnamed: 0,document,token,token_text,correct_label,preds,cm,sentence_text,tokenized_sentence
0,7,0,Design,O,O,TN,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,..."
1,7,1,Thinking,O,O,TN,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,..."
2,7,2,for,O,O,TN,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,..."
3,7,3,innovation,O,O,TN,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,..."
4,7,4,reflexion,O,O,TN,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,..."
...,...,...,...,...,...,...,...,...
4992528,22687,815,process,O,O,TN,"Despite this, another valid tool that could fi...","[Despite, this, ,, another, valid, tool, that,..."
4992529,22687,816,explained,O,O,TN,"Despite this, another valid tool that could fi...","[Despite, this, ,, another, valid, tool, that,..."
4992530,22687,817,above,O,O,TN,"Despite this, another valid tool that could fi...","[Despite, this, ,, another, valid, tool, that,..."
4992531,22687,818,.,O,O,TN,"Despite this, another valid tool that could fi...","[Despite, this, ,, another, valid, tool, that,..."


In [32]:
total_preds_df3.to_json('initial_predictions.json')

In [28]:
#Save the data to a json file to build on this in the next notebook
df = pd.read_json('initial_predictions.json')