## Named Entity Recognition

This notebook explores Named Entity Recognition (NER) and entity masking techniques for text data. We leverage a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model fine-tuned for NER tasks. The code achieves the following:

- Data Preparation: Loads a sample DataFrame containing comments (text) and defines a dictionary to map NER tag prefixes ("B-" or "I-") to corresponding entity types (e.g., "Person", "Location").
- NER Pipeline Creation: Creates an NER pipeline using the transformers library and a pre-trained BERT model specifically designed for NER.
- NER Module: Implements an NERModule class to handle NER tasks:
    - get_ner retrieves entity predictions for a given text.
    - join_entities combines entities potentially broken down during model predictions.
    - replace_entities replaces identified entities with placeholders based on their types (optional masking).
    - mask_entities applies NER, creates a new column with masked text, and optionally removes intermediate columns.
- Entity Masking: Applies NER to the comments in the DataFrame, replaces named entities with generic placeholders based on their types (configurable masking options), and creates a new column containing the masked text.
- Sample Output: Demonstrates the code's functionality by printing an original comment and its masked version.

This approach allows for identification and masking of named entities (people, locations, organizations, etc.) within text data, potentially protecting privacy or focusing on the sentiment of the text itself.

In [16]:
import pandas as pd
import numpy as np

The output of the model would be words along with the type of entities. These are the different codes in use:
- B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
- B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
- B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
- B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.

In [17]:
# Create a dictionary to manage the entity types that could come up
dict_entities = {"I-PER": "Person", 
                 "B-PER": "Person", 
                 "I-ORG": "Organisation",
                 "B-ORG": "Organisation",
                 "I-LOC": "Location",
                 "B-LOC": "Location",
                 "I-MISC": "Miscellaneous",
                 "B-MISC": "Miscellaneous"
                }

In [18]:
# Create a test dataframe to pass to the model
data = [
    'Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China:',
    'Recession hit Veronique Branquinho, she has to quit her company, such a shame!',
    'It is 2023 now. Sarah, Sharon, along with Margot Robbie, work at TFINA LTD and work from Ultimo in Kawandalama',
    'Angola',
    'that`s great!! weee!! visitors!',
    'Antigua and Barbuda',
    'I think everyone hates this dude lol. they are more like that dictator who seems to be ranting about almost everything now. He is from Wakanda'
]

# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Comments'])

In [19]:
class classifier_pipeline:
 
    def __init__(self, task, model):
        self.task = task
        self.model = model

    def create_classifier(self):
        # Import the pipeline
        from transformers import pipeline

        return pipeline(task=self.task, model=self.model)


"""
NER Module Class: Encapsulates NER-related functionality.

- get_ner: extracts entity predictions for a given text.
- join_entities: handles entity merging based on model-specific tokenization.
- replace_entities: replaces entities with generic NER placeholders.
- mask_entities: applies NER, creates a masked text column.
"""
class NERModule:

    def __init__(self, classifier, df, colname):
        self.classifier = classifier
        self.df = df
        self.col = colname   


    def get_ner(self, text):
        # Predict the NER tags
        preds = self.classifier(text)
        # Format the data as per the requirement
        preds = [
                {
                    "entity": pred["entity"],
                    "score": round(pred["score"], 4),
                    "index": pred["index"],
                    "word": pred["word"],
                    "start": pred["start"],
                    "end": pred["end"],
                }
                for pred in preds
        ]
        # Join the words that were together in the original text (separated by hash in entity recognition)
        preds = self.join_entities(preds)
        return preds


    # Function to join the words that were broken down with # as the key in front of them during the entity recognition by the model
    def join_entities(self, preds):
        # Join the separated words together as per the input text
        res = [] # Result list
        i=0 # initial number to start the loop

        while i < len(preds):
            # Get the current row
            currentrow = preds[i]

            # If i is at the last line, add to result and break
            if i == len(preds)-1:
                res.append(currentrow)        
                break

            # Run a loop for all subsequent rows
            for j in range(i+1, len(preds)):
                # Get the content of the next row
                nextrow = preds[j]

                if np.float64(currentrow['end'])==np.float64(nextrow['start']):                              
                    # Update current row end index to the next row end index
                    currentrow['end'] = nextrow['end'] 

                    # Update current row word to concat with the next row word. Remember to remove any ## if they exist
                    nextrow['word'] = nextrow['word'].replace("##","")
                    currentrow['word'] = currentrow['word'] + nextrow['word'] 

                    # If J has reached the end, add the current row to the result and break
                    if j == len(preds)-1:
                        # Append to result
                        res.append(currentrow)
                        # Set i such that it breaks the outer loop
                        i = len(preds)
                        break
                else:
                    # Append the row to the result list
                    res.append(currentrow)
                    # Change the order to i to match j for the next loop
                    i = j
                    # Break the nested for loop
                    break
        return res



    # Clean the string by removing entities
    def replace_entities(self, text, entities, mask_name, mask_place, mask_org, mask_misc):
        # If there is nothing in entities, then return the original text
        if len(entities)==0:
            cleaned = text
        else:
            # Sort the entities by start location to make replacement easier
            entities.sort(key=lambda row: (row['start']), reverse=True)
            #entities = sorted(df['entities'][0], key=lambda row: (row['start']), reverse=True)

            # Convert the text to list
            text_list = list(text)

            # Loop through the entities
            for entity in entities:
                # Get start index
                start = entity['start']
                end = entity['end']
                entity_type = entity['entity']

                # Set term to replace
                replacement = ""
                if "PER" in entity_type:
                    if mask_name:
                        replacement = "<name>"
                    
                elif "LOC" in entity_type and mask_place:
                    if mask_place:
                        replacement = "<place>"
                elif "ORG" in entity_type and mask_org:
                    if mask_org:
                        replacement = "<organisation>"
                else:
                    if mask_misc:
                        replacement = "<misc>"

                # Check if length of replacement is 0. If so, then skip replacement
                if len(replacement)==0:
                    continue
                else:
                    # Replace the start position by generic term
                    text_list[start] = replacement

                    # Remove the remaining characters to the end of the original word
                    del text_list[start+1:end]

                # Join back the string and return
                cleaned = ''.join(text_list)
        
        return cleaned


    def mask_entities(self, delete_ner_tags=True, mask_name=True, mask_place=True, mask_org=True, mask_misc=True):

        # Create a lambda function to get the NER tags from the given col of the dataframe
        self.df['entities'] = self.df[self.col].apply(lambda x: self.get_ner(x))

        # Replace text with NER tags
        self.df['masked'] = self.df.apply(lambda x: self.replace_entities(x[self.col], x['entities'], mask_name=True, mask_place=True, mask_org=True, mask_misc=True), axis=1)

        # If the delete_ner_tags attribute is True, then drop the entities column
        if delete_ner_tags:
            self.df.drop('entities', axis=1, inplace=True)

        return self.df


In [20]:
# Instantiate an NER pipeline using a pre-trained BERT model specifically fine-tuned for NER.
classifier = classifier_pipeline('ner', 'dbmdz/bert-large-cased-finetuned-conll03-english').create_classifier()

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
# Create a NERModule object
ner = NERModule(classifier, df, 'Comments')

# Call the mask_entities method to perform NER, mask entities with placeholders, and create a new column with the masked text
df = ner.mask_entities()

In [22]:
# Sample text
index = 2 # Starts at 0

print(f"Original statement: {df['Comments'][index]} \n  Masked statement: {df['masked'][index]}")

Original statement: It is 2023 now. Sarah, Sharon, along with Margot Robbie, work at TFINA LTD and work from Ultimo in Kawandalama 
  Masked statement: It is 2023 now. <name>, <name>, along with <name> <name>, work at <organisation> <organisation> and work from <organisation> in <place>
