Explore the world of Named Entity Recognition (NER) with Python! This project dives into four popular libraries - NLTK, spaCy, Huggingface/Transformers, and stanza - to extract named entities from text data. By evaluating their performance on a set of 20 test cases, we gain insights into their accuracy, precision, and recall using micro-averaging. Spoiler alert: Huggingface/Transformers' ner pipeline steals the show with its top-notch performance. 

##### Read the data from the file

In [1]:
# Read the content of the text file
with open('A3-3-Data.txt', 'r') as file:
    content = file.read()

# Split the content into training_data and test_data sections
training_start = content.find("training_data = [")
test_start = content.find("test_data = [", training_start)
training_end = test_start

# Extract training_data
training_data_content = content[training_start:training_end]
exec(training_data_content)  # Execute the extracted content to load the training_data

# Extract test_data
test_data_content = content[test_start:]
exec(test_data_content)  # Execute the extracted content to load the test_data

training_data = training_data
test_data = test_data

# Print or use data1 and data2 as needed
print("Training data = ", training_data)
print("Testing data =", test_data)


Training data =  [[('John', 'PERSON'), ('works', 'O'), ('at', 'O'), ('Apple', 'ORGANIZATION')], [('Alice', 'PERSON'), ('studies', 'O'), ('at', 'O'), ('Stanford', 'ORGANIZATION'), ('in', 'O'), ('California', 'LOCATION')], [('The', 'O'), ('CEO', 'TITLE'), ('of', 'O'), ('Microsoft', 'ORGANIZATION'), ('is', 'O'), ('speaking', 'O'), ('today', 'O')], [('Google', 'ORGANIZATION'), ('announced', 'O'), ('a', 'O'), ('new', 'O'), ('smartphone', 'O')], [('Facebook', 'ORGANIZATION'), ('acquired', 'O'), ('Instagram', 'ORGANIZATION')], [('The', 'O'), ('movie', 'O'), ('premieres', 'O'), ('in', 'O'), ('New', 'LOCATION'), ('York', 'LOCATION'), ('City', 'LOCATION')], [('Leonardo', 'PERSON'), ('Da', 'PERSON'), ('Vinci', 'PERSON'), ('painted', 'O'), ('the', 'O'), ('Mona', 'O'), ('Lisa', 'O')], [('The', 'O'), ('Nobel', 'O'), ('Prize', 'O'), ('was', 'O'), ('awarded', 'O'), ('to', 'O'), ('three', 'O'), ('scientists', 'O')], [('Barack', 'PERSON'), ('Obama', 'PERSON'), ('was', 'O'), ('the', 'O'), ('president', '

##### Import libraries and modules

In [2]:
from nltk.tag import pos_tag
from nltk import ne_chunk
import spacy
import stanza
from transformers import pipeline, BertConfig   
from sklearn.metrics import classification_report
from collections import defaultdict
from nltk.tree import Tree
import warnings
warnings.filterwarnings("ignore")

In [3]:
# Initialize the stanza pipeline for English
stanza.download('en')  # download English model
nlp_stanza = stanza.Pipeline('en', processors='tokenize,ner')

# Load the English language model
nlp_spacy = spacy.load("en_core_web_sm")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-06-03 07:40:43 INFO: Downloaded file to C:\Users\prapt\stanza_resources\resources.json
2024-06-03 07:40:43 INFO: Downloading default packages for language: en (English) ...
2024-06-03 07:40:46 INFO: File exists: C:\Users\prapt\stanza_resources\en\default.zip
2024-06-03 07:40:52 INFO: Finished downloading models and saved to C:\Users\prapt\stanza_resources
2024-06-03 07:40:52 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-06-03 07:40:52 INFO: Downloaded file to C:\Users\prapt\stanza_resources\resources.json
2024-06-03 07:40:53 INFO: Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

2024-06-03 07:40:53 INFO: Using device: cpu
2024-06-03 07:40:53 INFO: Loading: tokenize
2024-06-03 07:40:53 INFO: Loading: mwt
2024-06-03 07:40:53 INFO: Loading: ner
2024-06-03 07:40:54 INFO: Done loading processors!


##### **STEP 1:**
##### Code from scratch to implement the Viterbi algorithm and a Hidden Markov Model

In [4]:
import numpy as np
import warnings
warnings.filterwarnings("ignore")
class MY_HMM:

    def __init__(self):
        self.transition_probs = {}
        self.emission_probs = {}
        self.tag_list = set()

    def HMM(self):
        transition_probs = {} # nested dict to store transition prob between tags, the dict has the tags and the count of its following tags
        emission_probs = {} # nested dict to store the emission probabilities( word-tag pairs), the dict has the tags and the words that are used with those tags
        tag_list = set()  # set to collect the unique tags from the given data
        
        # initialize and add to the transition_probs dictionary, the count of the tranisitions between tags(current_tag to next_tag) based on the training_data
        for sentence in training_data:
            for i in range(len(sentence) - 1):
                current_tag, next_tag = sentence[i][1], sentence[i + 1][1] # extracts the current and next tags from each pair in a sentence
                if current_tag not in transition_probs:
                    transition_probs[current_tag] = {} # initialize the transition_probs if current tag is the first tag and not in dictionary
                if next_tag not in transition_probs[current_tag]:  # if next tag is not already a key, that is no transition from current tag to next tag
                    transition_probs[current_tag][next_tag] = 0 # then count it as zero 
                transition_probs[current_tag][next_tag] += 1 # increment the count of transition from current tag to next 
        # So tranision counts for first sentence are added from 'PERSON' TO 'O' : transition_probs['PERSON']['O'] = 1, from 'O' to 'O' : transition_probs['O']['O'] = 1, from 'O' to 'ORGANIZATION' : transition_probs['O']['ORGANIZATION'] = 1

        # initialize and add the emission counts to the emission_probs, that is the number of occurrences of words of a particular tag
        #iterate over each sentence in training data and iterate over each word,tag pair
        for sentence in training_data:
            for word, tag in sentence:
                #if tag is found for the first time and is not in the dictionary then add it and initailize an emoty dictionary
                if tag not in emission_probs:
                    emission_probs[tag] = {}
                # if a word is not in a particular tag in the dictionary then add it and initializa as 0, increment by 1 whenever it occurs 
                if word not in emission_probs[tag]:
                    emission_probs[tag][word] = 0
                emission_probs[tag][word] += 1

        # get unique tags 
        for sent in sentence_tags:
            tags = sent.split()
            tag_list.update(tags)

        return list(tag_list), transition_probs, emission_probs
    

# computes the probability of transitioning from the current tag to the next tag given the transition probabilities and the total number of unique tags
def calculate_transition_probability(current_tag, next_tag, transition_probs, total_tags):
    # handle both existing and non existing(unseen) transitions
    # Check if the current tag has any transitions in the transition_probs dictionary
    if current_tag in transition_probs:
        # Check if the transition to the next tag is there in the transition_probs dictionary
        if next_tag in transition_probs[current_tag]:
            # if there is a tranision retrieve it
            count_transition = transition_probs[current_tag][next_tag]
            # Calculate the transition probability using the Laplace smoothing to handle unseen transitions where there is 0
            probability = (count_transition + 1) / (sum(transition_probs[current_tag].values()) + total_tags)
        # if tranision to the next tag is not there in the transition_probs dictionary set a default probabiltiy
        else:
            # Assign a default probability for unseen transitions
            probability = 1 / (sum(transition_probs[current_tag].values()) + total_tags)
    else:
        # Assign a default probability if the current tag has no transitions in the transition_probs dictionary
        probability = 1 / total_tags
    return probability

# Method to calculate and print the transition probability
def check_transition_probability(current_tag, next_tag, transition_probs,tag_list ):
    if current_tag in transition_probs and next_tag in transition_probs[current_tag]:
        count_transition = transition_probs[current_tag][next_tag]
        total_count_current_tag = sum(transition_probs[current_tag].values())
        probability = (count_transition + 1) / (total_count_current_tag + len(tag_list))
        print(f"Probability of transition from '{current_tag}' to '{next_tag}': {probability:.4f}")
    else:
        print(f"Transition from '{current_tag}' to '{next_tag}' not found.")

# implement the Viterbi algo for sequence tagging using a HMM 

# implement the Viterbi algo for sequence tagging using a HMM 
def viterbi():
    tag_list, transition_probs, emission_probs = mm.HMM()
    # Loop over each key (current tag) in trans_probs
    print("Transitions: ")
    for curr_tag, next_tags in transition_probs.items():
        print(f"Transition probabilities from '{curr_tag}':")

        # Loop over each next_tag and its count in the current tag's dictionary
        for next_tag, count in next_tags.items():
            print(f"  '{next_tag}': {count}")

        print()  
    
    # # Function to print the emission probabilities dictionary in a formatted way
    # for tag, words_dict in emission_probs.items():
    #     print(f"Tag: {tag}")
    #     for word, count in words_dict.items():
    #         print(f"  '{word}': {count}")
    #     print()  # Add a blank line for readability

    number_of_tag = len(emission_probs) # total number of tags in the training data
    predicted_tags = [] # list to store the predicted tag sequences for each of the input sentences from the test data
 
    # iterate over each sentence in the test data
    for sentence in test_data:
        # create a list of words in each sentence and convert into lowercase
        words = [word.lower() for word in sentence]

        # Initialize the matrix of all zeroes of the dimensions of total tags and the words in the sentence
        viterbi_matrix = np.zeros((number_of_tag, len(words)))

        # Initialize backpointer matrix to all zeroes with the dimensions same as matrix to store the backpointers for the best previous state
        backpointers = np.zeros((number_of_tag, len(words)), dtype=int)

        # iterate over the words from the word list made from the current input sentence
        for w in range(len(words)):
            # iterate through each tag from the unique tags list
            for i in range(len(tag_list)):
                # this test_emission_prob is set to zero initially and will to used to store the emission probability of the current word given the current tag for the words of the input sentence of test data 
                test_emission_prob = 0
                # array test_transition_probs initialized to zeroes of length equal to tag list to store the transition probabilties from the current tag to all other tags in the list
                test_transition_probs = np.zeros(len(tag_list))

                # retrieves the word at current index w fromt he words lsit of the current input sentence
                current_token = words[w]
                # retrieve the tag at the current index i from the tag list to represent the tag being considered for the w in the input sentence
                current_tag = tag_list[i]

                # Calculate emission probability for each of the tag(current_tag) given the current_token in the input sentence
                # check if the current token is in the emission_probs
                if current_token in emission_probs[current_tag]:
                    # calculate the emission probability using the laplace smooting for unseen cases
                    test_emission_prob = (emission_probs[current_tag][current_token] + 1) / (sum(emission_probs[current_tag].values()) + number_of_tag)
                else:
                    # else if the current token is not in the emission probs then default probability is assigned
                    test_emission_prob = 1 / (sum(emission_probs[current_tag].values()) + number_of_tag)

                # execute viterbi algo
                if w == 0:  # For the first word, use the initial probability for tranistions
                    # initializr all zeroes array to store the transtion probabilities from the current tag to each possible next tags
                    test_transition_probs = np.zeros(len(tag_list)) 
                    for j in range(len(tag_list)):
                        next_tag = tag_list[j]
                        test_transition_probs[j] = calculate_transition_probability(current_tag, next_tag, transition_probs, number_of_tag)

                    # compute initial matrix values for the first word as the product of the transition probabilities and emission probabiltiies
                    viterbi_matrix[i][w] = test_transition_probs[i] * test_emission_prob

                # for the rest of the words in the input sentence calculate the transition probabilities
                else:
                    for j in range(len(tag_list)):
                        next_tag = tag_list[j]
                        test_transition_probs[j] = calculate_transition_probability(current_tag, next_tag, transition_probs, number_of_tag)

                    # Multiply previous colummn of probabilities in the viterbi matrix by transition probabilities 
                    probabilities = viterbi_matrix[:, w - 1] * test_transition_probs
                    # the index of the highest probability in this probabilities  is stored as backpointer inthe backpointers matrix to do backtracking to find the best path
                    backpointers[i][w] = np.argmax(probabilities)
                    # change the viterbi matrix position for the current tag and word to the maximum probabiltiy found multiplied by the emission probability
                    viterbi_matrix[i][w] = np.max(probabilities) * test_emission_prob

        # backtrack to retrieve the best tag sequence that is the best_path and so initialize a list to store this best path for the current input sentence
        best_path = []
        # holds the maximum probability found in the last column of the viterbi matrix that is the highest probability of any tag sequence ending at the last word of the sentence
        best_path_prob = np.max(viterbi_matrix[:, -1])
        # find the best final state by using the tag index with the maximum viterbi score in the last column
        best_final_state = np.argmax(viterbi_matrix[:, -1])
        # append to the best_path, the tag associated with the best_final_state that is the most probable tag for the last word of the sentence
        best_path.append(tag_list[best_final_state])

        # iterate backward through the words of the sentence from last word to first word to reconstruct the best tag sequence using the backpointers matrix
        for w in range(len(words) - 1, 0, -1):
            # update the best_final_state using the retrieved previous state tag that led to the current best_final_state at position w
            best_final_state = backpointers[best_final_state][w]
            # retrieved tag that is the previous state is then inserted at the beginning of the best_path 
            best_path.insert(0, tag_list[best_final_state])
        # reconstructed optimal sequence of predicted tags that maximizes the probabiltiy of the entire tag sequence for the current input test sentence
        predicted_tags.append(best_path)

    return predicted_tags


sentence_tags = []

for sentence in training_data:
    sentence_tag = " ".join([tag for _, tag in sentence])
    sentence_tags.append(sentence_tag)

mm = MY_HMM()
# Run the viterbi function and print the predicted tags
predicted_tags = viterbi()
print(predicted_tags)
# test_data = [
#     ["Bill", "Gates", "founded", "Microsoft"]
# ]
for i, sentence in enumerate(test_data):
    print("Sentence {}:".format(i + 1))
    print("Words:", sentence)
    print("Predicted Tags:", predicted_tags[i])
    print()


predicted_labels_step1 = predicted_tags

Transitions: 
Transition probabilities from 'PERSON':
  'O': 8
  'PERSON': 7

Transition probabilities from 'O':
  'O': 56
  'ORGANIZATION': 5
  'LOCATION': 12
  'TITLE': 3
  'PERSON': 1

Transition probabilities from 'ORGANIZATION':
  'O': 7
  'LOCATION': 1
  'ORGANIZATION': 1

Transition probabilities from 'TITLE':
  'O': 3

Transition probabilities from 'LOCATION':
  'LOCATION': 4
  'O': 4

[['PERSON', 'PERSON', 'PERSON', 'PERSON'], ['O', 'LOCATION', 'LOCATION', 'O', 'O', 'TITLE'], ['LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'O', 'TITLE'], ['O', 'LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'O', 'TITLE'], ['LOCATION', 'LOCATION', 'O', 'O', 'TITLE', 'TITLE', 'TITLE', 'TITLE'], ['O', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON', 'PERSON'], ['O', 'TITLE', 'TITLE', 'TITLE', 'TITLE', 'TITLE'], ['O', 'O', 'O', 'O', 'O', 'O', 'TITLE'], ['O', 'O', 'O', 'TITLE', 'TITLE', 'TITLE'], ['O', 'LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'O'

##### **STEP 2:**
##### Use the NLTK ne_chunk() function for NER

In [5]:
def perform_nltk_ner():
    ner_tags = []
    for sentence in test_data:
        # Part-of-speech tagging using pos-tag
        pos_tags = pos_tag(sentence) 
        # Perform named entity recognition (NER) using ne_chunk()
        tags = ne_chunk(pos_tags)
        ner_tags.append(tags)
    return ner_tags

print("Output using NLTK ne_chunk() function for NER: ")
ner_tags = perform_nltk_ner()
for tag in ner_tags:
    print(tag)
print("\n")
print(ner_tags)

Output using NLTK ne_chunk() function for NER: 
(S
  (PERSON Bill/NNP)
  (PERSON Gates/NNP)
  founded/VBD
  (PERSON Microsoft/NNP))
(S
  The/DT
  (ORGANIZATION Louvre/NNP Museum/NNP)
  is/VBZ
  in/IN
  (GPE Paris/NNP))
(S
  (PERSON Mount/NNP)
  (ORGANIZATION Fuji/NNP)
  is/VBZ
  a/DT
  famous/JJ
  landmark/NN
  in/IN
  (GPE Japan/NNP))
(S
  The/DT
  (ORGANIZATION United/NNP Nations/NNP)
  was/VBD
  formed/VBN
  in/IN
  1945/CD)
(S
  (PERSON Shakira/NNP)
  performed/VBD
  at/IN
  the/DT
  (ORGANIZATION Super/NNP Bowl/NNP)
  halftime/NN
  show/NN)
(S
  The/DT
  (ORGANIZATION Nobel/NNP Peace/NNP Prize/NNP)
  was/VBD
  awarded/VBN
  to/TO
  (PERSON Malala/NNP Yousafzai/NNP))
(S
  The/DT
  (ORGANIZATION Amazon/NNP River/NNP)
  flows/VBZ
  through/IN
  (PERSON Brazil/NNP))
(S
  The/DT
  (ORGANIZATION Pyramids/NNP)
  of/IN
  (GPE Giza/NNP)
  are/VBP
  in/IN
  (GPE Egypt/NNP))
(S (GPE Rome/NNP) is/VBZ the/DT capital/NN of/IN (GPE Italy/NNP))
(S
  The/DT
  (ORGANIZATION Great/NNP Wall/NNP)
  of

Convert the obtained tags in the tree to a list of tags and change the notations making it similar to the true labels I will be ocnsidering for evaluation

In [6]:
# convert tags so that they are similar in format with true labels
def ner_tags_converted(entity):
    location_entities = {'LOCATION', 'GPE'}
    if entity == 'PERSON':
        return 'PERSON'
    elif entity in location_entities:
        return 'LOCATION'
    elif entity == 'ORGANIZATION':
        return 'ORGANIZATION'
    else:
        return 'O'

# Define a function to convert a tree to predicted labels
def tree_to_predicted_labels(tree):
    if isinstance(tree, Tree):
        # Extract the label of the current tree node
        label = tree.label()
        
        if label in ['PERSON', 'LOCATION', 'ORGANIZATION', 'GPE']:
            # If the label is an entity type, map it and return
            return [ner_tags_converted(label)] * len(tree.leaves())
        else:
            # If the label is not an entity type, recursively process its children
            predicted_labels = []
            for subtree in tree:
                predicted_labels.extend(tree_to_predicted_labels(subtree))
            return predicted_labels
    else:
        # If the input is not a tree, return a list of 'O' labels
        return ['O']

# Convert each tree to predicted labels
predicted_labels_step2 = [tree_to_predicted_labels(tree) for tree in ner_tags]
print(predicted_labels_step2)

[['PERSON', 'PERSON', 'O', 'PERSON'], ['O', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'LOCATION'], ['PERSON', 'ORGANIZATION', 'O', 'O', 'O', 'O', 'O', 'LOCATION'], ['O', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'O', 'O'], ['PERSON', 'O', 'O', 'O', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O'], ['O', 'ORGANIZATION', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'O', 'PERSON', 'PERSON'], ['O', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'PERSON'], ['O', 'ORGANIZATION', 'O', 'LOCATION', 'O', 'O', 'LOCATION'], ['LOCATION', 'O', 'O', 'O', 'O', 'LOCATION'], ['O', 'ORGANIZATION', 'ORGANIZATION', 'O', 'LOCATION', 'O', 'O', 'O', 'O', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'O']]


##### **STEP 3:**
##### Use the  spaCy nlp() function for NER

In [7]:
def process_sentence():
    for sentence in test_data:
        text = " ".join(sentence)
        # Process the text with spaCy
        doc_spacy = nlp_spacy(text)
        # Extract named entities with their labels
        entities_spacy = [(ent.text, ent.label_) for ent in doc_spacy.ents]
        print(entities_spacy)

print("Output using spacy nlp() function for NER: ")
process_sentence()

Output using spacy nlp() function for NER: 
[('Bill Gates', 'PERSON'), ('Microsoft', 'ORG')]
[('The Louvre Museum', 'ORG'), ('Paris', 'GPE')]
[('Mount Fuji', 'LOC'), ('Japan', 'GPE')]
[('The United Nations', 'ORG'), ('1945', 'DATE')]
[('Shakira', 'PERSON'), ('the Super Bowl', 'EVENT')]
[('The Nobel Peace Prize', 'WORK_OF_ART'), ('Malala Yousafzai', 'PERSON')]
[('Amazon River', 'LOC'), ('Brazil', 'GPE')]
[('Giza', 'PERSON'), ('Egypt', 'GPE')]
[('Rome', 'GPE'), ('Italy', 'GPE')]
[('The Great Wall of China', 'FAC'), ('one', 'CARDINAL'), ('Seven', 'CARDINAL')]


In [8]:
# Making lists of labels for each of the steps as each modules and functions used have different notations for tags
# Converted tags : Step 3 (spaCy (nlp()))
# PERSON and DATE does not need to be converted
# convert LOC,FAC,GPE => LOCATION, convert ORG => ORGANIZATION,  convert WORK_OF_ART, CARDINAL => TITLE

predicted_labels_step3 = [
    ['PERSON', 'PERSON', 'O', 'ORGANIZATION'],
    ['ORGANIZATION', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'LOCATION'],
    ['LOCATION', 'LOCATION', 'O', 'O', 'O', 'O', 'O', 'LOCATION'],
    ['ORGANIZATION','ORGANIZATION','ORGANIZATION','O','O','O','DATE'],
    ['PERSON', 'O', 'O','EVENT','EVENT','EVENT','O','O'],
    ['TITLE','TITLE','TITLE','TITLE','O','O','O','PERSON','PERSON'],
    ['O','LOCATION','LOCATION','O','O','LOCATION'],
    ['O','O','O','PERSON','O','O','LOCATION'],
    ['LOCATION', 'O', 'O', 'O', 'O', 'LOCATION'],
    ['LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'O', 'TITLE', 'O', 'O', 'TITLE', 'TITLE', 'O', 'O', 'TITLE']
]

##### **STEP 4:**
##### Use the Huggingface/Transformers’ ner pipeline function for NER

In [9]:
sentences = [' '.join(words) for words in test_data]
def ner_pipeline():
    # Load NER pipeline with the specified model
    ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)
    print("Output using Huggingface/Transformers’ ner pipeline function for NER: ")
    for i, sentence in enumerate(sentences):
        entities_dict = ner(sentence)
        print(f"Test sentence: {i+1}")
        print(sentence)
        print("Entities: ", entities_dict)
        for entity in entities_dict:
            print(f"Word: {entity['word']}, Entity: {entity['entity_group']}, Score: {entity['score']:.2f}")
        print("\n")

ner_pipeline()

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Output using Huggingface/Transformers’ ner pipeline function for NER: 
Test sentence: 1
Bill Gates founded Microsoft
Entities:  [{'entity_group': 'PER', 'score': 0.9970532, 'word': 'Bill Gates', 'start': 0, 'end': 10}, {'entity_group': 'ORG', 'score': 0.99925035, 'word': 'Microsoft', 'start': 19, 'end': 28}]
Word: Bill Gates, Entity: PER, Score: 1.00
Word: Microsoft, Entity: ORG, Score: 1.00


Test sentence: 2
The Louvre Museum is in Paris
Entities:  [{'entity_group': 'ORG', 'score': 0.86077875, 'word': 'Louvre', 'start': 4, 'end': 10}, {'entity_group': 'LOC', 'score': 0.47728878, 'word': 'Museum', 'start': 11, 'end': 17}, {'entity_group': 'LOC', 'score': 0.99909127, 'word': 'Paris', 'start': 24, 'end': 29}]
Word: Louvre, Entity: ORG, Score: 0.86
Word: Museum, Entity: LOC, Score: 0.48
Word: Paris, Entity: LOC, Score: 1.00


Test sentence: 3
Mount Fuji is a famous landmark in Japan
Entities:  [{'entity_group': 'LOC', 'score': 0.5602856, 'word': 'Mount', 'start': 0, 'end': 5}, {'entity_g

In [10]:
# Making lists of labels for each of the steps as each modules and functions used have different notations for tags
# Converted tags : Step 4 (HuggingFace/Transformers) 
# Convert PER => PERSON, Convert LOC => LOCATION, Convert ORG => ORGANIZATION, Convert MISC => TITLE

predicted_labels_step4 = [['PERSON', 'PERSON', 'O', 'ORGANIZATION'],
    ['O', 'ORGANIZATION', 'LOCATION', 'O', 'O', 'LOCATION'],
    ['LOCATION', 'PERSON', 'O', 'O', 'O', 'O', 'O', 'LOCATION'],
    ['O', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'O', 'O'],
    ['PERSON', 'O', 'O', 'O', 'TITLE', 'TITLE', 'O', 'O'],
    ['O', 'TITLE', 'TITLE', 'TITLE', 'O', 'O', 'O', 'PERSON', 'PERSON'],
    ['O', 'LOCATION', 'LOCATION', 'O', 'O', 'LOCATION'],
    ['O', 'LOCATION', 'LOCATION', 'LOCATION', 'O', 'O', 'LOCATION'],
    ['LOCATION', 'O', 'O', 'O', 'O', 'LOCATION'],
    ['O', 'LOCATION', 'LOCATION', 'LOCATION', 'LOCATION', 'O', 'O', 'O', 'O', 'TITLE', 'TITLE', 'TITLE', 'TITLE', 'TITLE']
]

##### **STEP 5:** 
##### Use the stanza’s ner pipeline for NER

In [11]:
def extract_entities(sentences):
    extracted_entities = []
    for sentence in sentences:
        # Join words to form a sentence
        text = ' '.join(sentence)
        # Process the text with the NER pipeline
        doc = nlp_stanza(text)
        entities = []
        for ent in doc.ents:
            # Append entity text and its type
            entities.append((ent.text, ent.type))
        extracted_entities.append(entities)
    for entities in extracted_entities:
        print(entities)

extract_entities(test_data)


[('Bill Gates', 'PERSON'), ('Microsoft', 'ORG')]
[('The Louvre Museum', 'FAC'), ('Paris', 'GPE')]
[('Mount Fuji', 'LOC'), ('Japan', 'GPE')]
[('The United Nations', 'ORG'), ('1945', 'DATE')]
[('Shakira', 'PERSON'), ('the Super Bowl', 'EVENT')]
[('The Nobel Peace Prize', 'WORK_OF_ART'), ('Malala Yousafzai', 'PERSON')]
[('The Amazon River', 'LOC'), ('Brazil', 'GPE')]
[('The Pyramids of Giza', 'PERSON'), ('Egypt', 'GPE')]
[('Rome', 'GPE'), ('Italy', 'GPE')]
[('China', 'GPE'), ('Seven', 'CARDINAL')]


In [12]:
# Making lists of labels for each of the steps as each modules and functions used have different notations for tags
# Converted tags : Step 5 (stanza) 
# PERSON, DATE does not need to be converted
# Convert LOC,FAC,GPE => LOCATION, convert ORG => ORGANIZATION and convert WORK_OF_ART, CARDINAL => TITLE

predicted_labels_step5= [
    ['PERSON', 'PERSON', 'O', 'ORGANIZATION'],
    ['ORGANIZATION', 'ORGANIZATION', 'ORGANIZATION', 'O', 'O', 'LOCATION'],
    ['LOCATION', 'LOCATION', 'O', 'O', 'O', 'O', 'O', 'LOCATION'],
    ['ORGANIZATION','ORGANIZATION','ORGANIZATION','O','O','O','DATE'],
    ['PERSON', 'O', 'O','EVENT','EVENT','EVENT','O','O'],
    ['TITLE','TITLE','TITLE','TITLE','O','O','O','PERSON','PERSON'],
    ['LOCATION','LOCATION','LOCATION','O','O','LOCATION'],
    ['PERSON','PERSON','PERSON','PERSON','O','O','LOCATION'],
    ['LOCATION', 'O', 'O', 'O', 'O', 'LOCATION'],
    ['O', 'O', 'O', 'O', 'LOCATION', 'O', 'O', 'O', 'O', 'TITLE', 'O', 'O', 'O', 'O']
]

##### **STEP 6:**
##### Evaluate the performance of all the above steps in terms of accuracy, precision and recall, either as a whole (micro-averaging), or category by category (macro-averaging).

In [13]:
# true label for evaluation of all the above steps 1 to 5
true_labels = [
    ['PERSON', 'PERSON', 'O', 'ORGANIZATION'],
    ['O', 'LOCATION', 'LOCATION', 'O', 'O', 'LOCATION'],
    ['LOCATION', 'LOCATION', 'O', 'O', 'O', 'O', 'O', 'LOCATION'],
    ['LOCATION','LOCATION','LOCATION','O','O','O','DATE'],
    ['PERSON', 'O', 'O','O','EVENT','EVENT','O','O'],
    ['O','TITLE','TITLE','TITLE','O','O','O','PERSON','PERSON'],
    ['O','LOCATION','LOCATION','O','O','LOCATION'],
    ['O','LOCATION','LOCATION','LOCATION','O','O','LOCATION'],
    ['LOCATION', 'O', 'O', 'TITLE', 'O', 'LOCATION'],
    ['O', 'LOCATION', 'LOCATION', 'O', 'LOCATION', 'O', 'TITLE', 'O', 'O', 'TITLE', 'TITLE', 'O', 'O', 'TITLE']
]

In [14]:
predictions_combined = []

# List of predicted labels from different steps
prediction_lists_all_steps = [predicted_labels_step1, predicted_labels_step2, predicted_labels_step3, predicted_labels_step4, predicted_labels_step5]

# Extend predictions_combined with all lists in predicted_lists
predictions_combined.extend(prediction_lists_all_steps)

# evaluating performance using micro averaging
def micro_averaging(true_labels, pred_labels):
    # calculating total number of labels for all the sentences in the test data
    total_labels = sum(len(true) for true in true_labels)
    # Count the number of correct predictions (true_preds) where the true label matches predicted label
    true_preds = sum(sum(1 for t, p in zip(true, pred) if t == p) for true, pred in zip(true_labels, pred_labels))
    # Count the number of true positive predictions (true_positive_preds) excluding 'O' labels
    true_positive_preds = sum(sum(1 for t, p in zip(true, pred) if t == p != 'O') for true, pred in zip(true_labels, pred_labels))
    # Count the number of predicted positive labels (pred_positives) excluding 'O' labels
    pred_positives = sum(sum(1 for p in pred if p != 'O') for pred in pred_labels)
    # Count the number of actual positive labels (actual_positives) excluding 'O' labels
    actual_positives = sum(1 for true in true_labels for t in true if t != 'O')

    # calculating the accuracy - ratio of correct predictions(true_preds) to the total number of true labels(total_labels)
    accuracy = true_preds / total_labels
    # calculating precision - ratio of true positive predictions(true_positive_preds) to predicted positive labels(pred_positives),
    precision = true_positive_preds / pred_positives if pred_positives > 0 else 0 # handle cases where there are no predicted positive labels
    # calculating recall - ratio of true positive predictions(true_positive_preds) to actual positive labels(actual_positives),
    recall = true_positive_preds / actual_positives if actual_positives > 0 else 0 # handle cases where there are no actual positive labels
    
    return accuracy, precision, recall


#### **Printing the Micro Averaging Accuracy, Micro Averaging Precision and Micro Averaging Recall for the following steps:**
###### STEP 1: Code from scratch to implement the Viterbi algorithm and a Hidden Markov Model
###### STEP 2 : Use the NLTK ne_chunk() function for NER
###### STEP 3 : Use the  spaCy nlp() function for NER
###### STEP 4 : Use the Huggingface/Transformers’ ner pipeline function for NER
###### STEP 5: Use the stanza’s ner pipeline for NER

In [15]:
for step_num, prediction_label in enumerate(predictions_combined, start=1):
    print(f"\nPrediction for Step {step_num} : ")
    micro_avg_accuracy, micro_avg_precision, micro_avg_recall = micro_averaging(true_labels, prediction_label)
    print("Micro-average Accuracy:", micro_avg_accuracy)
    print("Micro-average Precision:", micro_avg_precision)
    print("Micro-average Recall:", micro_avg_recall)


Prediction for Step 1 : 
Micro-average Accuracy: 0.4666666666666667
Micro-average Precision: 0.30612244897959184
Micro-average Recall: 0.39473684210526316

Prediction for Step 2 : 
Micro-average Accuracy: 0.6533333333333333
Micro-average Precision: 0.375
Micro-average Recall: 0.3157894736842105

Prediction for Step 3 : 
Micro-average Accuracy: 0.8133333333333334
Micro-average Precision: 0.725
Micro-average Recall: 0.7631578947368421

Prediction for Step 4 : 
Micro-average Accuracy: 0.8266666666666667
Micro-average Precision: 0.7567567567567568
Micro-average Recall: 0.7368421052631579

Prediction for Step 5 : 
Micro-average Accuracy: 0.7466666666666667
Micro-average Precision: 0.6486486486486487
Micro-average Recall: 0.631578947368421
