# Evaluate NER on Book 4 

To assess the quality of NER annotation in Book 4, Precision, Recall, and F1 Score measurements were used against a Gold Standard. Among the NER systems we tested, Flair ner-large emerged as the top performer, achieving a high-quality annotation with an F1 Score of 0.945, along with strong Precision (0.952) and Recall (0.938) values.

However, Flair ner had a somewhat lower Recall (0.637), indicating instances where entities were missed. On a positive note, the spaCy-trf model also delivered promising results (Precision: 0.938, Recall: 0.938, F1 Score: 0.938).

During the calculation of True Positives, False Negatives, and False Positives, we took into account the possibility of partial matches between multi-word entities due to variations in entity boundaries. To handle this, we chose to evaluate annotations on a named-entity level rather than a token-level basis. For this compute see also: https://github.com/chakki-works/seqeval/tree/master.

It's worth pointing out that Flair ner-large failed to identify 119 place entities. While it did manage to detect 115 of these entities, they were not correctly labeled as places.

In [1]:
import pandas as pd

In [2]:
## open the Gold Standard of Book 4 (18,664 rows)
GoldStandard_Book4 = pd.read_excel("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/1.4.GoldStandard_Book4.xlsx")

In [3]:
len(GoldStandard_Book4)

18664

In [4]:
## open the output of Flair NER (18,664 rows)
NERs_Book4 = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/2.2.BIO_NER_Flair_spaCy_Book4.csv")

In [5]:
len(NERs_Book4)

18664

In [6]:
## append the Gold Standard to the dataset of the NER outputs
NERs_Book4['Manual_Annotation'] = GoldStandard_Book4['Manual_Annotation']

# Flair ner-large

In [7]:
## create a copy of the Flair ner-large column
NERs_Book4['BIO_Flair-large_copy'] = NERs_Book4['BIO_Flair-large']

## the function trasnform all the annotations not including LOC into O
def update_values(value):
    if 'LOC' not in value:
        return 'O'
    return value

NERs_Book4['BIO_Flair-large_copy'] = NERs_Book4['BIO_Flair-large_copy'].apply(update_values)

In [8]:
NERs_Book4['BIO_Flair-large_copy'].unique() ## the new column contains only O, B-LOC and I-LOC

array(['O', 'B-LOC', 'I-LOC'], dtype=object)

**Compute True Positive and False Negatives including partial matches**

In [9]:
True_Positives = [] ## create a list of true positives
False_Negatives = [] ## create a list of false negatives

In [10]:
for index, manual_annotation in enumerate(NERs_Book4['Manual_Annotation']): ## for each token in the Gold Standard
        
    if manual_annotation == 'B-LOC': ## for each B-LOC entity in the Gold Standard
        
        ## create a tuple containing the reference and start position
        reference_startpos = (NERs_Book4['Reference'][index], NERs_Book4['Start_pos'][index])
        
        if len(NERs_Book4['BIO_Flair-large_copy'][index]) > 1: ## if the NER system predicted a LOC entity for the token
            True_Positives.append(reference_startpos) ## it is a true positive
            
        else: ## if the NER system did not predict a LOC entity for the token
            
            if NERs_Book4['Manual_Annotation'][index+1] != 'I-LOC': ## if B-LOC is not followed by I-LOC
                False_Negatives.append(reference_startpos) ## it is a false negative
            
            else: ## if B-LOC is followed by I-LOC
                
                flag = False
                
                for n in range(1,100):
                    
                    if NERs_Book4['Manual_Annotation'][index+n] == 'I-LOC': ## inside the multi-word LOC entity
                        
                        if len(NERs_Book4['BIO_Flair-large_copy'][index+n]) > 1: ## the NER system predicted a LOC entity
                            True_Positives.append((NERs_Book4['Reference'][index+n], NERs_Book4['Start_pos'][index+n])) ## it is a true positive
                            flag = True
                            break
                            
                    else: break
                        
                if flag == False: ## no entity was predicted in the span
                    False_Negatives.append(reference_startpos) ## it is a false negative

Flair ner-large contains 1,811 True Positives.

In [11]:
len(True_Positives)

1811

Flair ner-large contains 119 False Negatives.

In [12]:
len(False_Negatives)

119

**Compute False Positives**

In [13]:
False_Positives = [] ## create a list of false positives

In [14]:
for index, Flair_nerlarge_annotation in enumerate(NERs_Book4['BIO_Flair-large_copy']):
        
    if Flair_nerlarge_annotation == 'B-LOC': ## for each B-LOC Flair annotation
        
        ## create a tuple containing the reference and start position
        reference_startpos = (NERs_Book4['Reference'][index], NERs_Book4['Start_pos'][index])
        
        if len(NERs_Book4['Manual_Annotation'][index]) == 1: ## if the Gold Standard does not contain an entity
            
            if NERs_Book4['BIO_Flair-large_copy'][index+1] != 'I-LOC': ## if B-LOC is not followed by I-LOC
                False_Positives.append(reference_startpos) ## it is a false positive
        
        else: ## if B-LOC is followed by I-LOC
            
            flag = False
            
            for n in range(1,100):
                
                if NERs_Book4['BIO_Flair-large_copy'][index+1] == 'I-LOC': ## inside the multi-word LOC annotation
                    
                    if len(NERs_Book4['Manual_Annotation'][index+n]) > 1: ## the Gold Standard contains an entity
                        flag = True
                        break
                        
                else: break
                        
                if flag == False:
                    False_Positives.append(reference_startpos) ## it is a false positive

Flair ner-large contains 91 False Positives.

In [15]:
len(False_Positives)

91

Flair ner-large had a Precision of 0.952.

In [16]:
## calculate Precision

Precision = len(True_Positives) / (len(True_Positives) + len(False_Positives))
Precision

0.9521556256572029

Flair ner-large had a Recall of 0.938.

In [17]:
## calculate Recall

Recall = len(True_Positives) / (len(True_Positives) + len(False_Negatives))
Recall

0.9383419689119171

Flair ner-large had a F1 Score of 0.945.

In [18]:
## calculate F1 Score

F1 = 2 * (Precision * Recall) / (Precision + Recall)
F1

0.9451983298538622

**Compute the cases in which NER predicts a named entity but assigns the label wrong**

In [19]:
Incorrect_Labels = []

for index, manual_annotation in enumerate(NERs_Book4['Manual_Annotation']): ## for each token in the Gold Standard
        
    if manual_annotation == 'B-LOC': ## for each B-LOC entity in the Gold Standard
        
        ## create a tuple containing the reference and start position
        reference_startpos = (NERs_Book4['Reference'][index], NERs_Book4['Start_pos'][index])
        
        if len(NERs_Book4['BIO_Flair-large'][index]) > 1: ## if an entity is predicted for the token            
            if 'LOC' not in str(NERs_Book4['BIO_Flair-large'][index]): ## if the entity label is not LOC 
                Incorrect_Labels.append(reference_startpos) ## it is an incorrect entity label
            
        if len(NERs_Book4['BIO_Flair-large'][index]) == 1: ## if no entity is predicted for the token 
            
            if NERs_Book4['Manual_Annotation'][index+1] == 'I-LOC':
                
                flag = False
                
                for n in range(1,100): ## for any natural number
                    
                    if NERs_Book4['Manual_Annotation'][index+n] == 'I-LOC':
                        
                        if len(NERs_Book4['BIO_Flair-large'][index+n]) > 1: ## if an entity is predicted for the token
                            
                            if 'LOC' not in str(NERs_Book4['BIO_Flair-large'][index+n]): ## if the entity is not annotated as LOC
                                if str(NERs_Book4['BIO_Flair-large'][index+n]).startswith('B-'):
                                    Incorrect_Labels.append((NERs_Book4['Reference'][index+n], NERs_Book4['Start_pos'][index+n])) ## it is an incorrect entity labelflag = True
                                
                    else: break

In [20]:
len(Incorrect_Labels)

115

# spaCy trf

In [21]:
NERs_Book4['BIO_spaCy-trf_copy'] = NERs_Book4['BIO_spaCy-trf']

## the function trasnform all the GPE annotations into LOC annotations
def update_labels(label):
    if 'B-GPE' in label:
        return 'B-LOC'
    if 'I-GPE' in label:
        return 'I-LOC'
    return label

NERs_Book4['BIO_spaCy-trf_copy'] = NERs_Book4['BIO_spaCy-trf_copy'].apply(update_labels)
NERs_Book4['BIO_spaCy-trf_copy'] = NERs_Book4['BIO_spaCy-trf_copy'].apply(update_values)

In [22]:
NERs_Book4['BIO_spaCy-trf_copy'].unique()

array(['O', 'B-LOC', 'I-LOC'], dtype=object)