Before starting this part, I have noticed that there were more than 3 entities in the training data, I had to remove them. 58 instances were removed, and it left us 4000 labeled data to train our model

In [None]:
import json

with open('ner_training_data.json', 'r') as file:
    data = json.load(file)

filtered_data = []
deleted_count = 0

for instance in data:
    text, entities = instance
    # Check if all entities are of the specified types
    if all(entity[2] in ['ACQUIRER', 'ACQUIRED', 'PRICE'] for entity in entities['entities']):
        filtered_data.append(instance)
    else:
        deleted_count += 1

with open('filtered_ner_training_data.json', 'w') as file:
    json.dump(filtered_data, file, indent=2)

print(f"Number of deleted instances: {deleted_count}")


the Spacy model didn't have good accuracy because:
we were creating the model from scratch. 
training data did not have many price entities. 
we didnt have that many data instance and compute time for a model to learn from scratch. 

Instead, we can use a pre-trained BERT and fine-tune it for NER task. This way, we can use its prelearned knowledge and fine-tune it with less data to have greater accuracy. 

In [3]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 217.9 kB/s eta 0:00:59
     --------------------------------------- 0.0/12.8 MB 262.6 kB/s eta 0:00:49
     --------------------------------------- 0.0/12.8 MB 262.6 kB/s eta 0:00:49
     --------------------------------------- 0.1/12.8 MB 302.7 kB/s eta 0:00:43
     --------------------------------------- 0.1/12.8 MB 302.7 kB/s eta 0:00:43
     --------------------------------------- 0.1/12.8 MB 327.7 kB/s eta 0:00:39
     --------------------------------------- 0.1/12.8 MB 310.3 kB/s eta 0:00:41
     --------------------------------------- 0.1

Since we have the spacy format for our data, first,  we should convert to iob for BERT model.

In [14]:
import json
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span

# Load SpaCy's English tokenizer
nlp = spacy.blank("en")

# Load the input JSON data
with open('filtered_ner_training_data.json', 'r') as file:
    data = json.load(file)

# Function to validate entities
def validate_entities(text, entities):
    valid_entities = []
    for start, end, label in entities:
        if start >= 0 and start < len(text) and end > start and end <= len(text):
            valid_entities.append((start, end, label))
        else:
            print(f"Ignoring invalid entity: {text[start:end]} with start={start} and end={end}")
    return valid_entities

# Function to convert data to IOB format
def convert_to_iob(data):
    converted_data = []
    for item in data:
        text = item[0]
        entities = item[1]['entities']
        
        # Validate entities
        valid_entities = validate_entities(text, entities)
        
        # Create a SpaCy doc object
        doc = nlp(text)
        
        # Initialize BILUO tags with 'O'
        biluo_tags = ['O'] * len(doc)
        
        for start, end, label in valid_entities:
            char_span = doc.char_span(start, end)
            if char_span is not None:
                # Determine the BILUO tag for each token in the span
                for i, token in enumerate(char_span):
                    if i == 0:
                        if len(char_span) == 1:
                            biluo_tags[token.i] = f'U-{label}'
                        else:
                            biluo_tags[token.i] = f'B-{label}'
                    elif i == len(char_span) - 1:
                        biluo_tags[token.i] = f'L-{label}'
                    else:
                        biluo_tags[token.i] = f'I-{label}'
            else:
                print(f"Invalid char span for entity: {text[start:end]} with start={start} and end={end}")
        
        # Convert BILUO tags to IOB tags
        iob_tags = [tag.replace("U-", "B-").replace("L-", "I-") for tag in biluo_tags]
        
        # Create a list of tokens
        tokens = [token.text for token in doc]
        
        # Append the tokens and IOB tags to the converted data
        converted_data.append({"tokens": tokens, "iob_tags": iob_tags})
    
    return converted_data

# Convert the data
iob_data = convert_to_iob(data)

# Save the converted data to a new JSON file
with open('dataset.json', 'w') as outfile:
    json.dump(iob_data, outfile, indent=4)

print("Data successfully converted and saved to dataset.json")


Invalid char span for entity: Sophos Solutions S.A.S with start=706 and end=728
Invalid char span for entity: Caseys General Stores Inc. with start=99 and end=125
Invalid char span for entity: LightEdge with start=568 and end=577
Invalid char span for entity: LightEdge with start=2119 and end=2128
Invalid char span for entity: LightEdge with start=2604 and end=2613
Invalid char span for entity: LightEdge with start=3012 and end=3021
Invalid char span for entity: LightEdge with start=3247 and end=3256
Invalid char span for entity: Connectria with start=956 and end=966
Invalid char span for entity: Connectria with start=3134 and end=3144
Invalid char span for entity: Amazon with start=808 and end=814
Invalid char span for entity: Amazon with start=1059 and end=1065
Invalid char span for entity: State Street with start=605 and end=617
Invalid char span for entity: State Street with start=3738 and end=3750
Invalid char span for entity: Digital World Acquisition Corp. with start=196 and end

We should process overlapping entities and allow them for not getting any errors.

In [5]:
import json
import spacy
from spacy.tokens import Span
from spacy import displacy

# Load SpaCy's English tokenizer
nlp = spacy.blank("en")

# Load the input JSON data
with open('filtered_ner_training_data.json', 'r') as file:
    data = json.load(file)

# Function to identify overlapping entities
def find_overlapping_entities(data):
    for item in data:
        text = item[0]
        entities = item[1]['entities']
        entities = sorted(entities, key=lambda x: x[0])
        for i, (start, end, label) in enumerate(entities):
            if i > 0:
                prev_start, prev_end, prev_label = entities[i - 1]
                if start < prev_end:
                    return text, entities
    return None, None

# Find an example with overlapping entities
text, overlapping_entities = find_overlapping_entities(data)

def validate_entities(text, entities):
    valid_entities = []
    for start, end, label in entities:
        if start < len(text) and end <= len(text):
            valid_entities.append((start, end, label))
    return valid_entities

if text:
    print("Found overlapping entities in the following text:")
    print(text)
    print("Entities:", overlapping_entities)
    
    # Validate entities
    valid_entities = validate_entities(text, overlapping_entities)
    
    # Create a SpaCy doc object
    doc = nlp(text)
    
    # Create SpaCy spans for the entities
    spans = [doc.char_span(start, end, label=label) for start, end, label in valid_entities]
    
    # Filter out None spans
    spans = [span for span in spans if span is not None]
    
    # Set the spans in the doc
    doc.spans["sc"] = spans
    
    # Visualize the entities with displacy
    options = {"colors": {"ACQUIRER": "lightblue", "ACQUIRED": "lightgreen", "PRICE": "lightcoral"}}
    displacy.render(doc, style="span", jupyter=True, options=options)
else:
    print("No overlapping entities found in the dataset.")


Found overlapping entities in the following text:
Orthofix, SeaSpine complete merger

Orthofix and SeaSpine today completed its previously announced merger of equals to create a global spine and orthopedics company.Under the terms of the agreement, Orthofix will merge with SeaSpine and SeaSpine will continue as the surviving company and a wholly-owned subsidiary of Orthofix. SeaSpine shares also ceased trading on the Nasdaq global market this morning.Holders of SeaSpine common stock will receive 0.4163 shares of Orthofix common stock for each share of SeaSpine common stock owned. The combined companies will continue to trade on the Nasdaq under the symbol OFIX.The two companies plan to rename the merged company at a later date, but it will be Orthofix Medical until then. The combined company will market spine and orthopedics with a complementary portfolio of biologics, spinal hardware solutions, market-leading bone growth therapies, specialized orthopedic solutions and a surgical navig