### MACCROBAT_DATA

Annotation files are in brat standoff format

General annotation structure
All annotations follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type.

Examples of annotation for an entity (T1), an event trigger (T2), an event (E1) and a relation (R1) are shown in the following.

Annotation ID conventions
All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

T: text-bound annotation
R: relation
E: event
A: attribute
M: modification (alias for attribute, for backward compatibility)
N: normalization [new in v1.3]
#: note

In [33]:
MACCROBAT_data_dir = "./data/MACCROBAT"

In [41]:
import os
import pandas as pd

In [3]:
file_ids = []
for file in os.listdir(MACCROBAT_data_dir):
    file_id = file.split(".")[0]
    if file_id not in file_ids:
        file_ids.append(file_id)
file_ids

['19860925',
 '26361640',
 '26228535',
 '27773410',
 '23678274',
 '25853982',
 '28103924',
 '27064109',
 '28154700',
 '20146086',
 '26656340',
 '28353558',
 '22515939',
 '28353588',
 '26309459',
 '28272235',
 '23242090',
 '23312850',
 '23124805',
 '26106249',
 '26313770',
 '26285706',
 '18416479',
 '28353613',
 '28151916',
 '26175648',
 '23468586',
 '28216610',
 '27059701',
 '28121940',
 '23077697',
 '27741115',
 '21067996',
 '28100235',
 '28151860',
 '25884600',
 '27904130',
 '19214295',
 '18787726',
 '22719160',
 '28422883',
 '26675562',
 '21477357',
 '25139918',
 '28353561',
 '22791498',
 '28538413',
 '26457578',
 '27842605',
 '20671919',
 '25155594',
 '26469535',
 '28353604',
 '28403092',
 '28239141',
 '28202869',
 '25024632',
 '28403086',
 '18666334',
 '25572898',
 '28296775',
 '22514576',
 '26584481',
 '28296749',
 '16778410',
 '19860007',
 '28190872',
 '25743872',
 '26523273',
 '28193213',
 '28120581',
 '26670309',
 '26336183',
 '25410883',
 '26530965',
 '28057913',
 '20977862',

In [32]:
tags = []
for ann_file in [os.path.join(MACCROBAT_data_dir, file_id+".ann") for file_id in file_ids]:
    with open(ann_file) as f:
        for line in f.readlines():
            tags.append(line.split("\t")[0][0])

set(tags)

{'#', '*', 'A', 'E', 'R', 'T'}

For named entity recognition, only the T tags are needed to generate the BIO format. The other tags such as #, *, A, E, and R are used for representing different types of information in the .ann files and are not directly related to named entity recognition.

\# is used to denote comments in the annotation file.
\* is used to represent coreference annotations.
A is used to represent attribute annotations.
E is used to represent event annotations.
R is used to represent relation annotations.

Therefore, we can ignore all other tags except for the T tags when generating the BIO format for named entity recognition.

### Find Entity types in ann files

In [177]:
# create an empty list to store the DataFrames
df_list = []

for ann_file in [os.path.join(MACCROBAT_data_dir, file_id+".ann") for file_id in file_ids]:
    # read the TSV file with consecutive tabs treated as a single delimiter
    df = pd.read_csv(ann_file, sep="\t+", header=None, names=['id', 'entity_with_range', 'word'], engine='python')

    # selecting only T tag rows
    df = df[df.iloc[:,0].str.startswith('T')]

    df['file_name'] = f"{os.path.splitext(ann_file)[0]}.txt"

    df_list.append(df)

ann_df = pd.concat(df_list)
ann_df = ann_df.reset_index(drop=True)

In [178]:
ann_df.head()

Unnamed: 0,id,entity_with_range,word,file_name
0,T1,Age 4 15,24-year-old,./data/MACCROBAT/19860925.txt
1,T2,Sex 28 32,male,./data/MACCROBAT/19860925.txt
2,T3,History 16 27,non-smoking,./data/MACCROBAT/19860925.txt
3,T4,Clinical_event 41 50,presented,./data/MACCROBAT/19860925.txt
4,T5,Sign_symptom 65 75,hemoptysis,./data/MACCROBAT/19860925.txt


In [179]:
ann_df.shape

(25041, 4)

In [180]:
ann_df[ann_df['entity_with_range'].str.split().str.len() > 3]

Unnamed: 0,id,entity_with_range,word,file_name
39,T40,Disease_disorder 701 714;730 735,granular cell tumor,./data/MACCROBAT/19860925.txt
1190,T31,Dosage 4085 4093;4103 4108,low dose daily,./data/MACCROBAT/20146086.txt
1272,T160,Dosage 3117 3124;3135 3141,1500 mg weekly,./data/MACCROBAT/20146086.txt
1469,T101,Detailed_description 2552 2559;2585 2654,neither had developed any signs or symptoms su...,./data/MACCROBAT/28353558.txt
2376,T42,Administration 704 715;721 726,intravenous bolus,./data/MACCROBAT/23124805.txt
2380,T47,Dosage 828 835;861 879,5 mg/kg every 4 to 6 weeks,./data/MACCROBAT/23124805.txt
2381,T48,Dosage 849 857;861 879,10 mg/kg every 4 to 6 weeks,./data/MACCROBAT/23124805.txt
2853,T18,Diagnostic_procedure 341 350;357 358,Hepatitis C,./data/MACCROBAT/18416479.txt
3335,T16,Disease_disorder 315 317;319 330,LV dysfunction,./data/MACCROBAT/23468586.txt
3336,T15,Disease_disorder 297 313;319 330,left ventricular dysfunction,./data/MACCROBAT/23468586.txt


The above entries represents the tagged text spans more than one disjoint ranges

Note that the "B" prefix is used to indicate the beginning of an entity, while the "I" prefix is used to indicate an intermediate token within an entity. When there are disjoint ranges for an entity, we can start a new entity with a "B" prefix for each range.

Including the full text that spans disjoint ranges in the BIO format will help the model learn to recognize the entire entity, even if it is fragmented across multiple parts of the text.

In [181]:
ann_df['entity'] = ann_df['entity_with_range'].str.split().str[0]
ann_df['range'] = ann_df['entity_with_range'].str.split().str[1:]
ann_df = ann_df[['id', 'entity', 'range', 'word', 'file_name']]
ann_df.head()

Unnamed: 0,id,entity,range,word,file_name
0,T1,Age,"[4, 15]",24-year-old,./data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,./data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,./data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,./data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,./data/MACCROBAT/19860925.txt


In [192]:
print("The set of entities available in the dataset is as follows: ")
set(ann_df['entity'])

The set of entities available in the dataset is as follows: 


{'Activity',
 'Administration',
 'Age',
 'Area',
 'Biological_attribute',
 'Biological_structure',
 'Clinical_event',
 'Color',
 'Coreference',
 'Date',
 'Detailed_description',
 'Diagnostic_procedure',
 'Disease_disorder',
 'Distance',
 'Dosage',
 'Duration',
 'Family_history',
 'Frequency',
 'Height',
 'History',
 'Lab_value',
 'Mass',
 'Medication',
 'Nonbiological_location',
 'Occupation',
 'Other_entity',
 'Other_event',
 'Outcome',
 'Personal_background',
 'Qualitative_concept',
 'Quantitative_concept',
 'Severity',
 'Sex',
 'Shape',
 'Sign_symptom',
 'Subject',
 'Texture',
 'Therapeutic_procedure',
 'Time',
 'Volume',
 'Weight'}

### Exploring each entities

In [314]:
ann_df.groupby(['entity']).size().sort_values(ascending=False)

entity
Diagnostic_procedure      4567
Sign_symptom              3359
Biological_structure      2931
Detailed_description      2901
Lab_value                 2858
Disease_disorder          1362
Medication                1076
Therapeutic_procedure     1005
Date                       731
Clinical_event             626
History                    392
Severity                   369
Dosage                     362
Nonbiological_location     354
Coreference                313
Duration                   280
Age                        206
Sex                        191
Administration             175
Distance                   122
Activity                   108
Family_history              81
Frequency                   76
Shape                       65
Time                        57
Personal_background         57
Subject                     54
Color                       52
Texture                     46
Area                        43
Outcome                     42
Qualitative_concept         41
V

In [222]:
# Define a function to select random rows from each group
def select_random_rows(group):
    return group.sample(n=min(2, len(group)), replace=False)

# Apply the function to the DataFrame grouped by 'group'
random_rows = ann_df.groupby('entity').apply(select_random_rows)

# Print the selected random rows
random_rows['word']

entity               
Activity        16278               flexed
                7071     physical exercise
Administration  16777                 oral
                11207              topical
Age             19252          23-year-old
                               ...        
Time            6758        first 24 hours
Volume          18706        11 × 8 × 2 cm
                24524                 5 mL
Weight          24555               3132 g
                5538                 83 kg
Name: word, Length: 82, dtype: object

In [223]:
selected_entities = ['Age', 'Biological_attribute', 'Biological_structure', 'Clinical_event', 'Diagnostic_procedure', 'Disease_disorder', 'Dosage', 'Family_history', 'Height', 'History', 'Lab_value', 'Mass', 'Medication', 'Sex', 'Sign_symptom', 'Therapeutic_procedure', 'Weight']

In [227]:
filtered_ann_df = ann_df[ann_df['entity'].isin(selected_entities)]
filtered_ann_df = filtered_ann_df.reset_index(drop=True)
filtered_ann_df.head()

Unnamed: 0,id,entity,range,word,file_name
0,T1,Age,"[4, 15]",24-year-old,./data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,./data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,./data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,./data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,./data/MACCROBAT/19860925.txt


In [228]:
filtered_ann_df.shape

(19036, 5)

## Generating BIO format files

In [None]:
import re

#### Function to read txt files

In [238]:
import numpy as np
import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [232]:
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    return content

In [281]:
import re

# Read the txt file
with open('./data/MACCROBAT/19860925.txt', 'r') as f:
    text = f.read()

# Read the ann file
with open('./data/MACCROBAT/19860925.ann', 'r') as f:
    ann = f.readlines()

# Define a function to convert the ann file into a list of (label, start, end) tuples
def parse_ann(ann):
    entities = []
    for line in ann:
        fields = line.strip().split('\t')
        if fields[0].startswith('T'):
            entity_with_range, word = fields[1], fields[2]
            label = entity_with_range.split()[0]
            if label in selected_entities:
                ranges = [
                    (
                        int(start_end.split()[0]),
                        int(start_end.split()[1])
                    )
                    for start_end in ' '.join(entity_with_range.split()[1:]).split(';')
                ]
                for start, end in ranges:
                    entities.append((label, start, end))
    return entities

# Convert the ann file into a list of (label, start, end) tuples
entities = parse_ann(ann)

# Define a function to convert the entities into a list of (token, tag) tuples in BIO format
def entities_to_bio(text, entities):
    tokens = re.findall(r'\b\w+\b', text)
    tags = ['O'] * len(tokens)
    for label, start, end in entities:
        for i in range(len(tokens)):
            token_start = text.find(tokens[i])
            if start <= token_start < end:
                if token_start == start:
                    tags[i] = 'B-' + label
                else:
                    tags[i] = 'I-' + label
    return list(zip(tokens, tags))



# Convert the entities into a list of (token, tag) tuples in BIO format
bio = entities_to_bio(text, entities)


In [284]:
# Print the result
for word, tag in bio:
    print(word, tag)

Our O
24 B-Age
year I-Age
old I-Age
non B-History
smoking I-History
male B-Sex
patient O
presented B-Clinical_event
with O
repeated O
hemoptysis B-Sign_symptom
in I-History
May O
2008 O
with O
4 I-Age
days O
of O
concomitant O
right B-Biological_structure
thoracic I-Biological_structure
pain B-Sign_symptom
which O
intensified O
while O
breathing O
During O
holidays B-History
in I-History
his I-History
home I-History
country I-History
this O
Cuban O
patient O
suffered O
from O
a I-Age
cold B-Sign_symptom
with O
fever B-Sign_symptom
and O
a I-Age
strong O
cough B-Sign_symptom
The O
strong O
dry O
cough B-Sign_symptom
persisted O
after O
recovery O
from O
the O
cold B-Sign_symptom
The O
patient O
did O
not O
report O
any O
loss B-Sign_symptom
of O
weight I-Sign_symptom
The O
initial O
CT B-Diagnostic_procedure
scan I-Diagnostic_procedure
of O
the O
thorax B-Biological_structure
showed O
a I-Age
12 O
4 I-Age
cm O
solid O
mass B-Sign_symptom
paravertebral B-Biological_structure
right B-Biol

In [285]:
MACCROBAT_data_dir

'./data/MACCROBAT'

In [311]:
# def convert_ann_to_bio(input_dir, output_dir):
#     for file in os.listdir(input_dir):
#         if file.endswith('.txt'):
#             # Read the corresponding txt file
#             with open(os.path.join(input_dir, file), 'r', encoding='utf-8') as f:
#                 text = f.read()
#             # Find the corresponding ann file
#             ann_file = os.path.join(input_dir, file.replace('.txt', '.ann'))
#
#             # tokenization
#             tokens = re.findall(r'\b\w+\b', text)
#             # Initialize a list to hold the BIO-formatted tags
#             bio_tags = ['O'] * len(tokens)
#
#             # Read the annotation file
#             with open(ann_file, 'r', encoding='utf-8') as f:
#                 for line in f:
#                     fields = line.strip().split('\t')
#                     # Get the tag and its starting and ending positions
#                     if fields[0].startswith('T'):
#                         entity_with_range, word = fields[1], fields[2]
#                         label = entity_with_range.split()[0]
#                         if label in selected_entities:
#                             ranges = [
#                                 (
#                                     int(start_end.split()[0]),
#                                     int(start_end.split()[1])
#                                 )
#                                 for start_end in ' '.join(entity_with_range.split()[1:]).split(';')
#                             ]
#                             # Keep track of the current position in the text
#                             current_pos = 0
#                             for i in range(len(tokens)):
#                                 # Calculate the starting position of the token
#                                 token_start = text.find(tokens[i], current_pos)
#                                 token_end = token_start + len(tokens[i])
#                                 # Update the current position in the text
#                                 current_pos = token_end
#                                 for start, end in ranges:
#                                     if start <= token_start and end >= token_end:
#                                         if token_start == start:
#                                             bio_tags[i] = 'B-' + label
#                                         else:
#                                             bio_tags[i] = 'I-' + label
#             # for token, tag in zip(tokens, bio_tags):
#             #     print(token, tag)
#
#
#             if not os.path.exists(output_dir):
#                 os.makedirs(output_dir)
#              # # Write the BIO tags to a new file
#             with open(os.path.join(output_dir, file.replace('.txt', '.bio')), 'w', encoding='utf-8') as f:
#                 for i in range(len(tokens)):
#                     f.write(tokens[i] + '\t' + bio_tags[i] + '\n')
#                 f.write('\n')

In [338]:
def convert_ann_to_bio(input_dir, output_dir):
    for file in os.listdir(input_dir):
        if file.endswith('.txt'):
            # Read the corresponding txt file
            with open(os.path.join(input_dir, file), 'r', encoding='utf-8') as f:
                text = f.read()
            # Find the corresponding ann file
            ann_file = os.path.join(input_dir, file.replace('.txt', '.ann'))

            # tokenization
            sentences = text.split('\n')
            tokens = [word for sentence in sentences for word in sentence.split()]

            # Initialize a list to hold the BIO-formatted tags
            bio_tags = ['O'] * len(tokens)

            # Read the annotation file
            with open(ann_file, 'r', encoding='utf-8') as f:
                for line in f:
                    fields = line.strip().split('\t')
                    # Get the tag and its starting and ending positions
                    if fields[0].startswith('T'):
                        entity_with_range, word = fields[1], fields[2]
                        label = entity_with_range.split()[0]
                        if label in selected_entities:
                            ranges = [
                                (
                                    int(start_end.split()[0]),
                                    int(start_end.split()[1])
                                )
                                for start_end in ' '.join(entity_with_range.split()[1:]).split(';')
                            ]
                            # Keep track of the current position in the text
                            current_pos = 0
                            for i in range(len(tokens)):
                                # Calculate the starting position of the token
                                token_start = text.find(tokens[i], current_pos)
                                token_end = token_start + len(tokens[i])
                                # Update the current position in the text
                                current_pos = token_end
                                for start, end in ranges:
                                    if start <= token_start and end >= token_end:
                                        if token_start == start:
                                            bio_tags[i] = 'B-' + label
                                        else:
                                            bio_tags[i] = 'I-' + label

            if not os.path.exists(output_dir):
                os.makedirs(output_dir)
            # Write the BIO tags to a new file
            with open(os.path.join(output_dir, file.replace('.txt', '.bio')), 'w', encoding='utf-8') as f:
                sentence_start_index = 0
                for sentence in sentences:
                    sentence_tokens = sentence.split()
                    sentence_length = len(sentence_tokens)
                    sentence_end_index = sentence_start_index + sentence_length
                    for i in range(sentence_start_index, sentence_end_index):
                        f.write(sentence_tokens[i - sentence_start_index] + '\t' + bio_tags[i] + '\n')
                    f.write('\n')
                    sentence_start_index = sentence_end_index


In [325]:
import os
from typing import List, Tuple

def read_text_file(file_path: str) -> str:
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read()

def tokenize_text(text: str) -> List[str]:
    return [word for sentence in text.split('\n') for word in sentence.split()]

def read_annotation_file(file_path: str, selected_entities: List[str]) -> List[Tuple[str, List[Tuple[int,int]]]]:
    entity_ranges = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            fields = line.strip().split('\t')
            # Get the tag and its starting and ending positions
            if fields[0].startswith('T'):
                entity_with_range, word = fields[1], fields[2]
                label = entity_with_range.split()[0]
                if label in selected_entities:
                    ranges = [
                        (
                            int(start_end.split()[0]),
                            int(start_end.split()[1])
                        )
                        for start_end in ' '.join(entity_with_range.split()[1:]).split(';')
                    ]
                    entity_ranges.append((label, ranges))
    return entity_ranges

def convert_ann_to_bio(input_dir: str, output_dir: str, selected_entities: List[str]):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    for file_name in os.listdir(input_dir):
        if file_name.endswith('.txt'):
            # Read the corresponding txt file
            text = read_text_file(os.path.join(input_dir, file_name))

            # Find the corresponding ann file
            ann_file = os.path.join(input_dir, file_name.replace('.txt', '.ann'))

            # tokenization
            sentences = text.split('\n')

            # Tokenize the text
            tokens = tokenize_text(text)

            # Initialize a list to hold the BIO-formatted tags
            bio_tags = ['O'] * len(tokens)

            # Read the annotation file
            entity_ranges = read_annotation_file(ann_file, selected_entities)

            # Update the BIO tags
            for label, ranges in entity_ranges:
                # Keep track of the current position in the text
                current_pos = 0
                for i in range(len(tokens)):
                    # Calculate the starting position of the token
                    token_start = text.find(tokens[i], current_pos)
                    token_end = token_start + len(tokens[i])
                    # Update the current position in the text
                    current_pos = token_end
                    for start, end in ranges:
                        if start <= token_start and end >= token_end:
                            if token_start == start:
                                bio_tags[i] = 'B-' + label
                            else:
                                bio_tags[i] = 'I-' + label



            if not os.path.exists(output_dir):
                os.makedirs(output_dir)
            # Write the BIO tags to a new file
            with open(os.path.join(output_dir, file.replace('.txt', '.bio')), 'w', encoding='utf-8') as f:
                sentence_start_index = 0
                for sentence in sentences:
                    sentence_tokens = sentence.split()
                    sentence_length = len(sentence_tokens)
                    sentence_end_index = sentence_start_index + sentence_length
                    for i in range(sentence_start_index, sentence_end_index):
                        f.write(sentence_tokens[i - sentence_start_index] + '\t' + bio_tags[i] + '\n')
                    f.write('\n')
                    sentence_start_index = sentence_end_index



In [447]:
import os
from typing import List, Tuple, Dict

def read_text_file(file_path: str) -> str:
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read()

def tokenize_text(text: str) -> List[str]:
    # Tokenize the text into a list of words
    tokens = []
    for sentence in text.split('\n'):
        for word in sentence.split():
            # Remove trailing punctuation marks from the word
            while word and word[-1] in string.punctuation:
                word = word[:-1]
            tokens.append(word)
    return tokens
    # tokens = [word for sentence in text.split('\n') for word in sentence.split()]
    # return tokens

def get_start_end_range_to_token_index(tokens: List[str], entity_ranges: List[Tuple[str, int, int]]) -> Dict[Tuple[int, int], List[int]]:
    # Initialize a dictionary to map each (start, end) range to the corresponding token indices
    start_end_range_to_token_index = {}
    # Keep track of the current position in the text
    current_pos = 0
    # Iterate over each token in the tokens list
    for i in range(len(tokens)):
        # Calculate the starting position of the token
        token_start = text.find(tokens[i], current_pos)
        token_end = token_start + len(tokens[i])
        # Update the current position in the text
        current_pos = token_end
        # Check if the current token is inside any of the entity ranges
        for label, start, end in entity_ranges:
            if start <= token_start and end >= token_end:
                # If the (start, end) range is not already in the dictionary, add it with an empty list
                if (start, end) not in start_end_range_to_token_index:
                    start_end_range_to_token_index[(start, end)] = []
                # Add the index of the token to the list corresponding to the (start, end) range in the dictionary
                start_end_range_to_token_index[(start, end)].append(i)
    return start_end_range_to_token_index


def read_annotation_file(file_path: str, selected_entities: List[str]) -> List[Tuple[str, List[Tuple[int,int]]]]:
    entity_ranges = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            fields = line.strip().split('\t')
            # Get the tag and its starting and ending positions
            if fields[0].startswith('T'):
                entity_with_range, word = fields[1], fields[2]
                label = entity_with_range.split()[0]
                if label in selected_entities:
                    ranges = [
                        (
                            int(start_end.split()[0]),
                            int(start_end.split()[1])
                        )
                        for start_end in ' '.join(entity_with_range.split()[1:]).split(';')
                    ]
                    entity_ranges.append((label, ranges))
    # Sort the entity ranges based on start and end
    entity_ranges = sorted(entity_ranges, key=lambda x: (x[1][0][0], x[1][0][1]))
    return entity_ranges

def convert_ann_to_bio(input_dir: str, output_dir: str, selected_entities: List[str]):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    for file_name in os.listdir(input_dir):
        if file_name.endswith('.txt'):
            # Read the corresponding txt file
            text = read_text_file(os.path.join(input_dir, file_name))

            # Find the corresponding ann file
            ann_file = os.path.join(input_dir, file_name.replace('.txt', '.ann'))

            # Tokenize the text
            tokens = tokenize_text(text)

            # Initialize a list to hold the BIO-formatted tags
            bio_tags = ['O'] * len(tokens)

            # Read the annotation file
            entity_ranges = read_annotation_file(ann_file, selected_entities)
            entity_ranges = [(name, *tup) for name, tup_list in entity_ranges for tup in tup_list]

            start_end_2_idx = get_start_end_range_to_token_index(tokens, entity_ranges)

            # Update the BIO tags
            for label, start, end in entity_ranges:
                # Get the list of token indices corresponding to the (start, end) range
                token_indices = start_end_2_idx.get((start, end), [])
                # Assign the BIO tags to each token index in the range
                for i in token_indices:
                    if i == token_indices[0]:
                        bio_tags[i] = 'B-' + label
                    else:
                        bio_tags[i] = 'I-' + label

            # Write the BIO tags to a new file
            with open(os.path.join(output_dir, file_name.replace('.txt', '.bio')), 'w', encoding='utf-8') as f:
                sentence_start_index = 0
                for sentence in text.split('\n'):
                    sentence_tokens = sentence.split()
                    sentence_length = len(sentence_tokens)
                    sentence_end_index = sentence_start_index + sentence_length
                    for i in range(sentence_start_index, sentence_end_index):
                        f.write(sentence_tokens[i - sentence_start_index] + '\t' + bio_tags[i] + '\n')
                    f.write('\n')
                    sentence_start_index = sentence_end_index

    print("Conversion completed successfully.")


In [448]:
convert_ann_to_bio('./data/MACCROBAT', './data/BIO_FILES', selected_entities)

Conversion completed successfully.


In [346]:
with open('./data/MACCROBAT/28383413.txt', 'r') as f:
    text = f.read()

In [351]:
text[4990:5123]

'leg.\nAt the follow-up in late January 2016, the patient could not walk and live by herself and was depressed.\nAt the latest follow-up'

In [440]:
# Read the corresponding txt file
text = read_text_file('./data/MACCROBAT/28383413.txt')

# Find the corresponding ann file
ann_file = './data/MACCROBAT/28383413.ann'

# Tokenize the text
tokens = tokenize_text(text)

# Initialize a list to hold the BIO-formatted tags
bio_tags = ['O'] * len(tokens)

# Read the annotation file
entity_ranges = read_annotation_file(ann_file, selected_entities)
entity_ranges = [(name, *tup) for name, tup_list in entity_ranges for tup in tup_list]

In [441]:
tokens

['n',
 'March',
 '2015',
 'a',
 '62-year-old',
 'woman',
 'was',
 'admitted',
 'to',
 'our',
 'hospital',
 'She',
 'complained',
 'of',
 'progressive',
 'visual',
 'disturbance',
 'which',
 'began',
 'about',
 '4',
 'years',
 'ago',
 'and',
 'was',
 'treated',
 'as',
 'cataract',
 'in',
 'local',
 'hospital',
 'but',
 'no',
 'relief',
 'was',
 'seen',
 'On',
 'the',
 'contrary',
 'the',
 'symptoms',
 'aggravated',
 'half',
 'a',
 'year',
 'ago',
 'together',
 'with',
 'headache',
 'left',
 'eye',
 'pain',
 'tearing',
 'and',
 'increased',
 'secretions',
 'and',
 'the',
 'computed',
 'tomography',
 '(CT',
 'scan',
 'of',
 'the',
 'brain',
 'in',
 'local',
 'hospital',
 'showed',
 'a',
 'sellar',
 'region',
 'lesion',
 'Besides',
 '2',
 'years',
 'earlier',
 'the',
 'patient',
 'underwent',
 'resection',
 'of',
 'melanoma',
 'in',
 'the',
 'left',
 'heel',
 '(T2N0M0',
 'ki67',
 '3–5',
 'Stage',
 'II',
 'followed',
 'by',
 'resection',
 'of',
 'the',
 'recurred',
 'melanoma',
 'nearby',
 

In [432]:
start_end_range_to_token_index = {}
# Keep track of the current position in the text
current_pos = 0
# Iterate over each token in the tokens list
for i in range(len(tokens)):
    # Calculate the starting position of the token
    token_start = text.find(tokens[i], current_pos)
    token_end = token_start + len(tokens[i])

    # Update the current position in the text
    current_pos = token_end
    # Check if the current token is inside any of the entity ranges
    for label, start, end in entity_ranges:
        if tokens[i] == 'follow-up':
            print(token_start, token_end, start, end)
        if start <= token_start and end >= token_end:
            # If the (start, end) range is not already in the dictionary, add it with an empty list
            if (start, end) not in start_end_range_to_token_index:
                start_end_range_to_token_index[(start, end)] = []
            # Add the index of the token to the list corresponding to the (start, end) range in the dictionary
            start_end_range_to_token_index[(start, end)].append(i)

5002 5011 16 27
5002 5011 28 33
5002 5011 38 46
5002 5011 94 112
5002 5011 163 171
5002 5011 198 204
5002 5011 236 244
5002 5011 287 295
5002 5011 297 305
5002 5011 306 310
5002 5011 312 319
5002 5011 334 344
5002 5011 354 373
5002 5011 375 377
5002 5011 391 396
5002 5011 424 437
5002 5011 438 444
5002 5011 494 503
5002 5011 507 515
5002 5011 523 532
5002 5011 534 540
5002 5011 542 551
5002 5011 553 561
5002 5011 576 585
5002 5011 602 610
5002 5011 611 634
5002 5011 652 658
5002 5011 660 669
5002 5011 680 695
5002 5011 705 734
5002 5011 739 759
5002 5011 796 807
5002 5011 813 834
5002 5011 839 842
5002 5011 852 872
5002 5011 877 895
5002 5011 897 905
5002 5011 906 917
5002 5011 939 950
5002 5011 955 981
5002 5011 999 1011
5002 5011 1015 1018
5002 5011 1023 1034
5002 5011 1038 1042
5002 5011 1058 1078
5002 5011 1079 1087
5002 5011 1105 1114
5002 5011 1125 1134
5002 5011 1138 1157
5002 5011 1166 1174
5002 5011 1178 1197
5002 5011 1215 1226
5002 5011 1230 1239
5002 5011 1244 1257
5002 501

In [442]:
start_end_2_idx = get_start_end_range_to_token_index(tokens, entity_ranges)

In [443]:
bio_tags

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O'

In [444]:
start_end_range_to_token_index

{(16, 27): [4],
 (28, 33): [5],
 (38, 46): [7],
 (94, 112): [15, 16],
 (163, 171): [27],
 (198, 204): [33],
 (236, 244): [40],
 (287, 295): [48],
 (297, 305): [49, 50],
 (306, 310): [51],
 (312, 319): [52],
 (334, 344): [55],
 (354, 373): [58, 59],
 (391, 396): [64],
 (424, 437): [70, 71],
 (438, 444): [72],
 (494, 503): [80],
 (507, 515): [82],
 (523, 532): [85, 86],
 (542, 551): [88, 89],
 (553, 561): [90, 91],
 (576, 585): [94],
 (602, 610): [98],
 (611, 634): [99, 100, 101, 102],
 (660, 669): [107, 108],
 (680, 695): [110],
 (705, 734): [113, 114, 115, 116, 117],
 (739, 759): [119, 120],
 (796, 807): [126],
 (813, 834): [128, 129, 130],
 (839, 842): [132, 133],
 (852, 872): [136, 137, 138],
 (877, 895): [140, 141, 142, 143, 144],
 (897, 905): [145],
 (906, 917): [146, 147],
 (939, 950): [152, 153],
 (955, 981): [155, 156],
 (999, 1011): [160, 161],
 (1015, 1018): [163],
 (1023, 1034): [165, 166],
 (1038, 1042): [168],
 (1058, 1078): [172, 173],
 (1079, 1087): [174, 175, 176],
 (110

In [445]:
for label, start, end in entity_ranges:
    # Get the list of token indices corresponding to the (start, end) range
    token_indices = start_end_2_idx.get((start, end), [])
    # Assign the BIO tags to each token index in the range
    for i in token_indices:
        if i == token_indices[0]:
            bio_tags[i] = 'B-' + label
        else:
            bio_tags[i] = 'I-' + label

[799] 5123


In [380]:
for label, start, end in entity_ranges:
    # print(label, start, end)
    # Keep track of the current position in the text
    current_pos = 0
    for i in range(len(tokens)):
        # Calculate the starting position of the token
        token_start = text.find(tokens[i], current_pos)
        if tokens[i] == 'follow-up':
            print(current_pos, token_start, start)
        token_end = token_start + len(tokens[i])
        # Update the current position in the text
        current_pos = token_end
        # for start, end in ranges:

        if start <= token_start and end > token_end:
            if token_start == start:
                bio_tags[i] = 'B-' + label
            else:
                bio_tags[i] = 'I-' + label


5001 5002 16
5001 5002 28
5001 5002 38
5001 5002 94
5001 5002 163
5001 5002 198
5001 5002 236
5001 5002 287
5001 5002 297
5001 5002 306
5001 5002 312
5001 5002 334
5001 5002 354
5001 5002 375
5001 5002 391
5001 5002 424
5001 5002 438
5001 5002 494
5001 5002 507
5001 5002 523
5001 5002 534
5001 5002 542
5001 5002 553
5001 5002 576
5001 5002 602
5001 5002 611
5001 5002 652
5001 5002 660
5001 5002 680
5001 5002 705
5001 5002 739
5001 5002 796
5001 5002 813
5001 5002 839
5001 5002 852
5001 5002 877
5001 5002 897
5001 5002 906
5001 5002 939
5001 5002 955
5001 5002 999
5001 5002 1015
5001 5002 1023
5001 5002 1038
5001 5002 1058
5001 5002 1079
5001 5002 1105
5001 5002 1125
5001 5002 1138
5001 5002 1166
5001 5002 1178
5001 5002 1215
5001 5002 1230
5001 5002 1244
5001 5002 1265
5001 5002 1275
5001 5002 1292
5001 5002 1308
5001 5002 1323
5001 5002 1331
5001 5002 1347
5001 5002 1365
5001 5002 1378
5001 5002 1389
5001 5002 1403
5001 5002 1415
5001 5002 1445
5001 5002 1460
5001 5002 1479
5001 5002 

In [446]:
for token, tag in zip(tokens, bio_tags):
    print(token, tag)

n O
March O
2015 O
a O
62-year-old B-Age
woman B-Sex
was O
admitted B-Clinical_event
to O
our O
hospital O
She O
complained O
of O
progressive O
visual B-Sign_symptom
disturbance I-Sign_symptom
which O
began O
about O
4 O
years O
ago O
and O
was O
treated O
as O
cataract B-Disease_disorder
in O
local O
hospital O
but O
no O
relief B-Sign_symptom
was O
seen O
On O
the O
contrary O
the O
symptoms B-Sign_symptom
aggravated O
half O
a O
year O
ago O
together O
with O
headache B-Sign_symptom
left B-Biological_structure
eye I-Biological_structure
pain B-Sign_symptom
tearing B-Sign_symptom
and O
increased O
secretions B-Sign_symptom
and O
the O
computed B-Diagnostic_procedure
tomography I-Diagnostic_procedure
(CT O
scan O
of O
the O
brain B-Biological_structure
in O
local O
hospital O
showed O
a O
sellar B-Biological_structure
region I-Biological_structure
lesion B-Sign_symptom
Besides O
2 O
years O
earlier O
the O
patient O
underwent O
resection B-Therapeutic_procedure
of O
melanoma B-Diseas

'follow-up'