### MACCROBAT_DATA

Annotation files are in brat standoff format

General annotation structure
All annotations follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type.

Examples of annotation for an entity (T1), an event trigger (T2), an event (E1) and a relation (R1) are shown in the following.

Annotation ID conventions
All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

T: text-bound annotation
R: relation
E: event
A: attribute
M: modification (alias for attribute, for backward compatibility)
N: normalization [new in v1.3]
#: note

### Goal:
* Convert .ann (brat standoff format) & .txt to BIO format files
* Sample BIO format is as follows

* BIO format:
*
| Word           | BIO Tag    |
|----------------|-----------|
| The            | O         |
| patient        | O         |
| has            | O         |
| a              | O         |
| history        | O         |
| of             | O         |
| hypertension   | B-DISEASE |
| and            | I-DISEASE |
| type           | O         |
| 2              | O         |
| diabetes       | B-DISEASE |
| .              | O         |
| The            | O         |
| patient        | O         |
| denies         | O         |
| chest          | B-SYMPTOM |
| pain           | I-SYMPTOM |
| and            | O         |
| shortness      | B-SYMPTOM |
| of             | I-SYMPTOM |
| breath         | I-SYMPTOM |
| .              | O         |
| The            | O         |
| physical       | O         |
| exam           | O         |
| was            | O         |
| unremarkable   | O         |
| except         | O         |
| for            | O         |
| an             | O         |
| elevated       | B-SYMPTOM |
| blood          | I-SYMPTOM |
| pressure       | I-SYMPTOM |
| of             | I-SYMPTOM |
| 140/90         | B-SYMPTOM |
| mmHg           | I-SYMPTOM |
| .              | O         |


In [11]:
MACCROBAT_data_dir = "../data/MACCROBAT"

In [8]:
import os
import pandas as pd


In [12]:
file_ids = []
for file in os.listdir(MACCROBAT_data_dir):
    file_id = file.split(".")[0]
    if file_id not in file_ids:
        file_ids.append(file_id)
file_ids

['19860925',
 '26361640',
 '26228535',
 '27773410',
 '23678274',
 '25853982',
 '28103924',
 '27064109',
 '28154700',
 '20146086',
 '26656340',
 '28353558',
 '22515939',
 '28353588',
 '26309459',
 '28272235',
 '23242090',
 '23312850',
 '23124805',
 '26106249',
 '26313770',
 '26285706',
 '18416479',
 '28353613',
 '28151916',
 '26175648',
 '23468586',
 '28216610',
 '27059701',
 '28121940',
 '23077697',
 '27741115',
 '21067996',
 '28100235',
 '28151860',
 '25884600',
 '27904130',
 '19214295',
 '18787726',
 '22719160',
 '28422883',
 '26675562',
 '21477357',
 '25139918',
 '28353561',
 '22791498',
 '28538413',
 '26457578',
 '27842605',
 '20671919',
 '25155594',
 '26469535',
 '28353604',
 '28403092',
 '28239141',
 '28202869',
 '25024632',
 '28403086',
 '18666334',
 '25572898',
 '28296775',
 '22514576',
 '26584481',
 '28296749',
 '16778410',
 '19860007',
 '28190872',
 '25743872',
 '26523273',
 '28193213',
 '28120581',
 '26670309',
 '26336183',
 '25410883',
 '26530965',
 '28057913',
 '20977862',

### Checking all the tag types in all .ann files

In [13]:
tags = []
for ann_file in [os.path.join(MACCROBAT_data_dir, file_id+".ann") for file_id in file_ids]:
    with open(ann_file) as f:
        for line in f.readlines():
            tags.append(line.split("\t")[0][0])

set(tags)

{'#', '*', 'A', 'E', 'R', 'T'}

For named entity recognition, only the T tags are needed to generate the BIO format. The other tags such as #, *, A, E, and R are used for representing different types of information in the .ann files and are not directly related to named entity recognition.

\# is used to denote comments in the annotation file.
\* is used to represent coreference annotations.
A is used to represent attribute annotations.
E is used to represent event annotations.
R is used to represent relation annotations.

Therefore, we can ignore all other tags except for the T tags when generating the BIO format for named entity recognition.

### Find Entity types in ann files

In [23]:
# create an empty list to store the DataFrames
df_list = []

for ann_file in [os.path.join(MACCROBAT_data_dir, file_id+".ann") for file_id in file_ids]:
    # read the TSV file with consecutive tabs treated as a single delimiter
    df = pd.read_csv(ann_file, sep="\t+", header=None, names=['id', 'entity_with_range', 'word'], engine='python')

    # selecting only T tag rows
    df = df[df.iloc[:,0].str.startswith('T')]

    df['file_name'] = f"{os.path.splitext(ann_file)[0]}.txt"

    df_list.append(df)

ann_df = pd.concat(df_list)
ann_df = ann_df.reset_index(drop=True)

In [24]:
ann_df.head()

Unnamed: 0,id,entity_with_range,word,file_name
0,T1,Age 4 15,24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex 28 32,male,../data/MACCROBAT/19860925.txt
2,T3,History 16 27,non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event 41 50,presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom 65 75,hemoptysis,../data/MACCROBAT/19860925.txt


In [25]:
ann_df.shape

(25041, 4)

Few entries in data represents the tagged text spans more than one disjoint ranges
Below are the entites with disjoint ranges

In [26]:
ann_df[ann_df['entity_with_range'].str.split().str.len() > 3]

Unnamed: 0,id,entity_with_range,word,file_name
39,T40,Disease_disorder 701 714;730 735,granular cell tumor,../data/MACCROBAT/19860925.txt
1190,T31,Dosage 4085 4093;4103 4108,low dose daily,../data/MACCROBAT/20146086.txt
1272,T160,Dosage 3117 3124;3135 3141,1500 mg weekly,../data/MACCROBAT/20146086.txt
1469,T101,Detailed_description 2552 2559;2585 2654,neither had developed any signs or symptoms su...,../data/MACCROBAT/28353558.txt
2376,T42,Administration 704 715;721 726,intravenous bolus,../data/MACCROBAT/23124805.txt
2380,T47,Dosage 828 835;861 879,5 mg/kg every 4 to 6 weeks,../data/MACCROBAT/23124805.txt
2381,T48,Dosage 849 857;861 879,10 mg/kg every 4 to 6 weeks,../data/MACCROBAT/23124805.txt
2853,T18,Diagnostic_procedure 341 350;357 358,Hepatitis C,../data/MACCROBAT/18416479.txt
3335,T16,Disease_disorder 315 317;319 330,LV dysfunction,../data/MACCROBAT/23468586.txt
3336,T15,Disease_disorder 297 313;319 330,left ventricular dysfunction,../data/MACCROBAT/23468586.txt


Note that the "B" prefix is used to indicate the beginning of an entity, while the "I" prefix is used to indicate an intermediate token within an entity. When there are disjoint ranges for an entity, we can start a new entity with a "B" prefix for each range.

Including the full text that spans disjoint ranges in the BIO format will help the model learn to recognize the entire entity, even if it is fragmented across multiple parts of the text.

### Splitting entity_with_ranges column

In [27]:
ann_df['entity'] = ann_df['entity_with_range'].str.split().str[0]
ann_df['ranges'] = ann_df['entity_with_range'].str.split().str[1:]
ann_df = ann_df[['id', 'entity', 'ranges', 'word', 'file_name']]
ann_df.head()

Unnamed: 0,id,entity,ranges,word,file_name
0,T1,Age,"[4, 15]",24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,../data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,../data/MACCROBAT/19860925.txt


### All the entities in the dataset

In [28]:
print("The set of entities available in the dataset is as follows: ")
set(ann_df['entity'])

The set of entities available in the dataset is as follows: 


{'Activity',
 'Administration',
 'Age',
 'Area',
 'Biological_attribute',
 'Biological_structure',
 'Clinical_event',
 'Color',
 'Coreference',
 'Date',
 'Detailed_description',
 'Diagnostic_procedure',
 'Disease_disorder',
 'Distance',
 'Dosage',
 'Duration',
 'Family_history',
 'Frequency',
 'Height',
 'History',
 'Lab_value',
 'Mass',
 'Medication',
 'Nonbiological_location',
 'Occupation',
 'Other_entity',
 'Other_event',
 'Outcome',
 'Personal_background',
 'Qualitative_concept',
 'Quantitative_concept',
 'Severity',
 'Sex',
 'Shape',
 'Sign_symptom',
 'Subject',
 'Texture',
 'Therapeutic_procedure',
 'Time',
 'Volume',
 'Weight'}

### Seeing sample words in each entities

In [55]:
# Define a function to select random rows from each group
def select_random_rows(group):
    return group.sample(n=min(1, len(group)), replace=False)

# Apply the function to the DataFrame grouped by 'group'
random_rows = ann_df.groupby('entity').apply(select_random_rows)

# Print the selected random rows
for index, row in random_rows.iterrows():
    print(f"Entity: {row['entity']:>20} \n word: {row['word']:>20}")


Entity:             Activity 
 word:                 walk
Entity:       Administration 
 word:          intravenous
Entity:                  Age 
 word:          20-year-old
Entity:                 Area 
 word: 9 cm × 6 cm in diameter
Entity: Biological_attribute 
 word:           eczematous
Entity: Biological_structure 
 word:     descending colon
Entity:       Clinical_event 
 word:              visited
Entity:                Color 
 word:            yellowish
Entity:          Coreference 
 word:     heavy chain gene
Entity:                 Date 
 word:                 2012
Entity: Detailed_description 
 word: plate calcifications
Entity: Diagnostic_procedure 
 word: T2-weighted sequence
Entity:     Disease_disorder 
 word:  sebaceous carcinoma
Entity:             Distance 
 word:                 2-cm
Entity:               Dosage 
 word:     high-dose weekly
Entity:             Duration 
 word:         three months
Entity:       Family_history 
 word:    family of potters
Entity:    

In [58]:
ann_df.groupby(['entity']).size().sort_values(ascending=False)

entity
Diagnostic_procedure      4567
Sign_symptom              3359
Biological_structure      2931
Detailed_description      2901
Lab_value                 2858
Disease_disorder          1362
Medication                1076
Therapeutic_procedure     1005
Date                       731
Clinical_event             626
History                    392
Severity                   369
Dosage                     362
Nonbiological_location     354
Coreference                313
Duration                   280
Age                        206
Sex                        191
Administration             175
Distance                   122
Activity                   108
Family_history              81
Frequency                   76
Shape                       65
Time                        57
Personal_background         57
Subject                     54
Color                       52
Texture                     46
Area                        43
Outcome                     42
Qualitative_concept         41
V

In [59]:
# Define the selected entities
selected_entities = [
    'Age', 'Biological_attribute', 'Biological_structure', 'Clinical_event', 'Diagnostic_procedure',
    'Disease_disorder', 'Dosage', 'Family_history', 'Height', 'History', 'Lab_value', 'Mass',
    'Medication', 'Sex', 'Sign_symptom', 'Therapeutic_procedure', 'Weight'
]

In [60]:
filtered_ann_df = ann_df[ann_df['entity'].isin(selected_entities)]
filtered_ann_df = filtered_ann_df.reset_index(drop=True)
filtered_ann_df.head()

Unnamed: 0,id,entity,ranges,word,file_name
0,T1,Age,"[4, 15]",24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,../data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,../data/MACCROBAT/19860925.txt


In [61]:
filtered_ann_df.shape

(19036, 5)

### Checking if there are overlapping annotation ranges in each file

In [156]:
test_df = filtered_ann_df[filtered_ann_df['file_name'] == '../data/MACCROBAT/19860925.txt']
test_df.head()

Unnamed: 0,id,entity,ranges,word,file_name
0,T1,Age,"[4, 15]",24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,../data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,../data/MACCROBAT/19860925.txt


In [157]:


test_df.loc[:, 'ranges'] = test_df['ranges'].apply(lambda range_list: test_fun(range_list))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df.loc[:, 'ranges'] = test_df['ranges'].apply(lambda range_list: test_fun(range_list))


In [207]:
ann_df.file_name.values

array(['../data/MACCROBAT/19860925.txt', '../data/MACCROBAT/19860925.txt',
       '../data/MACCROBAT/19860925.txt', ...,
       '../data/MACCROBAT/26714786.txt', '../data/MACCROBAT/26714786.txt',
       '../data/MACCROBAT/26714786.txt'], dtype=object)

In [246]:
def test_fun(x):
    if len(x) > 2:
        new_x = []
        start = x[0]
        for num in x[1:]:
            end = num.split(';')[0]
            new_x.append([start, end])
            if len(num.split(';')) > 1:
                start = num.split(';')[1]
        return new_x
    else:
        return [x]

overlapping_entities = {}
for file_name in filtered_ann_df.file_name.values:
    test_df = filtered_ann_df[filtered_ann_df['file_name'] == file_name]

    range_to_label = {}

    for index, row in test_df.iterrows():
        for start, end in test_fun(row['ranges']):
            range_to_label[(int(start), int(end))] = row['entity']
        # if isinstance(row['ranges'][0], list):
        #     for start, end in row['ranges']:
        #         range_to_label[(int(start), int(end))] = row['entity']
        # else:
        #     start = int(row['ranges'][0])
        #     end = int(row['ranges'][1])
        #     range_to_label[(start, end)] = row['entity']


    prev_start = -1
    prev_end = -1
    for start, end in dict(sorted(range_to_label.items(), key=lambda x: x[0][0])).keys():
        if start <= prev_end:
            if range_to_label[(prev_start, prev_end)] != range_to_label[(start, end)]:

                if range_to_label[(prev_start, prev_end)] > range_to_label[(start, end)]:
                    entity_1 = range_to_label[(prev_start, prev_end)]
                    entity_1_len = prev_end - prev_start
                    entity_2 =  range_to_label[(start, end)]
                    entity_2_len = end - start
                else:
                    entity_1 =  range_to_label[(start, end)]
                    entity_1_len = end - start
                    entity_2 = range_to_label[(prev_start, prev_end)]
                    entity_2_len = prev_end - prev_start

                if (entity_1, entity_2) not in overlapping_entities:
                    overlapping_entities[(entity_1, entity_2)] = [0, 0, 0]
                overlapping_entities[(entity_1, entity_2)][0] += 1
                if entity_1_len > entity_2_len:
                    overlapping_entities[(entity_1, entity_2)][1] += 1
                elif entity_1_len < entity_2_len:
                    overlapping_entities[(entity_1, entity_2)][2] += 1

        prev_start = start
        prev_end = end

In [247]:
print("Found overlapping entities and displaying overlapping entities in the format")
print("(entity1, entity2): [number of overlapping occurences, number of occurences when entity 1 entirely covers entity 2, number of occurences when entity 2 entirely covers entity 1")
overlapping_entities

Found overlapping entities and displaying overlapping entities in the format
(entity1, entity2): [number of overlapping occurences, number of occurences when entity 1 entirely covers entity 2, number of occurences when entity 2 entirely covers entity 1


{('History', 'Biological_structure'): [2028, 2028, 0],
 ('History', 'Disease_disorder'): [3026, 3026, 0],
 ('Lab_value', 'Diagnostic_procedure'): [240, 137, 103],
 ('Sign_symptom', 'History'): [1904, 0, 1904],
 ('Sign_symptom', 'Family_history'): [201, 0, 201],
 ('Medication', 'History'): [726, 0, 726],
 ('History', 'Diagnostic_procedure'): [304, 304, 0],
 ('Lab_value', 'Family_history'): [180, 0, 180],
 ('Family_history', 'Diagnostic_procedure'): [180, 180, 0],
 ('Sign_symptom', 'Biological_structure'): [417, 417, 0],
 ('Lab_value', 'History'): [899, 0, 899],
 ('History', 'Clinical_event'): [255, 255, 0],
 ('Therapeutic_procedure', 'History'): [1305, 0, 1305],
 ('Disease_disorder', 'Diagnostic_procedure'): [133, 0, 133],
 ('Sign_symptom', 'Disease_disorder'): [260, 132, 128],
 ('Family_history', 'Disease_disorder'): [270, 270, 0]}

In [235]:
overlapping_entity_list = []
for e1, e2 in list(overlapping_entities.keys()):
    overlapping_entity_list.extend([e1, e2])
set(overlapping_entity_list)

{'Biological_structure',
 'Clinical_event',
 'Diagnostic_procedure',
 'Disease_disorder',
 'Family_history',
 'History',
 'Lab_value',
 'Medication',
 'Sign_symptom',
 'Therapeutic_procedure'}

In [225]:
for entity in selected_entities:
    print(f"{entity:>20}: {filtered_ann_df[filtered_ann_df['entity']==entity].shape[0]}")

                 Age: 206
Biological_attribute: 10
Biological_structure: 2931
      Clinical_event: 626
Diagnostic_procedure: 4567
    Disease_disorder: 1362
              Dosage: 362
      Family_history: 81
              Height: 4
             History: 392
           Lab_value: 2858
                Mass: 2
          Medication: 1076
                 Sex: 191
        Sign_symptom: 3359
Therapeutic_procedure: 1005
              Weight: 4


In [None]:
# Define the selected entities
# change if necessary
selected_entities = [
    'Age', 'Biological_attribute', 'Biological_structure', 'Clinical_event', 'Diagnostic_procedure',
    'Disease_disorder', 'Dosage', 'Family_history', 'Height', 'History', 'Lab_value', 'Mass',
    'Medication', 'Sex', 'Sign_symptom', 'Therapeutic_procedure', 'Weight'
]

In [306]:
def iterate_ranges(x):
    if len(x) > 2:
        new_x = []
        start = x[0]
        for num in x[1:]:
            end = num.split(';')[0]
            new_x.append([start, end])
            if len(num.split(';')) > 1:
                start = num.split(';')[1]
        return new_x
    else:
        return [x]

overlapping_entities = {}
overall_range_to_label = {}
for file_name in filtered_ann_df.file_name.values:
    test_df = filtered_ann_df[filtered_ann_df['file_name'] == file_name]

    range_to_label = {}
    for index, row in test_df.iterrows():
        for start, end in iterate_ranges(row['ranges']):
            range_to_label[(int(start), int(end))] = row['entity']
            overall_range_to_label[(file_name, int(start), int(end))] = row['entity']

    prev_start = -1
    prev_end = -1
    for start, end in dict(sorted(range_to_label.items(), key=lambda x: x[0][0])).keys():

        if start <= prev_end:
            if overall_range_to_label[(file_name, prev_start, prev_end)] != overall_range_to_label[(file_name, start, end)]:
                entity_1 = overall_range_to_label[(file_name, prev_start, prev_end)]
                entity_1_len = prev_end - prev_start
                entity_2 =  overall_range_to_label[(file_name, start, end)]
                entity_2_len = end - start

                if prev_end - prev_start > end - start:
                    del overall_range_to_label[(file_name, start, end)]
                    continue

                elif prev_end - prev_start < end - start:
                    del overall_range_to_label[(file_name, prev_start, prev_end)]

            elif overall_range_to_label[(file_name, prev_start, prev_end)] == overall_range_to_label[(file_name, start, end)]:

                entity = overall_range_to_label[(file_name, prev_start, prev_end)]
                del overall_range_to_label[(file_name, prev_start, prev_end)]
                del overall_range_to_label[(file_name, start, end)]
                overall_range_to_label[(file_name, prev_start, max(end, prev_end))] =  entity
                continue

        prev_start = start
        prev_end = end

In [307]:
prev_start = -1
prev_end = -1
prev_file = None
for file_name, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
    if prev_file != file_name:
        prev_file = file_name
        prev_start = -1
        prev_end = -1
        continue

    if start <= prev_end:
        print('Alert - Check')
        print(file_name, prev_start, prev_end, start, end)
        print(overall_range_to_label[(file_name, prev_start, prev_end)], overall_range_to_label[(file_name, start, end)])

    prev_start = start
    prev_end = end


**Fixed Overlapping ranges. Now single entity for each tokens**

In [312]:
print(f"Total number of (start, end) ranges with corresponsing tag extracted is {len(overall_range_to_label)}")

Total number of (start, end) ranges with corresponsing tag extracted is 18760


### Generating BIO format files

In [314]:
import string

In [315]:
def tokenize_text(text):
    # Tokenize the text into a list of words
    tokens = []
    for sentence in text.split('\n'):
        for word in sentence.split():
            # Remove trailing punctuation marks from the word
            while word and word[-1] in string.punctuation:
                word = word[:-1]
            tokens.append(word)
    return tokens

In [341]:
output_dir = '../data/BIO_FILES'
for file_id in file_ids:

    txt_file = os.path.join(MACCROBAT_data_dir, file_id+".txt")
    with open(txt_file, 'r') as f:
        text = f.read()


    tokens = tokenize_text(text)

    # Initialize a list to hold the BIO-formatted tags
    bio_tags = ['O'] * len(tokens)

    curr_pos = 0
    for i in range(len(tokens)):

        token_start = text.find(tokens[i], curr_pos)
        token_end = token_start + len(tokens[i])
        curr_pos = token_end

        for file_name, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
            if file_name == txt_file:
                tag = overall_range_to_label[(file_name, start, end)]
                if start <= token_start and end >= token_end:

                    if token_start == start:
                        bio_tags[i] = f'B-{tag}'
                    else:
                        bio_tags[i] = f'I-{tag}'

    if not os.path.exists(output_dir):
            os.makedirs(output_dir)

    # Write the BIO tags to a new file
    output_file = os.path.join(output_dir, file_id+".bio")
    with open(output_file, 'w', encoding='utf-8') as f:
        sentence_start_index = 0
        for sentence in text.split('\n'):
            sentence_tokens = sentence.split()
            sentence_length = len(sentence_tokens)
            sentence_end_index = sentence_start_index + sentence_length
            for i in range(sentence_start_index, sentence_end_index):
                f.write(sentence_tokens[i - sentence_start_index] + '\t' + bio_tags[i] + '\n')
            f.write('\n')
            sentence_start_index = sentence_end_index

print("Conversion completed successfully.")



Conversion completed successfully.
