### MACCROBAT_DATA

Annotation files are in brat standoff format

General annotation structure
All annotations follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type.

Examples of annotation for an entity (T1), an event trigger (T2), an event (E1) and a relation (R1) are shown in the following.

Annotation ID conventions
All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

T: text-bound annotation
R: relation
E: event
A: attribute
M: modification (alias for attribute, for backward compatibility)
N: normalization [new in v1.3]
#: note

### Goal:
* Convert .ann (brat standoff format) & .txt to BIO format files
* Sample BIO format is as follows

* BIO format:
*
| Word           | BIO Tag    |
|----------------|-----------|
| The            | O         |
| patient        | O         |
| has            | O         |
| a              | O         |
| history        | O         |
| of             | O         |
| hypertension   | B-DISEASE |
| and            | I-DISEASE |
| type           | O         |
| 2              | O         |
| diabetes       | B-DISEASE |
| .              | O         |
| The            | O         |
| patient        | O         |
| denies         | O         |
| chest          | B-SYMPTOM |
| pain           | I-SYMPTOM |
| and            | O         |
| shortness      | B-SYMPTOM |
| of             | I-SYMPTOM |
| breath         | I-SYMPTOM |
| .              | O         |
| The            | O         |
| physical       | O         |
| exam           | O         |
| was            | O         |
| unremarkable   | O         |
| except         | O         |
| for            | O         |
| an             | O         |
| elevated       | B-SYMPTOM |
| blood          | I-SYMPTOM |
| pressure       | I-SYMPTOM |
| of             | I-SYMPTOM |
| 140/90         | B-SYMPTOM |
| mmHg           | I-SYMPTOM |
| .              | O         |


In [1]:
MACCROBAT_data_dir = "../data/MACCROBAT"

In [8]:
import os
import pandas as pd


In [12]:
file_ids = []
for file in os.listdir(MACCROBAT_data_dir):
    file_id = file.split(".")[0]
    if file_id not in file_ids:
        file_ids.append(file_id)
file_ids

['19860925',
 '26361640',
 '26228535',
 '27773410',
 '23678274',
 '25853982',
 '28103924',
 '27064109',
 '28154700',
 '20146086',
 '26656340',
 '28353558',
 '22515939',
 '28353588',
 '26309459',
 '28272235',
 '23242090',
 '23312850',
 '23124805',
 '26106249',
 '26313770',
 '26285706',
 '18416479',
 '28353613',
 '28151916',
 '26175648',
 '23468586',
 '28216610',
 '27059701',
 '28121940',
 '23077697',
 '27741115',
 '21067996',
 '28100235',
 '28151860',
 '25884600',
 '27904130',
 '19214295',
 '18787726',
 '22719160',
 '28422883',
 '26675562',
 '21477357',
 '25139918',
 '28353561',
 '22791498',
 '28538413',
 '26457578',
 '27842605',
 '20671919',
 '25155594',
 '26469535',
 '28353604',
 '28403092',
 '28239141',
 '28202869',
 '25024632',
 '28403086',
 '18666334',
 '25572898',
 '28296775',
 '22514576',
 '26584481',
 '28296749',
 '16778410',
 '19860007',
 '28190872',
 '25743872',
 '26523273',
 '28193213',
 '28120581',
 '26670309',
 '26336183',
 '25410883',
 '26530965',
 '28057913',
 '20977862',

### Checking all the tag types in all .ann files

In [13]:
tags = []
for ann_file in [os.path.join(MACCROBAT_data_dir, file_id+".ann") for file_id in file_ids]:
    with open(ann_file) as f:
        for line in f.readlines():
            tags.append(line.split("\t")[0][0])

set(tags)

{'#', '*', 'A', 'E', 'R', 'T'}

For named entity recognition, only the T tags are needed to generate the BIO format. The other tags such as #, *, A, E, and R are used for representing different types of information in the .ann files and are not directly related to named entity recognition.

\# is used to denote comments in the annotation file.
\* is used to represent coreference annotations.
A is used to represent attribute annotations.
E is used to represent event annotations.
R is used to represent relation annotations.

Therefore, we can ignore all other tags except for the T tags when generating the BIO format for named entity recognition.

### Find Entity types in ann files

In [23]:
# create an empty list to store the DataFrames
df_list = []

for ann_file in [os.path.join(MACCROBAT_data_dir, file_id+".ann") for file_id in file_ids]:
    # read the TSV file with consecutive tabs treated as a single delimiter
    df = pd.read_csv(ann_file, sep="\t+", header=None, names=['id', 'entity_with_range', 'word'], engine='python')

    # selecting only T tag rows
    df = df[df.iloc[:,0].str.startswith('T')]

    df['file_name'] = f"{os.path.splitext(ann_file)[0]}.txt"

    df_list.append(df)

ann_df = pd.concat(df_list)
ann_df = ann_df.reset_index(drop=True)

In [24]:
ann_df.head()

Unnamed: 0,id,entity_with_range,word,file_name
0,T1,Age 4 15,24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex 28 32,male,../data/MACCROBAT/19860925.txt
2,T3,History 16 27,non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event 41 50,presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom 65 75,hemoptysis,../data/MACCROBAT/19860925.txt


In [25]:
ann_df.shape

(25041, 4)

Few entries in data represents the tagged text spans more than one disjoint ranges
Below are the entites with disjoint ranges

In [26]:
ann_df[ann_df['entity_with_range'].str.split().str.len() > 3]

Unnamed: 0,id,entity_with_range,word,file_name
39,T40,Disease_disorder 701 714;730 735,granular cell tumor,../data/MACCROBAT/19860925.txt
1190,T31,Dosage 4085 4093;4103 4108,low dose daily,../data/MACCROBAT/20146086.txt
1272,T160,Dosage 3117 3124;3135 3141,1500 mg weekly,../data/MACCROBAT/20146086.txt
1469,T101,Detailed_description 2552 2559;2585 2654,neither had developed any signs or symptoms su...,../data/MACCROBAT/28353558.txt
2376,T42,Administration 704 715;721 726,intravenous bolus,../data/MACCROBAT/23124805.txt
2380,T47,Dosage 828 835;861 879,5 mg/kg every 4 to 6 weeks,../data/MACCROBAT/23124805.txt
2381,T48,Dosage 849 857;861 879,10 mg/kg every 4 to 6 weeks,../data/MACCROBAT/23124805.txt
2853,T18,Diagnostic_procedure 341 350;357 358,Hepatitis C,../data/MACCROBAT/18416479.txt
3335,T16,Disease_disorder 315 317;319 330,LV dysfunction,../data/MACCROBAT/23468586.txt
3336,T15,Disease_disorder 297 313;319 330,left ventricular dysfunction,../data/MACCROBAT/23468586.txt


Note that the "B" prefix is used to indicate the beginning of an entity, while the "I" prefix is used to indicate an intermediate token within an entity. When there are disjoint ranges for an entity, we can start a new entity with a "B" prefix for each range.

Including the full text that spans disjoint ranges in the BIO format will help the model learn to recognize the entire entity, even if it is fragmented across multiple parts of the text.

### Splitting entity_with_ranges column

In [27]:
ann_df['entity'] = ann_df['entity_with_range'].str.split().str[0]
ann_df['ranges'] = ann_df['entity_with_range'].str.split().str[1:]
ann_df = ann_df[['id', 'entity', 'ranges', 'word', 'file_name']]
ann_df.head()

Unnamed: 0,id,entity,ranges,word,file_name
0,T1,Age,"[4, 15]",24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,../data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,../data/MACCROBAT/19860925.txt


### All the entities in the dataset

In [28]:
print("The set of entities available in the dataset is as follows: ")
set(ann_df['entity'])

The set of entities available in the dataset is as follows: 


{'Activity',
 'Administration',
 'Age',
 'Area',
 'Biological_attribute',
 'Biological_structure',
 'Clinical_event',
 'Color',
 'Coreference',
 'Date',
 'Detailed_description',
 'Diagnostic_procedure',
 'Disease_disorder',
 'Distance',
 'Dosage',
 'Duration',
 'Family_history',
 'Frequency',
 'Height',
 'History',
 'Lab_value',
 'Mass',
 'Medication',
 'Nonbiological_location',
 'Occupation',
 'Other_entity',
 'Other_event',
 'Outcome',
 'Personal_background',
 'Qualitative_concept',
 'Quantitative_concept',
 'Severity',
 'Sex',
 'Shape',
 'Sign_symptom',
 'Subject',
 'Texture',
 'Therapeutic_procedure',
 'Time',
 'Volume',
 'Weight'}

### Seeing sample words in each entities

In [55]:
# Define a function to select random rows from each group
def select_random_rows(group):
    return group.sample(n=min(1, len(group)), replace=False)

# Apply the function to the DataFrame grouped by 'group'
random_rows = ann_df.groupby('entity').apply(select_random_rows)

# Print the selected random rows
for index, row in random_rows.iterrows():
    print(f"Entity: {row['entity']:>20} \n word: {row['word']:>20}")


Entity:             Activity 
 word:                 walk
Entity:       Administration 
 word:          intravenous
Entity:                  Age 
 word:          20-year-old
Entity:                 Area 
 word: 9 cm × 6 cm in diameter
Entity: Biological_attribute 
 word:           eczematous
Entity: Biological_structure 
 word:     descending colon
Entity:       Clinical_event 
 word:              visited
Entity:                Color 
 word:            yellowish
Entity:          Coreference 
 word:     heavy chain gene
Entity:                 Date 
 word:                 2012
Entity: Detailed_description 
 word: plate calcifications
Entity: Diagnostic_procedure 
 word: T2-weighted sequence
Entity:     Disease_disorder 
 word:  sebaceous carcinoma
Entity:             Distance 
 word:                 2-cm
Entity:               Dosage 
 word:     high-dose weekly
Entity:             Duration 
 word:         three months
Entity:       Family_history 
 word:    family of potters
Entity:    

In [58]:
ann_df.groupby(['entity']).size().sort_values(ascending=False)

entity
Diagnostic_procedure      4567
Sign_symptom              3359
Biological_structure      2931
Detailed_description      2901
Lab_value                 2858
Disease_disorder          1362
Medication                1076
Therapeutic_procedure     1005
Date                       731
Clinical_event             626
History                    392
Severity                   369
Dosage                     362
Nonbiological_location     354
Coreference                313
Duration                   280
Age                        206
Sex                        191
Administration             175
Distance                   122
Activity                   108
Family_history              81
Frequency                   76
Shape                       65
Time                        57
Personal_background         57
Subject                     54
Color                       52
Texture                     46
Area                        43
Outcome                     42
Qualitative_concept         41
V

In [59]:
# Define the selected entities
selected_entities = [
    'Age', 'Biological_attribute', 'Biological_structure', 'Clinical_event', 'Diagnostic_procedure',
    'Disease_disorder', 'Dosage', 'Family_history', 'Height', 'History', 'Lab_value', 'Mass',
    'Medication', 'Sex', 'Sign_symptom', 'Therapeutic_procedure', 'Weight'
]

In [60]:
filtered_ann_df = ann_df[ann_df['entity'].isin(selected_entities)]
filtered_ann_df = filtered_ann_df.reset_index(drop=True)
filtered_ann_df.head()

Unnamed: 0,id,entity,ranges,word,file_name
0,T1,Age,"[4, 15]",24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,../data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,../data/MACCROBAT/19860925.txt


In [61]:
filtered_ann_df.shape

(19036, 5)

### Checking if there are overlapping annotation ranges in each file

In [156]:
test_df = filtered_ann_df[filtered_ann_df['file_name'] == '../data/MACCROBAT/19860925.txt']
test_df.head()

Unnamed: 0,id,entity,ranges,word,file_name
0,T1,Age,"[4, 15]",24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,../data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,../data/MACCROBAT/19860925.txt


In [342]:
def iterate_ranges(x):
    if len(x) > 2:
        new_x = []
        start = x[0]
        for num in x[1:]:
            end = num.split(';')[0]
            new_x.append([start, end])
            if len(num.split(';')) > 1:
                start = num.split(';')[1]
        return new_x
    else:
        return [x]

overlapping_entities = {}
for file_name in filtered_ann_df.file_name.values:
    test_df = filtered_ann_df[filtered_ann_df['file_name'] == file_name]

    range_to_label = {}

    for index, row in test_df.iterrows():
        for start, end in iterate_ranges(row['ranges']):
            range_to_label[(int(start), int(end))] = row['entity']
        # if isinstance(row['ranges'][0], list):
        #     for start, end in row['ranges']:
        #         range_to_label[(int(start), int(end))] = row['entity']
        # else:
        #     start = int(row['ranges'][0])
        #     end = int(row['ranges'][1])
        #     range_to_label[(start, end)] = row['entity']


    prev_start = -1
    prev_end = -1
    for start, end in dict(sorted(range_to_label.items(), key=lambda x: x[0][0])).keys():
        if start <= prev_end:
            if range_to_label[(prev_start, prev_end)] != range_to_label[(start, end)]:

                if range_to_label[(prev_start, prev_end)] > range_to_label[(start, end)]:
                    entity_1 = range_to_label[(prev_start, prev_end)]
                    entity_1_len = prev_end - prev_start
                    entity_2 =  range_to_label[(start, end)]
                    entity_2_len = end - start
                else:
                    entity_1 =  range_to_label[(start, end)]
                    entity_1_len = end - start
                    entity_2 = range_to_label[(prev_start, prev_end)]
                    entity_2_len = prev_end - prev_start

                if (entity_1, entity_2) not in overlapping_entities:
                    overlapping_entities[(entity_1, entity_2)] = [0, 0, 0]
                overlapping_entities[(entity_1, entity_2)][0] += 1
                if entity_1_len > entity_2_len:
                    overlapping_entities[(entity_1, entity_2)][1] += 1
                elif entity_1_len < entity_2_len:
                    overlapping_entities[(entity_1, entity_2)][2] += 1

        prev_start = start
        prev_end = end

In [343]:
print("Found overlapping entities and displaying overlapping entities in the format")
print("(entity1, entity2): [number of overlapping occurences, number of occurences when entity 1 entirely covers entity 2, number of occurences when entity 2 entirely covers entity 1")
overlapping_entities

Found overlapping entities and displaying overlapping entities in the format
(entity1, entity2): [number of overlapping occurences, number of occurences when entity 1 entirely covers entity 2, number of occurences when entity 2 entirely covers entity 1


{('History', 'Biological_structure'): [2028, 2028, 0],
 ('History', 'Disease_disorder'): [3026, 3026, 0],
 ('Lab_value', 'Diagnostic_procedure'): [240, 137, 103],
 ('Sign_symptom', 'History'): [1904, 0, 1904],
 ('Sign_symptom', 'Family_history'): [201, 0, 201],
 ('Medication', 'History'): [726, 0, 726],
 ('History', 'Diagnostic_procedure'): [304, 304, 0],
 ('Lab_value', 'Family_history'): [180, 0, 180],
 ('Family_history', 'Diagnostic_procedure'): [180, 180, 0],
 ('Sign_symptom', 'Biological_structure'): [417, 417, 0],
 ('Lab_value', 'History'): [899, 0, 899],
 ('History', 'Clinical_event'): [255, 255, 0],
 ('Therapeutic_procedure', 'History'): [1305, 0, 1305],
 ('Disease_disorder', 'Diagnostic_procedure'): [133, 0, 133],
 ('Sign_symptom', 'Disease_disorder'): [260, 132, 128],
 ('Family_history', 'Disease_disorder'): [270, 270, 0]}

In [344]:
overlapping_entity_list = []
for e1, e2 in list(overlapping_entities.keys()):
    overlapping_entity_list.extend([e1, e2])
set(overlapping_entity_list)

{'Biological_structure',
 'Clinical_event',
 'Diagnostic_procedure',
 'Disease_disorder',
 'Family_history',
 'History',
 'Lab_value',
 'Medication',
 'Sign_symptom',
 'Therapeutic_procedure'}

In [345]:
for entity in selected_entities:
    print(f"{entity:>20}: {filtered_ann_df[filtered_ann_df['entity']==entity].shape[0]}")

                 Age: 206
Biological_attribute: 10
Biological_structure: 2931
      Clinical_event: 626
Diagnostic_procedure: 4567
    Disease_disorder: 1362
              Dosage: 362
      Family_history: 81
              Height: 4
             History: 392
           Lab_value: 2858
                Mass: 2
          Medication: 1076
                 Sex: 191
        Sign_symptom: 3359
Therapeutic_procedure: 1005
              Weight: 4


In [None]:
# Define the selected entities
# change if necessary
# selected_entities = [
#     'Age', 'Biological_attribute', 'Biological_structure', 'Clinical_event', 'Diagnostic_procedure',
#     'Disease_disorder', 'Dosage', 'Family_history', 'Height', 'History', 'Lab_value', 'Mass',
#     'Medication', 'Sex', 'Sign_symptom', 'Therapeutic_procedure', 'Weight'
# ]

In [346]:
# selected_entities = [
#     'Age',
#     'Biological_attribute',
#     'Biological_structure',
#     'Clinical_event',
#     'Detailed_description',
#     'Diagnostic_procedure',
#     'Disease_disorder',
#     'Family_history',
#     'Lab_value',
#     'Mass',
#     'Medication',
#     'Nonbiological_location',
#     'Occupation',
#     'Other_entity',
#     'Qualitative_concept',
#     'Quantitative_concept',
#     'Sign_symptom',
#     'Therapeutic_procedure'
# ]


In [541]:
selected_entities = [
    'Age',
    'Biological_attribute',
    'Biological_structure',
    'Clinical_event',
    'Diagnostic_procedure',
    'Disease_disorder',
    'Dosage',
    'Family_history',
    'Height',
    'History',
    'Lab_value',
    'Mass',
    'Medication',
    'Sex',
    'Sign_symptom',
    'Therapeutic_procedure',
    'Weight'
]


In [542]:
filtered_ann_df = ann_df[ann_df['entity'].isin(selected_entities)]
filtered_ann_df = filtered_ann_df.reset_index(drop=True)
filtered_ann_df.head()

Unnamed: 0,id,entity,ranges,word,file_name
0,T1,Age,"[4, 15]",24-year-old,../data/MACCROBAT/19860925.txt
1,T2,Sex,"[28, 32]",male,../data/MACCROBAT/19860925.txt
2,T3,History,"[16, 27]",non-smoking,../data/MACCROBAT/19860925.txt
3,T4,Clinical_event,"[41, 50]",presented,../data/MACCROBAT/19860925.txt
4,T5,Sign_symptom,"[65, 75]",hemoptysis,../data/MACCROBAT/19860925.txt


In [543]:
filtered_ann_df.shape

(19036, 5)

In [544]:
def iterate_ranges(x):
    if len(x) > 2:
        new_x = []
        start = x[0]
        for num in x[1:]:
            end = num.split(';')[0]
            new_x.append([start, end])
            if len(num.split(';')) > 1:
                start = num.split(';')[1]
        return new_x
    else:
        return [x]

overlapping_entities = {}
overall_range_to_label = {}
for file_name in filtered_ann_df.file_name.values:
    test_df = filtered_ann_df[filtered_ann_df['file_name'] == file_name]

    range_to_label = {}
    for index, row in test_df.iterrows():
        for start, end in iterate_ranges(row['ranges']):
            range_to_label[(int(start), int(end))] = row['entity']
            overall_range_to_label[(file_name, int(start), int(end))] = row['entity']

    prev_start = -1
    prev_end = -1
    for start, end in dict(sorted(range_to_label.items(), key=lambda x: x[0][0])).keys():

        if start <= prev_end:
            if overall_range_to_label[(file_name, prev_start, prev_end)] != overall_range_to_label[(file_name, start, end)]:
                entity_1 = overall_range_to_label[(file_name, prev_start, prev_end)]
                entity_1_len = prev_end - prev_start
                entity_2 =  overall_range_to_label[(file_name, start, end)]
                entity_2_len = end - start

                if prev_end - prev_start > end - start:
                    del overall_range_to_label[(file_name, start, end)]
                    continue

                elif prev_end - prev_start < end - start:
                    del overall_range_to_label[(file_name, prev_start, prev_end)]

            elif overall_range_to_label[(file_name, prev_start, prev_end)] == overall_range_to_label[(file_name, start, end)]:

                entity = overall_range_to_label[(file_name, prev_start, prev_end)]
                del overall_range_to_label[(file_name, prev_start, prev_end)]
                del overall_range_to_label[(file_name, start, end)]
                overall_range_to_label[(file_name, prev_start, max(end, prev_end))] =  entity
                continue

        prev_start = start
        prev_end = end

In [545]:
prev_start = -1
prev_end = -1
prev_file = None
for file_name, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
    if prev_file != file_name:
        prev_file = file_name
        prev_start = -1
        prev_end = -1
        continue

    if start <= prev_end:
        print('Alert - Check')
        print(file_name, prev_start, prev_end, start, end)
        print(overall_range_to_label[(file_name, prev_start, prev_end)], overall_range_to_label[(file_name, start, end)])

    prev_start = start
    prev_end = end


**Fixed Overlapping ranges. Now single entity for each tokens**

In [546]:
print(f"Total number of (start, end) ranges with corresponsing tag extracted is {len(overall_range_to_label)}")

Total number of (start, end) ranges with corresponsing tag extracted is 18760


### Generating BIO format files

In [547]:
import string

In [562]:
def tokenize_text(text):
    # Tokenize the text into a list of words
    tokens = []
    for sentence in re.split(r'\n', text):
        for word in sentence.split():

            # Remove the format [%d]
            word = re.sub(r'\[?\d+\]', '', word)

            word = word.strip()

            # Remove trailing punctuation marks from the word
            while word and word[-1] in string.punctuation:
                word = word[:-1]

            # Remove leading punctutation marks from the word
            while word and word[0] in string.punctuation:
                word = word[1:]
            if word:
                tokens.append(word)

        if tokens[-1] != "<NEWL>":
            tokens.append("<NEWL>")

    return tokens

In [563]:
output_dir = '../data/NEW_BIO_FILES'
for file_id in file_ids:

    txt_file = os.path.join(MACCROBAT_data_dir, file_id+".txt")
    with open(txt_file, 'r') as f:
        text = f.read()


    tokens = tokenize_text(text)
    print(tokens)

['Our', '24-year-old', 'non-smoking', 'male', 'patient', 'presented', 'with', 'repeated', 'hemoptysis', 'in', 'May', '2008', 'with', '4', 'days', 'of', 'concomitant', 'right', 'thoracic', 'pain', 'which', 'intensified', 'while', 'breathing', '<NEWL>', 'During', 'holidays', 'in', 'his', 'home', 'country', 'this', 'Cuban', 'patient', 'suffered', 'from', 'a', 'cold', 'with', 'fever', 'and', 'a', 'strong', 'cough', '<NEWL>', 'The', 'strong', 'dry', 'cough', 'persisted', 'after', 'recovery', 'from', 'the', 'cold', '<NEWL>', 'The', 'patient', 'did', 'not', 'report', 'any', 'loss', 'of', 'weight', '<NEWL>', 'The', 'initial', 'CT', 'scan', 'of', 'the', 'thorax', 'showed', 'a', '12', '×', '4', 'cm', 'solid', 'mass', 'paravertebral', 'right', 'in', 'the', 'lower', 'thorax', 'without', 'any', 'signs', 'of', 'metastases', 'Figure', '1', '<NEWL>', 'The', 'bronchoscopy', 'Figure', '\u200b2', 'with', 'non-bleeding', 'biopsy', 'revealed', 'a', 'mass', 'of', 'the', 'lower', 'right', 'bronchus', 'which'

In [564]:
output_dir = '../data/NEW_BIO_FILES'
for file_id in file_ids:

    txt_file = os.path.join(MACCROBAT_data_dir, file_id+".txt")
    with open(txt_file, 'r') as f:
        text = f.read()


    tokens = tokenize_text(text)

    # Initialize a list to hold the BIO-formatted tags
    bio_tags = ['O'] * len(tokens)

    curr_pos = 0
    for i in range(len(tokens)):

        token_start = text.find(tokens[i], curr_pos)
        token_end = token_start + len(tokens[i])
        curr_pos = token_end

        for file_name, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
            if file_name == txt_file:
                tag = overall_range_to_label[(file_name, start, end)]
                if start <= token_start and end >= token_end:

                    if token_start == start:
                        bio_tags[i] = f'B-{tag}'
                    else:
                        bio_tags[i] = f'I-{tag}'

    if not os.path.exists(output_dir):
            os.makedirs(output_dir)

    # Write the BIO tags to a new file
    output_file = os.path.join(output_dir, file_id+".bio")
    # Write the BIO tags to a new file
    with open(output_file, 'w', encoding='utf-8') as f:
        for token, tag in zip(tokens, bio_tags):
            if token == "<NEWL>":
                f.write('\n')
            else:
                f.write(token + '\t' + tag + '\n')

    # with open(output_file, 'w', encoding='utf-8') as f:
    #     sentence_start_index = 0
    #     for sentence in text.split('\n'):
    #         sentence_tokens = sentence.split()
    #         sentence_length = len(sentence_tokens)
    #         sentence_end_index = sentence_start_index + sentence_length
    #         for i in range(sentence_start_index, sentence_end_index):
    #             f.write(sentence_tokens[i - sentence_start_index] + '\t' + bio_tags[i] + '\n')
    #         f.write('\n')
    #         sentence_start_index = sentence_end_index

    print(f"{output_file} completed")
print("Conversion completed successfully.")

../data/NEW_BIO_FILES/19860925.bio completed
../data/NEW_BIO_FILES/26361640.bio completed
../data/NEW_BIO_FILES/26228535.bio completed
../data/NEW_BIO_FILES/27773410.bio completed
../data/NEW_BIO_FILES/23678274.bio completed
../data/NEW_BIO_FILES/25853982.bio completed
../data/NEW_BIO_FILES/28103924.bio completed
../data/NEW_BIO_FILES/27064109.bio completed
../data/NEW_BIO_FILES/28154700.bio completed
../data/NEW_BIO_FILES/20146086.bio completed
../data/NEW_BIO_FILES/26656340.bio completed
../data/NEW_BIO_FILES/28353558.bio completed
../data/NEW_BIO_FILES/22515939.bio completed
../data/NEW_BIO_FILES/28353588.bio completed
../data/NEW_BIO_FILES/26309459.bio completed
../data/NEW_BIO_FILES/28272235.bio completed
../data/NEW_BIO_FILES/23242090.bio completed
../data/NEW_BIO_FILES/23312850.bio completed
../data/NEW_BIO_FILES/23124805.bio completed
../data/NEW_BIO_FILES/26106249.bio completed
../data/NEW_BIO_FILES/26313770.bio completed
../data/NEW_BIO_FILES/26285706.bio completed
../data/NE

### Checking how many words with "O" tags

In [565]:
output_dir = '../data/NEW_BIO_FILES'
!grep -r 'O$' '{output_dir}' | wc -l

   51164


### Checking how many words have tags other than "O"

In [566]:
output_dir = '../data/NEW_BIO_FILES'
!grep -rv 'O$' '{output_dir}' | wc -l

   36664


### Checking count for each entity

In [567]:
import subprocess
def print_num_of_words_with_tag(directory, tag):
    output = subprocess.check_output(['sh', '-c', f"grep -r '{tag}$' {directory} | wc -l"])
    count = int(output.decode().strip())
    print(f"Number of lines with {tag} tag is {count}")

output_dir = '../data/NEW_BIO_FILES'
for entity in selected_entities:
    print_num_of_words_with_tag(output_dir, entity)

Number of lines with Age tag is 257
Number of lines with Biological_attribute tag is 13
Number of lines with Biological_structure tag is 5355
Number of lines with Clinical_event tag is 683
Number of lines with Diagnostic_procedure tag is 7916
Number of lines with Disease_disorder tag is 1997
Number of lines with Dosage tag is 963
Number of lines with Family_history tag is 498
Number of lines with Height tag is 8
Number of lines with History tag is 1804
Number of lines with Lab_value tag is 4875
Number of lines with Mass tag is 4
Number of lines with Medication tag is 1342
Number of lines with Sex tag is 191
Number of lines with Sign_symptom tag is 4670
Number of lines with Therapeutic_procedure tag is 1539
Number of lines with Weight tag is 8


In [552]:
output_dir = '../data/BIO_FILES'
for file_id in file_ids:

    txt_file = os.path.join(MACCROBAT_data_dir, file_id+".txt")
    with open(txt_file, 'r') as f:
        text = f.read()


    tokens = tokenize_text(text)

    # Initialize a list to hold the BIO-formatted tags
    bio_tags = ['O'] * len(tokens)

    curr_pos = 0
    for i in range(len(tokens)):

        token_start = text.find(tokens[i], curr_pos)
        token_end = token_start + len(tokens[i])
        curr_pos = token_end

        for file_name, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
            if file_name == txt_file:
                tag = overall_range_to_label[(file_name, start, end)]
                if start <= token_start and end >= token_end:

                    if token_start == start:
                        bio_tags[i] = f'B-{tag}'
                    else:
                        bio_tags[i] = f'I-{tag}'

    if not os.path.exists(output_dir):
            os.makedirs(output_dir)

    # Write the BIO tags to a new file
    output_file = os.path.join(output_dir, file_id+".bio")
    with open(output_file, 'w', encoding='utf-8') as f:
        sentence_start_index = 0
        for sentence in text.split('\n'):
            sentence_tokens = sentence.split()
            sentence_length = len(sentence_tokens)
            sentence_end_index = sentence_start_index + sentence_length
            for i in range(sentence_start_index, sentence_end_index):
                f.write(sentence_tokens[i - sentence_start_index] + '\t' + bio_tags[i] + '\n')
            f.write('\n')
            sentence_start_index = sentence_end_index

print("Conversion completed successfully.")



../data/BIO_FILES/19860925.bio completed
../data/BIO_FILES/26361640.bio completed


KeyboardInterrupt: 

## Kaggle Data

In [553]:
json_file = '../data/Corona2.json'

In [382]:
import json

In [384]:
with open(json_file, "r") as f:
    data = json.load(f)
data = data['examples']

In [390]:
print(f'Number of documents in dataset is {len(data)}')

Number of documents in dataset is 31


In [392]:
tags = []
for record in data:
    for annotation in record['annotations']:
        curr_tag = annotation['tag_name']
        if curr_tag not in tags:
            tags.append(curr_tag)
tags

['Medicine', 'MedicalCondition', 'Pathogen']

In [386]:
selected_entities

['Age',
 'Biological_attribute',
 'Biological_structure',
 'Clinical_event',
 'Diagnostic_procedure',
 'Disease_disorder',
 'Dosage',
 'Family_history',
 'Height',
 'History',
 'Lab_value',
 'Mass',
 'Medication',
 'Sex',
 'Sign_symptom',
 'Therapeutic_procedure',
 'Weight']

Medicine can be combined with Medication
MedicalCondition can be combined with Disease_disorder
Pathogen can be added as new entity tag

In [394]:
tags_dict = {
    "Medicine": "Medication",
    "MedicalCondition": "Disease_disorder",
    "Pathogen": "Pathogen"
}

In [418]:
for record in data:
    text = record['content']
    for annotation in record['annotations']:
        start = annotation['start']
        end = annotation['end']
        tag = annotation['tag_name']
        word = text[start:end].strip()
        if word:
            print(word, tag)

Diosmectite Medicine
aluminomagnesium silicate Medicine
diarrhea MedicalCondition
kaopectate Medicine
bismuth compounds Medicine
Pepto-Bismol Medicine
diarrhea MedicalCondition
chemotherapy Medicine
constipation MedicalCondition
loperamide Medicine
diarrhea MedicalCondition
flatulence MedicalCondition
loperamide Medicine
diarrhea MedicalCondition
diarrhea MedicalCondition
Racecadotril Medicine
diarrhea MedicalCondition
loss of skin color MedicalCondition
Diarrhea MedicalCondition
watery bowel movements MedicalCondition
dehydration MedicalCondition
dehydration MedicalCondition
diarrhoea MedicalCondition
decrease in responsiveness MedicalCondition
fast heart rate MedicalCondition
Antiretroviral therapy Medicine
ART Medicine
ART Medicine
HIV Pathogen
HIV Pathogen
ART Medicine
HIV Pathogen
[8][5][93] A combined approach with methotrexate and biologics improves ACR50, HAQ scores and RA remission rates.[94] Triple therapy consisting of methotrexate, sulfasalazine and hydroxychloroquine may a

In [407]:
import re

text = "8][5][93] A combined approach with methotrexate and biologics improves ACR50, HAQ scores and RA remission rates.[94] Triple therapy consisting of methotrexate, sulfasalazine and hydroxychloroquine may also effectively control disease activity.[95] Adverse effects should be monitored regularly with toxicity including gastrointestinal, hematologic, pulmonary, and hepatic.[93]"

# Remove the format [%d]
text = re.sub(r'\[?\d+\]', '', text)

print(text)

 A combined approach with methotrexate and biologics improves ACR50, HAQ scores and RA remission rates. Triple therapy consisting of methotrexate, sulfasalazine and hydroxychloroquine may also effectively control disease activity. Adverse effects should be monitored regularly with toxicity including gastrointestinal, hematologic, pulmonary, and hepatic.


### Checking Overlapping

In [571]:
for record in data:
    text = record['content']
    range_to_label = {}
    for annotation in record['annotations']:
        start = annotation['start']
        end = annotation['end']
        tag = annotation['tag_name']
        word = text[start:end].strip()
        if word:
            range_to_label[(int(start), int(end))] = tag
    prev_start = -1
    prev_end = -1
    for start, end in dict(sorted(range_to_label.items(), key=lambda x: x[0][0])).keys():
        if start <= prev_end:
            print(f"current - {text[start:end]} {range_to_label[(start, end)]}")
            print(f"prev - {text[prev_start:prev_end]} {range_to_label[(prev_start, prev_end)]}")
            print('Alert')
            print(start, end, prev_start, prev_end)
        prev_start = start
        prev_end = end

current - diarrhea MedicalCondition
prev - diarrhea  MedicalCondition
Alert
461 469 461 470
current - ulfasalazine, leflunomide MedicalCondition
prev - sulfasalazine, leflunomide Medicine
Alert
81 106 80 106
current - DMARDs Medicine
prev - DMARDs. MedicalCondition
Alert
262 268 262 269
current - 8][5][93] A combined approach with methotrexate and biologics improves ACR50, HAQ scores and RA remission rates.[94] Triple therapy consisting of methotrexate, sulfasalazine and hydroxychloroquine may also effectively control disease activity.[95] Adverse effects should be monitored regularly with toxicity including gastrointestinal, hematologic, pulmonary, and hepatic.[93] MedicalCondition
prev - [8][5][93] A combined approach with methotrexate and biologics improves ACR50, HAQ scores and RA remission rates.[94] Triple therapy consisting of methotrexate, sulfasalazine and hydroxychloroquine may also effectively control disease activity.[95] Adverse effects should be monitored regularly with t

## Fixing Overlapping

In [572]:
overall_range_to_label = {}

for idx, record in enumerate(data):

    text = record['content']

    range_to_label = {}

    for annotation in record['annotations']:

        start = annotation['start']
        end = annotation['end']
        tag = annotation['tag_name']
        word = text[start:end].strip()

        if word:
            range_to_label[(int(start), int(end))] = tag
            overall_range_to_label[(idx, int(start), int(end))] = tag


    prev_start = -1
    prev_end = -1

    for start, end in dict(sorted(range_to_label.items(), key=lambda x: x[0][0])).keys():
        # if start <= prev_end:
        #     print(f"current - {text[start:end]} {overall_range_to_label[(idx, start, end)]}")
        #     print(f"prev - {text[prev_start:prev_end]} {overall_range_to_label[(idx, prev_start, prev_end)]}")
        #     print('Alert')
        #     print(start, end, prev_start, prev_end)
        #

        if start <= prev_end:
            if overall_range_to_label[(idx, prev_start, prev_end)] != overall_range_to_label[(idx, start, end)]:
                entity_1 = overall_range_to_label[(idx, prev_start, prev_end)]
                entity_1_len = prev_end - prev_start
                entity_2 =  overall_range_to_label[(idx, start, end)]
                entity_2_len = end - start

                if prev_end - prev_start > end - start:
                    del overall_range_to_label[(idx, start, end)]
                    continue

                elif prev_end - prev_start < end - start:
                    del overall_range_to_label[(idx, prev_start, prev_end)]

            elif overall_range_to_label[(idx, prev_start, prev_end)] == overall_range_to_label[(idx, start, end)]:

                entity = overall_range_to_label[(idx, prev_start, prev_end)]
                del overall_range_to_label[(idx, prev_start, prev_end)]
                del overall_range_to_label[(idx, start, end)]
                overall_range_to_label[(idx, prev_start, max(end, prev_end))] =  entity
                prev_end = max(end, prev_end)
                continue

        prev_start = start
        prev_end = end

## Verifying there are no overlaps after fix

In [573]:
prev_start = -1
prev_end = -1
prev_file = None
for file_name, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
    if prev_file != file_name:
        prev_file = file_name
        prev_start = -1
        prev_end = -1
        continue

    if start <= prev_end:
        print('Alert - Check')
        print(file_name, prev_start, prev_end, start, end)
        print(overall_range_to_label[(file_name, prev_start, prev_end)], overall_range_to_label[(file_name, start, end)])

    prev_start = start
    prev_end = end


No overlaps after fix

In [574]:
len(overall_range_to_label)

260

In [575]:
import re
import string

In [576]:
def tokenize_text(text):
    # Tokenize the text into a list of words
    tokens = []
    for sentence in re.split(r'\[\d+\]', text):
        for word in sentence.split():

            # Remove the format [%d]
            word = re.sub(r'\[?\d+\]', '', word)

            word = word.strip()

            # Remove trailing punctuation marks from the word
            while word and word[-1] in string.punctuation:
                word = word[:-1]

            # Remove leading punctutation marks from the word
            while word and word[0] in string.punctuation:
                word = word[1:]

            tokens.append(word)

        if tokens[-1] != "<NEWL>":
            tokens.append("<NEWL>")

    return tokens

In [577]:
for record in data:
    text = record['content']
    for annotation in record['annotations']:
        start = annotation['start']
        end = annotation['end']
        word = text[start:end].strip()
        if word == 'stomach flu':
            print(word, start, end)
    print(tokenize_text(text))

['While', 'bismuth', 'compounds', 'Pepto-Bismol', 'decreased', 'the', 'number', 'of', 'bowel', 'movements', 'in', 'those', 'with', 'travelers', 'diarrhea', 'they', 'do', 'not', 'decrease', 'the', 'length', 'of', 'illness', '<NEWL>', 'Anti-motility', 'agents', 'like', 'loperamide', 'are', 'also', 'effective', 'at', 'reducing', 'the', 'number', 'of', 'stools', 'but', 'not', 'the', 'duration', 'of', 'disease', '<NEWL>', 'These', 'agents', 'should', 'be', 'used', 'only', 'if', 'bloody', 'diarrhea', 'is', 'not', 'present', '<NEWL>', 'Diosmectite', 'a', 'natural', 'aluminomagnesium', 'silicate', 'clay', 'is', 'effective', 'in', 'alleviating', 'symptoms', 'of', 'acute', 'diarrhea', 'in', 'children', '<NEWL>', 'and', 'also', 'has', 'some', 'effects', 'in', 'chronic', 'functional', 'diarrhea', 'radiation-induced', 'diarrhea', 'and', 'chemotherapy-induced', 'diarrhea', '<NEWL>', 'Another', 'absorbent', 'agent', 'used', 'for', 'the', 'treatment', 'of', 'mild', 'diarrhea', 'is', 'kaopectate', 'Rac

In [578]:
output_dir = '../data/NEW_BIO_FILES'
for doc_id, doc in enumerate(data):
    output_file = os.path.join(output_dir, f"kaggle_doc_{doc_id}.bio")

    text = doc['content']

    tokens = tokenize_text(text)

    # Initialize a list to hold the BIO-formatted tags
    bio_tags = ['O'] * len(tokens)

    curr_pos = 0
    for i in range(len(tokens)):

        if tokens[i] == "<NEWL>":
            continue

        token_start = text.find(tokens[i], curr_pos)
        token_end = token_start + len(tokens[i])
        curr_pos = token_end

        for idx, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
            if doc_id == idx:
                tag = tags_dict[overall_range_to_label[(idx, start, end)]]
                if start <= token_start and end >= token_end:
                    if token_start == start and bio_tags[i-1] != f'B-{tag}':
                        bio_tags[i] = f'B-{tag}'
                    elif bio_tags[i-1] == f'B-{tag}':
                        bio_tags[i] = f'I-{tag}'
                    else:
                        bio_tags[i] = f'B-{tag}'

    if not os.path.exists(output_dir):
            os.makedirs(output_dir)

    # Write the BIO tags to a new file
    with open(output_file, 'w', encoding='utf-8') as f:
        for token, tag in zip(tokens, bio_tags):
            if token == "<NEWL>":
                f.write('\n')
            else:
                f.write(token + '\t' + tag + '\n')
    print(f"{output_file} completed")
print("Conversion completed successfully.")

../data/NEW_BIO_FILES/kaggle_doc_0.bio completed
../data/NEW_BIO_FILES/kaggle_doc_1.bio completed
../data/NEW_BIO_FILES/kaggle_doc_2.bio completed
../data/NEW_BIO_FILES/kaggle_doc_3.bio completed
../data/NEW_BIO_FILES/kaggle_doc_4.bio completed
../data/NEW_BIO_FILES/kaggle_doc_5.bio completed
../data/NEW_BIO_FILES/kaggle_doc_6.bio completed
../data/NEW_BIO_FILES/kaggle_doc_7.bio completed
../data/NEW_BIO_FILES/kaggle_doc_8.bio completed
../data/NEW_BIO_FILES/kaggle_doc_9.bio completed
../data/NEW_BIO_FILES/kaggle_doc_10.bio completed
../data/NEW_BIO_FILES/kaggle_doc_11.bio completed
../data/NEW_BIO_FILES/kaggle_doc_12.bio completed
../data/NEW_BIO_FILES/kaggle_doc_13.bio completed
../data/NEW_BIO_FILES/kaggle_doc_14.bio completed
../data/NEW_BIO_FILES/kaggle_doc_15.bio completed
../data/NEW_BIO_FILES/kaggle_doc_16.bio completed
../data/NEW_BIO_FILES/kaggle_doc_17.bio completed
../data/NEW_BIO_FILES/kaggle_doc_18.bio completed
../data/NEW_BIO_FILES/kaggle_doc_19.bio completed
../data/NE

In [500]:
overall_range_to_label

{(0, 360, 371): 'Medicine',
 (0, 383, 408): 'Medicine',
 (0, 104, 112): 'MedicalCondition',
 (0, 679, 689): 'Medicine',
 (0, 6, 23): 'Medicine',
 (0, 25, 37): 'Medicine',
 (0, 577, 589): 'Medicine',
 (0, 853, 865): 'MedicalCondition',
 (0, 188, 198): 'Medicine',
 (0, 754, 762): 'MedicalCondition',
 (0, 870, 880): 'MedicalCondition',
 (0, 823, 833): 'Medicine',
 (0, 535, 543): 'MedicalCondition',
 (0, 692, 704): 'Medicine',
 (0, 563, 571): 'MedicalCondition',
 (0, 461, 470): 'MedicalCondition',
 (1, 364, 382): 'MedicalCondition',
 (1, 0, 8): 'MedicalCondition',
 (1, 94, 116): 'MedicalCondition',
 (1, 178, 189): 'MedicalCondition',
 (1, 221, 232): 'MedicalCondition',
 (1, 23, 32): 'MedicalCondition',
 (1, 409, 435): 'MedicalCondition',
 (1, 386, 401): 'MedicalCondition',
 (2, 0, 22): 'Medicine',
 (2, 24, 27): 'Medicine',
 (2, 120, 123): 'Medicine',
 (2, 211, 214): 'Pathogen',
 (2, 52, 55): 'Pathogen',
 (2, 234, 237): 'Medicine',
 (2, 148, 151): 'Pathogen',
 (3, 38, 44): 'Medicine',
 (3, 

In [507]:
text.split('[%d]')
text

'Influenza, commonly known as "the flu", is an infectious disease caused by an influenza virus.[1] Symptoms can be mild to severe.[5] The most common symptoms include: high fever, runny nose, sore throat, muscle and joint pain, headache, coughing, and feeling tired.[1] These symptoms typically begin two days after exposure to the virus and most last less than a week.[1] The cough, however, may last for more than two weeks.[1] In children, there may be diarrhea and vomiting, but these are not common in adults.[6] Diarrhea and vomiting occur more commonly in gastroenteritis, which is an unrelated disease and sometimes inaccurately referred to as "stomach flu" or the "24-hour flu".[6] Complications of influenza may include viral pneumonia, secondary bacterial pneumonia, sinus infections, and worsening of previous health problems such as asthma or heart failure.[2][5]'

In [532]:
output_dir = '../data/NEW_BIO_FILES'
for file_id in file_ids:

    txt_file = os.path.join(MACCROBAT_data_dir, file_id+".txt")
    with open(txt_file, 'r') as f:
        text = f.read()


    tokens = tokenize_text(text)

    # Initialize a list to hold the BIO-formatted tags
    bio_tags = ['O'] * len(tokens)

    curr_pos = 0
    for i in range(len(tokens)):

        token_start = text.find(tokens[i], curr_pos)
        token_end = token_start + len(tokens[i])
        curr_pos = token_end

        for file_name, start, end in dict(sorted(overall_range_to_label.items(), key=lambda x: (x[0], x[1]))).keys():
            if file_name == txt_file:
                tag = overall_range_to_label[(file_name, start, end)]
                if start <= token_start and end >= token_end:

                    if token_start == start:
                        bio_tags[i] = f'B-{tag}'
                    else:
                        bio_tags[i] = f'I-{tag}'

    if not os.path.exists(output_dir):
            os.makedirs(output_dir)

    # Write the BIO tags to a new file
    output_file = os.path.join(output_dir, file_id+".bio")
    with open(output_file, 'w', encoding='utf-8') as f:
        sentence_start_index = 0
        for sentence in text.split('\n'):
            sentence_tokens = sentence.split()
            sentence_length = len(sentence_tokens)
            sentence_end_index = sentence_start_index + sentence_length
            for i in range(sentence_start_index, sentence_end_index):
                f.write(sentence_tokens[i - sentence_start_index] + '\t' + bio_tags[i] + '\n')
            f.write('\n')
            sentence_start_index = sentence_end_index
    print(f"{output_file} completed")
print("Conversion completed successfully.")

../data/NEW_BIO_FILES/19860925.bio completed
../data/NEW_BIO_FILES/26361640.bio completed
../data/NEW_BIO_FILES/26228535.bio completed
../data/NEW_BIO_FILES/27773410.bio completed
../data/NEW_BIO_FILES/23678274.bio completed
../data/NEW_BIO_FILES/25853982.bio completed
../data/NEW_BIO_FILES/28103924.bio completed
../data/NEW_BIO_FILES/27064109.bio completed
../data/NEW_BIO_FILES/28154700.bio completed
../data/NEW_BIO_FILES/20146086.bio completed
../data/NEW_BIO_FILES/26656340.bio completed
../data/NEW_BIO_FILES/28353558.bio completed
../data/NEW_BIO_FILES/22515939.bio completed
../data/NEW_BIO_FILES/28353588.bio completed
../data/NEW_BIO_FILES/26309459.bio completed
../data/NEW_BIO_FILES/28272235.bio completed
../data/NEW_BIO_FILES/23242090.bio completed
../data/NEW_BIO_FILES/23312850.bio completed
../data/NEW_BIO_FILES/23124805.bio completed
../data/NEW_BIO_FILES/26106249.bio completed
../data/NEW_BIO_FILES/26313770.bio completed
../data/NEW_BIO_FILES/26285706.bio completed
../data/NE