# Data Loading and Filtering


All datasets follow the BIO labelling scheme explained below:

The BIO scheme is a method used in natural language processing (NLP) for encoding entity annotations in text, particularly for tasks like named entity recognition (NER). In this scheme, each token (word or punctuation mark) in a sentence is tagged with one of three types of labels:

B- (Beginning): This label marks the beginning of an entity. The "B-" is followed by the type of entity. For instance, "B-PER" would be used for the beginning of a person's name.

I- (Inside): This label is used for tokens that are inside an entity but not at the beginning. Like "B-", "I-" is followed by the entity type. For example, in a multi-word entity like a company name, the second and subsequent words would be tagged with "I-ORG" if it's an organization.

O (Outside): This label is used for tokens that are not part of any entity.

Here's an example to illustrate the BIO scheme:

Sentence: "John Smith works at Google."
Tagging:
John: B-PER (Beginning of a person's name)
Smith: I-PER (Inside a person's name)
works: O (Not an entity)
at: O (Not an entity)
Google: B-ORG (Beginning of an organization's name)


In [None]:
import csv
import json
import pandas as pd
import numpy as np
import re
import string
from collections import Counter

In [None]:
ls

[0m[01;36mbin[0m@                        [01;34mdatalab[0m/  [01;36mlib[0m@     [01;34mmedia[0m/                    [01;34mproc[0m/        [01;36msbin[0m@  [01;34mtools[0m/
[01;34mboot[0m/                       [01;34mdev[0m/      [01;36mlib32[0m@   [01;34mmnt[0m/                      [01;34mpython-apt[0m/  [01;34msrv[0m/   [01;34musr[0m/
[01;34mcontent[0m/                    [01;34metc[0m/      [01;36mlib64[0m@   NGC-DL-CONTAINER-LICENSE  [01;34mroot[0m/        [01;34msys[0m/   [01;34mvar[0m/
cuda-keyring_1.0-1_all.deb  [01;34mhome[0m/     [01;36mlibx32[0m@  [01;34mopt[0m/                      [01;34mrun[0m/         [30;42mtmp[0m/


In [None]:
# load the datasets into dataframes

def load_tsv_dataset(file_path):
  """
  Loads a tsv dataset. Renames thne columns to 'token' and 'label'.
  Note that renaming the columns will overwrite the first row of the dataframe
  """
  df = pd.read_csv(file_path, delimiter='\t')
  df.columns = ['token', 'label']
  print(df.head())
  return df

bc5dr_chem_devel = 'llm_annotations/datasets/BC5CDR-chem/devel.tsv'
bc5dr_chem_devel_df = load_tsv_dataset(bc5dr_chem_devel)

In [None]:
def get_filtered_entities(df, target_label):
  """
  df (pandas dataframe): has two columns 'token' and 'label'
  target_label: 'B', 'I', or 'O' (see description above for what these signify)

  Filtering involves: removing blanks, and filtering out entities that consist
  only of punctuation, numbers, or single letters.

  Return a frequency of all filtered entities with label 'target_label'.
  """
  filtered_df = df[df['label'] == target_label]
  target_entities = filtered_df['token'].tolist() # a set of all the entities with the target label

  # regex for filtering out nonsense strings
  punctuation = re.escape(string.punctuation)
  pattern = re.compile(rf'^(?![a-zA-Z]?$)(?!\d+$)(?!^[{punctuation}]+$).+')
  target_entities = [ent for ent in target_entities if pattern.match(ent)]
  return Counter(target_entities)



In [None]:
all_b_entities = get_filtered_entities(bc5dr_chem_devel_df, 'B')
print(f'Processed B Entities size({len(all_b_entities)})\n')
for b_ent, freq in all_b_entities.items():
  print(f'{b_ent} ({freq})')
print('-----------------------')

all_i_entities = get_filtered_entities(bc5dr_chem_devel_df, 'I')
print(f'Processed I Entities size({len(all_i_entities)})\n')
for i_ent, freq in all_i_entities.items():
  print(f'{i_ent} ({freq})')

Processed B Entities size(1019)

Calcitriol (1)
vitamin (15)
OCT (10)
phosphate (3)
calcium (45)
methylprednisolone (14)
IVMP (1)
levodopa (30)
apomorphine (18)
Puromycin (2)
PAN (12)
lignocaine (8)
sirolimus (8)
rapamycin (8)
Sirolimus (2)
dexamethasone (34)
lithium (51)
magnesium (8)
Magnesium (1)
cefotetan (6)
cephalosporins (3)
ketoprofen (6)
acetaminophen (15)
adenosine (17)
Ketoprofen (1)
Nitric (1)
lead (10)
oxygen (19)
NO (18)
nitrotyrosine (1)
malondialdehyde (7)
MDA (4)
Vitamin (7)
Glyceryl (2)
nitric (10)
glyceryl (8)
GTN (9)
diltiazem (12)
angiotensin (10)
enalapril (10)
creatinine (27)
diuretic (2)
Diuretic (2)
Enalapril (1)
olanzapine (4)
Acetazolamide (1)
acetazolamide (2)
Vasopressin (1)
milrinone (4)
vasopressin (2)
tacrolimus (20)
cyclosporine (10)
corticosteroids (8)
chloroquine (7)
hydroxychloroquine (1)
Cyclophosphamide (4)
CP (12)
acrolein (4)
morphine (56)
naloxone (14)
Morphine (4)
Prednisolone (1)
acetylcholine (10)
prednisolone (4)
Apomorphine (2)
dopamine (34

In [None]:
# entities that are tagged with both B and I??
print('Processed B and I Entities\n')
for bi_ent, freq in (all_b_entities & all_i_entities).items():
  print(f'{bi_ent} ({freq})')


Processed B and I Entities

phosphate (2)
calcium (1)
adenosine (3)
cocaine (1)
amphetamine (2)
sodium (5)
VPA (2)
ketamine (2)
potassium (1)
epinephrine (2)
interferon (3)
heparin (5)
warfarin (1)
2R (1)
oral (1)
valproate (2)
citrate (1)
serotonin (1)
glutathione (1)
PG (1)
aspartate (4)
alpha (10)
tyrosine (1)
ADP (1)
fluorouracil (3)
dextran (3)
urea (2)
para (1)
amino (1)
trans (1)
cyclic (1)
chloride (3)
RA (9)
phenylethylbarbiturate (1)
penicillamine (4)
aminonucleoside (2)
disodium (1)
pyrrolidinone (1)
estradiol (4)
cis (2)
benserazide (1)
Penicillamine (1)
carbon (1)
E2 (4)
