##### **Classifying ADR terms via using named entity recognition in tyrosine kinase inhibitors**

*Draft mode*

Notebook plan:

- may try an extremely small dataset here on tyrosine kinase inhibitors and using named entity recognition technique (token classification) first to map ADRs to ADR-related body systems e.g. liver, kidney, cardiovascular, respiratory... and so on (may be useful for other ADR-(on/off)-target recognitions or knowledge graph formation later)

- aim is to classify each ADR term into specific body systems (the NLP model concept may make a bit more sense after I try this first...) with a goal to show most common ADR-related body systems for tyrosine kinase inhibitors (chosen because they're also structurally similar)

- this may involve the full loop of defining, fine-tuning model, setting up metrics and training loop to be able to use the model on a test dataset in the end

- test data will be other tyrosine kinase inhibitors not in this particular dataset, e.g. dasatinib, afatinib etc. e.g. from https://nzf.org.nz/nzf_4653?searchterm=tyrosine%20protein%20and%20tyrosine%20kinase%20inhibitors

In [1]:
import pandas as pd
df = pd.read_csv("Data/cyp_substrates_adrs.csv", delimiter=",")
df

Unnamed: 0,generic_drug_name,notes,cyp_type_and_cyp_strength_of_evidence,drug_class,adverse_drug_reactions,first_ref,second_ref,date_checked
0,bosentan,,s_3A4,agents_for_pulmonary_hypertension,"abnormal_LFT^^, headache^^, RTI^^, decreased_h...",drugs.com,nzf,270225
1,carbamazepine,,s_3A4,antiepileptics,"constipation^^, leukopenia^^, dizziness^^, dro...",drugs.com,nzf,211024
2,eliglustat,,s_3A4_s_2D6,metabolic_agents,"diarrhea^^, oropharyngeal_pain^^, arthralgia^^...",drugs.com,emc,151124
3,flibanserin,,s_3A4,CNS_agents,"dizziness^^, drowsiness^^, fatigue^, vertigo^,...",drugs.com,Drugs@FDA,161124
4,imatinib,,s_3A4,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa...",drugs.com,nzf,181124
...,...,...,...,...,...,...,...,...
79,enzalutamide,*,m_2C8,anti_androgens,"headache^^, dizziness^^, flushing^^, periphera...",drugs.com,nzf,10425
80,repaglinide,,s_2C8,meglinitides,"hypoglycemia^^, headache^^, upper_RTI^^, chest...",drugs.com,Drugs@FDA,10425
81,rosiglitazone,*,m_2C8,thiazolidinediones,"edema^, hypertension^, anemia^, hepatitis(pm),...",drugs.com,Drugs@FDA,20425
82,selexipag,,m_2C8,agents_for_pulmonary_hypertension,"diarrhea^^, flushing^^, sinus_tachycardia^^, r...",drugs.com,nzf,30425


In [2]:
df_tk = df.query("drug_class == 'tyrosine_kinase_inhibitor'")
df_tk

Unnamed: 0,generic_drug_name,notes,cyp_type_and_cyp_strength_of_evidence,drug_class,adverse_drug_reactions,first_ref,second_ref,date_checked
4,imatinib,,s_3A4,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa...",drugs.com,nzf,181124
5,ibrutinib,,s_3A4,tyrosine_kinase_inhibitor,"hypertension^^, AF^^, sinus_tachycardia^^, ra...",drugs.com,nzf,191124
6,neratinib,,s_3A4,tyrosine_kinase_inhibitor,"diarrhea^^, abdominal_pain^^, stomatitis^^, dy...",drugs.com,nzf,201124
58,ruxolitinib,,m_2C9,tyrosine_kinase_inhibitor,"UTI^^, anemia^^, thrombocytopenia^^, neutropen...",drugs.com,nzf,120325
83,tucatinib,*,m_2C8,tyrosine_kinase_inhibitor,"palmar_plantar_erythrodysesthesia_syndrome^^, ...",drugs.com,Drugs@FDA,30425


In [3]:
# Consider using BERT-like models e.g. PubMedBERT, BioBERT for NER in ADRs 
# (core idea is to pre-train and fine-tune transformer models)

In [4]:
tk_adrs = list(df_tk["adverse_drug_reactions"])
tk_adrs

['rash^^, diarrhea^^, abdominal_pain^^, constipation^^, dyspepsia^^, hemorrhage^^, neutropenia^^, thrombocytopenia^^, anemia^^, flu_syndrome^^, weight_gain^^, muscle_spasm/cramps^^, musculoskeletal_pain^^, joint_pain^^, myalgia^^, bone_pain^^, headache^^, dizziness^^, periorbital_edema^^, edema^^, fatigue^^, fever^^, insomnia^^, depression^^, nasopharyngitis^^, cough^^, upper_respiratory_tract_infection^^, pharyngolaryngeal_pain^^, sinusitis^^, flushing^, pruritus^, face_edema^, dry skin^, erythema^, alopecia^, night_sweats^, photosensitivity_reaction^, abdominal_distension^, GERD^, dry_mouth^, gastritis^, pancytopenia^, febrile_neutropenia^, increased_LFT^, anorexia^, weight_loss^, joint_swelling^, paresthesia^, hypoesthesia^, eyelid_edema^, lacrimation_increased^, conjunctival_hemorrhage^, conjunctivitis^, dry_eye^, blurred_vision^, taste_disturbance^, weakness^, generalized_edema^, rigors^, dyspnea^, epistaxis^',
 'hypertension^^, AF^^, sinus_tachycardia^^,  rash^^, skin_infection^^

In [5]:
## annotate text data for NER
# below adapted function code is part of a class code to produce a NER dataset maker 
# from https://gist.github.com/jangedoo/7ac6fdc7deadc87fd1a1124c9d4ccce9 

# create a tuple of entity values and entity names to prepare data
import re

def assign_tokens_with_entities(texts: str):

    # split the texts by spaces only if the space does not occur between square brackets
    # not splitting "multi-word" entity value yet
    raw_tokens = re.split(r"\s(?![^\[]*\])", texts)
    
    # a regex for matching the annotation according to our notation [entity_value](entity_name)
    entity_value_pattern = r"\[(?P<value>.+?)\]\((?P<entity>.+?)\)"
    entity_value_pattern_compiled = re.compile(entity_value_pattern, flags = re.I | re.M)
    tokens_with_entities = []
    
    for raw_token in raw_tokens:
        match = entity_value_pattern_compiled.match(raw_token)
        if match:
            raw_entity_name, raw_entity_value = match.group("entity"), match.group("value")

            # prefix the name of entity 
            # B- indicates beginning of an entity
            # I- indicates the token is not a new entity itself but rather a part of existing one
            # splits strings by commas (also works for semi-colons or spaces)
            for i, raw_entity_token in enumerate(re.split(r"[,\s;]+", raw_entity_value)):
                entity_prefix = "B" if i == 0 else "I"
                entity_name = f"{entity_prefix}-{raw_entity_name}"
                tokens_with_entities.append((raw_entity_token, entity_name))
        else:
            tokens_with_entities.append((raw_tokens, "O"))
        
    return tokens_with_entities

In [6]:
## Rest of tk inhibitors ADR texts in this order, ibrutinib, neratinib, ruxolitinib and tucatinib
# note: imatinib's ADRs being used as training texts for now; may need to increase size of training texts later

# 'hypertension^^, AF^^, sinus_tachycardia^^,  rash^^, skin_infection^^, pruritus^^, diarrhea^^, stomatitis^^, abdominal_pain^^, constipation^^, dyspepsia^^, GERD^^, UTI^^, decreased_platelets^^, neutropenia^^, decreased_neutrophils^^, decreased_hemoglobin^^, thrombocytopenia^^, hemorrhage^^, anemia^^, lymphocytosis^^, increased_bilirubin^^, increased_AST^^, increased_ALT^^, hepatic_failure(pm), hepatic_cirrhosis(pm), hepatitis_B_reactivation(pm), hepatotoxicity(pm), hypogammaglobulinemia^^, hyperuricemia^^, decreased_appetite^^, hypokalemia^^, musculoskeletal_pain^^, arthralgia^^, muscle_spasms^^, osteonecrosis^^, headache^^, dizziness^^, peripheral_neuropathy(pm)^^, dry_eye^^, blurred_vision^^, increased_lacrimation^^, reduced_visual_acuity^^, conjunctivitis^^, eye_hemorrhage(pm), second_malignancies^^, fatigue^^, fever^^, peripheral_edema^^, pain^^, infections^^, fall^^, sepsis^^, weight_loss^^, insomnia^^, increased_blood_creatinine^^, upper_RTI^^, cough^^, pneumonia^^, dyspnea^^, sinusitis^^, oropharyngeal_pain^^, bronchitis^^, nasopharyngitis^^, flu_syndrome^^, viral_upper_RTI^^, atrial_flutter^, cardiac_failure(pm)^, ventricular_tachyarrhythmias(pm)^, deaths_due_to_cardiac_causes^, erythema(pm)^, urticaria(pm)^, angioedema(pm), Stevens_Johnson_syndrome(pm), onychoclasis(pm), panniculitis(pm), neutrophilic_dermatoses(pm), major_hemorrhage^, febrile_neutropenia^, leukocytosis^,  anaphylactic_shock(pm), tumor_lysis_syndrome(pm)^, hyponatremia^, ischemic_cerebrovascular_events(pm)^, non_melanoma_skin_cancer^, non_skin_carcinomas^, basal_and_squamous_cell_carcinomas^, epistaxis^, interstitial_lung_disease(pm)', /
# 'diarrhea^^, abdominal_pain^^, stomatitis^^, dyspepsia^^, fatigue^^, rash^^, decreased_appetite^^, muscle_spasms^^, abdominal_distention^, dry_mouth^, nail_disorder^, skin_fissures^, weight_loss^, dehydration^, UTI^, increased_ALT^, increased_AST^, epistaxis^', /
# 'UTI^^, anemia^^, thrombocytopenia^^, neutropenia^^, hypercholesterolemia^^, weight_gain^^, dizziness^^, headache^^, bruising^^, bleeding^^, increased_ALT^^, increased_AST^^, GI_bleeding^, tuberculosis^, intracranial_bleeding^, herpes_zoster^', /
# 'palmar_plantar_erythrodysesthesia_syndrome^^, rash^^, diarrhea^^, stomatitis^^, decreased_hemoglobin^^, decreased_phosphate^^, anemia^^, increased_bilirubin^^, increased_ALT^^, increased_AST^^, hepatotoxicity^^, anorexia^^, weight_loss^^, decreased_magnesium^^, decreased_potassium^^, increased_ALP^^, arthralgia^^, increased_creatinine^^, epistaxis^^'

In [7]:
# trialling to assign tokens with entities using a few ADR terms only
assign_tokens_with_entities('[rash](dermatology)^^, [diarrhea](gastrointestinal)^^, [abdominal_pain](gastrointestinal)^^, '
'[constipation](gastrointestinal)^^, [dyspepsia](gastrointestinal)^^, [hemorrhage](haematologic)^^')

# Note: not taking into account of the frequency of occurrences yet (so they're removed for now)

[('rash', 'B-dermatology'),
 ('diarrhea', 'B-gastrointestinal'),
 ('abdominal_pain', 'B-gastrointestinal'),
 ('constipation', 'B-gastrointestinal'),
 ('dyspepsia', 'B-gastrointestinal'),
 ('hemorrhage', 'B-haematologic')]

In [8]:
# other alternative way is also available re. using tokenize and align labels function after using HF transformers
# e.g. https://huggingface.co/docs/transformers/tasks/token_classification#preprocess
# i.e. using the AutoTokenizer as shown in notebook 5-2
# however, the downside is that this may run into sub-word tokenizations 
# as the tokenizer models are pre-trained so may not align with the specific types of texts we use 
# so some words may be randomly splitted into sub-words, which are not wanted

In [9]:
## Commonly affected body systems in ADR - labels/tags
# nervous system
# psychiatric
 
# ocular
# cardiovascular
# respiratory
# gastrointestinal
# musculoskeletal
# haematologic
# hepatic
# renal
# genitourinary

# metabolic
# endocrine

# dermatologic
# immunologic
# hypersensitivity

# oncologic (carcinogenicity)

# local
# other

In [10]:
# Preparing imatinib's ADR terms as raw training texts 
# (done manually first, for future use, likely need to look into other quicker way to add labels especially for longer/larger texts)

raw_texts = "[rash^^,](dermatology) [diarrhea^^,](gastrointestinal) [abdominal_pain^^,](gastrointestinal) \
[constipation^^,](gastrointestinal) [dyspepsia](gastrointestinal)^^, [hemorrhage](haematologic)^^, \
[neutropenia^^,](haematologic) [thrombocytopenia^^,](haematologic) [anemia^^,](haematologic) [flu_syndrome^^,](respiratory) \
[weight_gain^^,](metabolic) [muscle_spasm/cramps^^,](musculoskeletal) [musculoskeletal_pain^^,](musculoskeletal)"

In [None]:
## Rest of imatinib's ADR -> **need to change format to be like above**

# [joint_pain]((musculoskeletal)^^, [myalgia](musculoskeletal)^^, [bone_pain](musculoskeletal)^^, [headache](nervous system)^^, 
# [dizziness](nervous system)^^, [periorbital_edema](nervous system)^^, [edema](other)^^, [fatigue](other)^^, 
# [fever](other)^^, [insomnia](psychiatric)^^, [depression](psychiatric)^^, [nasopharyngitis](respiratory)^^, [cough](respiratory)^^, 
# [upper_respiratory_tract_infection](respiratory)^^, [pharyngolaryngeal_pain](respiratory)^^, [sinusitis](respiratory)^^, 
# [flushing](cardiovascular)^, [pruritus](dermatologic)^, [face_edema](hypersensitivity)^, [dry_skin](dermatologic)^, 
# [erythema](dermatologic)^, [alopecia](dermatologic)^, [night_sweats](dermatologic)^, [photosensitivity_reaction](dermatologic)^, 
# [abdominal_distension](gastrointestinal)^, [GERD](gastrointestinal)^, [dry_mouth](gastrointestinal)^, 
# [gastritis](gastrointestinal)^, [pancytopenia](haematologic)^, [febrile_neutropenia](haematologic)^, [increased_LFT](hepatic)^, 
# [anorexia](metabolic)^, [weight_loss](metabolic)^, [joint_swelling](musculoskeletal)^, [paresthesia](nervous system)^, 
# [hypoesthesia](nervous system)^, [eyelid_edema](ocular)^, [lacrimation_increased](ocular)^, [conjunctival_hemorrhage](ocular)^, 
# [conjunctivitis](ocular)^, [dry_eye](ocular)^, [blurred_vision](ocular)^, [taste_disturbance](other)^, [weakness](other)^, 
# [generalized_edema](other)^, [rigors](other)^, [dyspnea](respiratory)^, [epistaxis](respiratory)^

In [None]:
# ## Trying out class code from https://gist.github.com/jangedoo/7ac6fdc7deadc87fd1a1124c9d4ccce9 
# class code exported as a separate python script 
# *need to debug* -> error msg - ArrowTypeError: Expected bytes, got a 'list' object
# figured out that there's nothing wrong with the class code, but the issue arised from the raw texts input format, 
# so changed it as shown above

In [13]:
raw_texts

'[rash^^,](dermatology) [diarrhea^^,](gastrointestinal) [abdominal_pain^^,](gastrointestinal) [constipation^^,](gastrointestinal) [dyspepsia](gastrointestinal)^^, [hemorrhage](haematologic)^^, [neutropenia^^,](haematologic) [thrombocytopenia^^,](haematologic) [anemia^^,](haematologic) [flu_syndrome^^,](respiratory) [weight_gain^^,](metabolic) [muscle_spasm/cramps^^,](musculoskeletal) [musculoskeletal_pain^^,](musculoskeletal)'

In [14]:
from NERDataMaker import NERDataMaker
dm = NERDataMaker(raw_texts.split("\n"))
print(f"total examples = {len(dm)}")
print(dm[0:3])

total examples = 1
[{'id': 0, 'ner_tags': [0, 6, 1, 7, 1, 7, 1, 7, 1, 2, 2, 8, 2, 8, 2, 8, 5, 11, 3, 9, 4, 10, 4, 10], 'tokens': ['rash^^', '', 'diarrhea^^', '', 'abdominal_pain^^', '', 'constipation^^', '', 'dyspepsia', 'hemorrhage', 'neutropenia^^', '', 'thrombocytopenia^^', '', 'anemia^^', '', 'flu_syndrome^^', '', 'weight_gain^^', '', 'muscle_spasm/cramps^^', '', 'musculoskeletal_pain^^', '']}]


In [15]:
type(dm)

NERDataMaker.NERDataMaker

In [16]:
#dm[0:5]

In [17]:
from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForTokenClassification, TrainingArguments, Trainer

In [18]:
# Example usage:
# from transformers import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# dm = NERDataMaker(["I come from [Kathmanduu valley,](location) [Nepal](location)"])  
# dm.as_hf_dataset(tokenizer=tokenizer)

In [19]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
dm.as_hf_dataset(tokenizer=tokenizer)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'ner_tags', 'tokens', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 1
})

In [20]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased", 
    num_labels = len(dm.unique_entities), 
    id2label = dm.id2label, 
    label2id = dm.label2id
    )

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=40,
    weight_decay=0.01,
)

train_ds = dm.as_hf_dataset(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=train_ds, # TODO: *need to change to test_ds (trialling using train dataset in the meantime only)*
    processing_class=tokenizer,
    #tokenizer=tokenizer, # deprecated
    data_collator = data_collator,
)

trainer.train()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,No log,2.141899
2,No log,1.869016
3,No log,1.598116
4,No log,1.325723
5,No log,1.100859
6,No log,0.941418
7,No log,0.809085
8,No log,0.655387
9,No log,0.533349
10,No log,0.444007


TrainOutput(global_step=40, training_loss=0.5012574195861816, metrics={'train_runtime': 100.8872, 'train_samples_per_second': 0.396, 'train_steps_per_second': 0.396, 'total_flos': 1633252377600.0, 'train_loss': 0.5012574195861816, 'epoch': 40.0})

In [21]:
# test texts from dasatinib's ADRs from nzf *note: in NZ/UK spelling*
#nausea, vomiting, diarrhoea, anorexia, weight changes, abdominal pain, taste disturbance, constipation, dyspepsia, colitis, gastritis, arrhythmias, pericardial effusion, congestive heart failure, hypertension, chest pain, flushing, haemorrhage (including gastro-intestinal and CNS haemorrhage), palpitation, dyspnoea, pulmonary hypertension, cough, oedema (more common in patients over 65 years old), pleural effusion, depression, dizziness, headache, insomnia, somnolence, fatigue, neuropathy, influenza-like symptoms, myelosuppression (including anaemia, neutropenia, thrombocytopenia), febrile neutropenia, musculoskeletal pain, muscle spasm, muscle weakness, arthralgia, visual disturbances, dry eye, tinnitus, acne, dry skin, dermatitis, urticaria, alopecia, hyperhidrosis; less commonly pancreatitis (rarely acute pancreatitis), hepatitis, cholestasis, cholecystitis, oesophagitis, hypotension, thrombophlebitis, syncope, QT-interval prolongation, myocardial infarction, pericarditis, angina pectoris, cardiomegaly, bronchospasm, amnesia, anxiety, confusion, tremor, gynaecomastia, irregular menstruation, thyroid dysfunction, renal failure, urinary frequency, proteinuria, rhabdomyolysis, tendonitis, osteonecrosis, diabetes mellitus, hypersensitivity reactions (including erythema nodosum, pruritus, and Stevens-Johnson syndrome), Drug-induced photosensitivity reactions, pigmentation disorders, nail disorders, skin ulcer, bullous conditions, photophobia, vertigo, hearing loss; rarely cor pulmonale, deep vein thrombosis, embolism, myocarditis, coronary artery disease, cardiac arrest, pleuropericarditis, pure red cell aplasia, cerebrovascular accident, transient ischemic attack, acute respiratory distress syndrome, pulmonary embolism, convulsion, optic neuritis, VIIth nerve paralysis, dementia, ataxia, leukocytoclastic vasculitis, skin fibrosis; also reported atrial fibrillation, atrial flutter, thrombotic microangiopathy, interstitial lung disease, chylothorax, pulmonary arterial hypertension, nephrotic syndrome, hepatitis B reactivation

In [None]:
from transformers import pipeline
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
# sample ADRs of dasatinib from FDA below (US spelling, which may correlate better with training data which is originally in US spellings)
pipe("myelosuppression, fluid retention events, diarrhea, headache, skin rash, hemorrhage, dyspnea, fatigue, nausea, and musculoskeletal pain.")

Device set to use cpu


[{'entity_group': 'musculoskeletal',
  'score': 0.1426468,
  'word': 'my',
  'start': 0,
  'end': 2},
 {'entity_group': 'gastrointestinal',
  'score': 0.28940064,
  'word': '##elo',
  'start': 2,
  'end': 5},
 {'entity_group': 'gastrointestinal',
  'score': 0.24409828,
  'word': '##su',
  'start': 5,
  'end': 7},
 {'entity_group': 'gastrointestinal',
  'score': 0.24849327,
  'word': '##pp',
  'start': 7,
  'end': 9},
 {'entity_group': 'gastrointestinal',
  'score': 0.22623841,
  'word': '##ression',
  'start': 9,
  'end': 16},
 {'entity_group': 'gastrointestinal',
  'score': 0.2468893,
  'word': ',',
  'start': 16,
  'end': 17},
 {'entity_group': 'gastrointestinal',
  'score': 0.19548449,
  'word': 'fluid',
  'start': 18,
  'end': 23},
 {'entity_group': 'gastrointestinal',
  'score': 0.23864552,
  'word': 'retention',
  'start': 24,
  'end': 33},
 {'entity_group': 'gastrointestinal',
  'score': 0.25369453,
  'word': 'events',
  'start': 34,
  'end': 40},
 {'entity_group': 'gastrointest

In [None]:
# very early thoughts after a brief test of above:
# NER accuracy seems not great at the moment, 
# may need to try other pre-trained models, e.g. more field-specific types, 
# also need to prepare a separate validation dataset actually,
# and training data size may be way too small as well
# but getting the overall idea about how it'll work on unseen, new text data for the trained (from own smaller hopefully better quality dataset) 
# and pre-trained (base/foundational model) transformer model to label words/ADRs to do NER