*this notebook uses a venv created by using uv*
- https://docs.astral.sh/uv/guides/integration/jupyter/#using-jupyter-from-vs-code

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("CYP3A4_strong_substrates")
data

Unnamed: 0,generic_drug_name,cyp_strength_of_evidence,drug_class,common_adverse_effects^^,less_common_adverse_effects^,first_ref,second_ref,date_checked
0,carbamazepine,strong,antiepileptics,"constipation^^, leucopenia^^, dizziness^^, som...","eosinophilia^, thrombocytopenia^, neutropenia^...",drugs.com,nzf,211024
1,eliglustat,strong,metabolic_agents,"diarrhea^^, oropharyngeal_pain^^, arthralgia^^...","rash^, flatulence^, dyspepsia^, gastroesophage...",drugs.com,emc,151124
2,flibanserin,strong,CNS_agents,"dizziness^^, somnolence^^","sedation^, fatigue^, vertigo^, accidental_inju...",drugs.com,Drugs@FDA,161124
3,imatinib,strong,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa...","flushing^, pruritus^, face_edema^, dry skin^, ...",drugs.com,nzf,181124
4,ibrutinib,strong,tyrosine_kinase_inhibitor,"hypertension^^, atrial_fibrillation^^, sinus_t...","atrial_flutter^, cardiac_failure(pm)^, ventric...",drugs.com,nzf,191124
5,neratinib,strong,tyrosine_kinase_inhibitor,"diarrhea^^, abdominal_pain^^, stomatitis^^, dy...","abdominal_distention^, dry_mouth^, nail_disord...",drugs.com,nzf,201124
6,esomeprazole,strong,proton_pump_inhibitors,"headache^^, flatulence^^","dizziness^, somnolence^, taste_disturbance/per...",drugs.com,emc,161124
7,omeprazole,strong,proton_pump_inhibitors,"fever^^, otitis_media^^, respiratory_system_re...","accidental_injury^, asthenia^, pain(pm), fatig...",drugs.com,nzf,181124
8,ivacaftor,strong,CFTR_potentiator,"rash^^, oropharyngeal_pain^^, abdominal_pain^^...","acne^, increased_hepatic_enzymes^, increased_b...",drugs.com,nzf,201124
9,naloxegol,strong,peripheral_opioid_receptor_antagonists,abdominal pain^^,"possible_opioid_withdrawal_syndrome^, diarrhea...",drugs.com,emc,211124


In [3]:
# drop some columns
df = data.drop([
    "cyp_strength_of_evidence", 
    "drug_class", 
    "less_common_adverse_effects^", 
    "first_ref", 
    "second_ref", 
    "date_checked"
    ], axis=1)
df

Unnamed: 0,generic_drug_name,common_adverse_effects^^
0,carbamazepine,"constipation^^, leucopenia^^, dizziness^^, som..."
1,eliglustat,"diarrhea^^, oropharyngeal_pain^^, arthralgia^^..."
2,flibanserin,"dizziness^^, somnolence^^"
3,imatinib,"rash^^, diarrhea^^, abdominal_pain^^, constipa..."
4,ibrutinib,"hypertension^^, atrial_fibrillation^^, sinus_t..."
5,neratinib,"diarrhea^^, abdominal_pain^^, stomatitis^^, dy..."
6,esomeprazole,"headache^^, flatulence^^"
7,omeprazole,"fever^^, otitis_media^^, respiratory_system_re..."
8,ivacaftor,"rash^^, oropharyngeal_pain^^, abdominal_pain^^..."
9,naloxegol,abdominal pain^^


In [None]:
## Example - trial generating tensors on ADRs for ONE drug e.g. terfenadine

import torch
import torch.nn as nn
from collections import Counter

torch.manual_seed(1)

sentence = "dizziness^^, syncopal_episodes^^, palpitations^, ventricular_arrhythmias^^, cardiac_arrest^^, cardiac_death^^, headaches^"

words = sentence.split(', ')
words

['dizziness^^',
 'syncopal_episodes^^',
 'palpitations^',
 'ventricular_arrhythmias^^',
 'cardiac_arrest^^',
 'cardiac_death^^',
 'headaches^']

In [7]:
# create a dictionary
vocab = Counter(words) 
vocab

Counter({'dizziness^^': 1,
         'syncopal_episodes^^': 1,
         'palpitations^': 1,
         'ventricular_arrhythmias^^': 1,
         'cardiac_arrest^^': 1,
         'cardiac_death^^': 1,
         'headaches^': 1})

In [8]:
vocab = sorted(vocab)
vocab

['cardiac_arrest^^',
 'cardiac_death^^',
 'dizziness^^',
 'headaches^',
 'palpitations^',
 'syncopal_episodes^^',
 'ventricular_arrhythmias^^']

In [9]:
vocab_size = len(vocab)
vocab_size

7

In [10]:
# create a word to index dictionary from the vocab
word2idx = {word: ind for ind, word in enumerate(vocab)} 
word2idx

{'cardiac_arrest^^': 0,
 'cardiac_death^^': 1,
 'dizziness^^': 2,
 'headaches^': 3,
 'palpitations^': 4,
 'syncopal_episodes^^': 5,
 'ventricular_arrhythmias^^': 6}

In [11]:
for word in words:
    word2idx[word]
    print(word)

dizziness^^
syncopal_episodes^^
palpitations^
ventricular_arrhythmias^^
cardiac_arrest^^
cardiac_death^^
headaches^


In [12]:
# Create a list of words from the word2idx dictionary
encoded_sentences = [word2idx[word] for word in words]
encoded_sentences

[2, 5, 4, 6, 0, 1, 3]

In [13]:
## docs: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding
# assign a value to embedding_dim - the size of each embedding vector (usually embedding_dim << no. of words)
embedding_dim = 5

## initialise an embedding layer from Torch
# padding_idx - padding an input at the set index and insert zero, meaning not going to contribute to the gradient
# vocab_size = num_embeddings - size of the dictionary of embeddings
emb = nn.Embedding(vocab_size, embedding_dim)
word_vectors = torch.LongTensor(encoded_sentences)
emb(word_vectors)

tensor([[-0.7773, -0.2515, -0.2223,  1.6871,  0.2284],
        [-0.7765,  2.0242, -0.0288,  2.3571, -1.0373],
        [ 0.1991,  0.0457,  0.1530, -0.4757, -1.8821],
        [ 1.5748, -0.6298,  2.4070,  0.2786,  0.2468],
        [-1.5256, -0.7502, -0.6540, -1.6095, -0.1002],
        [-0.6092, -0.9798, -1.6091, -0.7121,  0.3037],
        [ 0.4676, -0.6970, -1.1608,  0.6995, -0.7735]],
       grad_fn=<EmbeddingBackward0>)

In [None]:
# save all ADRs from common ADRs column as a list & join separate strings into one string
adr_str = df["common_adverse_effects^^"].tolist()
adr_string = ",".join(adr_str)
adr_string

'constipation^^, leucopenia^^, dizziness^^, somnolence^^, ataxia^^, elevated GGT^^, allergic_skin_reactions^^,diarrhea^^, oropharyngeal_pain^^, arthralgia^^, back_pain^^, pain_in_extremity^^, upper_abdominal_pain^^, headache^^, migraine^^, fatigue^^,dizziness^^, somnolence^^,rash^^, diarrhea^^, abdominal_pain^^, constipation^^, dyspepsia^^, hemorrhage^^, neutropenia^^, thrombocytopenia^^, anemia^^, influenza^^, weight_gain^^, muscle_spasm/cramps^^, musculoskeletal_pain^^, joint_pain^^, myalgia^^, bone_pain^^, headache^^, dizziness^^, periorbital_edema^^, edema^^, fatigue^^, pyrexia^^, insomnia^^, depression^^, nasopharyngitis^^, cough^^, upper_respiratory_tract infection^^, pharyngolaryngeal_pain^^, sinusitis^^,hypertension^^, atrial_fibrillation^^, sinus_tachycardia^^,  rash^^, skin_infection^^, pruritus^^, diarrhea^^, stomatitis^^, abdominal_pain^^, constipation^^, dyspepsia^^, gastroesophageal_reflux_disease^^, UTI^^, decreased_platelets^^, neutropenia^^, decreased_neutrophils^^, de

In [None]:
# Converting all common ADRs into Torch tensors 
# (using a work-in-progress function code saved as adr_tensors.py)
from adr_tensors import adr_tensors
adr = adr_tensors(adr_string)
adr

tensor([[-4.3704e-01,  7.6260e-01,  4.4151e-01,  1.1651e+00,  2.0154e+00],
        [ 2.1518e-01, -5.2419e-01, -1.8034e+00, -6.4464e-01,  1.5392e+00],
        [-8.6959e-01, -3.3312e+00, -7.4787e-01,  1.1173e+00,  2.9814e-01],
        [ 1.0989e-01, -1.0055e+00, -2.1061e-01, -7.5475e-03,  1.6734e+00],
        [ 1.0343e-02,  9.8371e-01,  8.7929e-01, -1.4504e+00, -8.3126e-01],
        [-4.6102e-01, -5.6008e-01,  3.9558e-01, -9.8228e-01,  1.3264e+00],
        [ 8.5472e-01, -2.8052e-01,  7.3169e-01, -1.4344e+00, -5.0081e-01],
        [ 1.7163e-01, -1.5999e-01, -5.0467e-01, -1.4746e+00, -3.4158e-01],
        [ 7.3227e-01, -1.0483e+00, -4.7088e-01,  2.9114e-01,  1.9907e+00],
        [-9.2470e-01, -9.3010e-01,  1.4301e+00, -9.1352e-01,  1.3851e+00],
        [-8.1385e-01, -9.2758e-01,  1.1120e+00,  6.1554e-01,  1.9382e-01],
        [-2.5832e+00, -1.5122e-01, -2.1021e+00, -6.2002e-01, -1.4782e+00],
        [-1.1334e+00, -1.0103e-01,  3.4335e-01, -1.0703e+00,  8.1682e-01],
        [ 2.0530e-01,  3.

In [None]:
# May need to set up a class with a few different functions (possibly in separate .py scripts then run in notebook first)

* Structure-adverse drug reaction relationships: 
**ADRs <-> (dense vectors of real numbers) <-> 2D drug structures**

* Structure-activity relationships: 
**drug activities <-> 2d drug structures**

1. trial generating tensors of ADRs for one drug
- https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

2. building a NN model (?RNN) to classify drugs in an ADRs dataset (?identify drugs in different therapeutic classes) or to predict ADRs of drugs (regression) - to determine whether to use classification/regression
- to infer possible drugs vs. ADRs relationships

3. 2D drug structures part (much further down the line)
- graph neural networks (GNN - other variations also available): molecules as undirected graphs where the connections between nodes (atoms) and edges (bonds) don't matter (i.e. don't need to be in particular orders or sequences) 
OR 
- RNN that uses SMILES (NLP technique) -> tokenize SMILES strings -> converts into a dictionary mapping tokens to indices in the vocabulary -> converts the vocabulary (SMILES strings) into one-hot encodings