*this notebook uses a venv created by using uv*
- https://docs.astral.sh/uv/guides/integration/jupyter/#using-jupyter-from-vs-code

In [1]:
import pandas as pd
print(f"Pandas version used is: {pd.__version__}")
import torch
print(f"PyTorch version used is: {torch.__version__}")
import torch.nn as nn

Pandas version used is: 2.2.3
PyTorch version used is: 2.2.2


In [2]:
data = pd.read_csv("All_CYP3A4_substrates")
data.head()

Unnamed: 0,generic_drug_name,notes,cyp_strength_of_evidence,drug_class,adverse_drug_reactions,first_ref,second_ref,date_checked
0,carbamazepine,,strong,antiepileptics,"constipation^^, leucopenia^^, dizziness^^, som...",drugs.com,nzf,211024
1,eliglustat,,strong,metabolic_agents,"diarrhea^^, oropharyngeal_pain^^, arthralgia^^...",drugs.com,emc,151124
2,flibanserin,,strong,CNS_agents,"dizziness^^, somnolence^^, sedation^, fatigue^...",drugs.com,Drugs@FDA,161124
3,imatinib,,strong,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa...",drugs.com,nzf,181124
4,ibrutinib,,strong,tyrosine_kinase_inhibitor,"hypertension^^, atrial_fibrillation^^, sinus_t...",drugs.com,nzf,191124


For drug with astericks marked in "notes" column, see data notes under "Exceptions for ADRs" section in 1_ADR_data.qmd.

In [3]:
# drop some columns
df = data.drop([
    "notes",
    "first_ref", 
    "second_ref", 
    "date_checked"
    ], axis=1)
df

Unnamed: 0,generic_drug_name,cyp_strength_of_evidence,drug_class,adverse_drug_reactions
0,carbamazepine,strong,antiepileptics,"constipation^^, leucopenia^^, dizziness^^, som..."
1,eliglustat,strong,metabolic_agents,"diarrhea^^, oropharyngeal_pain^^, arthralgia^^..."
2,flibanserin,strong,CNS_agents,"dizziness^^, somnolence^^, sedation^, fatigue^..."
3,imatinib,strong,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa..."
4,ibrutinib,strong,tyrosine_kinase_inhibitor,"hypertension^^, atrial_fibrillation^^, sinus_t..."
5,neratinib,strong,tyrosine_kinase_inhibitor,"diarrhea^^, abdominal_pain^^, stomatitis^^, dy..."
6,esomeprazole,strong,proton_pump_inhibitors,"headache^^, flatulence^^, dizziness^, somnolen..."
7,omeprazole,strong,proton_pump_inhibitors,"fever^^, otitis_media^^, respiratory_system_re..."
8,ivacaftor,strong,CFTR_potentiator,"rash^^, oropharyngeal_pain^^, abdominal_pain^^..."
9,naloxegol,strong,peripheral_opioid_receptor_antagonists,"abdominal pain^^, possible_opioid_withdrawal_s..."


In [4]:
string = df["generic_drug_name"].tolist()
# Convert list of drugs into multiple strings of drug names
drugs = f"'{"','".join(string)}'"
# Convert from lower case to upper case
for letter in drugs:
    if letter.islower():
        drugs = drugs.replace(letter, letter.upper())
print(drugs)

'CARBAMAZEPINE','ELIGLUSTAT','FLIBANSERIN','IMATINIB','IBRUTINIB','NERATINIB','ESOMEPRAZOLE','OMEPRAZOLE','IVACAFTOR','NALOXEGOL','OXYCODONE','SIROLIMUS','TERFENADINE','DIAZEPAM','HYDROCORTISONE','LANSOPRAZOLE','PANTOPRAZOLE','LERCANIDIPINE','NALDEMEDINE','NELFINAVIR','TELAPREVIR','ONDANSETRON','QUININE','RIBOCICLIB','SUVOREXANT','TELITHROMYCIN','TEMSIROLIMUS'


In [5]:
# Get SMILES for each drug (via copying-and-pasting the previous cell output - attempted various ways to feed the string
# directly into cyp_drugs.py, current way seems to be the most straightforward one...)
from cyp_drugs import chembl_drugs
# Using ChEMBL version 34
df_s = chembl_drugs(
    'CARBAMAZEPINE','ELIGLUSTAT','FLIBANSERIN','IMATINIB','IBRUTINIB','NERATINIB','ESOMEPRAZOLE','OMEPRAZOLE','IVACAFTOR','NALOXEGOL','OXYCODONE','SIROLIMUS','TERFENADINE','DIAZEPAM','HYDROCORTISONE','LANSOPRAZOLE','PANTOPRAZOLE','LERCANIDIPINE','NALDEMEDINE','NELFINAVIR','TELAPREVIR','ONDANSETRON','QUININE','RIBOCICLIB','SUVOREXANT','TELITHROMYCIN','TEMSIROLIMUS', 
    #file_name="All_cyp3a4_smiles"
    )
print(df_s.shape)
df_s.head()

## Note: latest ChEMBL version 35 (as from 1st Dec 2024) seems to be taking a long time to load (no output after ~7min), 
## both versions 33 & 34 are ok with outputs loading within a few secs

(27, 4)


Unnamed: 0,chembl_id,pref_name,max_phase,canonical_smiles
0,CHEMBL108,CARBAMAZEPINE,4,NC(=O)N1c2ccccc2C=Cc2ccccc21
1,CHEMBL12,DIAZEPAM,4,CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21
2,CHEMBL2110588,ELIGLUSTAT,4,CCCCCCCC(=O)N[C@H](CN1CCCC1)[C@H](O)c1ccc2c(c1...
3,CHEMBL1201320,ESOMEPRAZOLE,4,COc1ccc2[nH]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1
4,CHEMBL231068,FLIBANSERIN,4,O=c1[nH]c2ccccc2n1CCN1CCN(c2cccc(C(F)(F)F)c2)CC1


I'm parsing the canonical SMILES through my old script to generate these small molecules as RDKit molecules and standardised SMILES, making sure these SMILES are valid and parsable.

In [6]:
# Using my previous code to preprocess small mols
import datamol as dm

# disable rdkit messages
dm.disable_rdkit_log()

#  The following function code were adapted from datamol.io
def preprocess(row):

    """
    Function to preprocess, fix, standardise, sanitise compounds 
    and then generate various molecular representations based on these molecules.
    Can be utilised as df.apply(preprocess, axis=1).

    :param smiles_column: SMILES column name (needs to be names as "canonical_smiles") 
    derived from ChEMBL database (or any other sources) via an input dataframe
    :param mol: RDKit molecules
    :return: preprocessed RDKit molecules, standardised SMILES, SELFIES, 
    InChI and InChI keys added as separate columns in the dataframe
    """

    # smiles_column = strings object
    smiles_column = "canonical_smiles"
    # Convert each compound into a RDKit molecule in the smiles column
    mol = dm.to_mol(row[smiles_column], ordered=True)
    # Fix common errors in the molecules
    mol = dm.fix_mol(mol)
    # Sanitise the molecules 
    mol = dm.sanitize_mol(mol, sanifix=True, charge_neutral=False)
    # Standardise the molecules
    mol = dm.standardize_mol(
        mol,
        # Switch on to disconnect metal ions
        disconnect_metals=True,
        normalize=True,
        reionize=True,
        # Switch on "uncharge" to neutralise charges
        uncharge=True,
        # Taking care of stereochemistries of compounds
        # Note: this uses the older approach of "AssignStereochemistry()" from RDKit
        # https://github.com/datamol-io/datamol/blob/main/datamol/mol.py#L488
        stereo=True,
    )

    # Adding following rows of different molecular representations 
    row["rdkit_mol"] = dm.to_mol(mol)
    row["standard_smiles"] = dm.standardize_smiles(dm.to_smiles(mol))
    #row["selfies"] = dm.to_selfies(mol)
    #row["inchi"] = dm.to_inchi(mol)
    #row["inchikey"] = dm.to_inchikey(mol)
    return row

df_s3a4 = df_s.apply(preprocess, axis = 1)
df_s3a4.head()

Unnamed: 0,chembl_id,pref_name,max_phase,canonical_smiles,rdkit_mol,standard_smiles
0,CHEMBL108,CARBAMAZEPINE,4,NC(=O)N1c2ccccc2C=Cc2ccccc21,<rdkit.Chem.rdchem.Mol object at 0x12ad2bc30>,NC(=O)N1c2ccccc2C=Cc2ccccc21
1,CHEMBL12,DIAZEPAM,4,CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21,<rdkit.Chem.rdchem.Mol object at 0x12ad2bd10>,CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21
2,CHEMBL2110588,ELIGLUSTAT,4,CCCCCCCC(=O)N[C@H](CN1CCCC1)[C@H](O)c1ccc2c(c1...,<rdkit.Chem.rdchem.Mol object at 0x12ad2bd80>,CCCCCCCC(=O)N[C@H](CN1CCCC1)[C@H](O)c1ccc2c(c1...
3,CHEMBL1201320,ESOMEPRAZOLE,4,COc1ccc2[nH]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1,<rdkit.Chem.rdchem.Mol object at 0x12ad2be60>,COc1ccc2[nH]c([S@@+]([O-])Cc3ncc(C)c(OC)c3C)nc2c1
4,CHEMBL231068,FLIBANSERIN,4,O=c1[nH]c2ccccc2n1CCN1CCN(c2cccc(C(F)(F)F)c2)CC1,<rdkit.Chem.rdchem.Mol object at 0x12ad2bed0>,O=c1[nH]c2ccccc2n1CCN1CCN(c2cccc(C(F)(F)F)c2)CC1


In [7]:
## Splitting data 
# random splits usually lead to overly optimistic models... testing molecules are too similar to traininig molecules
# Some blog references re. data splitting wrt small molecules: 
# https://greglandrum.github.io/rdkit-blog/posts/2024-05-31-scaffold-splits-and-murcko-scaffolds1.html
# https://practicalcheminformatics.blogspot.com/2024/11/some-thoughts-on-splitting-chemical.html

## Try using Pat Walters' useful_rdkit_utils' GroupKFoldShuffle 
# (code originated from: https://github.com/scikit-learn/scikit-learn/issues/20520)

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
import useful_rdkit_utils as uru
import numpy as np

# Generate numpy arrays containing the fingerprints 
df_s3a4['fp'] = df_s3a4.rdkit_mol.apply(rdFingerprintGenerator.GetMorganGenerator().GetCountFingerprintAsNumPy)

# Get Butina cluster labels
df_s3a4["butina_cluster"] = uru.get_butina_clusters(df_s3a4.standard_smiles)

# Set up a GroupKFoldShuffle object
group_kfold_shuffle = uru.GroupKFoldShuffle(n_splits=5, shuffle=True)

# Using cross-validation/doing data split
## X = np.stack(df_s3a4.fp), y = df.adverse_drug_reactions, group labels = df_s3a4.butina_cluster
for train, test in group_kfold_shuffle.split(np.stack(df_s3a4.fp), df.adverse_drug_reactions, df_s3a4.butina_cluster):
    print(len(train),len(test))

21 6
20 7
23 4
23 4
21 6


In [8]:
## Figuring out locating train and test sets:

## create a dictionary as {index: butina label} first? --> not this way I think...
## butina cluster labels vs. index
#df_s3a4["butina_cluster"]

## or maybe can directly convert from numpy to tensor --> not this way! 
## will need to locate drugs via indices first to specify training and testing sets
# torch_train = torch.from_numpy(train)
# torch_train
# torch_test = torch.from_numpy(test)
# torch_test

In [9]:
# Locate train & test sets in the original df using pd.iloc 
# Training set indices
train

array([ 0,  3,  4,  5,  6,  7,  8,  9, 10, 13, 14, 15, 16, 18, 19, 21, 22,
       23, 24, 25, 26])

In [10]:
# Convert indices into list
train_set = train.tolist()
# Locate drugs and drug info via pd.iloc
df_train = df.iloc[train_set]
print(df_train.shape)
df_train.head()

(21, 4)


Unnamed: 0,generic_drug_name,cyp_strength_of_evidence,drug_class,adverse_drug_reactions
0,carbamazepine,strong,antiepileptics,"constipation^^, leucopenia^^, dizziness^^, som..."
3,imatinib,strong,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa..."
4,ibrutinib,strong,tyrosine_kinase_inhibitor,"hypertension^^, atrial_fibrillation^^, sinus_t..."
5,neratinib,strong,tyrosine_kinase_inhibitor,"diarrhea^^, abdominal_pain^^, stomatitis^^, dy..."
6,esomeprazole,strong,proton_pump_inhibitors,"headache^^, flatulence^^, dizziness^, somnolen..."


In [11]:
# Testing set indices
test

array([ 1,  2, 11, 12, 17, 20])

In [12]:
test_set = test.tolist()
df_test = df.iloc[test_set]
print(df_test.shape)
df_test

(6, 4)


Unnamed: 0,generic_drug_name,cyp_strength_of_evidence,drug_class,adverse_drug_reactions
1,eliglustat,strong,metabolic_agents,"diarrhea^^, oropharyngeal_pain^^, arthralgia^^..."
2,flibanserin,strong,CNS_agents,"dizziness^^, somnolence^^, sedation^, fatigue^..."
11,sirolimus,strong,immunosuppressant,"hypertriglyceridemia^^, hypercholesterolemia^^..."
12,terfenadine,strong,antihistamines,"dizziness^^, syncopal_episodes^^, palpitations..."
17,lercanidipine,mod,calcium_channel_blockers,"hypotension(pm), gingival_hypertrophy(pm), uri..."
20,telaprevir,mod,antivirals,"rash^^, pruritus^^, anemia^^, decreased_mean_p..."


In [13]:
# Taking another look at original df
df.head()

Unnamed: 0,generic_drug_name,cyp_strength_of_evidence,drug_class,adverse_drug_reactions
0,carbamazepine,strong,antiepileptics,"constipation^^, leucopenia^^, dizziness^^, som..."
1,eliglustat,strong,metabolic_agents,"diarrhea^^, oropharyngeal_pain^^, arthralgia^^..."
2,flibanserin,strong,CNS_agents,"dizziness^^, somnolence^^, sedation^, fatigue^..."
3,imatinib,strong,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa..."
4,ibrutinib,strong,tyrosine_kinase_inhibitor,"hypertension^^, atrial_fibrillation^^, sinus_t..."


In [14]:
## Likely using regression (?)

## Using Butina clustering/splits to split data - to do this, it requires SMILES in order to generate fingerprints 
## I may only use these SMILES to this extent for the current post, but for future posts these SMILES might be utilised more...

In [15]:
df_train.head()

Unnamed: 0,generic_drug_name,cyp_strength_of_evidence,drug_class,adverse_drug_reactions
0,carbamazepine,strong,antiepileptics,"constipation^^, leucopenia^^, dizziness^^, som..."
3,imatinib,strong,tyrosine_kinase_inhibitor,"rash^^, diarrhea^^, abdominal_pain^^, constipa..."
4,ibrutinib,strong,tyrosine_kinase_inhibitor,"hypertension^^, atrial_fibrillation^^, sinus_t..."
5,neratinib,strong,tyrosine_kinase_inhibitor,"diarrhea^^, abdominal_pain^^, stomatitis^^, dy..."
6,esomeprazole,strong,proton_pump_inhibitors,"headache^^, flatulence^^, dizziness^, somnolen..."


In [16]:
## Separate df_train into X_train & y_train, then separate df_test in X_test & y_test

# Use scikit_learn's train_test_split() on df_train - to get X_train, y_train --> no need for this I think...

## NOTE: this step may be integrated with one-hot encoding and vector embeddings!

#y_train = df_train[["adverse_drug_reactions"]]
#y_train

#y_test = df_test[["adverse_drug_reactions"]]
#y_test

Converting X & y variables into one-hot encodings or vector embeddings and also set up X_train, y_train, X_test, y_test

In [17]:
## X_train
## 1. convert "cyp_strength_of_evidence" column into one-hot encoding
from torch.nn.functional import one_hot

# If using df["column_name"], this'll trigger a setting with copy warning
# A value is trying to be set on a copy of a slice from a DataFrame.
# Try using .loc[row_indexer,col_indexer] = value instead
# ref: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html#copy-on-write-cow
# Enable copy-on-write globally to remove the warning
pd.options.mode.copy_on_write = True

# replace CYP strength as numbers
# a useful thread to solve downcasting issue in pd.replace() - https://github.com/pandas-dev/pandas/issues/57734
with pd.option_context('future.no_silent_downcasting', True):
   df_train["cyp_strength_of_evidence"] = df_train["cyp_strength_of_evidence"].replace({"strong": 1, "mod": 2}).infer_objects()
   df_test["cyp_strength_of_evidence"] = df_test["cyp_strength_of_evidence"].replace({"strong": 1, "mod": 2}).infer_objects()

In [18]:
# Get total number of CYP strengths in df
total_cyp_str_train = len(set(df_train["cyp_strength_of_evidence"]))

# note: if using df_train["cyp_strength_of_evidence"].values then this leads to non-writable tensors
# adding copy() is one option e.g.
# cyp_str_encoded = one_hot(torch.from_numpy(df_train["cyp_strength_of_evidence"].values.copy()) % total_cyp_str)

# UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. 
# This means writing to this tensor will result in undefined behavior. 
# You may want to copy the array to protect its data or make it writable before converting it to a tensor. 
# This type of warning will be suppressed for the rest of this program. 
# (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:212.)

array_train = df_train["cyp_strength_of_evidence"].to_numpy()
# make the numpy array writeable, otherwise it'll trigger a user warning (as shown above)
array_train.flags.writeable = True

cyp_str_train_t = one_hot(torch.from_numpy(array_train) % total_cyp_str_train)
cyp_str_train_t

tensor([[0, 1],
        [0, 1],
        [0, 1],
        [0, 1],
        [0, 1],
        [0, 1],
        [0, 1],
        [0, 1],
        [0, 1],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0]])

In [19]:
## see separate scripts used previously e.g. words_tensors.py 
## or Tensors_for_adrs_interactive.py to show step-by-step conversions from words to PyTorch tensors

# 2. Convert "adverse_drug_reactions" column into embeddings
# Save all ADRs from common ADRs column as a list (joining every row of ADRs in place only)
adr_str_train = df_train["adverse_drug_reactions"].tolist()
# Join separate rows of strings into one complete string
adr_string_train = ",".join(adr_str_train)
# Converting all ADRs into Torch tensors using adr_tensors.py
from words_tensors import words_tensors
adr_train_t = words_tensors(adr_string_train)
adr_train_t

tensor([[-1.5256e+00, -7.5023e-01],
        [-6.5398e-01, -1.6095e+00],
        [-1.0017e-01, -6.0919e-01],
        ...,
        [ 1.7133e+00, -1.7943e+00],
        [-1.3579e+00, -9.6113e-01],
        [ 1.1118e-03, -8.3201e-01]], grad_fn=<EmbeddingBackward0>)

In [20]:
# 3. Convert "drug_class" column into one-hot encoding or embeddings instead (?more efficient)
## if choosing to represent drug_class as one-hot encoding, will need to convert each of the 20 drug classes into digits first
## opting for embedding first

# total number of drug classes in df = 20
# len_d_class = len(set(df["drug_class"]))
# len_d_class

dc_str_train = df_train["drug_class"].tolist()
dc_string_train = ",".join(dc_str_train)
dc_train_t = words_tensors(dc_string_train)
dc_train_t

tensor([[-0.0098,  0.2955]], grad_fn=<EmbeddingBackward0>)

In [21]:
adr_train_t.shape

torch.Size([762, 2])

In [22]:
dc_train_t.shape

torch.Size([1, 2])

In [23]:
cyp_str_train_t.shape

torch.Size([21, 2])

In [24]:
# Concatenate adr tensors, drug class tensors and cyp strength tensors as X_train
X_train = torch.cat([adr_train_t, dc_train_t, cyp_str_train_t], 0).float()
X_train

tensor([[-1.5256, -0.7502],
        [-0.6540, -1.6095],
        [-0.1002, -0.6092],
        ...,
        [ 1.0000,  0.0000],
        [ 1.0000,  0.0000],
        [ 1.0000,  0.0000]], grad_fn=<CatBackward0>)

In [25]:
## X_test
## 1. Convert cyp strength into one-hot encodings
total_cyp_str_test = len(set(df_test["cyp_strength_of_evidence"]))
array_test = df_test["cyp_strength_of_evidence"].to_numpy()
array_test.flags.writeable = True
cyp_str_test_t = one_hot(torch.from_numpy(array_test) % total_cyp_str_test)

## 2. Convert "adverse_drug_reactions" column into embeddings
adr_str_test = df_test["adverse_drug_reactions"].tolist()
adr_string_test = ",".join(adr_str_test)
adr_test_t = words_tensors(adr_string_test)

## 3. Convert "drug_class" column into embeddings
dc_str_test = df_test["drug_class"].tolist()
dc_string_test = ",".join(dc_str_test)
dc_test_t = words_tensors(dc_string_test)

# Concatenate adr tensors, drug class tensors and cyp strength tensors as X_test
X_test = torch.cat([cyp_str_test_t, adr_test_t, dc_test_t], 0).float()
X_test

tensor([[ 0.0000,  1.0000],
        [ 0.0000,  1.0000],
        [ 0.0000,  1.0000],
        [ 0.0000,  1.0000],
        [ 1.0000,  0.0000],
        [ 1.0000,  0.0000],
        [-0.3661, -0.1123],
        [ 0.1401, -1.0772],
        [ 1.1557, -1.0234],
        [ 0.9199,  1.3019],
        [ 0.4592,  0.0879],
        [ 0.9443, -0.7599],
        [ 1.6396, -1.9154],
        [-1.8657, -0.6708],
        [ 0.1195,  1.6996],
        [ 0.6387, -1.0975],
        [-2.1762, -1.2711],
        [ 0.3710, -0.2658],
        [ 0.0103, -1.3901],
        [-0.0589,  0.1576],
        [-1.2375,  1.1541],
        [ 0.8352,  1.7608],
        [ 1.0910, -0.1847],
        [ 0.0786, -0.8521],
        [ 0.5188,  0.0342],
        [ 1.0575,  0.1753],
        [-0.2596,  0.3957],
        [ 0.4421,  0.9448],
        [ 0.3388, -0.7186],
        [-0.5009,  0.7134],
        [ 0.1131, -1.4705],
        [-0.5454,  0.0985],
        [-0.0589,  0.1576],
        [-0.1218,  0.3274],
        [-1.0497,  0.2502],
        [-0.8610,  2

In [None]:
## y_train

In [26]:
#from torch.utils.data import TensorDataset, DataLoader

## Create a PyTorch dataset (reference code below)
# training_data = TensorDataset(X_train, y_train)
# torch.manual_seed(1)
# batch_size = 2

## Create a dataset loader - DataLoader (reference code below)
# train_dataloader = DataLoader(training_data, batch_size, shuffle = True)

In [27]:
## Set up a DNN regression model 

In [28]:
# May need to set up a class with a few different functions (possibly in separate .py scripts then run in notebook first)

* Structure-adverse drug reaction relationships: 
**ADRs <-> (dense vectors of real numbers) <-> 2D drug structures**

* Structure-activity relationships: 
**drug activities <-> 2d drug structures**

1. building a NN model (?RNN or DNN initially) to classify drugs in an ADRs dataset (?identify drugs in different therapeutic classes) or to predict ADRs of drugs (regression) - to determine whether to use classification/regression (likely current post)
- to infer possible drugs vs. ADRs relationships

2. 2D drug structures part (much further down the line as a separate post)
- graph neural networks (GNN - other variations also available): molecules as undirected graphs where the connections between nodes (atoms) and edges (bonds) don't matter (i.e. don't need to be in particular orders or sequences) 
OR 
- RNN that uses SMILES (NLP technique) -> tokenize SMILES strings -> converts into a dictionary mapping tokens to indices in the vocabulary -> converts the vocabulary (SMILES strings) into one-hot encodings