## Import Dependencies

### General
* Pandas
* Numpy
* Seaborn

### Datasets
* TDC Tox

### RDKit Modules
* AllChem
* rdMolDescriptors
* IPythonConsole
* Draw
* DataStructs
* Butina

In [7]:
import pandas as pd
import numpy as np
import seaborn as sn
#---------------------- Therapeutic Drug Commons (TDC data) from https://tdcommons.ai/single_pred_tasks/tox/#dili-drug-induced-liver-injury
from tdc.single_pred import Tox
#---------------------- RDKit packages
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import DataStructs
from rdkit.ML.Cluster import Butina

## Data cleaning

### Reading, converting to pandas

Read TDC Tox DILI Dataset & convert to Pandas dataframe.\
Rename columns to be more human-readable.

In [8]:
# Make this a function that works for multiple datasets, concatenates them

tox_data = Tox(name = 'DILI')

tox_df = tox_data.get_data()

tox_df.columns = ["X", "SMILES", "DILI? (bool)"]

Found local copy...
Loading...
Done!


### Append skeleton structures

Generate RDKit molecular structure and append skeleton diagram as a column to the dataset.

In [None]:

# Get RDKit molecular structure
from rdkit.Chem import PandasTools
PandasTools.AddMoleculeColumnToFrame(tox_df, 'SMILES', 'Structure')

# Display RDKit molecule skeleton - Use Ipython to fix broken pandas functionality (doesn't display skeleton in HTML table)
from IPython.display import HTML
HTML(tox_df.to_html())

# Get Fingerprints

Define function 'generate_fingerprints'
Initialise empty list of Morgan fingerprints
for molecules in a given dataframe, generate their morgan fingerprints and append them to the dataframe
Reutrn appended dataframe as numpy array to analyse using 'shape'

Run generate_fingerprints on each molecule in the dataframe

Use shape to confirm success - First number should equal dataframe length


In [10]:
from rdkit import DataStructs

def computeMorganFP(mol, depth = 2, nBits = 2048):
    a = np.zeros(nBits, int)
    DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol,depth,nBits),a)
    return a


tox_df["m3fp"] = tox_df["Structure"].map(computeMorganFP)

In [16]:
morgan_df = tox_df["m3fp"].apply(pd.Series)

morgan_df.insert(2048, "DILI?", tox_df["DILI? (bool)"].astype(int))


print(morgan_df.describe)

<bound method NDFrame.describe of      0  1  2  3  4  5  6  7  8  9  ...  2039  2040  2041  2042  2043  2044  \
0    0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
1    0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
2    0  1  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
3    0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
4    0  1  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
..  .. .. .. .. .. .. .. .. .. ..  ...   ...   ...   ...   ...   ...   ...   
470  0  1  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
471  0  1  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
472  0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
473  0  1  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
474  0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   

     2045  2046  2047  DILI? 

## Thanks To

https://www.youtube.com/watch?v=-oHqQBUyrQ0