## Import Dependencies

### General
* Pandas
* Numpy
* Seaborn

### Datasets
* TDC Tox

### RDKit Modules
* AllChem
* rdMolDescriptors
* IPythonConsole
* Draw
* DataStructs
* Butina

In [2]:
import pandas as pd
import numpy as np
import seaborn as sn
#---------------------- Therapeutic Drug Commons (TDC data) from https://tdcommons.ai/single_pred_tasks/tox/#dili-drug-induced-liver-injury
from tdc.single_pred import Tox
#---------------------- RDKit packages
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import DataStructs
from rdkit.ML.Cluster import Butina

## Data cleaning

### Reading, converting to pandas

Read TDC Tox DILI Dataset & convert to Pandas dataframe.\
Rename columns to be more human-readable.

In [None]:
# Make this a function that works for multiple datasets, concatenates them

tox_data = Tox(name = 'DILI')

tox_df = tox_data.get_data()

tox_df.columns = ["X", "SMILES", "DILI? (bool)"]

### Append skeleton structures

Generate RDKit molecular structure and append skeleton diagram as a column to the dataset.

In [None]:

# Get RDKit molecular structure
from rdkit.Chem import PandasTools
PandasTools.AddMoleculeColumnToFrame(tox_df, 'SMILES', 'Structure')

# Display RDKit molecule skeleton - Use Ipython to fix broken pandas functionality (doesn't display skeleton in HTML table)
from IPython.display import HTML
HTML(tox_df.to_html())

# Get Fingerprints

Define function 'generate_fingerprints'
Initialise empty list of Morgan fingerprints
for molecules in a given dataframe, generate their morgan fingerprints and append them to the dataframe
Reutrn appended dataframe as numpy array to analyse using 'shape'

Run generate_fingerprints on each molecule in the dataframe

Use shape to confirm success - First number should equal dataframe length


In [45]:
def generate_fingerprints(data):
    morgan_fingerprint_list = []
    
    for mol in data:
        morgan_fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
        morgan_fingerprint_list.append(morgan_fingerprint)

    return np.array(morgan_fingerprint_list)


morgan_fingerprint_list = generate_fingerprints(tox_df.Structure[:])

morgan_fingerprint_list.shape

(475, 2048)

## Thanks To

https://www.youtube.com/watch?v=-oHqQBUyrQ0