# Using software (MOE or Dragon)

If you use software (e.g. MOE or Dragon) to generate descriptors, you load the compounds (in .csv file) into software.

# Using rdkit package

If you would like to use rdkit to generate descriptors, you should load the compounds as molecule objects in rdkit. We make an example using file "example.csv" in the following:

In [19]:
from rdkit import Chem
import pandas as pd
import os
currentDirectory = os.getcwd()
d = os.path.join(currentDirectory, "Datasets", "example.csv")
dataset = pd.read_csv(d, index_col = 0)

C:\Users\Lin\Desktop\QSAR_models\Datasets\example.csv


In [3]:
dataset.head()

Unnamed: 0_level_0,"Toxicity to environmental bacteria (EPA Microtox test), -log10 of Conc.(mg/kg)",SMILES
index_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6.343381,ClC(Cl)(Cl)c1ccc(cc1)C#N
2,6.173821,S=C=NCc1ccccc1
3,5.969274,CCCCCCCCCCCCO
4,5.949961,c1ccc2c(c1)ccc3ccccc32
5,5.766061,ClCc1ccc(CCl)cc1


In [6]:
dataset.shape

(899, 2)

In [4]:
molecules = [Chem.MolFromSmiles(mol) for mol in dataset.SMILES]

# Dataset splitting

a. Random splitting the dataset into training (80%) and test set (20%).

b. For the binary dataset, you should make sure the whole dataset has balanced endpoint values (# of inactives = # of actives) before splitting.

#### python code for splitting dataset

In [7]:
import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = split_train_test(dataset, 0.2)

In [9]:
print(train_set.shape, test_set.shape)

(719, 2) (180, 2)


In [8]:
# Or Using sklearn
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(dataset, test_size=0.2, random_state=42)

In [10]:
print(train_set.shape, test_set.shape)

(719, 2) (180, 2)
