## Function testing

#### **Load dataset**

##### Available dataset:

|No. | Dataset name | Task type|
|----|--------------|----------|
|1 | BACE potency predict| classification|
|2 | BACE pIC50 predict| regression|
|3 | HIV potency predict| classification|
|4 | Molecule solubility predict| regression|
|5 | Molecule lipophilic predict| regression|

*More dataset is coming...*


In [3]:
from helper.load_dataset import load_bace_classification, load_bace_regression, load_esol, load_hiv, load_lipo, load_hdac2, load_fgfr1

In [2]:
# Example of loading dataset

print('BACE potency classification task')
bace_class = load_bace_classification()
print(bace_class.head())


BACE potency classification task
                                              SMILES  Class
0  O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2c...      1
1  Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(...      1
2  S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...      1
3  S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c...      1
4  S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...      1


### **Split dataset**

* Default test size: 10% or test 10% and valid 10%

Available split type:

- **Random splitting**
- **Scaffold splitting** (split less frequent scaffold to test set in order to mimic real-world application)

In [3]:
from helper.preprocess import split_train_test, split_train_valid_test

In [4]:
# Train - test split
print('Random Splitting')
print('Train - Test')
bace_train, bace_test = split_train_test(bace_class)
print(f'Train size: {len(bace_train)} | Test size: {len(bace_test)}')

print('Train - Valid - Test')
bace_train, bace_valid, bace_test = split_train_valid_test(bace_class)
print(f'Train size: {len(bace_train)} | Valid size: {len(bace_valid)} |Test size: {len(bace_test)}')

print('________________\n')

print('Scaffold Splitting')
print('Train - Test')
bace_train, bace_test = split_train_test(bace_class, type='scaffold')
print(f'Train size: {len(bace_train)} | Test size: {len(bace_test)}')
print('Train - Valid - Test')
bace_train, bace_valid, bace_test = split_train_valid_test(bace_class, type='scaffold')
print(f'Train size: {len(bace_train)} | Valid size: {len(bace_valid)} |Test size: {len(bace_test)}')


Random Splitting
Train - Test
Train size: 1361 | Test size: 152
Train - Valid - Test
Train size: 1190 | Valid size: 171 |Test size: 152
________________

Scaffold Splitting
Train - Test
Train size: 1361 | Test size: 152
Train - Valid - Test
Train size: 1210 | Valid size: 151 |Test size: 152


### **Generate fingerprint/descriptor**

Available fingerprint/descriptors:

- ECFP (Extended circular fingerprint)
- MACCS (Substructure fingerprint)
- RDKit Descriptors (0D molecular descriptors)
- eRG (extended reduced graph)

In [5]:
from helper.features import smi_ecfp, smi_erg, smi_maccs, smi_rdkitDesc

In [13]:
test_smi = 'c1ccccc1'
print('SMILES to ECFP, default ECFP4, 1024 bits, can change it with radius and n_bits arguments')

test_ecfp = smi_ecfp(test_smi, radius=2, n_bits=1024) # default set ECFP4, 1024 bits
print(f'ECFP arrays: {test_ecfp} \nLength: {len(test_ecfp)}')

print('_____________\n')

print('SMILES to MACCS')
test_maccs = smi_maccs(test_smi)
print(f'MACCS arrays: {test_maccs} \nLength: {len(test_maccs)}')


print('_____________\n')

print('SMILES to RDKit Descriptors')
test_rdkitdesc = smi_rdkitDesc(test_smi)
print(f'RDKit Descriptors arrays: {test_rdkitdesc} \nLength: {len(test_rdkitdesc)}')


print('_____________\n')

print('SMILES to eRG')
test_erg = smi_erg(test_smi)
print(f'ERG arrays: {test_erg} \nLength: {len(test_erg)}')

SMILES to ECFP, default ECFP4, 1024 bits, can change it with radius and n_bits arguments
ECFP arrays: [0. 0. 0. ... 0. 0. 0.] 
Length: 1024
_____________

SMILES to MACCS
MACCS arrays: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0] 
Length: 167
_____________

SMILES to RDKit Descriptors
RDKit Descriptors arrays: [2.0, 2.0, 2.0, 2.0, 0.4426283718993647, 8.0, 78.11399999999999, 72.06599999999999, 78.046950192, 30, 0, -0.062268570782092456, -0.062268570782092456, 0.062268570782092456, 0.062268570782092456, 0.3333333333333333, 0.5, 0.666666666

#### **Generate graph features**

Function to call graph features class and encoding molecule. Might have add more features and different set to work with

In [14]:
from helper.graphfeat import StructureEncoder

In [19]:
graph_generation = StructureEncoder()
graph = graph_generation.encoding_structure('c1ccccc1', 1)
print('Node features:')
print(f'Number of node features: {graph.num_node_features}')
print(graph.x)

print('Edge features:')
print(f'Number of edge features: {graph.num_edge_features}')
print(graph.edge_attr)

Node features:
Number of node features: 47
tensor([[ 0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          1.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  1.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,  0.5000,
          1.0000,  1.0000,  1.0000,  0.0000,  0.0000,  0.0000, -0.0623,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          1.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  1.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,  0.5000,
          1.0000,  1.0000,  1.0000,  0.0000,  0.0000,  0.0000, -0.0623,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
 