# Image Classification with DNN

## DATASETS:
(a) Carbonic Anhydrase II (ChEMBL205), a protein lyase,  
(b) Cyclin-dependent kinase 2 (CHEMBL301), a protein kinase,  
(c) ether-a-go-go-related gene potassium channel 1 (HERG) (CHEMBL240), a voltage-gated ion channel,  
(d) Dopamine D4 receptor (CHEMBL219), a monoamine GPCR,  
(e) Coagulation factor X (CHEMBL244), a serine protease,  
(f) Cannabinoid CB1 receptor (CHEMBL218), a lipid-like GPCR and  
(g) Cytochrome P450 19A1 (CHEMBL1978), a cytochrome P450.  
The activity classes were selected based on data availability and as representatives of therapeutically important target classes or as anti-targets.

In [1]:
# Import
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
path = Path('../dataset/13321_2017_226_MOESM1_ESM/')

In [3]:
list(path.iterdir())

[PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/.ipynb_checkpoints'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL301'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL218'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL219'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL244'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/mol_images'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL1978'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL240')]

# Create train validation splits

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
datasets = ['CHEMBL205','CHEMBL1978', 'CHEMBL301', 'CHEMBL218', 
            'CHEMBL240', 'CHEMBL219', 
            'CHEMBL244']

In [6]:
DATA = path
DATA.mkdir(exist_ok=True)
PATH = DATA/datasets[2]
len(list(PATH.iterdir()))

4

In [9]:
for dataset in datasets:
    
    DATASET = DATA/dataset
    df = pd.read_csv(DATASET/f'{dataset}_cl.csv')
    x_train, x_valid = train_test_split(df.index, test_size=0.2, random_state=666, stratify=df['Activity'])
    df.loc[x_train, 'is_valid']=False
    df.loc[x_valid, 'is_valid']=True
    df = df.reset_index(drop=True)
    df.to_csv(DATASET/f'{dataset}_train_valid.csv', index=False)
    df.head()

In [10]:
dataset = datasets[2]

In [11]:
df = pd.read_csv(DATA/dataset/f'{datasets[2]}_train_valid.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7755 entries, 0 to 7754
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   CID       7755 non-null   object
 1   SMILES    7755 non-null   object
 2   Activity  7755 non-null   int64 
 3   is_valid  7755 non-null   bool  
dtypes: bool(1), int64(1), object(2)
memory usage: 189.5+ KB


In [12]:
df = pd.read_csv(DATA/dataset/f'{datasets[2]}_cl.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7755 entries, 0 to 7754
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   CID       7755 non-null   object
 1   SMILES    7755 non-null   object
 2   Activity  7755 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 181.9+ KB
