# Image Classification with DNN

## DATASETS:
(a) Carbonic Anhydrase II (ChEMBL205), a protein lyase,  
(b) Cyclin-dependent kinase 2 (CHEMBL301), a protein kinase,  
(c) ether-a-go-go-related gene potassium channel 1 (HERG) (CHEMBL240), a voltage-gated ion channel,  
(d) Dopamine D4 receptor (CHEMBL219), a monoamine GPCR,  
(e) Coagulation factor X (CHEMBL244), a serine protease,  
(f) Cannabinoid CB1 receptor (CHEMBL218), a lipid-like GPCR and  
(g) Cytochrome P450 19A1 (CHEMBL1978), a cytochrome P450.  
The activity classes were selected based on data availability and as representatives of therapeutically important target classes or as anti-targets.

In [26]:
!nvidia-smi

Thu Oct 28 18:33:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P8    17W / 240W |    569MiB /  8116MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [2]:
#%%capture
#!wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
#!chmod +x Miniconda3-latest-Linux-x86_64.sh
#!time bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
#!time conda install -q -y -c conda-forge rdkit

In [3]:
# Import
import pandas as pd
import numpy as np
from pathlib import Path

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import sys
import os
sys.path.append('/usr/local/lib/python3.7/site-packages/')
from rdkit import Chem
from rdkit.Chem import AllChem



In [5]:
dataset = 'oldsmiledata_id_processed_transformed_shuffled'

In [6]:
path = Path('../dataset/oldsmiledata')

In [7]:
list(path.iterdir())

[PosixPath('../dataset/oldsmiledata/test_oldsmiledata_id_processed_transformed_shuffled.csv'),
 PosixPath('../dataset/oldsmiledata/oldsmiledata_id_processed_transformed_rescaled_shuffled.csv'),
 PosixPath('../dataset/oldsmiledata/.ipynb_checkpoints'),
 PosixPath('../dataset/oldsmiledata/mol_images'),
 PosixPath('../dataset/oldsmiledata/train_oldsmiledata_id_processed_transformed_shuffled.csv')]

In [8]:
IMAGES = path/'mol_images'/'con'
train = pd.read_csv(path/f'train_{dataset}.csv')
valid = pd.read_csv(path/f'test_{dataset}.csv')

In [9]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23924 entries, 0 to 23923
Data columns (total 41 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   template                        23924 non-null  object 
 1   docked                          23924 non-null  object 
 2   rmsd                            23924 non-null  float64
 3   uniprot_id                      23924 non-null  object 
 4   smiles_template                 23924 non-null  object 
 5   smiles_docked                   23924 non-null  object 
 6   mcs_smartsString                23924 non-null  object 
 7   template_HeavyAtomCount         23924 non-null  int64  
 8   template_NHOHCount              23924 non-null  int64  
 9   template_NOCount                23924 non-null  int64  
 10  template_RingCount              23924 non-null  int64  
 11  template_NumHAcceptors          23924 non-null  int64  
 12  template_NumHDonors             

In [10]:
valid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024 entries, 0 to 1023
Data columns (total 41 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   template                        1024 non-null   object 
 1   docked                          1024 non-null   object 
 2   rmsd                            1024 non-null   float64
 3   uniprot_id                      1024 non-null   object 
 4   smiles_template                 1024 non-null   object 
 5   smiles_docked                   1024 non-null   object 
 6   mcs_smartsString                1024 non-null   object 
 7   template_HeavyAtomCount         1024 non-null   int64  
 8   template_NHOHCount              1024 non-null   int64  
 9   template_NOCount                1024 non-null   int64  
 10  template_RingCount              1024 non-null   int64  
 11  template_NumHAcceptors          1024 non-null   int64  
 12  template_NumHDonors             10

# Create dataloader

In [11]:
from fastai.vision.all import *

  return torch._C._cuda_getDeviceCount() > 0


In [12]:
train['img_temp'] = train['template'] + '.png'
train['img_docked'] = train['docked'] + '.png'
train['image'] = train['template'] + train['docked'] + '.png'
train['is_valid'] = False
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23924 entries, 0 to 23923
Data columns (total 45 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   template                        23924 non-null  object 
 1   docked                          23924 non-null  object 
 2   rmsd                            23924 non-null  float64
 3   uniprot_id                      23924 non-null  object 
 4   smiles_template                 23924 non-null  object 
 5   smiles_docked                   23924 non-null  object 
 6   mcs_smartsString                23924 non-null  object 
 7   template_HeavyAtomCount         23924 non-null  int64  
 8   template_NHOHCount              23924 non-null  int64  
 9   template_NOCount                23924 non-null  int64  
 10  template_RingCount              23924 non-null  int64  
 11  template_NumHAcceptors          23924 non-null  int64  
 12  template_NumHDonors             

In [None]:
valid['img_temp'] = valid['template'] + '.png'
valid['img_docked'] = valid['docked'] + '.png'
valid['image'] = valid['template'] + valid['docked'] + '.png'
valid['is_valid'] = True
valid.head()

In [None]:
getters = [ColReader('img_temp', pref=IMAGES), ColReader('img_docked', pref=IMAGES), ColReader('rmsd')]

In [None]:
db = DataBlock(
    blocks = (ImageBlock(), RegressionBlock()), 
    getters = [ColReader('image', pref=IMAGES), ColReader('rmsd')],
    splitter=ColSplitter('is_valid'),
    item_tfms=None,
    )

In [None]:
df = pd.concat([train, valid], ignore_index=True)

In [None]:
df.is_valid.value_counts()

In [None]:
df.smiles_template.nunique()

In [None]:
df.info()

In [None]:
dls = db.dataloaders(df, bs=64, shuffle_train=True)

In [None]:
dls.show_batch(max_n=3)

In [None]:
dls.show_batch(max_n=5, unique=True)

# Train CNN model

In [24]:
learn = cnn_learner(dls, arch=resnet18, pretrained=True, 
                            loss_func=None,
                            wd=None, metrics = rmse)

In [25]:
learn.fine_tune(16, 3e-3)

epoch,train_loss,valid_loss,_rmse,time
0,12.039248,9.306625,3.050676,14:57


epoch,train_loss,valid_loss,_rmse,time
0,8.041008,6.952068,2.636678,20:38


KeyboardInterrupt: 

In [None]:
stop

In [126]:
def train_model(dls, arch=resnet18, loss_func=None, epochs=16, wd=None, lr=None):
    
    
    print(f'Architechture: {arch}')
    #print(f'Untrained epochs: freeze_epochs={freeze_epochs}')
    print(f'Trained epochs: epochs={epochs}')
    print(f'Weight decay: wd={wd}')
    learn = cnn_learner(dls, arch=arch, pretrained=True, 
                            wd=wd,
                            metrics=[rmse])
    
    if lr == None: 
        print(f'Finding learning rate...')
        lr_min, lr_steep = learn.lr_find(suggestions=True, show_plot=False)
        print(f'Training model with learning rate: {lr_min}')
        lr = lr_min
    else: 
        print(f'Training model with learning rate: {lr}')
    learn.fine_tune(epochs, lr)
    
    return learn

In [127]:
learn = train_model(dls, epochs=2)

Architechture: <function resnet18 at 0x7f57e0340d08>
Trained epochs: epochs=2
Weight decay: wd=None
Finding learning rate...


TypeError: forward() takes 2 positional arguments but 3 were given

# Test different regularizations

## Results:
### wd = 0.002 is good for around 15-20 epochs before overfitting
### lr = 3e-3 is good for most CNNs and also in this case
### dropout = 0.5 is a usually sustainable

In [18]:
from sklearn.model_selection import ParameterGrid

In [19]:
# wd = 0.002 works for around 15-20 epochs
# lr = 3e-3 is good most CNN and in this case
# dropout = 0.5 is a usually sustainable
# batch_size = 64
param_grid={
    "bs" : [128, 252, 512],
}
param_grid = ParameterGrid(param_grid)

for p in param_grid:
    dls = get_dls(dataset, bs=p['bs'])
    learn = train_model(dls, loss_func=loss_func, epochs=15, wd=0.002, lr=3e-3)

Architechture: <function resnet18 at 0x7f3c3036af28>
Trained epochs: epochs=15
Weight decay: wd=0.002
Training model with learning rate: 0.003


epoch,train_loss,valid_loss,accuracy,f1_score,precision_score,recall_score,roc_auc_score,matthews_corrcoef,time
0,0.591833,0.614759,0.772639,0.42616,0.27646,0.929448,0.928012,0.428293,00:33


epoch,train_loss,valid_loss,accuracy,f1_score,precision_score,recall_score,roc_auc_score,matthews_corrcoef,time
0,0.302307,0.246268,0.923656,0.689342,0.546763,0.932515,0.977525,0.67929,00:41
1,0.231287,0.195587,0.94316,0.745636,0.628151,0.917178,0.97694,0.731142,00:40
2,0.224081,0.232139,0.931457,0.719818,0.572464,0.969325,0.982272,0.714525,00:40
3,0.202265,0.195893,0.933408,0.724971,0.58011,0.966258,0.980671,0.718863,00:40
4,0.172223,0.209649,0.936194,0.730905,0.592381,0.953988,0.977702,0.722443,00:40
5,0.167371,0.199423,0.942324,0.753278,0.615984,0.969325,0.982639,0.746287,00:40
6,0.132522,0.140064,0.962106,0.819149,0.723005,0.944785,0.983326,0.807321,00:40
7,0.110819,0.106491,0.968515,0.839716,0.781003,0.907975,0.982942,0.825238,00:42
8,0.097044,0.13895,0.965171,0.830393,0.744526,0.93865,0.987551,0.818038,00:41
9,0.085997,0.155352,0.961828,0.822309,0.71236,0.972393,0.984965,0.813682,00:40


Architechture: <function resnet18 at 0x7f3c3036af28>
Trained epochs: epochs=15
Weight decay: wd=0.002
Training model with learning rate: 0.003


epoch,train_loss,valid_loss,accuracy,f1_score,precision_score,recall_score,roc_auc_score,matthews_corrcoef,time
0,0.704093,0.487004,0.805795,0.469155,0.312057,0.944785,0.939678,0.474125,00:33


epoch,train_loss,valid_loss,accuracy,f1_score,precision_score,recall_score,roc_auc_score,matthews_corrcoef,time
0,0.383489,0.272373,0.915297,0.668122,0.518644,0.93865,0.973423,0.660307,00:40
1,0.289926,0.269745,0.920034,0.684268,0.533448,0.953988,0.977969,0.6783,00:39
2,0.24866,0.252783,0.930343,0.715262,0.568841,0.96319,0.980212,0.70915,00:39
3,0.228303,0.157967,0.949847,0.766234,0.664414,0.904908,0.975188,0.74995,00:39
4,0.192442,0.17678,0.958763,0.803191,0.70892,0.92638,0.979659,0.789334,00:39
5,0.160356,0.174733,0.951519,0.781407,0.661702,0.953988,0.980627,0.771141,00:39
6,0.137067,0.135016,0.959877,0.809524,0.711628,0.93865,0.982934,0.79701,00:40
7,0.118722,0.130855,0.966286,0.836707,0.746988,0.95092,0.982823,0.825625,00:39
8,0.103941,0.127559,0.966007,0.835135,0.746377,0.947853,0.98549,0.823732,00:40
9,0.087584,0.137622,0.96545,0.831063,0.747549,0.935583,0.983199,0.818431,00:41


Architechture: <function resnet18 at 0x7f3c3036af28>
Trained epochs: epochs=15
Weight decay: wd=0.002
Training model with learning rate: 0.003


epoch,train_loss,valid_loss,accuracy,f1_score,precision_score,recall_score,roc_auc_score,matthews_corrcoef,time


RuntimeError: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 7.93 GiB total capacity; 5.73 GiB already allocated; 90.56 MiB free; 6.70 GiB reserved in total by PyTorch)

# Compare architechtures

**Results:** Not much difference between **Resnet18, Resnet34 and Resnet50** on (224, 224) size images. \
**Alexnet** got worse results then **resnet**. \
Reason could be that the extra layers is not much of a use in this case (i.e. The images does not contain a lot of details) \
**Resnet18** takes less time to train due to less layers and therefore should be used in this case. 

In [None]:
archs =  [resnet18, resnet50, alexnet]

In [None]:
for arch in archs:
    train_model(dls, arch=arch, epochs=15, lr=3e-3)