
Predict Lipophilicity Property Using GCN
=====================

So far, we predicted lipophilicity with linear molecular representation (fingerprint, smiles). This time, we would employ GCN architecture with graph representation.

What about data?
----------------

Generally, when you have to deal with image, text, audio or video data,
you can use standard python packages that load data into a numpy array.
Then you can convert this array into a ``torch.*Tensor``.

-  For images, packages such as Pillow, OpenCV are useful
-  For audio, packages such as scipy and librosa
-  For text, either raw Python or Cython based loading, or NLTK and
   SpaCy are useful

Specifically for vision, we have created a package called
``torchvision``, that has data loaders for common datasets such as
Imagenet, CIFAR10, MNIST, etc. and data transformers for images, viz.,
``torchvision.datasets`` and ``torch.utils.data.DataLoader``.

This provides a huge convenience and avoids writing boilerplate code.

For this tutorial, we will use the CIFAR10 dataset.
It has the classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’,
‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. The images in CIFAR-10 are of
size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

.. figure:: /_static/img/cifar10.png
   :alt: cifar10

   cifar10


Training an image classifier
----------------------------

We will do the following steps in order:

1. Load and normalizing the CIFAR10 training and test datasets using
   ``torchvision``
2. Define a Convolutional Neural Network
3. Define a loss function
4. Train the network on the training data
5. Test the network on the test data

1. Loading and normalizing CIFAR10
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using ``torchvision``, it’s extremely easy to load CIFAR10.



(1) Prepare Dataset and Data Loader
----------------------------

Download Lipophilicity dataset from [MoleculeNet](http://moleculenet.ai/datasets-1) Benchmark dataset.  
You can download from the webpage or source_dataset file from the url directly. 

In [38]:
!wget -q "http://deepchem.io.s3-website-us-west-1.amazonaws.com/datasets/Lipophilicity.csv" -O Lipophilicity.csv

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from rdkit import Chem



def get_splitted_lipo_dataset(ratios=[0.8, 0.1, 0.1], seed=123):

    raw_data = pd.read_csv('Lipophilicity.csv') # Open original dataset
    smiles = raw_data['smiles']
        
    train_val, test = train_test_split(raw_data, test_size=ratios[2], random_state=seed)
    train, val = train_test_split(train_val, test_size=ratios[1]/(ratios[0]+ratios[1]), random_state=seed)
    
    return train, val, test

In [39]:
datasets = get_splitted_lipo_dataset()
smiles = datasets[0]
print(smiles)

      CMPD_CHEMBLID   exp                                             smiles
1369   CHEMBL199237  2.70                 O=C(NCc1ccccn1)c2ccc(Oc3ccccc3)cc2
3084   CHEMBL277863  2.05  COC(=O)N1CCN([C@H](CN2CCCC2)C1)C(=O)Cc3ccc(Cl)...
2141  CHEMBL1824036 -0.51  CS(=O)(=O)c1ccc2OCC(=O)N(CCN3CCC(CC3)NCc4ccc5O...
3741    CHEMBL87266  1.35  Cc1cc(N)c2cc(NC(=O)CC(=O)Nc3ccc4nc(C)cc(N)c4c3...
2192   CHEMBL513370  2.36  COc1ccc(N(C(C(=O)NC[C@@H](C)O)c2ccccc2F)C(=O)c...
...             ...   ...                                                ...
3180   CHEMBL578061  4.06                     N(c1ccccc1)c2ccnc(Nc3ccccc3)n2
266        CHEMBL97  2.16          COc1ccc2nccc([C@H](O)C3CC4CCN3CC4C=C)c2c1
2398   CHEMBL138649  3.60                     Oc1ccc2OC(=CC(=O)c2c1)c3ccccc3
2073   CHEMBL272705  2.21            C[C@H](CO)Nc1nc(SCc2occc2)nc3NC(=O)Sc13
2480   CHEMBL177611  0.60                 Clc1ccc(cc1)C(=O)N[C@H]2CN3CCC2CC3

[3359 rows x 3 columns]


In [67]:
LIST_SYMBOLS = ['C', 'N', 'O', 'S', 'F', 'H', 'Si', 'P', 'Cl', 'Br',
            'Li', 'Na', 'K', 'Mg', 'Ca', 'Fe', 'As', 'Al', 'I', 'B',
            'V', 'Tl', 'Sb', 'Sn', 'Ag', 'Pd', 'Co', 'Se', 'Ti', 'Zn',
            'Ge', 'Cu', 'Au', 'Ni', 'Cd', 'Mn', 'Cr', 'Pt', 'Hg', 'Pb']


def atom_feature(atom):
    return np.array(char_to_ix(atom.GetSymbol(), LIST_SYMBOLS) +
                    char_to_ix(atom.GetDegree(), [0, 1, 2, 3, 4, 5]) +
                    char_to_ix(atom.GetTotalNumHs(), [0, 1, 2, 3, 4]) +
                    char_to_ix(atom.GetImplicitValence(), [0, 1, 2, 3, 4, 5]) +
                    char_to_ix(int(atom.GetIsAromatic()), [0, 1]))    # (40, 6, 5, 6, 2)


def char_to_ix(x, allowable_set):
    if x not in allowable_set:
        return [0] # Unknown Atom Token
    return [allowable_set.index(x)+1]


def mol2graph(smi, MAX_LEN):
    mol = Chem.MolFromSmiles(smi)

    X = np.zeros((MAX_LEN, 5), dtype=np.uint8)
    A = np.zeros((MAX_LEN, MAX_LEN), dtype=np.uint8)

    temp_A = Chem.rdmolops.GetAdjacencyMatrix(mol).astype(np.uint8, copy=False)[:MAX_LEN, :MAX_LEN]
    num_atom = temp_A.shape[0]
    A[:num_atom, :num_atom] = temp_A + np.eye(temp_A.shape[0], dtype=np.uint8)
    
    for i, atom in enumerate(mol.GetAtoms()):
        feature = atom_feature(atom)
        X[i, :] = feature
        if i + 1 >= num_atom: break
            
    return X, A

smiles = "O=C(NCc1ccccn1)c2ccc(Oc3ccccc3)cc2"
X, A = mol2graph(smiles, 70)

In [69]:
from torch.utils.data import Dataset, DataLoader

class gcnDataset(Dataset):
    def __init__(self, df, max_len=120):
        self.smiles = df["smiles"]
        self.exp = df["exp"].values
                
        list_X = list()
        list_A = list()
        for i, smiles in enumerate(self.smiles):
            X, A = mol2graph(smiles, max_len)
            list_X.append(X)
            list_A.append(A)
            
        self.X = np.array(list_X, dtype=np.uint8)
        self.A = np.array(list_A, dtype=np.uint8)
        
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, index):
        return self.X[index], self.A[index], self.exp[index]
    
sample_dataset = gcnDataset(datasets[0])

(2) Model Architecture
----------------------------

Download Lipophilicity dataset from [MoleculeNet](http://moleculenet.ai/datasets-1) Benchmark dataset.  
You can download from the webpage or source_dataset file from the url directly. 

(4) Visualization
----------------------------

Download Lipophilicity dataset from [MoleculeNet](http://moleculenet.ai/datasets-1) Benchmark dataset.  
You can download from the webpage or source_dataset file from the url directly. 

In [108]:
from decimal import Decimal

def generate_setting(args, var1, var2):
    dict_args = vars(args)
    output = '{:92}'.format('[Exp Settings]') + '\n'
    output += '-'*91 + '\n'

    num_var = 3
    cnt_var = 0
    for keyword, value in dict_args.items():
        if keyword != var1 and keyword != var2 and type(value) != list and not 'best' in keyword and keyword != 'elapsed':
            str_value = str(value)
            if str_value.isdigit():
                if type(value) == float:
                    temp = '| {}={:.2E}'.format(keyword, Decimal(dict_args[keyword]))
                if type(value) == int:
                    temp = '| {}={}'.format(keyword, str_value[:15])

            else:
                temp = '| {}={}'.format(keyword, str_value[:15])
            output += '{:<30}'.format(temp[:30])
            cnt_var += 1
            if cnt_var % num_var == 0:
                cnt_var = 0
                output += '|\n'
                output += '-'*91 + '\n'
    return output

In [109]:
def plot_performance(results, variable1, variable2, title='', filename=''):
    fig, ax = plt.subplots(1, 2)

    fig.set_size_inches(15, 6)
    sns.set_style("darkgrid", {"axes.facecolor": ".9"})
    sns.barplot(x=variable1, y='best_mae', hue=variable2, data=results, ax=ax[0])
    sns.barplot(x=variable1, y='best_std', hue=variable2, data=results, ax=ax[1])

    font = FontProperties()
    font.set_family('monospace')
    font.set_size('large')
    alignment = {'horizontalalignment': 'center', 'verticalalignment': 'baseline'}
    fig.text(0.5, -0.6, generate_setting(args, variable1, variable2), fontproperties=font, **alignment)
    
    fig.suptitle(title)
    filename = filename if len(filename) > 0 else title
    plt.savefig('./images/{}.png'.format(filename))

In [110]:
def plot_distribution(results, variable1, variable2, x='true_y', y='pred_y', title='', filename='', **kwargs):
    list_v1 = results[variable1].unique()
    list_v2 = results[variable2].unique()
    list_data = list()
    for value1 in list_v1:
        for value2 in list_v2:
            row = results.loc[results[variable1]==value1]
            row = row.loc[results[variable2]==value2]

            best_true_y = list(row.best_true_y)[0]
            best_pred_y = list(row.best_pred_y)[0]
            for i in range(len(best_true_y)):
                list_data.append({x:best_true_y[i], y:best_pred_y[i], variable1:value1, variable2:value2})
    df = pd.DataFrame(list_data)

    g = sns.FacetGrid(df, row=variable2, col=variable1, margin_titles=True)
    g.map(plt.scatter, x, y, alpha=0.3)

    def identity(**kwargs):
        plt.plot(np.linspace(-1.5,4,50), np.linspace(-1.5,4,50),'k',linestyle='dashed')
    g.map(identity)
    g.set_axis_labels(x, y)
    g.fig.suptitle(title) # can also get the figure from plt.gcf()
    plt.subplots_adjust(top=kwargs.get('top',0.93))
    filename = filename if len(filename) > 0 else title
    plt.savefig('./images/{}.png'.format(filename))

In [111]:
def plot_loss(results, variable1, variable2, x='true_y', y='pred_y', title='', filename='', **kwargs):
    list_v1 = results[variable1].unique()
    list_v2 = results[variable2].unique()
    list_data = list()
    for value1 in list_v1:
        for value2 in list_v2:
            row = results.loc[results[variable1]==value1]
            row = row.loc[results[variable2]==value2]

            train_losses = list(row.train_losses)[0]
            val_losses = list(row.val_losses)[0]
            maes = list(row.maes)[0]
            
            for item in train_losses:
                item.update({'type':'train', 'loss':item['train_loss'], variable1:value1, variable2:value2})
                
            for item in val_losses:
                item.update({'type':'val', 'loss':item['val_loss'], variable1:value1, variable2:value2})
            
            for item in maes:
                item.update({'type':'mae', variable1:value1, variable2:value2})
            list_data += train_losses + val_losses + maes

    df = pd.DataFrame(list_data)
    temp_mae = df.loc[df['mae'] < df['mae'].quantile(0.98)]
    ymax = temp_mae['mae'].max()
    ymin = temp_mae['mae'].min()
    
    temp_loss = df.loc[df['loss'] < df['loss'].quantile(0.98)]
    lossmax = temp_loss['loss'].max()
    lossmin = temp_loss['loss'].min()
    
    g = sns.FacetGrid(df, row=variable2, col=variable1, hue='type', margin_titles=False)
    axes = g.axes
    for i in range(len(axes)):
        for j in range(len(axes[0])):
            if i==0:
                g.axes[i][j].yaxis.set_label_coords(1.1,0.9)
                
    def mae_line(x, y, **kwargs):
        ax2 = plt.gca().twinx()
        ax2.plot(x, y,'g--')
        ax2.set_ylim(kwargs['ymax']*1.05, kwargs['ymin']*0.95)
        ax2.grid(False)

    g.map(plt.plot, x, y)
    g.map(mae_line, 'epoch', 'mae', ymin=ymin, ymax=ymax)
    g.set_axis_labels(x, y)
    g.fig.suptitle(title) # can also get the figure from plt.gcf()
    g.add_legend()
    
    for ax in g.axes.flatten():
        ax.set_ylim(lossmin, lossmax)
        
    plt.subplots_adjust(top=kwargs.get('top', 0.93))
    filename = filename if len(filename) > 0 else title
    plt.savefig('./images/{}.png'.format(filename))

(5) Experiment
----------------------------

Download Lipophilicity dataset from [MoleculeNet](http://moleculenet.ai/datasets-1) Benchmark dataset.  
You can download from the webpage or source_dataset file from the url directly. 

In [129]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
import math


#===== Activation =====#
def gelu(x):

    """ Ref: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py
        Implementation of the gelu activation function.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))

ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu}

class Attention(nn.Module):
    def __init__(self, input_dim, output_dim, num_attn_head, dropout=0.1):
        super(Attention, self).__init__()   

        self.num_attn_heads = num_attn_head
        self.attn_dim = output_dim // num_attn_head
        self.projection = nn.ModuleList([nn.Linear(input_dim, self.attn_dim) for i in range(self.num_attn_heads)])
        self.coef_matrix = nn.ParameterList([nn.Parameter(torch.FloatTensor(self.attn_dim, self.attn_dim)) for i in range(self.num_attn_heads)])
        self.tanh = nn.Tanh()
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=dropout)
        self.param_initializer()

    def forward(self, X, A):
        list_X_head = list()
        for i in range(self.num_attn_heads):
            X_projected = self.projection[i](X)
            attn_matrix = self.attn_coeff(X_projected, A, self.coef_matrix[i])
            X_head = torch.matmul(attn_matrix, X_projected)
            list_X_head.append(X_head)
            
        X = torch.cat(list_X_head, dim=2)
        X = self.relu(X)
        return X
            
    def attn_coeff(self, X_projected, A, C):
        X = torch.einsum('akj,ij->aki', (X_projected, C))
        attn_matrix = torch.matmul(X, torch.transpose(X_projected, 1, 2)) 
        attn_matrix = torch.mul(A, attn_matrix)
        attn_matrix = self.dropout(self.tanh(attn_matrix))
        return attn_matrix
    
    def param_initializer(self):
        for i in range(self.num_attn_heads):    
            nn.init.xavier_normal_(self.projection[i].weight.data)
            nn.init.xavier_normal_(self.coef_matrix[i].data)


#####################################################
# ===== Gconv, Readout, BN1D, ResBlock, Encoder =====#
#####################################################

class GConv(nn.Module):
    def __init__(self, input_dim, output_dim, attn):
        super(GConv, self).__init__()
        self.attn = attn
        if self.attn is None:
            self.fc = nn.Linear(input_dim, output_dim)
        
    def forward(self, X, A):
        if self.attn is None:
            x = self.fc(X)
            x = torch.matmul(A, x)
        else:
            x = self.attn(X, A)            
        return x, A
    
class BN1d(nn.Module):
    def __init__(self, out_dim, use_bn):
        super(BN1d, self).__init__()
        self.use_bn = use_bn
        self.bn = nn.BatchNorm1d(out_dim)
             
    def forward(self, x):
        if not self.use_bn:
            return  x
        origin_shape = x.shape
        x = x.view(-1, origin_shape[-1])
        x = self.bn(x)
        x = x.view(origin_shape)
        return x
        
class ResBlock(nn.Module):
    def __init__(self, in_dim, out_dim, use_bn, use_attn, dp_rate, sc_type, n_attn_head=None):
        super(ResBlock, self).__init__()   
        self.use_bn = use_bn
        self.sc_type = sc_type
        
        attn = Attention(in_dim, out_dim, n_attn_head) if use_attn else None
        self.gconv = GConv(in_dim, out_dim, attn)
        
        self.bn1 = BN1d(out_dim, use_bn)
        self.dropout = nn.Dropout2d(p=dp_rate)
        self.relu = nn.ReLU()
        
        if not self.sc_type in ['no', 'gsc', 'sc']:
            raise Exception

        if self.sc_type != 'no':
            self.bn2 = BN1d(out_dim, use_bn)
            self.shortcut = nn.Sequential()
            if in_dim != out_dim:
                self.shortcut.add_module('shortcut', nn.Linear(in_dim, out_dim, bias=False))
                
        if self.sc_type == 'gsc':
            self.g_fc1 = nn.Linear(out_dim, out_dim, bias=True)
            self.g_fc2 = nn.Linear(out_dim, out_dim, bias=True)
            self.sigmoid = nn.Sigmoid()

    def forward(self, _x, A):
        x, A = self.gconv(_x, A)

        if self.sc_type == 'no': #no skip-connection
            x = self.relu(self.bn1(x))
            return self.dropout(x), A
        
        elif self.sc_type == 'sc': # basic skip-connection
            x = self.relu(self.bn1(x))
            x = x + self.shortcut(_x)          
            return self.dropout(self.relu(self.bn2(x))), A
        
        elif self.sc_type == 'gsc': # gated skip-connection
            x = self.relu(self.bn1(x)) 
            x1 = self.g_fc1(self.shortcut(_x))
            x2 = self.g_fc2(x)
            gate_coef = self.sigmoid(x1+x2)
            x = torch.mul(x1, gate_coef) + torch.mul(x2, 1.0-gate_coef)
            return self.dropout(self.relu(self.bn2(x))), A
             
class Readout(nn.Module):
    def __init__(self, out_dim, molvec_dim):
        super(Readout, self).__init__()
        self.readout_fc = nn.Linear(out_dim, molvec_dim)
        nn.init.xavier_normal_(self.readout_fc.weight.data)

    def forward(self, output_H):
        molvec = self.readout_fc(output_H)
        molvec = torch.mean(molvec, dim=1)
        return molvec


class Net(nn.Module):
    def __init__(self, args):
        super(Net, self).__init__()   
        
        # Create Atom Element embedding layer
        self.embedding = self.create_emb_layer([args.vocab_size, args.degree_size,
                                                args.numH_size, args.valence_size,
                                                args.isarom_size],  args.emb_train)        
        # Create Residual Convolution layer
        self.gconvs = nn.ModuleList()
        for i in range(args.n_layer):
            if i==0:
                self.gconvs.append(ResBlock(args.in_dim, args.out_dim, args.use_bn, False, args.dp_rate, args.sc_type))
            else:
                self.gconvs.append(ResBlock(args.out_dim, args.out_dim, args.use_bn, args.use_attn, args.dp_rate, args.sc_type, args.n_attn_head))

        self.readout = Readout(args.out_dim, args.molvec_dim)

        # Create MLP layers for regression
        self.fc1 = nn.Linear(args.molvec_dim, args.molvec_dim//2)
        self.fc2 = nn.Linear(args.molvec_dim//2, 1)
        self.bn1 = BN1d(args.molvec_dim//2, args.use_bn)
        self.act = ACT2FN[args.act]
        self.dropout = nn.Dropout(p=args.dp_rate)
        self.sigmoid = nn.Sigmoid()
    
    def create_emb_layer(self, list_vocab_size, emb_train=False):
        list_emb_layer = nn.ModuleList()
        for i, vocab_size in enumerate(list_vocab_size):
            vocab_size += 1
            emb_layer = nn.Embedding(vocab_size, vocab_size)
            weight_matrix = torch.zeros((vocab_size, vocab_size))
            for i in range(vocab_size):
                weight_matrix[i][i] = 1
            emb_layer.load_state_dict({'weight': weight_matrix})
            emb_layer.weight.requires_grad = emb_train
            list_emb_layer.append(emb_layer)
        return list_emb_layer

          
    def _embed(self, x):
        list_embed = list()
        for i in range(5):
            list_embed.append(self.embedding[i](x[:, :, i]))
        x = torch.cat(list_embed, 2)
        return x
    
    def _encode(self, x, A):
        for i, module in enumerate(self.gconvs):
            x, A = module(x, A)
        molvec = self.readout(x)
        return molvec
    
    def forward(self, x, A):
        A = A.float()
        x = self._embed(x)          # embedding layer
        molvec = self._encode(x, A)   # encoding through gcn layer
        molvec = self.dropout(self.bn1(self.act(self.fc1(molvec))))
        molvec = self.fc2(molvec)
        
        return torch.squeeze(molvec)
    


In [130]:
from tqdm import tqdm_notebook, tqdm

def train(model, dataloader, optimizer, criterion, args, **kwargs):
    
    epoch_train_loss = 0
    list_train_loss = list()
    cnt_iter = 0
    for batch_idx, batch in enumerate(dataloader):
        X, A, y = batch[0].long(), batch[1].long(), batch[2].float()
        X, A, y = X.to(args.device), A.to(args.device), y.to(args.device)
    
        model.train()
        optimizer.zero_grad()

        pred_y = model(X, A)
        
        train_loss = criterion(pred_y, y)
        epoch_train_loss += train_loss.item()
        list_train_loss.append({'epoch':batch_idx/len(dataloader)+kwargs['epoch'], 'train_loss':train_loss.item()})
        train_loss.backward()
        optimizer.step()
        
        cnt_iter += 1
    return model, list_train_loss


def validate(model, dataloader, criterion, args):
    
    epoch_val_loss = 0
    cnt_iter = 0
    for batch_idx, batch in enumerate(dataloader):
        X, y = batch[0].long(), batch[1].float()
        X, y = X.to(args.device), y.to(args.device)
    
        model.eval()
        pred_y = model(X)
        val_loss = criterion(pred_y, y)
        epoch_val_loss += val_loss.item()
        cnt_iter += 1

    return epoch_val_loss/cnt_iter

def test(model, dataloader, args, **kwargs):

    list_y, list_pred_y = list(), list()
    for batch_idx, batch in enumerate(dataloader):
        X, y = batch[0].long(), batch[1].float()
        X, y = X.to(args.device), y.to(args.device)
    
        model.eval()
        pred_y = model(X)
        list_y += y.cpu().detach().numpy().tolist()
        list_pred_y += pred_y.cpu().detach().numpy().tolist()

    mae = mean_absolute_error(list_y, list_pred_y)
    std = np.std(np.array(list_y)-np.array(list_pred_y))
    return mae, std, list_y, list_pred_y


def experiment(partition, args):
    ts = time.time()
    args.input_shape = (args.max_len, args.vocab_size)
    
    model = Net(args)
    
    model.to(args.device)
    criterion = nn.MSELoss()
    
    # Initialize Optimizer
    trainable_parameters = filter(lambda p: p.requires_grad, model.parameters())
    if args.optim == 'ADAM':
        optimizer = optim.Adam(trainable_parameters, lr=args.lr, weight_decay=args.l2_coef)
    elif args.optim == 'RMSProp':
        optimizer = optim.RMSprop(trainable_parameters, lr=args.lr, weight_decay=args.l2_coef)
    elif args.optim == 'SGD':
        optimizer = optim.SGD(trainable_parameters, lr=args.lr, weight_decay=args.l2_coef)
    else:
        assert False, "Undefined Optimizer Type"
        
    # Train, Validate, Evaluate
    list_train_loss = list()
    list_val_loss = list()
    list_mae = list()
    list_std = list()
    
    args.best_mae = 10000
    for epoch in range(args.epoch):
        model, train_losses = train(model, partition['train'], optimizer, criterion, args, **{'epoch':epoch})
        val_loss = validate(model, partition['val'], criterion, args)
        mae, std, true_y, pred_y = test(model, partition['test'], args, **{'epoch':epoch})
        
        list_train_loss += train_losses
        list_val_loss.append({'epoch':epoch, 'val_loss':val_loss})
        list_mae.append({'epoch':epoch, 'mae':mae})
        list_std.append({'epoch':epoch, 'std':std})
        
        if args.best_mae > mae or epoch==0:
            args.best_epoch = epoch
            args.best_mae = mae
            args.best_std = std
            args.best_true_y = true_y
            args.best_pred_y = pred_y
            

    # End of experiments
    te = time.time()
    args.elapsed = te-ts
    args.train_losses = list_train_loss
    args.val_losses = list_val_loss
    args.maes = list_mae
    args.stds = list_std

    return model, args 

In [131]:
import argparse
import time 
from sklearn.metrics import mean_absolute_error
from utils import *


seed = 123
np.random.seed(seed)
torch.manual_seed(seed)

parser = argparse.ArgumentParser()
args = parser.parse_args("")

# ==== Embedding Config ==== #
args.max_len = 70
args.vocab_size = 41
args.degree_size = 6
args.numH_size = 5
args.valence_size = 6
args.isarom_size = 2
args.emb_train = False


# ==== Model Architecture Config ==== #
args.in_dim = 59
args.out_dim = 256
args.molvec_dim = 512
args.n_layer = 4
args.n_attn_head = 8
args.sc_type = 'sc'
args.use_attn = True
args.use_bn = True
args.act = 'relu'
args.dp_rate = 0.3


# ==== Optimizer Config
args.lr = 0.00005
args.l2_coef = 0.0001
args.optim = 'ADAM'


# ==== Training Config ==== #
args.epoch = 100
args.batch_size = 256
args.device = 'cuda' if torch.cuda.is_available() else 'cpu'
args.exp_name = 'exp1_lr_stage'


writer = Writer(prior_keyword=['n_layer', 'use_bn', 'lr', 'dp_rate', 'emb_train', 'epoch', 'batch_size'])
writer.clear()

# Define Hyperparameter Search Space
#list_n_layer = [1]
list_lr = [0.005]
list_n_layer = [3]




train_dataloader = DataLoader(gcnDataset(datasets[0], args.max_len), batch_size=args.batch_size, shuffle=True)
val_dataloader = DataLoader(gcnDataset(datasets[1], args.max_len), batch_size=args.batch_size, shuffle=False)
test_dataloader = DataLoader(gcnDataset(datasets[2], args.max_len), batch_size=args.batch_size, shuffle=False)
partition = {'train': train_dataloader, 'val': val_dataloader, 'test': test_dataloader}

cnt_exp = 0
for lr in list_lr:
    for n_layer in list_n_layer:
        args.lr = lr
        args.n_layer = n_layer

        model, result = experiment(partition, args)
        writer.write(result)
        
        cnt_exp += 1
        print('[Exp {:2}] got mae: {:2.3f}, std: {:2.3f} at epoch {:2}'.format(cnt_exp, result.best_mae, result.best_std, result.best_epoch))




RuntimeError: size mismatch, m1: [17920 x 65], m2: [59 x 256] at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THC/generic/THCTensorMathBlas.cu:290

In [None]:
import matplotlib.pyplot as plt
from matplotlib import gridspec
from matplotlib.font_manager import FontProperties
import seaborn as sns

results = writer.read(exp_name='exp1_lr_stage')
#results = results.loc[results['epoch']==50]
variable1 = 'n_stage'
variable2 = 'lr'


plot_performance(results, variable1, variable2,
                'Performance depends on {} vs {}'.format(variable1, variable2),
                'exp1_Performance {} vs {}'.format(variable1, variable2))

plot_distribution(results, variable1, variable2, 'true_y', 'pred_y', 
                  'Prediction results depends on {} vs {}'.format(variable1, variable2),
                  'exp1_Prediction {} vs {}'.format(variable1, variable2))

plot_loss(results, variable1, variable2, 'epoch', 'loss', 
                  'Loss depends on {} vs {}'.format(variable1, variable2),
                  'exp1_Loss {} vs {}'.format(variable1, variable2))

plt.show()