# Training ML models for computed band gap of HOIPs
### Huan Tran, MSE, Georgia Institute of Technology

This notebook is a part of [V. N. Tuoc, Nga. T. T. Nguyen, V. Sharma, and T. D. Huan, *Probabilistic deep learning approach for targeted hybrid organic-inorganic perovskites*, 2021]. 

The original dataset containing the computed band gap of 1,346 atomic structures predicted for 192 chemical compositions of hybrid organic-inorganic perovskites (HOIPs) is available at [C. Kim, T.D. Huan, S. Krishnan, and R. Ramprasad, Scientific Data 4, 170057 ('17), url: https://www.nature.com/articles/sdata201757]. Herein, three fingerprinted versions of this dataset will be fetched and learned to develop 5 ML models, which are based on Gaussian Process Regression, fully connected Neural Net, and Probability Neural Net. The last one supplies a good way to handle aleatoric (data) uncertainty. Details on this topic can be found in the original reference mentioned above.

In [1]:
import pandas as pd
from matsml.models import ProbNeuralNet

  matsML, a ML toolkit for some problems in materials science
  Huan Tran, huantd@gmail.com
  *****


In [2]:
# data parameters
data_file = 'hoips_4tfp.csv'
id_col = ['ID']
y_cols = ['Egap']
comment_cols = []
n_trains = 0.85
sampling = 'random'
x_scaling = 'minmax'
y_scaling = 'minmax'
data_params = {'data_file':data_file, 'id_col':id_col,'y_cols':y_cols,
    'comment_cols':comment_cols,'y_scaling':y_scaling,'x_scaling':x_scaling,
    'sampling':sampling, 'n_trains':n_trains}

In [7]:
# Model parameters
layers = [5,5]                   # list of nodes in hidden layers
epochs = 100                     # Epochs
nfold_cv = 5                     # Number of folds for cross validation
use_bias = True                  # Use bias term or not
file_model = 'model.pkl'         # Name of the model file to be created
loss = 'mse'                     #
metric = 'mse'                   #
verbosity = 0
batch_size = 32                  #
activ_funct = 'elu'              # Options: "tanh","relu","sigmoid","softmax", 
                                 # "softplus","softsign","selu","elu",
                                 # "exponential"
optimizer = 'nadam'              # options: "SGD","RMSprop","Adam","Adadelta", 
                                 # "Adagrad","Adamax","Nadam","Ftrl"

model_params={'layers':layers,'activ_funct':activ_funct,'epochs':epochs,
    'nfold_cv':nfold_cv,'optimizer':optimizer,'use_bias':use_bias,
    'file_model':file_model,'loss':loss,'metric':metric,
    'batch_size':batch_size,'verbosity':verbosity,'rmse_cv':False}

In [8]:
# Compile a model
model=ProbNeuralNet(data_params=data_params,model_params=model_params)

 
  Learning fingerprinted/featured data
    algorithm                    Probabilistic NeuralNet w/ TensorFlow-Probability
    layers                       [5, 5]
    activ_funct                  elu
    epochs                       100
    optimizer                    nadam
    nfold_cv                     5


In [9]:
# Train the model
model.train()

  Reading data ... 
    data file                    hoips_4tfp.csv
    data size                    1346
    training size                1144 (85.0 %)
    test size                    202 (15.0 %)
    x dimensionality             221
    y dimensionality             1
    y label(s)                   ['Egap']
  Preprocessing data ...
    scaling x                    minmax
    scaling y                    minmax
    prepare train/test sets      random
  Building model: ProbNeuralNet
  Training ProbNeuralNet w/ cross validation
    cv,rmse_train,rmse_test,rmse_opt: 0 0.130835 0.129978 0.129978
    cv,rmse_train,rmse_test,rmse_opt: 1 0.133496 0.132256 0.129978
    cv,rmse_train,rmse_test,rmse_opt: 2 0.129701 0.121604 0.121604
    cv,rmse_train,rmse_test,rmse_opt: 3 0.138079 0.119489 0.119489
    cv,rmse_train,rmse_test,rmse_opt: 4 0.134343 0.132062 0.119489
    Optimal ncv:  3
