# Example 1: Learning molecules energy 

#### Huan Tran

The main objective of this example is to demonstrate a generic workflow of materials, involving (1) obtaining a small dataset of molecules and their energy, (2) fingerprint them, (3) develop some ML models, and (4) use these models to make predictions. 

***
### 1. Download data
The dataset contains 1000 non-equilibrium structures of CH4, whose energy was computed using BigDFT package. It is available at www.matsml.org.

In [1]:
import os
import pandas as pd

# get data
data_url='https://www.matsml.org/data/molecs.tgz'
os.system('wget -O molecs.tgz --no-check-certificate '+data_url)
os.system('tar -xf molecs.tgz')

# check necessary content
print (os.path.isfile('molecs/sum_list.csv'))
if os.path.isfile('molecs/sum_list.csv'):
    print (pd.read_csv('molecs/sum_list.csv'))

True
         file_name     target
0    CF4-00001.xyz -25.466963
1    CF4-00002.xyz -25.357728
2    CF4-00003.xyz -25.463676
3    CF4-00004.xyz -25.312495
4    CF4-00005.xyz -25.364009
..             ...        ...
395  CF4-00396.xyz -25.405787
396  CF4-00397.xyz -25.487477
397  CF4-00398.xyz -25.461510
398  CF4-00399.xyz -25.223929
399  CF4-00400.xyz -25.497948

[400 rows x 2 columns]


***
### 2. Fingerprint the obtained data
Coulomb matrix (CM) [M. Rupp, A. Tkatchenko, K.-R. Müller, and O. Anatole von Lilienfeld, <em>Fast and accurate modeling of molecular atomization energies with machine learning</em>, Phys. Rev. Lett., 108, 058301 (2012)] is perhaps one of the earliest fingerprints used in materials informatics. It was defined as an $N\times N$ matrix for a molecule of $N$ atoms. The key advantage of CM is that it is invariant under rotations and translations, required ro represent materials structure as a whole. However, its size depends on the molecule size, making it not directly usable for machine learning. Normally, the eigenvalues of these matricies are computed and sorted, and then zero padding is used to make fixed-size vectors. Here, we defined a projection of these Coulomb matricies onto a set of Gaussian functions, covering the entire range of the Coulomn matrix element values. The results are also a set of fixed-size fingerprints, which are ready for learning.

In [2]:
import pandas as pd
from matsml.fingerprint import Fingerprint

sum_list=os.path.join(os.getcwd(),'molecs/sum_list.csv')
data_loc=os.path.join(os.getcwd(),'molecs/')
n_atoms_max=6                          # max number of atoms in all of the structures to be fingerprinted
fp_type='pcm_molecs'                   # projected Coulomb matrix for molecules
struct_format='xyz'                    # atomic structure format 
fp_file='fp.csv'                       # fingerprinted data file name
fp_dim=200                             # intended fingerprint dimensionality; the final number can be smaller 
verbosity=0                            # verbosity, 0 or 1

data_params={'sum_list':sum_list,'data_loc':data_loc,'n_atoms_max':n_atoms_max,
    'fp_file':fp_file,'struct_format':struct_format,'fp_type':fp_type,
    'fp_dim':fp_dim,'verbosity':verbosity}

fp=Fingerprint(data_params)

# Compute fingerprint
fp.get_fingerprint()

# How does the fingerprinted data look like
print(pd.read_csv('fp.csv').columns)

  matsML, version 1.0
  *****
  Atomic structure fingerprinting
    sum_list                     /home/huan/work/matsml/examples/ex1_pcm-molecs/molecs/sum_list.csv
    data_loc                     /home/huan/work/matsml/examples/ex1_pcm-molecs/molecs/
    n_atoms_max                  6
    struct_format                xyz
    fp_type                      pcm_molecs
    fp_dim                       100
    fp_file                      fp.csv
    verbosity                    0
  Read input
    num_structs                  400
  Computing Coulomb matrix
  Projecting Coulomb matrix to create fingerprints
  Done fingerprinting, results saved in fp.csv
Index(['id', 'target', 'pcm_0018', 'pcm_0019', 'pcm_0020', 'pcm_0021',
       'pcm_0022', 'pcm_0023', 'pcm_0024', 'pcm_0025', 'pcm_0026', 'pcm_0027',
       'pcm_0028', 'pcm_0029', 'pcm_0030', 'pcm_0031', 'pcm_0032', 'pcm_0033',
       'pcm_0034', 'pcm_0035', 'pcm_0036', 'pcm_0037', 'pcm_0038', 'pcm_0039',
       'pcm_0040', 'pcm_0041', 'pcm_0

***
### 3. Train some ML models
Having the fingerprinted data "fp.csv", whose fields shown above, it will now be learned. First, some specific information of the data is given. 

In [3]:
# data parameters for learning, note that this maybe different from data_params for the 
# above fingerprint step. We used the same data in "fp.csv" for all three algorithms below
data_file='fp.csv'        # fingerprinted data file
id_col=['id']             # column for data ID 
y_cols=['target']         # columns for (one or more) target properties
comment_cols=[]           # comment columns, anything not counted into ID, fingerprints, and target
n_trains=0.85             # 85% for training, 15% for validating
sampling='random'         # method for train/test spliting
x_scaling='minmax'        # method for x scaling
y_scaling='minmax'        # method for y scaling

# Dict of data parameters
data_params={'data_file':data_file, 'id_col':id_col,'y_cols':y_cols,
    'comment_cols':comment_cols,'y_scaling':y_scaling,'x_scaling':x_scaling,
    'sampling':sampling, 'n_trains':n_trains}

Then, three generic learning algorithms will be used. Depending on the algorithm, some method-specific parameters are needed so the model can be properly built and trained on the fingerprinted data.
***
#### 3a. Fully-connected NeuralNet

In [4]:
from matsml.models import FCNeuralNet

# Model parameters
layers=[8,8]                 # list of nodes in hidden layers
epochs=300                   # Epochs
nfold_cv=5                   # Number of folds for cross validation
use_bias=True                # Use bias term or not
model_file='model_nn.pkl'    # Name of the model file to be created
verbosity=0                  # Verbosity, 0 or 1
batch_size=32                # Default = 32
loss='mse'
metric='mse'
activ_funct='tanh'           # Options: "tanh","relu","sigmoid","softmax","softplus","softsign","selu","elu","exponential"
optimizer='nadam'            # options: "SGD","RMSprop","Adam","Adadelta","Adagrad","Adamax","Nadam","Ftrl"

# Dict of model parameters
model_params={'layers':layers,'activ_funct':activ_funct,'epochs':epochs,
    'nfold_cv':nfold_cv,'optimizer':optimizer,'use_bias':use_bias,
    'model_file':model_file,'loss':loss,'metric':metric,
    'batch_size':batch_size,'verbosity':verbosity,'rmse_cv':False}

# Compile a model
model=FCNeuralNet(data_params=data_params,model_params=model_params)

# Train the model
model.train()

# Plot results
model.plot(pdf=False)

 
  Learning fingerprinted/featured data
    algorithm                    fully connected NeuralNet w/ TensorFlow
    layers                       [8, 8]
    activ_funct                  tanh
    epochs                       300
    optimizer                    nadam
    nfold_cv                     5
  Reading data ... 
    data file                    fp.csv
    data size                    400
    training size                340 (85.0 %)
    test size                    60 (15.0 %)
    x dimensionality             50
    y dimensionality             1
    y label(s)                   ['target']
  Scaling x                      minmax
  Scaling y                      minmax
  Prepare train/test sets        random
  Building model                 FCNeuralNet
  Training model w/ cross validation
    cv,rmse_train,rmse_test,rmse_opt: 0 0.059998 0.056726 0.056726
    cv,rmse_train,rmse_test,rmse_opt: 1 0.060184 0.117547 0.056726
    cv,rmse_train,rmse_test,rmse_opt: 2 0.056156 0.073283 

NameError: name 'plot_train_test_preds' is not defined

In [7]:
from matsml.io import plot_train_test_preds
plot_train_test_preds()


  Plot results in "training.csv" & "test.csv"


ValueError: Image size of 91459x8015 pixels is too large. It must be less than 2^16 in each direction.

<Figure size 600x600 with 1 Axes>

***
#### 3b. Kernel Ridge Regression

In [None]:
from matsml.models import KRR

# Model parameters
kernel = 'rbf'                 # Kernel
nfold_cv=5                     # Number of folds for cross validation
model_file='model_krr.pkl'     # Name of the model file to be created
metric = 'mse'                 # Metric
alpha = [-2,5]                 # hyper parameter range
gamma = [-2,5]                 # hyper parameter range
n_grids = 10

# Dict of model parameters
model_params={'kernel':kernel,'metric':metric,'nfold_cv':nfold_cv,
    'model_file':model_file,'alpha':alpha,'gamma':gamma,'n_grids':n_grids}

# Compile a model
model = KRR(data_params=data_params,model_params=model_params)

# Train the model
model.train()

# Plot results
plot_result()

***
#### 3c. Gaussian Process Regression

In [None]:
from matsml.models import GPR

# Model parameters
nfold_cv=5                         # Number of folds for cross validation
model_file='model_gpr.pkl'         # Name of the model file to be created
verbosity=0
rmse_cv=True                       # Compute CV RMSE or not
n_restarts_optimizer=100           # Number of optimizer start

# Dict of model parameters
model_params={'metric':metric,'nfold_cv':nfold_cv,
    'n_restarts_optimizer':n_restarts_optimizer,'model_file':model_file,
    'verbosity':verbosity,'rmse_cv':rmse_cv}

# Compile a model
model=GPR(data_params=data_params,model_params=model_params)

# Train the model
model.train()

# Plot results
plot_result()