# Training the refractive index

This notebook goes trough the main functions and objects implemented in this library. Based on a dataset containing ~4,000 entries of type (mp_id, structure, refractive index) taken from the MaterialsProject (MP). The workflow can be devided in two parts. First, the creation of a MODData object which stores the information concerning this particular dataset: the materials, the targets and optimal features. Second, a MODNetModel is trained which can later be used for predicting on unseen data.

In [1]:
import sys
from modnet.models import MODNetModel
from modnet.preprocessing import MODData

## 1. Loading the dataset

In this example the dataset is a dataframe saved as a pickle. But it can be any format as long as you can retreive the structures and targets (and the mpids optionally for fast featurization).

In [2]:
import pandas as pd
df = pd.read_pickle('data/df_ref.pkl')
print('{} datapoints'.format(len(df)))
df.head()

3735 datapoints


Unnamed: 0,structure,ref_index
mp-755998,[[2.06202807e-06 2.06349574e+00 2.69529553e+00...,2.43911
mp-13602,"[[0.97627791 4.96510018 6.65949814] O, [1.0206...",1.945601
mp-22467,"[[3.24119011 0.84967668 3.36565113] O, [3.2411...",2.283458
mp-23364,"[[1.23672715 1.23672715 0.81639113] Li, [3.710...",1.611454
mp-540621,"[[2.11356961 8.25584606 4.0428252 ] Sr, [6.474...",1.747648


In [3]:
#df_db = pd.read_pickle("/Users/ppdebreuck/Research/Software/modnet/modnet/data/feature_database_v2")

In [4]:
#df_red = df.loc[set(df_db.index).intersection(set(df.index))]
#df_red.to_pickle("data/df_ref.pkl")

## 2. Creating a MODData instance

### (a) structure, mpid, target creation

In [5]:
md = MODData(materials = df['structure'],
             targets = df['ref_index'].values,
             structure_ids = df.index,
             target_names = ['refractive_index']
            )

2023-01-23 15:24:59,211 - modnet - INFO - Loaded Matminer2023Featurizer featurizer.


### (b) Featurizing the data
The MODData has an integrated database containing the features of many materials from the MP. By enabling fast featurization they are directtly retreived from this database and not computed from the structure.

In [6]:
md.featurize(fast=True)

2023-01-23 15:24:59,251 - modnet - INFO - Computing features, this can take time...
2023-01-23 15:24:59,252 - modnet - INFO - Fast featurization on, retrieving from database...
2023-01-23 15:25:01,193 - modnet - INFO - Retrieved features for 3735 out of 3735 materials
2023-01-23 15:25:02,753 - modnet - INFO - Data has successfully been featurized!


In [7]:
md.get_featurized_df().head()

Unnamed: 0_level_0,AtomicOrbitals|HOMO_character,AtomicOrbitals|HOMO_element,AtomicOrbitals|HOMO_energy,AtomicOrbitals|LUMO_character,AtomicOrbitals|LUMO_element,AtomicOrbitals|LUMO_energy,AtomicOrbitals|gap_AO,AtomicPackingEfficiency|mean simul. packing efficiency,AtomicPackingEfficiency|mean abs simul. packing efficiency,AtomicPackingEfficiency|dist from 1 clusters |APE| < 0.010,...,BondFractions|Co - O bond frac.,BondFractions|Bi - O bond frac.,BondFractions|Sc - Sc bond frac.,BondFractions|Co - Co bond frac.,BondFractions|O - Y bond frac.,BondFractions|Nb - O bond frac.,CoulombMatrix|coulomb matrix eig 123,BondFractions|C - O bond frac.,BondFractions|Li - P bond frac.,SineCoulombMatrix|sine coulomb matrix eig 122
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
mp-755998,2.0,7.0,-0.266297,1.0,40.0,-0.162391,0.103906,0.012571,0.012571,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mp-13602,2.0,8.0,-0.338381,2.0,8.0,-0.338381,0.0,-0.020275,0.026709,0.01599,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mp-22467,2.0,8.0,-0.338381,2.0,8.0,-0.338381,0.0,-0.025676,0.025813,0.024296,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mp-23364,2.0,17.0,-0.32038,1.0,3.0,-0.10554,0.21484,0.0,0.0,0.612372,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mp-540621,2.0,8.0,-0.338381,2.0,8.0,-0.338381,0.0,-0.00131,0.050161,0.024298,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### (c) Computing the optimal features

This runs the feature selction algorithm. First the multual information is computed, followed by the iterative selction based on relevance and redundancy.

This step takes time, but is normally run only once before being saved.

In [8]:
md.feature_selection(n=200,
                    n_samples=500,
                    use_precomputed_cross_nmi=True,
                    ) # Here we use precomputed cross_nmi to save time

2023-01-23 15:25:02,875 - modnet - INFO - Loading cross NMI from 'Features_cross' file.
2023-01-23 15:25:02,922 - modnet - INFO - Starting target 1/1: refractive_index ...
2023-01-23 15:25:02,923 - modnet - INFO - Computing mutual information between features and target...
2023-01-23 15:25:24,350 - modnet - INFO - Computing optimal features...
2023-01-23 15:25:50,593 - modnet - INFO - Selected 50/200 features...
2023-01-23 15:26:14,771 - modnet - INFO - Selected 100/200 features...
2023-01-23 15:26:34,174 - modnet - INFO - Selected 150/200 features...
2023-01-23 15:26:51,347 - modnet - INFO - Done with target 1/1: refractive_index.
2023-01-23 15:26:51,347 - modnet - INFO - Merging all features...
2023-01-23 15:26:51,348 - modnet - INFO - Done.


In [9]:
md.target_nmi.nlargest(n=10)

ValenceOrbital|frac p valence electrons              0.097339
DensityFeatures|density                              0.092590
ElementProperty|MagpieData maximum GSbandgap         0.083672
ElementProperty|MagpieData mean NdValence            0.082769
ElementProperty|MagpieData minimum GSvolume_pa       0.082210
ValenceOrbital|avg d valence electrons               0.081176
ElementProperty|MagpieData range GSbandgap           0.080725
ElementProperty|MagpieData mean Row                  0.078446
ElementProperty|MagpieData mode Electronegativity    0.077785
ValenceOrbital|frac d valence electrons              0.077722
Name: refractive_index, dtype: float64

In [10]:
md.get_optimal_descriptors()[:10]

['ValenceOrbital|frac p valence electrons',
 'AGNIFingerPrint|mean AGNI dir=y eta=2.89e+00',
 'RadialDistributionFunction|radial distribution function|d_1.40',
 'ElementFraction|F',
 'CoulombMatrix|coulomb matrix eig 0',
 'AGNIFingerPrint|std_dev AGNI dir=y eta=1.88e+00',
 'AGNIFingerPrint|mean AGNI eta=1.23e+00',
 'DensityFeatures|density',
 'AverageBondLength|std_dev Average bond length',
 'ElementProperty|MagpieData minimum NValence']

### (d) Saving the MODData

In [11]:
md.save('out/ref_index.mdt')

2023-01-23 15:26:53,833 - modnet - INFO - Data successfully saved as out/ref_index.mdt!


## 3. MODNet model

In [12]:
md = MODData.load('out/ref_index.mdt')

2023-01-23 15:26:56,231 - modnet - INFO - Loaded <modnet.preprocessing.MODData object at 0x7f9604001c40> object, created with modnet version 0.2.0~develop


### (a) Creating the MODNet

In [13]:
model = MODNetModel([[['refractive_index']]],{'refractive_index':1},
                    n_feat=200,
                    num_neurons=[[128],[64],[32],[]],
                    act='elu'
                   )
model.model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 200)]             0         
                                                                 
 dense (Dense)               (None, 128)               25728     
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 32)                2080      
                                                                 
 refractive_index (Dense)    (None, 1)                 33        
                                                                 
Total params: 36,097
Trainable params: 36,097
Non-trainable params: 0
_________________________________________________________________


2023-01-23 15:26:56.310265: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### (b) Training the model

#### option 1: using the fit_preset function

In [14]:
#model.fit_preset(md,nested=0) # no innner CV is used (only simple train-val here)

#### option 2: using the fit function
In this case, the user provides hand-chosen hyperparameters

In [15]:
model = MODNetModel([[['refractive_index']]],
                    {'refractive_index':1},
                    n_feat=200,
                    num_neurons=[[128],[64],[32],[]],
                    act='elu'
                   )

In [16]:
model.fit(md,val_fraction=0.1,
          val_key='refractive_index',
          loss='mae', lr=0.001, epochs = 300,
          batch_size = 64, xscale='minmax',
          yscale=None,
          verbose=1
         )

  super(Adam, self).__init__(name, **kwargs)


epoch 0: loss: 0.321, val_loss:0.152 val_mae:0.152
epoch 1: loss: 0.142, val_loss:0.144 val_mae:0.144
epoch 2: loss: 0.119, val_loss:0.114 val_mae:0.114
epoch 3: loss: 0.113, val_loss:0.104 val_mae:0.104
epoch 4: loss: 0.107, val_loss:0.112 val_mae:0.112
epoch 5: loss: 0.105, val_loss:0.097 val_mae:0.097
epoch 6: loss: 0.104, val_loss:0.107 val_mae:0.107
epoch 7: loss: 0.100, val_loss:0.091 val_mae:0.091
epoch 8: loss: 0.092, val_loss:0.093 val_mae:0.093
epoch 9: loss: 0.093, val_loss:0.091 val_mae:0.091
epoch 10: loss: 0.085, val_loss:0.094 val_mae:0.094
epoch 11: loss: 0.092, val_loss:0.087 val_mae:0.087
epoch 12: loss: 0.079, val_loss:0.082 val_mae:0.082
epoch 13: loss: 0.082, val_loss:0.089 val_mae:0.089
epoch 14: loss: 0.079, val_loss:0.082 val_mae:0.082
epoch 15: loss: 0.076, val_loss:0.094 val_mae:0.094
epoch 16: loss: 0.081, val_loss:0.101 val_mae:0.101
epoch 17: loss: 0.079, val_loss:0.093 val_mae:0.093
epoch 18: loss: 0.073, val_loss:0.077 val_mae:0.077
epoch 19: loss: 0.075,

In [17]:
model.fit(md,
          val_fraction=0.1,
          val_key='refractive_index',
          lr=0.0005,
          epochs = 100,
          batch_size = 128,
          xscale='minmax',
          yscale=None,
          verbose=1
         )

  super(Adam, self).__init__(name, **kwargs)


epoch 0: loss: 0.002, val_loss:0.027 val_mae:0.054
epoch 1: loss: 0.002, val_loss:0.026 val_mae:0.053
epoch 2: loss: 0.002, val_loss:0.026 val_mae:0.054
epoch 3: loss: 0.002, val_loss:0.026 val_mae:0.054
epoch 4: loss: 0.002, val_loss:0.026 val_mae:0.054
epoch 5: loss: 0.001, val_loss:0.025 val_mae:0.054
epoch 6: loss: 0.002, val_loss:0.026 val_mae:0.057
epoch 7: loss: 0.002, val_loss:0.025 val_mae:0.052
epoch 8: loss: 0.001, val_loss:0.027 val_mae:0.056
epoch 9: loss: 0.001, val_loss:0.027 val_mae:0.055
epoch 10: loss: 0.001, val_loss:0.026 val_mae:0.053
epoch 11: loss: 0.001, val_loss:0.026 val_mae:0.055
epoch 12: loss: 0.001, val_loss:0.025 val_mae:0.054
epoch 13: loss: 0.001, val_loss:0.027 val_mae:0.054
epoch 14: loss: 0.001, val_loss:0.026 val_mae:0.055
epoch 15: loss: 0.001, val_loss:0.026 val_mae:0.055
epoch 16: loss: 0.001, val_loss:0.026 val_mae:0.054
epoch 17: loss: 0.001, val_loss:0.026 val_mae:0.054
epoch 18: loss: 0.001, val_loss:0.026 val_mae:0.056
epoch 19: loss: 0.001,

## 4. Saving the model

In [18]:
model.save('out/MODNet_refractive_index')

2023-01-23 15:27:44,268 - modnet - INFO - Model successfully saved as out/MODNet_refractive_index!


## 5. Predicting on unseen data

See "predicting_ref_index" notebook