# DeeperMD Package 

This package builds on the features provided in the Machine learning interatomic potential (MLIP) development package, DeepMD, streamlining the data preparation, model training, and model validation. By interfacing with dpdata (data preparation package), LAMMPS (molecular dynamics software), and deepmd (MLIP software), the DeeperMD package cleans up the model development process. New functionalities like hyperparameter optimization are included in the package to optimize model parameters.

<div>
<img src="images/deepermd_summary.png" width="1000"/>
</div>

## Data Preparation Sub-Package

Separated into two modules: `process_data` and `cross_val`

This subpackage reads in DFT data (currently only OUTCARs supported), and processes them for use in ML-based potentials. 

#### **process_data**

converts OUTCARs to `.npy` via `dpdata` package, separating data into training and validation directories based on a defined training split proportion.

In [1]:
#import functions
from deepermd.data_prep.process_data import OUTCAR_to_ms,OUTCAR_to_npy,train_test_split

In [2]:
#define parameters to be used in functions 
par = 'raw_data/B4C'
dest = 'example'
sub = ['temperature_hold','small_strains']
run = ['temperature_hold','shear_strain','volumetric_strain','uniaxial_strain']

<div>
<img src="images/raw_data_tree.png" width="1000"/>
</div>

#### `OUTCAR_to_ms`

converts OUTCARs from given `parent_directory` to DeepMD MultiSystem() object for storage of systems and data management.

In [3]:
#convert OUTCAR to deepmd MultiSystem() object 
ms,ms_virial,count_novirial,count_virial = OUTCAR_to_ms(
    parent_dir=par,
    sub_dirs=sub,
    run_types=run,
    )

In [4]:
ms.systems

{'B104C16': Data Summary
 Labeled System
 -------------------
 Frame Numbers      : 100
 Atom Numbers       : 120
 Including Virials  : No
 Element List       :
 -------------------
 B  C
 104  16,
 'B96C24': Data Summary
 Labeled System
 -------------------
 Frame Numbers      : 100
 Atom Numbers       : 120
 Including Virials  : No
 Element List       :
 -------------------
 B  C
 96  24}

In [5]:
ms_virial.systems

{'B104C16': Data Summary
 Labeled System
 -------------------
 Frame Numbers      : 600
 Atom Numbers       : 120
 Including Virials  : Yes
 Element List       :
 -------------------
 B  C
 104  16,
 'B96C24': Data Summary
 Labeled System
 -------------------
 Frame Numbers      : 600
 Atom Numbers       : 120
 Including Virials  : Yes
 Element List       :
 -------------------
 B  C
 96  24}

`count_virial` gives the total numer of frames from each run type in the form of a dictionary.

In [6]:
count_virial

{'temperature_hold': 1200,
 'shear_strain': 0,
 'volumetric_strain': 0,
 'uniaxial_strain': 0}

Once the data has been transferred to a MultiSystem object, it can be converted to `.npy` files and stored in the `dest` directory

In [7]:
#Convert systems to npy files via deepmd package 
ms.to_deepmd_npy(os.path.join(dest,'no_virials'),set_size = 1000000)
ms_virial.to_deepmd_npy(os.path.join(dest,'virials'),set_size = 1000000)

MultiSystems (2 systems containing 1200 frames)

#### `train_test_split`

scrubs destination dir for `.npy` files and splits into train-test sets. After running the below cell, the `dest` directory will include a `training` and `validation` directory with the chosen proportion of data within each (default 90-10 split).

In [8]:
train_test_split(
    destination_dir=dest,
    train_split=0.9,
    ms_virial=ms_virial,
    ms=ms)

#### `OUTCAR_to_npy`

Combines the above functions in one end-to-end method to simplify data preparation stage.

In [9]:
import shutil
shutil.rmtree(dest)

In [10]:
ms_nov,ms_v,count_nov,count_v = OUTCAR_to_npy(
    parent_dir=par,
    destination_dir=dest,
    run_types=run)

### gen_cval_data

This sub-module splits training and validation data into k-sets for use in k-fold cross-validation. This is a mostly back-end package for use in hyperparameter optimization data preparation.


#### `gen_data_dir`

In [11]:
from deepermd.data_prep import gen_cval_data

In [12]:
import os
train_path = 'example/virials/training_data'
destination = os.getcwd()

In [13]:
gen_cval_data.gen_data_dir(
    training_path=train_path,
    destination_path=destination,
    k = 4
)

cross-validation data directory tree generated at:
/blue/subhash/kimia.gh/python_course/Final_project/cval


This function generates a directory tree that will be accessed by the training module if cross-validation is chosen to be included during training.

<div>
<img src="images/cval_data_tree.png" width="1000"/>
</div>

**NOTE** this function MUST be ran before training with cross validation.

## Model Training Sub-Package

separated into 4 sub-modules: `hyperparam_optimization`, `hyperparam_train_test`, `post_training_handling`, and `train`.

In [14]:
from deepermd.training import hyperparam_optimization, hyperparam_train_test, post_training_handling, train

define parameters and their values for optimization. The key for this params dictionary should be the sequence of keys leading to the desired parameter from the input json given to deepmd.

In [15]:
params={
    "model descriptor axis_neuron":[4,8],
    "model fitting_net neuron":[[10,10],[20,20]]
    }

#### Submodule `hyperparam_train_test` 

#### `hyperparam_optimize()`
This function wraps the following functions together: 
- `hyperparam_train_test.json_dir_gen_1d`
- `hyperparam_train_test.hyperparam_train`
- `hyperparam_train_test.hyperparam_test`
- `post_training_handling.lammps_lat_const_modifier`
- `hyperparam_train_test.hyperparam_lammps`
    
This function goal is to generate json and model directories 
corresponding to a user specified parameter dictionary. Models will 
be trained based on json, frozen, compressed (if desicred), 
tested, and their lattice constants and cohesive energies evaluated
via LAMMPS simulations (if desired).

In [16]:
import os
base = os.path.join(os.getcwd(),'base.json')

This function can take several minutes to hours to run depending on the number of parameters chosen for optimization, as well as the variation in their respective values.

In [21]:
hyperparam_train_test.hyperparam_optimize(
    directory=os.getcwd(),
    base_json=base,
    param_dict=params,
    n=30,
    test_model='graph-compress.pb',
    d1_dir='1d_gridsearch',
    frozen_model='graph.pb',
    compression=True,
    compressed_model='graph-compress.pb',
    path_to_cval=os.path.join(os.getcwd(),'cval'),
    gen_cval_data=True,
    crossval=True,
    training_path=None,
    validation_path=None,
    test_path=None,
    multisystem=True,
    lammps=True,
    lammps_model='graph-compress.pb',
    lammps_data=os.path.join(os.getcwd(),'lammps_files','data.b4c_cell'),
    lammps_script=os.path.join(os.getcwd(),'lammps_files','in.lattice_constants'),
    ref_len=5.65,
    ref_coh=-7.2183)


KeyboardInterrupt: 