Design Review 1: KLIFF Trainer framework #172

ipcamit · 2024-03-04T02:16:15Z

Idea: A coherent fully reproducible framework to train all supported openkim models (including ML ones) and be able to archive exact training methodology for reuse and be available from openkim.

So easiest is to create a yaml file which can be hashed to ensure integrity and this hash can be included in kimspec.edn etc for provenance. This PR request contains first draft of this framework, along with its implementation for KIM physics based models. I think better to have a look at it now before having to review huge changes at once. Example yaml file is also there at the bottom.

Core changes and contributions:

Dataset: dataset object can now load weights from a file (I think this was also there on one of the todo list). the weights file has to be a 4 column whitespaced ascii file, that numpy can directly load. The 4 columns represent the configuration, energy, forces and stress weights. The length of the file has to be either 1 (all weights are same) or n (where n is the number of configurations the weights are to be set). In both ml and kim based training, these per configuration weights will be used.
_exceptions: This is just a proposal, and not very strong one at that, but I think it might be useful to collect all of the file based exceptions, and collect them under single file. This gives the flexibility to use them anywhere in the KLIFF more freely. Currently we might run in circular imports if two related modules want to use same exception. (Example Trainer base class and KIMtrainer should raise Trainer error). Let me know your thoughts.
kim_residuals: simple collection of most used loss functions. Another idea is to save the pickled version of the loss function for reproducibility.
kliff_trainer: This is the base class for trainer framework. All trainers must inherit this class. This includes some basic functionality:
a. Initializing the class members from yaml / dict
b. Seed all modules, numpy, scipy, torch etc
c. Setup workspace folder and model folders etc where each independent run would be automatically timestamped and saved
d. setup dataset, load the data, apply any transformation of properties, like energy normalization, compute fingerprint objects, hash the dataset based on path, transforms and weights, save the processed dataset as dill pickle object. Now when the next time any trainer need same dataset configuration, it can be reloaded directly.
e. setup test train split, either based on indices files, or ranges etc, split the dataset in test data and train data
f. call trainer specific functions, setup_model, setup_parameter_transforms, setup_optimizer
g. hash and archive the training configuration
kim_trainer: First example of subclassing this trainer class for KIM physics based models. It basically extends setup_model, setup_parameter_transforms, setup_optimizer, and save_kim_model
enumerations: enumerated list of supported options.

Tar model: KIM model tarred in a single file. For future proofing against LLNL/KIMKit.

Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.

Future direction: I would like to have a Trainer eventually for each model that can be submitted to the openkim. for v1 we will have Torch and KIM. The Torch trainer will also have a companion Lightning based trainer for v1.
But for future versions I think it would make sense if we have Trainers that wrap around library like QUIP to provide complete training surface for newer QUIP driver based models and so on. I would also like to have Trainer built around bootstrapping and MCMC sampling that yonathan contributed, but it will take a little more time hence perhaps v1.2 etc. For v1 the UC part will remain unchanged.

Let me know of thoughts and questions. Sorry for huge wall of text, but I think had I delayed it more, it would only have increased in size!

Example yaml file:

workspace:
    name: test_run # Name of the base workspace folder, where all the runs will be stored
    seed: 12345    # Seed for random number generator, all
    resume: False  # Resume training from a previous run
    walltime: 2:00:00 # Walltime for the run

dataset:
    type: ase           # ase or path or colabfit
    path: Si.xyz        # Path to the dataset
    save: False         # Save processed dataset to a file
    shuffle: False      # Shuffle the dataset
    keys:
        energy: Energy  # Key for energy, if ase dataset is used
        forces: forces  # Key for forces, if ase dataset is used
    training_dataset:
        train_size: 3   # Number of training samples
        train_indices:  # files with indices [optional]
    val_dataset:
        val_size: 1     # Number of validation samples
        val_indices:    # files with indices [optional]
    test_dataset:
        test_size:
        test_indices:
    colabfit_dataset:
        dataset_name:
        database_name:
        database_url:

model:
    model_type: kim     # kim or torch or tar
    model_path: ./
    model_name: SW_StillingerWeber_1985_Si__MO_405512056662_006

transforms:
     property:     # optional: path to property transform file
       name: NormalizedPropertyTransform 
       instance:   # optional: instance of custom property transform
       property_key: energy  # optional: key for the property to transform

    parameter:
        instance:
        parameter_list: # optional for KIM models, list of parameters to optimize
          - A:          # doct means the parameter is transformed
            name: LogParameterTransform
            value: 0.1
            bounds: [[-5, 5]]
          - B          # these are the parameters that are not transformed
          - sigma

    configuration: # optional: generate fingerprints from the configuration
        name: Descriptor
        kwargs: 
            cutoff: 3.7
            species: ["Si"]
            descriptor: "SymmetryFunctions"
            hyperparameters: set51
        instance:

training:
    loss:
        function: MSE
        weights: # optional: path to weights file
            energy: 1.0
            forces: 1.0
            config: 1.0
            stress: 0.0
        normalize_per_atom: true
    optimizer:
        name: L-BFGS-B
        provider: scipy
        learning_rate: 
        kwargs:
          tol: 1.e-6 # 1. is necessary, 1e-6 is treated as string

    batch_size: 1
    epochs: 1
    device: cpu
    num_workers: 2
    chkpt_interval: 1
    stop_condition:
    verbose: True

export: # optional: export the trained model
    model_type: tar # kim or tar
    model_path: ./
    model_name: SW_StillingerWeber_trained_1985_Si__MO_405512056662_006

codecov-commenter · 2024-03-04T02:21:44Z

Codecov Report

Attention: Patch coverage is 2.31548% with 675 lines in your changes are missing coverage. Please review.

Project coverage is 55.43%. Comparing base (9ee3ce8) to head (2f57f56).

❗ Current head 2f57f56 differs from pull request most recent head d3050a6. Consider uploading reports for the commit d3050a6 to get more accurate results

Files	Patch %	Lines
kliff/trainer/kliff_trainer.py	0.00%	346 Missing ⚠️
kliff/trainer/kim_trainer.py	0.00%	182 Missing ⚠️
kliff/trainer/option_enumerations.py	0.00%	87 Missing ⚠️
kliff/dataset/dataset.py	23.63%	42 Missing ⚠️
kliff/utils.py	27.27%	8 Missing ⚠️
kliff/trainer/kim_residuals.py	0.00%	5 Missing ⚠️
kliff/_exceptions.py	0.00%	3 Missing ⚠️
kliff/trainer/__init__.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##               v1     #172      +/-   ##
==========================================
- Coverage   63.14%   55.43%   -7.71%     
==========================================
  Files          50       56       +6     
  Lines        4718     5401     +683     
==========================================
+ Hits         2979     2994      +15     
- Misses       1739     2407     +668

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kliff/_exceptions.py

mjwen · 2024-03-27T21:49:52Z

kliff/dataset/dataset.py

@@ -630,10 +638,37 @@ def _read_from_colabfit(
            logger.error(f"{colabfit_dataset} is either empty or does not exist")
            raise DatasetError(f"{colabfit_dataset} is either empty or does not exist")

+        if isinstance(weight, Path):


Let's use if not isinstance(weight, Weight): to avoid the case a string being passed in as a Path.

Alternatively, we can add a utility function to utils.py to convert to Path like this one:

def to_path(path): return Path(path).expanduser().resolve()

I prefer the later.

The weight setting method is modified, so this comment might not be valid anymore

kliff/dataset/dataset.py

mjwen · 2024-03-27T21:59:51Z

kliff/dataset/dataset.py

@@ -730,9 +779,37 @@ def _read_from_path(
            parent = path.parent
            all_files = [path]

+        if isinstance(weight, Path):


The weights reading part is the same as what was used above. So, let's create a function for it.

kliff/trainer/kliff_trainer.py

mjwen · 2024-03-27T22:13:46Z

kliff/trainer/kliff_trainer.py

+from kliff.transforms.parameter_transforms import ParameterTransform
+from kliff.transforms.property_transforms import PropertyTransform
+
+from ..dataset.weight import Weight


Probably OK to use relative import from the current directory, but let's change to use absolute import for all other cases like this one ..dataset.weight

kliff/trainer/kliff_trainer.py

mjwen · 2024-03-27T22:32:42Z

kliff/trainer/kim_trainer.py

+                        f"Optimizable parameters must be string or value dict. Got {input_params} instead."
+                    )
+
+    def setup_optimizer(self):


Should this be implemented as get_optimizer() below?

setup optimizer was chosen as it might be more involved, like setting up ema, lr decay etc. I have no preference either way.

mjwen · 2024-03-27T22:55:42Z

Sorry for the late reply!

Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.

I cannot remember. Better check with Ryan.

I only have major comments for the trainer.

I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the fit \ test functions of the trainer. This is following the lightning approach and I think it is a good approach.
Related to 1, it might be better to only pass the config dict info related to training to the trainer, other stuff (like model, dataset, transforms) all be passed in as an instance. Basically, I am thinking of a trainer like

Trainer:
    def __init__(model: Union[KIMModel, TorchModel, TarModel], transform: Transform, training_config: dict):
        
    
    def fit(train_loader, val_loader):
         
   
    def test(test_loader):
     
    # other necessary functions such as save kim model

As you can see, this is again modeling after lightning. By moving the instantiation of the model, dataset/loader, transform, out of the Trainer can make it much more simper, which should be easier for maintaining.

There exist good packages to deal with configs that we can take advantage of. We don't need to do all the checking and such in the trainer for all hyperparameters, which would be very difficult to maintain in the long run. Specifically, I am considering omegaconf and hydra. The latter uses the former. The nice feather of hydra is that you can define a base config that is shared by all, and customize it for various specific cases. Also, it is composable, making it much easier to separate the configs for the various parts. schnetpack uses it: https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/schnetpack/configs/train.yaml and I've used it in one of my projects as well https://github.com/mjwen/rxnrep/tree/main/configs, which may give you a feeling of what it look like. hydra also have utility functions e.g. for instantiating class https://hydra.cc/docs/advanced/instantiate_objects/overview/, which we can take advantage of instead of using importlib.

ipcamit · 2024-03-30T13:22:03Z

I only have major comments for the trainer.

1. I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the `fit` \ `test` functions of the trainer. This is following the lightning approach and I think it is a good approach.

I can look into the hydra library, but for datasets, models etc, do you think it will be better to have different utility functions etc to setup datasets and models? Whole idea is to make sure that using this single yaml file you should be able to get same results in term of dataset splits etc.

mjwen · 2024-03-30T15:46:21Z

Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.

do you think it will be better to have different utility functions etc to setup datasets and models?

I guess, we can provide these low-level parts to provide more granularity - these would be extremely useful of one wants to build a new trainer for a new model and such. But at the same time, we can use the low-level functionality to achieve one-yaml-file style setups to achieve reproducibility and such. This can be done, e.g. via an example training script, that first instantiates all the components, e.g. dataset and model, and then passes these to the trainer to train the model and log metrics and such. I guess the major difference is instantiating the classes inside the Trainer or outside. I think the latter is slightly better.

mjwen · 2024-03-30T16:17:54Z

I have mixed feelings with hydra -- it can simplify configs management, but tracking and debugging can be a bit more involved. I am not sure whether we want to use it here. Probably you explore it a bit and we can then decide. But even if we don't use hydra, I think we still need to use OmegaConf to read, merge, write config files, instead of the vanilla yaml module.

ipcamit · 2024-03-30T16:40:43Z

Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.

This condition is already satisfied to some extent. The configuration is split into 6 independent blocks,

workspace: local folder/cache/other information directory.
dataset: instance of KLIFF dataset
model: instance of KIMModel or TorchScript model
transforms: instance of property, parameter, configuration transforms
training: loss function, optimizer, verbose
export: export the trained model, KIM-API model or tarball

I guess I can try splitting it into 3 modules,

Dataset: setup dataset, along with test train split
Model: setup KIM, Torch, Tar model etc. It will also consume export module
Trainer module: workspace management (base class), training loop (child class), optimizer (child class)

Optimizer can be a separate module but it would be difficult to organize as a separate module owing to scipy's different API. Can make a private optimizer module for torch.

Thoughts, suggestions?

Reason for above division:

workspace: cant be independent, as this is specifically setup and used by trainer
dataset: can be independent, currently setup using setup_dataset
model: can be independent, currently setup using setup_model
transforms: Parameter transforms cannot be independent, and configuration transforms are only used by torch, so a general module might not be that coherent. So included in Trainer
training: trainer module
export: can be merged with model

mjwen · 2024-03-30T20:06:43Z

Yes, the previous yaml file does serve the purpose of separating them into different parts. We just need to organize the internal codes in that direction a bit.

I agree with the three-module approach. Particularly, we want to integrate the optimizer with a trainer for the exact reason you've mentioned.

Also agreed that workspace settings need to be separate, and not associated with a trainer.

ipcamit · 2024-04-17T04:21:42Z

Simplified the trainer.
Intialization methods for model and datasets.

Note on glossary, everywhere the configuration dictionary is called "manifest", as configuration was becoming too confusing, example "dataset_from_configuration" creates a confusion on whether it uses the kliff.Configuration object. Hence I suggest we use "manifest" as the word of choice for yaml configurations. So KLIFF trainer takes in a training manifest, and splits it accordingly in model, dataset and transform manifest, and passes it to appropriate static functions in Dataset and KIMModel, such that they return require objects.

Regarding Omegaconf, I did integrate it, but have to remove it afterwards as it adds lot of needless complexity with no clear advantage. Biggest issue was that it only support "primitive" dataypes in python, something as basic as numpy array needs custom solutions etc. dict is flexible enough to be a better solution I think.

There might be some rough edges, that will clear as we add more functionality like the ML trainer.

ipcamit · 2024-04-20T22:50:26Z

Disregard for the time being, I need to work a bit more on this before opening the PR again.

ipcamit added 5 commits February 14, 2024 19:22

Trainer module

9108450

Merge branch 'kliff-master-v1' into kliff-trainer-v1

2dc9be0

Trainer base class implemented

731c5df

Trainer base class working

86ab5d2

First draft trainer framework

2f57f56

mjwen requested changes Mar 27, 2024

View reviewed changes

ipcamit added 5 commits April 6, 2024 12:26

from config functionality in KIMModel

01322ec

DS and Model manifest initialization

2fa8bb8

Moved back from omegaconf to dict

e1ef24e

Working KIM trainer module

614c1b9

Merged Eric's PR

d3050a6

ipcamit closed this Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Review 1: KLIFF Trainer framework #172

Design Review 1: KLIFF Trainer framework #172

ipcamit commented Mar 4, 2024

codecov-commenter commented Mar 4, 2024 •

edited by codecov bot

Loading

mjwen Mar 27, 2024

ipcamit Apr 17, 2024

mjwen Mar 27, 2024

mjwen Mar 27, 2024

mjwen Mar 27, 2024

ipcamit Apr 17, 2024

mjwen commented Mar 27, 2024 •

edited

Loading

ipcamit commented Mar 30, 2024

mjwen commented Mar 30, 2024 •

edited

Loading

mjwen commented Mar 30, 2024

ipcamit commented Mar 30, 2024 •

edited

Loading

mjwen commented Mar 30, 2024 •

edited

Loading

ipcamit commented Apr 17, 2024

ipcamit commented Apr 20, 2024

Design Review 1: KLIFF Trainer framework #172

Design Review 1: KLIFF Trainer framework #172

Conversation

ipcamit commented Mar 4, 2024

codecov-commenter commented Mar 4, 2024 • edited by codecov bot Loading

Codecov Report

mjwen Mar 27, 2024

Choose a reason for hiding this comment

ipcamit Apr 17, 2024

Choose a reason for hiding this comment

mjwen Mar 27, 2024

Choose a reason for hiding this comment

mjwen Mar 27, 2024

Choose a reason for hiding this comment

mjwen Mar 27, 2024

Choose a reason for hiding this comment

ipcamit Apr 17, 2024

Choose a reason for hiding this comment

mjwen commented Mar 27, 2024 • edited Loading

ipcamit commented Mar 30, 2024

mjwen commented Mar 30, 2024 • edited Loading

mjwen commented Mar 30, 2024

ipcamit commented Mar 30, 2024 • edited Loading

mjwen commented Mar 30, 2024 • edited Loading

ipcamit commented Apr 17, 2024

ipcamit commented Apr 20, 2024

codecov-commenter commented Mar 4, 2024 •

edited by codecov bot

Loading

mjwen commented Mar 27, 2024 •

edited

Loading

mjwen commented Mar 30, 2024 •

edited

Loading

ipcamit commented Mar 30, 2024 •

edited

Loading

mjwen commented Mar 30, 2024 •

edited

Loading