Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design Review 1: KLIFF Trainer framework #172

Closed
wants to merge 10 commits into from

Conversation

ipcamit
Copy link

@ipcamit ipcamit commented Mar 4, 2024

Idea: A coherent fully reproducible framework to train all supported openkim models (including ML ones) and be able to archive exact training methodology for reuse and be available from openkim.

So easiest is to create a yaml file which can be hashed to ensure integrity and this hash can be included in kimspec.edn etc for provenance. This PR request contains first draft of this framework, along with its implementation for KIM physics based models. I think better to have a look at it now before having to review huge changes at once. Example yaml file is also there at the bottom.

Core changes and contributions:

  1. Dataset: dataset object can now load weights from a file (I think this was also there on one of the todo list). the weights file has to be a 4 column whitespaced ascii file, that numpy can directly load. The 4 columns represent the configuration, energy, forces and stress weights. The length of the file has to be either 1 (all weights are same) or n (where n is the number of configurations the weights are to be set). In both ml and kim based training, these per configuration weights will be used.
  2. _exceptions: This is just a proposal, and not very strong one at that, but I think it might be useful to collect all of the file based exceptions, and collect them under single file. This gives the flexibility to use them anywhere in the KLIFF more freely. Currently we might run in circular imports if two related modules want to use same exception. (Example Trainer base class and KIMtrainer should raise Trainer error). Let me know your thoughts.
  3. kim_residuals: simple collection of most used loss functions. Another idea is to save the pickled version of the loss function for reproducibility.
  4. kliff_trainer: This is the base class for trainer framework. All trainers must inherit this class. This includes some basic functionality:
    a. Initializing the class members from yaml / dict
    b. Seed all modules, numpy, scipy, torch etc
    c. Setup workspace folder and model folders etc where each independent run would be automatically timestamped and saved
    d. setup dataset, load the data, apply any transformation of properties, like energy normalization, compute fingerprint objects, hash the dataset based on path, transforms and weights, save the processed dataset as dill pickle object. Now when the next time any trainer need same dataset configuration, it can be reloaded directly.
    e. setup test train split, either based on indices files, or ranges etc, split the dataset in test data and train data
    f. call trainer specific functions, setup_model, setup_parameter_transforms, setup_optimizer
    g. hash and archive the training configuration
  5. kim_trainer: First example of subclassing this trainer class for KIM physics based models. It basically extends setup_model, setup_parameter_transforms, setup_optimizer, and save_kim_model
  6. enumerations: enumerated list of supported options.

Tar model: KIM model tarred in a single file. For future proofing against LLNL/KIMKit.

Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.

Future direction: I would like to have a Trainer eventually for each model that can be submitted to the openkim. for v1 we will have Torch and KIM. The Torch trainer will also have a companion Lightning based trainer for v1.
But for future versions I think it would make sense if we have Trainers that wrap around library like QUIP to provide complete training surface for newer QUIP driver based models and so on. I would also like to have Trainer built around bootstrapping and MCMC sampling that yonathan contributed, but it will take a little more time hence perhaps v1.2 etc. For v1 the UC part will remain unchanged.

Let me know of thoughts and questions. Sorry for huge wall of text, but I think had I delayed it more, it would only have increased in size!

Example yaml file:

workspace:
    name: test_run # Name of the base workspace folder, where all the runs will be stored
    seed: 12345    # Seed for random number generator, all
    resume: False  # Resume training from a previous run
    walltime: 2:00:00 # Walltime for the run

dataset:
    type: ase           # ase or path or colabfit
    path: Si.xyz        # Path to the dataset
    save: False         # Save processed dataset to a file
    shuffle: False      # Shuffle the dataset
    keys:
        energy: Energy  # Key for energy, if ase dataset is used
        forces: forces  # Key for forces, if ase dataset is used
    training_dataset:
        train_size: 3   # Number of training samples
        train_indices:  # files with indices [optional]
    val_dataset:
        val_size: 1     # Number of validation samples
        val_indices:    # files with indices [optional]
    test_dataset:
        test_size:
        test_indices:
    colabfit_dataset:
        dataset_name:
        database_name:
        database_url:

model:
    model_type: kim     # kim or torch or tar
    model_path: ./
    model_name: SW_StillingerWeber_1985_Si__MO_405512056662_006

transforms:
     property:     # optional: path to property transform file
       name: NormalizedPropertyTransform 
       instance:   # optional: instance of custom property transform
       property_key: energy  # optional: key for the property to transform

    parameter:
        instance:
        parameter_list: # optional for KIM models, list of parameters to optimize
          - A:          # doct means the parameter is transformed
            name: LogParameterTransform
            value: 0.1
            bounds: [[-5, 5]]
          - B          # these are the parameters that are not transformed
          - sigma

    configuration: # optional: generate fingerprints from the configuration
        name: Descriptor
        kwargs: 
            cutoff: 3.7
            species: ["Si"]
            descriptor: "SymmetryFunctions"
            hyperparameters: set51
        instance:

training:
    loss:
        function: MSE
        weights: # optional: path to weights file
            energy: 1.0
            forces: 1.0
            config: 1.0
            stress: 0.0
        normalize_per_atom: true
    optimizer:
        name: L-BFGS-B
        provider: scipy
        learning_rate: 
        kwargs:
          tol: 1.e-6 # 1. is necessary, 1e-6 is treated as string

    batch_size: 1
    epochs: 1
    device: cpu
    num_workers: 2
    chkpt_interval: 1
    stop_condition:
    verbose: True

export: # optional: export the trained model
    model_type: tar # kim or tar
    model_path: ./
    model_name: SW_StillingerWeber_trained_1985_Si__MO_405512056662_006

@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2024

Codecov Report

Attention: Patch coverage is 2.31548% with 675 lines in your changes are missing coverage. Please review.

Project coverage is 55.43%. Comparing base (9ee3ce8) to head (2f57f56).

❗ Current head 2f57f56 differs from pull request most recent head d3050a6. Consider uploading reports for the commit d3050a6 to get more accurate results

Files Patch % Lines
kliff/trainer/kliff_trainer.py 0.00% 346 Missing ⚠️
kliff/trainer/kim_trainer.py 0.00% 182 Missing ⚠️
kliff/trainer/option_enumerations.py 0.00% 87 Missing ⚠️
kliff/dataset/dataset.py 23.63% 42 Missing ⚠️
kliff/utils.py 27.27% 8 Missing ⚠️
kliff/trainer/kim_residuals.py 0.00% 5 Missing ⚠️
kliff/_exceptions.py 0.00% 3 Missing ⚠️
kliff/trainer/__init__.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v1     #172      +/-   ##
==========================================
- Coverage   63.14%   55.43%   -7.71%     
==========================================
  Files          50       56       +6     
  Lines        4718     5401     +683     
==========================================
+ Hits         2979     2994      +15     
- Misses       1739     2407     +668     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kliff/_exceptions.py Outdated Show resolved Hide resolved
@@ -630,10 +638,37 @@ def _read_from_colabfit(
logger.error(f"{colabfit_dataset} is either empty or does not exist")
raise DatasetError(f"{colabfit_dataset} is either empty or does not exist")

if isinstance(weight, Path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use if not isinstance(weight, Weight): to avoid the case a string being passed in as a Path.

Alternatively, we can add a utility function to utils.py to convert to Path like this one:

def to_path(path):
    return Path(path).expanduser().resolve()

I prefer the later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weight setting method is modified, so this comment might not be valid anymore

kliff/dataset/dataset.py Outdated Show resolved Hide resolved
kliff/dataset/dataset.py Outdated Show resolved Hide resolved
@@ -730,9 +779,37 @@ def _read_from_path(
parent = path.parent
all_files = [path]

if isinstance(weight, Path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights reading part is the same as what was used above. So, let's create a function for it.

kliff/trainer/kliff_trainer.py Outdated Show resolved Hide resolved
from kliff.transforms.parameter_transforms import ParameterTransform
from kliff.transforms.property_transforms import PropertyTransform

from ..dataset.weight import Weight
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably OK to use relative import from the current directory, but let's change to use absolute import for all other cases like this one ..dataset.weight

kliff/trainer/kliff_trainer.py Outdated Show resolved Hide resolved
kliff/trainer/kliff_trainer.py Outdated Show resolved Hide resolved
f"Optimizable parameters must be string or value dict. Got {input_params} instead."
)

def setup_optimizer(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be implemented as get_optimizer() below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup optimizer was chosen as it might be more involved, like setting up ema, lr decay etc. I have no preference either way.

@mjwen
Copy link
Collaborator

mjwen commented Mar 27, 2024

Sorry for the late reply!

Question: Is there an easy for kimpy to tell me what is the model driver that a model is using.

I cannot remember. Better check with Ryan.

I only have major comments for the trainer.

  1. I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the fit \ test functions of the trainer. This is following the lightning approach and I think it is a good approach.
  2. Related to 1, it might be better to only pass the config dict info related to training to the trainer, other stuff (like model, dataset, transforms) all be passed in as an instance. Basically, I am thinking of a trainer like
Trainer:
    def __init__(model: Union[KIMModel, TorchModel, TarModel], transform: Transform, training_config: dict):
        
    
    def fit(train_loader, val_loader):
         
   
    def test(test_loader):
     
    # other necessary functions such as save kim model  

As you can see, this is again modeling after lightning. By moving the instantiation of the model, dataset/loader, transform, out of the Trainer can make it much more simper, which should be easier for maintaining.

  1. There exist good packages to deal with configs that we can take advantage of. We don't need to do all the checking and such in the trainer for all hyperparameters, which would be very difficult to maintain in the long run. Specifically, I am considering omegaconf and hydra. The latter uses the former. The nice feather of hydra is that you can define a base config that is shared by all, and customize it for various specific cases. Also, it is composable, making it much easier to separate the configs for the various parts. schnetpack uses it: https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/schnetpack/configs/train.yaml and I've used it in one of my projects as well https://github.com/mjwen/rxnrep/tree/main/configs, which may give you a feeling of what it look like. hydra also have utility functions e.g. for instantiating class https://hydra.cc/docs/advanced/instantiate_objects/overview/, which we can take advantage of instead of using importlib.

@ipcamit
Copy link
Author

ipcamit commented Mar 30, 2024

I only have major comments for the trainer.

1. I don't think the trainer need to deal with dataset and such (like dataset initializing and split). Instead, the train/val/test datasets should be set up outside the trainer, and their instances be passed to the `fit` \ `test` functions of the trainer. This is following the lightning approach and I think it is a good approach.

I can look into the hydra library, but for datasets, models etc, do you think it will be better to have different utility functions etc to setup datasets and models? Whole idea is to make sure that using this single yaml file you should be able to get same results in term of dataset splits etc.

@mjwen
Copy link
Collaborator

mjwen commented Mar 30, 2024

Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.

do you think it will be better to have different utility functions etc to setup datasets and models?

I guess, we can provide these low-level parts to provide more granularity - these would be extremely useful of one wants to build a new trainer for a new model and such. But at the same time, we can use the low-level functionality to achieve one-yaml-file style setups to achieve reproducibility and such. This can be done, e.g. via an example training script, that first instantiates all the components, e.g. dataset and model, and then passes these to the trainer to train the model and log metrics and such. I guess the major difference is instantiating the classes inside the Trainer or outside. I think the latter is slightly better.

@mjwen
Copy link
Collaborator

mjwen commented Mar 30, 2024

I have mixed feelings with hydra -- it can simplify configs management, but tracking and debugging can be a bit more involved. I am not sure whether we want to use it here. Probably you explore it a bit and we can then decide. But even if we don't use hydra, I think we still need to use OmegaConf to read, merge, write config files, instead of the vanilla yaml module.

@ipcamit
Copy link
Author

ipcamit commented Mar 30, 2024

Yes, I'm still imaging a single yaml file to config all the components, but the single yaml file can be composed from a couple separate files. If one does not like to use multiple files, all the parts can still be put in a single file.

This condition is already satisfied to some extent. The configuration is split into 6 independent blocks,

  1. workspace: local folder/cache/other information directory.
  2. dataset: instance of KLIFF dataset
  3. model: instance of KIMModel or TorchScript model
  4. transforms: instance of property, parameter, configuration transforms
  5. training: loss function, optimizer, verbose
  6. export: export the trained model, KIM-API model or tarball

I guess I can try splitting it into 3 modules,

  1. Dataset: setup dataset, along with test train split
  2. Model: setup KIM, Torch, Tar model etc. It will also consume export module
  3. Trainer module: workspace management (base class), training loop (child class), optimizer (child class)

Optimizer can be a separate module but it would be difficult to organize as a separate module owing to scipy's different API. Can make a private optimizer module for torch.

Thoughts, suggestions?

Reason for above division:

  1. workspace: cant be independent, as this is specifically setup and used by trainer
  2. dataset: can be independent, currently setup using setup_dataset
  3. model: can be independent, currently setup using setup_model
  4. transforms: Parameter transforms cannot be independent, and configuration transforms are only used by torch, so a general module might not be that coherent. So included in Trainer
  5. training: trainer module
  6. export: can be merged with model

@mjwen
Copy link
Collaborator

mjwen commented Mar 30, 2024

Yes, the previous yaml file does serve the purpose of separating them into different parts. We just need to organize the internal codes in that direction a bit.

I agree with the three-module approach. Particularly, we want to integrate the optimizer with a trainer for the exact reason you've mentioned.

Also agreed that workspace settings need to be separate, and not associated with a trainer.

@ipcamit
Copy link
Author

ipcamit commented Apr 17, 2024

  1. Simplified the trainer.
  2. Intialization methods for model and datasets.

Note on glossary, everywhere the configuration dictionary is called "manifest", as configuration was becoming too confusing, example "dataset_from_configuration" creates a confusion on whether it uses the kliff.Configuration object. Hence I suggest we use "manifest" as the word of choice for yaml configurations. So KLIFF trainer takes in a training manifest, and splits it accordingly in model, dataset and transform manifest, and passes it to appropriate static functions in Dataset and KIMModel, such that they return require objects.

Regarding Omegaconf, I did integrate it, but have to remove it afterwards as it adds lot of needless complexity with no clear advantage. Biggest issue was that it only support "primitive" dataypes in python, something as basic as numpy array needs custom solutions etc. dict is flexible enough to be a better solution I think.

There might be some rough edges, that will clear as we add more functionality like the ML trainer.

@ipcamit
Copy link
Author

ipcamit commented Apr 20, 2024

Disregard for the time being, I need to work a bit more on this before opening the PR again.

@ipcamit ipcamit closed this Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants