Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify for unknown architecture fields #241

Merged
merged 5 commits into from
Jun 27, 2024
Merged

Verify for unknown architecture fields #241

merged 5 commits into from
Jun 27, 2024

Conversation

PicoCentauri
Copy link
Contributor

@PicoCentauri PicoCentauri commented Jun 10, 2024

Partly closes #168 since for now I only added a verification for the architecture hypers using a function check_architecture_options. For the checking the dataset it is a bit more complicated because the yaml layout is very flexible. I think it is doable but also to keep the PR a bit smaller I will do this in another run.

While touching the code I also moved the check_options_list inside the dataset expansion because these two functions always will be called together. Also, there was no test if the written option_restart.yaml is actually a valid input to start another training run.

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

馃摎 Documentation preview 馃摎: https://metatrain--241.org.readthedocs.build/en/241/

@PicoCentauri PicoCentauri changed the title Verify architecture keys against reference Verify for unknown architecture fields Jun 10, 2024
@frostedoyster
Copy link
Collaborator

Thanks a lot for the work.
I tried this branch but I'm still finding some issues in practice:

architecture:
  name: experimental.soap_bpnn
  training:
    batch_size: 2
    num_epoch: 1

training_set:
  systems:
    read_from: qm9_reduced_100.xyz
    length_unit: angstrom
  targets:
    energy:
      key: U0
      unit: eV

test_set: 0.5
validation_set: 0.1

With this config, we would like to see an error (the key is supposed to be num_epochs, not num_epoch).
This config results in training for 100 epochs (the default for SOAP-BPNN). No warnings or errors

@PicoCentauri
Copy link
Contributor Author

Interesting let me check this again.

@PicoCentauri
Copy link
Contributor Author

I fixed checking the options you posted @frostedoyster and added them as a test.

@frostedoyster
Copy link
Collaborator

We seem to have some issues:
In the SOAP-BPNN, try adding per_structure_targets: ["energy"] to the training options. This fails and it's not supposed to...
Much more concerning: this also fails (also training section of SOAP-BPNN): loss_weights: {"energy": 1.0}, and, with the current design, it's expected to fail. I would propose to ignore dicts that are empty in the reference

@Luthaf
Copy link
Contributor

Luthaf commented Jun 12, 2024

I guess that the issue is that we are trying to use the default hypers as a schema for all possible hypers.

If we want to do proper validation, we should have an actual schema. JSON schema seems to be the main industry standard: https://json-schema.org/. Trying to do validation without a schema, we will always have either things we can't validate, or things we are too strict validating.

However, this would introduce more work for architecture contributors, so it might be worth to discuss it more in depth later. One thing we could have to reduce this work would be some tool to auto-generate the schema from an example YAML. I'm sure something like this already exists.

@PicoCentauri
Copy link
Contributor Author

Yes, I thought we can avoid using a JSON schema but it seems like we can't. I will explore if there is an easy way that is not too much work for developers.

Comment on lines 131 to 136
"stress": {
"$ref": "#/$defs/gradient_section"
},
"virial": {
"$ref": "#/$defs/gradient_section"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we already that stress and virial are exclusive in the schema?

Also, the gradients are hardcoded. I think this is fine for now...

Comment on lines +124 to +127
with open(PACKAGE_ROOT / "share/schema-base.json", "r") as f:
schema_base = json.load(f)

jsonschema.validate(instance=OmegaConf.to_container(options), schema=schema_base)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can wrap these two lines in a function check_base_options, but might be an overkill...

Copy link
Contributor

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code and checks looks pretty good, I think we need more documentation about this in the contributors documentation.

Is there a tool that can create a (non-ideal) schema from an example YAML file? We could recommend it as a starting point.

"model": {
"type": "object",
"properties": {
"soap": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be interesting to check if we can automatically import rascaline's JSON schema file here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be very cool. But then we have to generate this schemas dynamically based on the location of rascaline or what do have in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rascaline currently does not distribute schema (they are only used for documentation generation). I was thinking we could copy the file from rascaline next to this one, and update it whenever we update the pin on rascaline version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds like a good idea. If there is a way to use a distributed schema we can try it once we changed the schema in rascaline.

tests/cli/test_train_model.py Show resolved Hide resolved
@PicoCentauri
Copy link
Contributor Author

The code and checks looks pretty good, I think we need more documentation about this in the contributors documentation.

Yes sure I can add additional information in the section about adding a new architecture.

Is there a tool that can create a (non-ideal) schema from an example YAML file? We could recommend it as a starting point.

I checked a couple tools in the beginning, but all of them are not great. In the end, I used CHATGPT for generating, which was much better compared the other tools. I think it is okay if we can recommend this.

@Luthaf
Copy link
Contributor

Luthaf commented Jun 21, 2024

In the end, I used CHATGPT for generating, which was much better compared the other tools. I think it is okay if we can recommend this.

We can suggest some tools with their limitations, and say that we also had good success using ChatGPT/LLM for this!

@PicoCentauri
Copy link
Contributor Author

I updated the page for documentation including tools for generating the schemas.

@Luthaf and regarding linking the rascaline schemas: jsonschemas fortunately supports referencing external schemas from local files and URLs. But, this requires some not super easy custom to a Validator class. I played around with this, but I think I need some more time to get this to work in a robust way. If you agree, I will explore this further and add this in an upcoming PR.

Copy link
Contributor

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me, but I'd like to see an approval from the different architecture maintainers on the new json-schema files.

If you can also open issues for the things left for future PR, that would be great!

@Luthaf
Copy link
Contributor

Luthaf commented Jun 24, 2024

@frostedoyster @abmazitov @DavideTisi @spozdn Can you look at the new json-schema file in your architecture and let us know if (a) you understand the file format and what it is doing; and (b) you agree with the types/constrains of everything?

@frostedoyster
Copy link
Collaborator

It looks good to me and the schemas are quite intuitive

Copy link
Contributor

@DavideTisi DavideTisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool, just a bit pedantic but i do not have a better idea

"properties": {
"name": {
"type": "string",
"enum": ["experimental.gap"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why enum and not string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you want to explicitly match a string to a specific value with jsonschema you have to use an enum

@frostedoyster frostedoyster merged commit 8ae0a94 into main Jun 27, 2024
13 checks passed
@frostedoyster frostedoyster deleted the unknown-keys branch June 27, 2024 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent users from overwriting non-existent fields
4 participants