Verify for unknown architecture fields #241

PicoCentauri · 2024-06-10T14:26:30Z

Partly closes #168 since for now I only added a verification for the architecture hypers using a function check_architecture_options. For the checking the dataset it is a bit more complicated because the yaml layout is very flexible. I think it is doable but also to keep the PR a bit smaller I will do this in another run.

While touching the code I also moved the check_options_list inside the dataset expansion because these two functions always will be called together. Also, there was no test if the written option_restart.yaml is actually a valid input to start another training run.

Contributor (creator of pull-request) checklist

Tests updated (for new features and bugfixes)?
Documentation updated (for new features)?
Issue referenced (for PRs that solve an issue)?

📚 Documentation preview 📚: https://metatrain--241.org.readthedocs.build/en/241/

src/metatrain/utils/omegaconf.py

frostedoyster · 2024-06-11T10:33:31Z

Thanks a lot for the work.
I tried this branch but I'm still finding some issues in practice:

architecture:
  name: experimental.soap_bpnn
  training:
    batch_size: 2
    num_epoch: 1

training_set:
  systems:
    read_from: qm9_reduced_100.xyz
    length_unit: angstrom
  targets:
    energy:
      key: U0
      unit: eV

test_set: 0.5
validation_set: 0.1

With this config, we would like to see an error (the key is supposed to be num_epochs, not num_epoch).
This config results in training for 100 epochs (the default for SOAP-BPNN). No warnings or errors

PicoCentauri · 2024-06-11T12:10:49Z

Interesting let me check this again.

src/metatrain/utils/omegaconf.py

PicoCentauri · 2024-06-12T06:34:00Z

I fixed checking the options you posted @frostedoyster and added them as a test.

frostedoyster · 2024-06-12T10:31:40Z

We seem to have some issues:
In the SOAP-BPNN, try adding per_structure_targets: ["energy"] to the training options. This fails and it's not supposed to...
Much more concerning: this also fails (also training section of SOAP-BPNN): loss_weights: {"energy": 1.0}, and, with the current design, it's expected to fail. I would propose to ignore dicts that are empty in the reference

Luthaf · 2024-06-12T11:42:04Z

I guess that the issue is that we are trying to use the default hypers as a schema for all possible hypers.

If we want to do proper validation, we should have an actual schema. JSON schema seems to be the main industry standard: https://json-schema.org/. Trying to do validation without a schema, we will always have either things we can't validate, or things we are too strict validating.

However, this would introduce more work for architecture contributors, so it might be worth to discuss it more in depth later. One thing we could have to reduce this work would be some tool to auto-generate the schema from an example YAML. I'm sure something like this already exists.

PicoCentauri · 2024-06-12T13:59:43Z

Yes, I thought we can avoid using a JSON schema but it seems like we can't. I will explore if there is an easy way that is not too much work for developers.

PicoCentauri · 2024-06-19T09:43:43Z

src/metatrain/share/schema-dataset.json

+                                "stress": {
+                                    "$ref": "#/$defs/gradient_section"
+                                },
+                                "virial": {
+                                    "$ref": "#/$defs/gradient_section"
+                                }


Should we already that stress and virial are exclusive in the schema?

Also, the gradients are hardcoded. I think this is fine for now...

PicoCentauri · 2024-06-19T09:45:48Z

src/metatrain/cli/train.py

+    with open(PACKAGE_ROOT / "share/schema-base.json", "r") as f:
+        schema_base = json.load(f)
+
+    jsonschema.validate(instance=OmegaConf.to_container(options), schema=schema_base)


We can wrap these two lines in a function check_base_options, but might be an overkill...

Luthaf

The code and checks looks pretty good, I think we need more documentation about this in the contributors documentation.

Is there a tool that can create a (non-ideal) schema from an example YAML file? We could recommend it as a starting point.

Luthaf · 2024-06-20T10:04:32Z

src/metatrain/experimental/gap/schema-hypers.json

+    "model": {
+      "type": "object",
+      "properties": {
+        "soap": {


it would be interesting to check if we can automatically import rascaline's JSON schema file here.

Might be very cool. But then we have to generate this schemas dynamically based on the location of rascaline or what do have in mind?

rascaline currently does not distribute schema (they are only used for documentation generation). I was thinking we could copy the file from rascaline next to this one, and update it whenever we update the pin on rascaline version

Yes, sounds like a good idea. If there is a way to use a distributed schema we can try it once we changed the schema in rascaline.

tests/cli/test_train_model.py

PicoCentauri · 2024-06-21T09:01:13Z

The code and checks looks pretty good, I think we need more documentation about this in the contributors documentation.

Yes sure I can add additional information in the section about adding a new architecture.

Is there a tool that can create a (non-ideal) schema from an example YAML file? We could recommend it as a starting point.

I checked a couple tools in the beginning, but all of them are not great. In the end, I used CHATGPT for generating, which was much better compared the other tools. I think it is okay if we can recommend this.

Luthaf · 2024-06-21T11:29:14Z

In the end, I used CHATGPT for generating, which was much better compared the other tools. I think it is okay if we can recommend this.

We can suggest some tools with their limitations, and say that we also had good success using ChatGPT/LLM for this!

PicoCentauri · 2024-06-24T08:53:37Z

I updated the page for documentation including tools for generating the schemas.

@Luthaf and regarding linking the rascaline schemas: jsonschemas fortunately supports referencing external schemas from local files and URLs. But, this requires some not super easy custom to a Validator class. I played around with this, but I think I need some more time to get this to work in a robust way. If you agree, I will explore this further and add this in an upcoming PR.

Luthaf

That works for me, but I'd like to see an approval from the different architecture maintainers on the new json-schema files.

If you can also open issues for the things left for future PR, that would be great!

Luthaf · 2024-06-24T13:27:18Z

@frostedoyster @abmazitov @DavideTisi @spozdn Can you look at the new json-schema file in your architecture and let us know if (a) you understand the file format and what it is doing; and (b) you agree with the types/constrains of everything?

frostedoyster · 2024-06-26T05:35:20Z

It looks good to me and the schemas are quite intuitive

DavideTisi

very cool, just a bit pedantic but i do not have a better idea

DavideTisi · 2024-06-27T07:42:43Z

src/metatrain/experimental/gap/schema-hypers.json

+  "properties": {
+    "name": {
+      "type": "string",
+      "enum": ["experimental.gap"]


why enum and not string?

I think that if you want to explicitly match a string to a specific value with jsonschema you have to use an enum

PicoCentauri force-pushed the unknown-keys branch from 59dec14 to bb3846c Compare June 10, 2024 14:27

PicoCentauri changed the title ~~Verify architecture keys against reference~~ Verify for unknown architecture fields Jun 10, 2024

PicoCentauri force-pushed the unknown-keys branch from bb3846c to 1a582d3 Compare June 10, 2024 16:02

frostedoyster reviewed Jun 10, 2024

View reviewed changes

src/metatrain/utils/omegaconf.py Outdated Show resolved Hide resolved

PicoCentauri force-pushed the unknown-keys branch from 1a582d3 to 71eec8b Compare June 12, 2024 06:30

PicoCentauri commented Jun 12, 2024

View reviewed changes

src/metatrain/utils/omegaconf.py Outdated Show resolved Hide resolved

PicoCentauri requested a review from frostedoyster June 12, 2024 06:33

PicoCentauri force-pushed the unknown-keys branch from 71eec8b to d1f8629 Compare June 19, 2024 09:31

PicoCentauri requested review from DavideTisi, spozdn and abmazitov as code owners June 19, 2024 09:31

PicoCentauri commented Jun 19, 2024

View reviewed changes

PicoCentauri force-pushed the unknown-keys branch from d1f8629 to 1ad0ed0 Compare June 19, 2024 09:44

PicoCentauri commented Jun 19, 2024

View reviewed changes

PicoCentauri force-pushed the unknown-keys branch from 2812c20 to 7aaa70a Compare June 19, 2024 10:00

Luthaf reviewed Jun 20, 2024

View reviewed changes

PicoCentauri force-pushed the unknown-keys branch from 64103ba to 41497f4 Compare June 24, 2024 08:43

PicoCentauri added 4 commits June 24, 2024 10:44

Verify options

d99e9b2

format json and yaml files

e2b0ce9

add debug info for check_architecture_options

d0eaa6a

Update new architecture docs

1e027dd

PicoCentauri force-pushed the unknown-keys branch from 41497f4 to 1e027dd Compare June 24, 2024 08:46

Merge branch 'main' into unknown-keys

ed94569

Luthaf approved these changes Jun 24, 2024

View reviewed changes

PicoCentauri mentioned this pull request Jun 25, 2024

Suggestions for closely matching allowed names #271

Open

3 tasks

frostedoyster approved these changes Jun 26, 2024

View reviewed changes

DavideTisi approved these changes Jun 27, 2024

View reviewed changes

frostedoyster merged commit 8ae0a94 into main Jun 27, 2024
13 checks passed

frostedoyster deleted the unknown-keys branch June 27, 2024 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify for unknown architecture fields #241

Verify for unknown architecture fields #241

PicoCentauri commented Jun 10, 2024 •

edited by github-actions bot

Loading

frostedoyster commented Jun 11, 2024

PicoCentauri commented Jun 11, 2024

PicoCentauri commented Jun 12, 2024

frostedoyster commented Jun 12, 2024

Luthaf commented Jun 12, 2024

PicoCentauri commented Jun 12, 2024

PicoCentauri Jun 19, 2024

PicoCentauri Jun 19, 2024

Luthaf left a comment

Luthaf Jun 20, 2024

PicoCentauri Jun 21, 2024

Luthaf Jun 21, 2024

PicoCentauri Jun 24, 2024

PicoCentauri commented Jun 21, 2024

Luthaf commented Jun 21, 2024

PicoCentauri commented Jun 24, 2024

Luthaf left a comment

Luthaf commented Jun 24, 2024

frostedoyster commented Jun 26, 2024

DavideTisi left a comment

DavideTisi Jun 27, 2024

Luthaf Jun 27, 2024

Verify for unknown architecture fields #241

Verify for unknown architecture fields #241

Conversation

PicoCentauri commented Jun 10, 2024 • edited by github-actions bot Loading

Contributor (creator of pull-request) checklist

frostedoyster commented Jun 11, 2024

PicoCentauri commented Jun 11, 2024

PicoCentauri commented Jun 12, 2024

frostedoyster commented Jun 12, 2024

Luthaf commented Jun 12, 2024

PicoCentauri commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Luthaf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PicoCentauri commented Jun 21, 2024

Luthaf commented Jun 21, 2024

PicoCentauri commented Jun 24, 2024

Luthaf left a comment

Choose a reason for hiding this comment

Luthaf commented Jun 24, 2024

frostedoyster commented Jun 26, 2024

DavideTisi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PicoCentauri commented Jun 10, 2024 •

edited by github-actions bot

Loading