Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add env var LUDWIG_SCHEMA_VALIDATION_POLICY to change marshmallow validation strictness #3226

Merged
merged 5 commits into from
Mar 16, 2023

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Mar 8, 2023

Closes #3218.

@tgaddair tgaddair added the release-0.7 Needs cherry-pick into 0.7 release branch label Mar 8, 2023
@tgaddair tgaddair requested a review from ksbrar March 8, 2023 18:05
Copy link
Collaborator

@ksbrar ksbrar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only question is how extensively have you tested this locally? (Last thing I was fiddling around with pytest env variables and was about to try Ludwig CLI).

In particular, I want to check if there are any breakages when RAISE is set for schemas that explicitly allow additional params via JSON. That means (based on a ctrl-f):

  • input features (top-level)
  • output features (top-level)
  • preprocessing
  • augmentation

Really, just one of these cases suffices as a sanity check for the rest.

(Also, I believe we're covered but we have to be careful that any internal parameters are in fact declared on their respective schemas (e.g. proc_column for input features) - otherwise RAISE will reject them.)

@tgaddair
Copy link
Collaborator Author

tgaddair commented Mar 8, 2023

It's a good point @ksbrar, I haven't tested this extensively! If you have some cycles to test it out, please take a look and let me know what you find.

@github-actions
Copy link

github-actions bot commented Mar 8, 2023

Unit Test Results

         6 files           6 suites   6h 45m 55s ⏱️
  4 087 tests   4 044 ✔️   43 💤 0
12 282 runs  12 147 ✔️ 135 💤 0

Results for commit 8530a26.

♻️ This comment has been updated with latest results.

@ksbrar
Copy link
Collaborator

ksbrar commented Mar 14, 2023

I've tested out this PR. It doesn't work as is because the deprecation warning PR I merged a little while ago is actually performing the wrong task and is too aggressive in filtering parameters.

This code (#3118) will filter out and log any and every unknown parameter as deprecated - which is the incorrect thing to do, as there is a clear distinction between a purely "unknown" or invalid parameter and a "deprecated" one - before marshmallow even has a chance to raise an error about the unknown parameters.

I just pushed a simple fix that gets us to what we want (a user sets this environment flag to raise, and they will see marshmallow validation errors for unknown parameters), but I want to sync with @justinxzhao (or @tgaddair ) about how deprecated parameters should be marked and parsed.

@justinxzhao
Copy link
Collaborator

Synced offline -- upgrading the config, removing known deprecated parameters, and logging warnings for them is handled by upgrade_config_dict_to_latest_version.

We should be good to proceed with this PR (just blocked on writing a test).

Follow-up PRs to continue progress towards making STRICT the default policy for Ludwig configs:

  1. Make STRICT the default
  2. Remove explicit call to JSON validation and rely purely on Marshmallow deserialziation
    • Remove call to check_schema() in model_types::base.py (assuming parity)
    • Keep check_schema() around for unit tests to verify pure-JSON-validation behavior.

@ksbrar
Copy link
Collaborator

ksbrar commented Mar 14, 2023

@justinxzhao I spent some time and just couldn't get the environment variable switching to work properly for the test. If someone else wants to give it a go, here is some rough code showing a couple different attempts. I can clean it up/give some clarification if need be:

def rreload(module):
    from importlib import reload

    """Recursively reload modules."""
    reload(module)
    for attribute_name in dir(module):
        attribute = getattr(module, attribute_name)
        if type(attribute) is ModuleType:
            rreload(attribute)


# @mock.patch.dict(os.environ, {LUDWIG_SCHEMA_UNKNOWN_POLICY: "raise"})
def test_user_config_validation_policy():
    from importlib import reload, import_module
    import sys
    from ludwig.constants import LUDWIG_SCHEMA_VALIDATION_POLICY

    # from ludwig.schema.model_types import base

    config = {
        "input_features": [{"name": "foo", "type": "category"}],
        "output_features": [category_feature(encoder={"vocab_size": 2}, reduce_input="sum")],
        "combiner": {"type": "concat", "output_size": 14},
        "trainer": {"unknown": "parameter"},
    }

    # Test default policy (unknown parameters are ignored):
    # ModelConfig.from_dict(config)

    # reload(lsb)

    print(os.environ.get(LUDWIG_SCHEMA_VALIDATION_POLICY))
    os.environ[LUDWIG_SCHEMA_VALIDATION_POLICY] = RAISE
    print(os.environ.get(LUDWIG_SCHEMA_VALIDATION_POLICY))
    # reload(sys.modules["ludwig.schema.model_types.base"])
    # reload(sys.modules["ludwig.schema.model_types.ecd"])
    # reload(sys.modules["ludwig.schema.utils"])
    # reload(sys.modules["ludwig.schema"])

    # mod = import_module("ludwig.schema.model_types.base")
    import ludwig.schema.model_types.base as base

    rreload(base)

    # for name, module in sys.modules.items():
    #     if "ludwig.schema" in name:
    #         reload(module)

    # Test RAISE policy:
    # monkeypatch.setenv(LUDWIG_SCHEMA_UNKNOWN_POLICY, "raise")
    with pytest.raises(ConfigValidationError):
        base.ModelConfig.from_dict(config)

But changing the validation policy is easy enough to verify locally in a terminal, so I think we can proceed with just landing this PR.

@ksbrar ksbrar requested a review from justinxzhao March 14, 2023 23:32
Copy link
Collaborator

@ksbrar ksbrar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go from my standpoint.

@ksbrar ksbrar changed the title Added env var LUDWIG_SCHEMA_UNKNOWN_POLICY to handle unknown fields in config feat: Add env var LUDWIG_SCHEMA_VALIDATION_POLICY to change marshmallow validation strictness Mar 14, 2023
@ksbrar ksbrar merged commit d5d49b8 into master Mar 16, 2023
@ksbrar ksbrar deleted the unknown-env branch March 16, 2023 20:12
tgaddair added a commit that referenced this pull request Mar 16, 2023
…llow validation strictness (#3226)

Co-authored-by: ksbrar <kabir@brar.xyz>
tgaddair added a commit that referenced this pull request Mar 16, 2023
…llow validation strictness (#3226)

Co-authored-by: ksbrar <kabir@brar.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-0.7 Needs cherry-pick into 0.7 release branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] Allow users to set the unmarshalling mode as RAISE for BaseConfig
3 participants