-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds JSON Schema Validation #5458
Adds JSON Schema Validation #5458
Conversation
@mrkaye97 Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/5433954587. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details. |
@dbczumar Moved this over to here. I was having some issues with huge diffs when trying to rebase on top of I left some comments about design decisions I made and a couple of open questions. There are also two failing tests that I was seeing on local, caused by trying to log Let me know what the next steps for this should be, or if you've got any feedback! Also, who should I be requesting for review?
|
Oh and another question: Do I need to add |
mlflow/server/handlers.py
Outdated
|
||
return schema | ||
|
||
def _validate_param_against_schema(schema, param, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to thread both the parameter (name) and the value through, so that we could supply a more helpful error to the user by telling them which parameter was the one causing the failure
Hi @mrkaye97 ! Thanks for the awesome PR. Is it possible to implement this validation without introducing additional library dependencies? We try to avoid new library dependencies in core MLflow modules (such as |
@dbczumar Yeah, I definitely can. But just to push back a little, how strongly do you feel about this? On one hand, I agree with you that I'd prefer to not introduce a dependency. But I guess it depends on what "absolutely necessary" means. In this case, it'd be pretty easy to roll our own validation. But that also means we'd increase the maintenance costs of this code, in addition to increasing the complexity of adding additional validation for future endpoints, schemas, etc. Again, I'm happy to go either way, but just wanted to throw this out there before leaning into rolling our own |
@mrkaye97 Thanks for digging further here. Apologies for the vague "absolutely necessary" designation. The general guidance is that we shouldn't add a new dependency if: 1. there's a relatively straightforward way to correctly implement comparable logic, or 2. we can easily extract the relevant modules from the third-party library into the project directly without significant ramifications (e.g. breach of licensing / TOS) or maintenance challenges (e.g. perhaps the library is updated frequently with critical fixes / improvements, and it would be cumbersome to continually update the inlined modules). In this case, (1) seems to apply, given the limited set of validation functionality that we require for this use case. Let me know if you have any questions, and thanks for your flexibility :D |
Sure, that makes sense! Let me report back in a few days on this. Thanks for the feedback! |
alright @dbczumar, what do you think of this approach? IMO, it's a lot less elegant, but at least it rolls our own instead of relying on a library. Still have the same two failing tests on local ( Also a TODO: I'm not sure how to best extend this for LMK what you think whenever you get a chance 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright @dbczumar, what do you think of this approach? IMO, it's a lot less elegant, but at least it rolls our own instead of relying on a library. Still have the same two failing tests on local (
NaN
supplied, number required) but everything else seems to be working.Also a TODO: I'm not sure how to best extend this for
log_batch
, since it takes a list of (e.g.) metrics, which seems like a pain. It expects a list of metrics, where each metric is a dictionary, and we want to validate each k-v pair of that dictionary. Haven't figured out the best way to do that yet.LMK what you think whenever you get a chance 👍
@mrkaye97 This is fantastic! IMO, your implementation is quite elegant. Let me know if you'd like help / input regarding NaNs (as far as I can tell, NaNs are interpreted as floats when parsing JSON).
Regarding LogBatch, I think we can add comparable per-field validation logic to the body of the LogBatch handler. It may not look super nice, but I think it's the simplest approach.
Can't wait to merge this once NaNs and LogBatch have been addressed. Thanks again!
mlflow/server/handlers.py
Outdated
def assert_max_1k(x): | ||
assert x <= 1000 | ||
|
||
def _assert_max_50k(x): | ||
assert x <= 50000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we consolidate these as:
def _assert_less_than_or_equal(x, max_value):
assert x <= max_value
Then, we can pass lambda x: _assert_less_than_or_equal(x, 50000)
, etc. as part of the schema
parameter of _get_request_message
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, nice idea. will add this. I was trying to do this to no avail 🤦 :
lambda x: assert x < 50000
Works for me! I might need a few days -- going to be a busy week for me. Hopefully should have this over the finish line this weekend though. Thanks for the feedback. On the |
@dbczumar Update on this:
On the import json
x = json.loads('{"Key": NaN}')
print(x)
print(type(x["Key"]))
## Checking the same assertion that the validator is performing
print(isinstance(x["Key"], float))
## To make sure that Python does, in fact, consider NaN to be a float when we tell it
print(float("nan"))
print(isinstance(float("nan"), float))
## To make sure it's actually the float validator failing
print(float("nan") is not None) So yeah, I'm not actually sure what's going wrong here. Any ideas? |
Another thing: The R tests are failing because of a bug in the R package itself, not in this change. The issue is here: mlflow/mlflow/R/mlflow/R/tracking-rest.R Line 68 in 82d4021
Basically, your R package converts all of the elements of the body to character vectors, which is now failing because it tries to make an API call with (e.g.) that start time as See this data <- list(
foo = "bar",
baz = 314159
)
rapply(data, as.character, how = "replace")
#> $foo
#> [1] "bar"
#>
#> $baz
#> [1] "314159" Created on 2022-03-24 by the reprex package (v2.0.1) |
mlflow/R/mlflow/R/tracking-rest.R
Outdated
GET( | ||
api_url, | ||
query = query, | ||
mlflow_rest_timeout(), | ||
config = rest_config$config, | ||
req_headers | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed drive-by fixing these. Honestly, I'm not sure why these were implemented how they were in the first place
@mrkaye97 Thank you for looking into this! My interpretation is that, if the MLflow server currently accepts string request parameters with numeric contents (e.g. Rather than drive-by fixing the R client (I think we can revert that for the scope of this PR), can we instead adjust how we're handling type validation to rely on the existing mlflow/mlflow/utils/proto_json_utils.py Lines 150 to 153 in 2e175e7
parse_dict() succeeds, we should assume that the data types are good and proceed to verify that other request data requirements are satisfied (e.g. expected fields are present, values are less than the configured maximum thresholds, ...). On the other hand, if parse_dict() fails, then we can use the expected schema / type information for the request to return an informative error to the user.
Apologies for the additional wrinkle here. Thank you for all your hard work! Let me know if you have any questions. |
thanks for the feedback @dbczumar -- I'll revert the R changes in favor of your suggestion to preserve existing behavior. Just to make sure I'm understanding how you're thinking about this though: your assumption is that if Do you have any idea how prevalent this issue is? Is it only for numeric types? or might we break other things like It seems like maybe a better solution (if we could think of all of the cases) would be to change |
Thanks @mrkaye97 ! Correct, if
This allows us to provide an informative message for most cases while keeping the implementation cheap / simple. If the user can verify that their request looks good and it still isn't working, they can always file an issue. For more specific cases where we know the cause of the failure - e.g. the field is missing but required or fails to satisfy some numeric constraint - we can use the more specific error messages that have already been implemented in your PR. Let me know what your thoughts are here. |
Gotcha, that makes sense to me 👍 Agreed that trying to cover all the cases is risky. I imagine that I'm going to need to rewire some things a little bit, since this conditional logic (only check types on parse failure) is a little more involved than the original logic was. I'll work on this over the next few days and report back! |
@dbczumar Circling back: I think this approach actually raises some bigger issues buried deeper in the code. For instance: If I make a call to And then the problem occurs: Downstream, I end up with this: > if name is None or not _VALID_PARAM_AND_METRIC_NAMES.match(name):
E TypeError: expected string or bytes-like object
name = 31 Which I think results in a 500. I actually can't repro this from an R client making the same calls, which I'm a little confused by. In that case, I get back a 400 with the correct validation error. Got any ideas on this? Separately, it's weird to me from a design perspective that this validation would happen twice for some calls (but not others?). That seems unnecessary, and it'll result in odd UX for users who will get different error messages for the "same" error, depending on the endpoint they're hitting. IMO, the bottom line is that it's not clear to me that it's safe to assume that if |
mlflow/server/handlers.py
Outdated
if f in [ | ||
_assert_int, | ||
_assert_string, | ||
_assert_bool, | ||
_assert_float, | ||
_assert_array, | ||
_assert_item_type_string | ||
] and parsing_succeeded: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the best way I could think of to do this that didn't require a huge refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these methods invoked at any point? If not, can we remove them from the PR entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, misread the logic. This makes sense :)
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for all your hard work, @mrkaye97! This is awesome! I'll go ahead and clean up some lint errors and get this merged.
Signed-off-by: dbczumar <corey.zumar@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrkaye97 I resolved the lint errors and updated the tests to make use of raw REST requests to bypass client-side validations when testing server-side validations. I also updated the server-side validation logic to check all fields associated with an _assert_required
function, even if they aren't present in the request body (previously, we only validated a field if it was present in the request, but we should throw if a required field is absent), and I improved the error message for the case where _assert_required
fails. Finally, I added validations for the CreateExperiment
tags
and artifact_location
fields. I plan to merge tomorrow. Thanks again!
awesome, thanks for all the help and doing all of that cleanup to get this over the line @dbczumar! excited to see this go live -- I think it'll improve the UX a lot 🥳 |
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Of course! Thanks so much for all your hard work! Merging! 🎆 |
I did a deep dive into JSON schema back in 2012 and although a great idea, there were so many issues I ditched it. Glad to see its working now! Schema-less JSON makes me shudder. It's like fingers crossed programming ;) |
What changes are proposed in this pull request?
Closes #5208
This PR is a first pass at adding JSON validation against a predetermined schema, in an effort to make error handling more friendly for users. The end goal is to return HTTP 400 for bad parameters instead of 500, which happens currently (at least for most endpoints).
I'm sure there are many error handling cases I've missed, but I think this should cover the most common issues (misspecified parameter types). And of course, we can always expand on this in the future.
How is this patch tested?
Tested with
pytest
as per @dbczumar's suggestion. I added a few tests to check if errors were coming back as expected, but we can always add more.I also spun up a local MLFlow instance and ran some test API calls from an R session on local.
Before this change:
Lots of
![Screen Shot 2022-03-06 at 2 37 37 PM](https://user-images.githubusercontent.com/44219491/156939267-bf20bf20-4f47-43ec-bdde-d0ab2e798800.png)
500
s with nothing helpful.In order to debug, you would need to dig into your instance's logs, where you'd see something like this (which is helpful, but is hard to parse and doesn't indicate exactly what went wrong. And also, logs aren't always easily available to MLFlow users):
After this change:
Helpful error messages!
Does this PR change the documentation?
ci/circleci: build_doc
check. If it's successful, proceed to thenext step, otherwise fix it.
Details
on the right to open the job page of CircleCI.Artifacts
tab.docs/build/html/index.html
.Release Notes
Is this a user-facing change?
This change adds validation of API request bodies against predetermined JSON schemas, which will result in friendlier error messages for MLFlow users.
See #5208
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes