Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds JSON Schema Validation #5458

Merged
merged 40 commits into from
Apr 19, 2022
Merged

Adds JSON Schema Validation #5458

merged 40 commits into from
Apr 19, 2022

Conversation

mrkaye97
Copy link
Contributor

@mrkaye97 mrkaye97 commented Mar 5, 2022

What changes are proposed in this pull request?

Closes #5208

This PR is a first pass at adding JSON validation against a predetermined schema, in an effort to make error handling more friendly for users. The end goal is to return HTTP 400 for bad parameters instead of 500, which happens currently (at least for most endpoints).

I'm sure there are many error handling cases I've missed, but I think this should cover the most common issues (misspecified parameter types). And of course, we can always expand on this in the future.

How is this patch tested?

Tested with pytest as per @dbczumar's suggestion. I added a few tests to check if errors were coming back as expected, but we can always add more.

I also spun up a local MLFlow instance and ran some test API calls from an R session on local.

Before this change:

Lots of 500s with nothing helpful.
Screen Shot 2022-03-06 at 2 37 37 PM

Screen Shot 2022-03-06 at 2 35 02 PM

Screen Shot 2022-03-06 at 2 34 44 PM

In order to debug, you would need to dig into your instance's logs, where you'd see something like this (which is helpful, but is hard to parse and doesn't indicate exactly what went wrong. And also, logs aren't always easily available to MLFlow users):

2022-03-06T17:44:56.390215+00:00 app[web.1]: Traceback (most recent call last):
2022-03-06T17:44:56.390216+00:00 app[web.1]:   File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app

...

2022-03-06T17:44:56.390219+00:00 app[web.1]:   File "/opt/conda/lib/python3.9/site-packages/mlflow/server/handlers.py", line 284, in _create_experiment

...

2022-03-06T17:44:56.390221+00:00 app[web.1]:   File "/opt/conda/lib/python3.9/site-packages/google/protobuf/json_format.py", line 445, in ParseDict
2022-03-06T17:44:56.390221+00:00 app[web.1]:     parser.ConvertMessage(js_dict, message)
2022-03-06T17:44:56.390221+00:00 app[web.1]:   File "/opt/conda/lib/python3.9/site-packages/google/protobuf/json_format.py", line 476, in ConvertMessage
2022-03-06T17:44:56.390222+00:00 app[web.1]:     self._ConvertFieldValuePair(value, message)
2022-03-06T17:44:56.390222+00:00 app[web.1]:   File "/opt/conda/lib/python3.9/site-packages/google/protobuf/json_format.py", line 594, in _ConvertFieldValuePair
2022-03-06T17:44:56.390222+00:00 app[web.1]:     raise ParseError('Failed to parse {0} field: {1}.'.format(name, e))
2022-03-06T17:44:56.390223+00:00 app[web.1]: google.protobuf.json_format.ParseError: Failed to parse name field: expected string or bytes-like object.

After this change:

Helpful error messages!

Screen Shot 2022-03-13 at 5 43 30 PM

Screen Shot 2022-03-13 at 5 43 42 PM

Screen Shot 2022-03-13 at 5 43 51 PM

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

This change adds validation of API request bodies against predetermined JSON schemas, which will result in friendlier error messages for MLFlow users.

See #5208

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@github-actions
Copy link

github-actions bot commented Mar 5, 2022

@mrkaye97 Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/5433954587. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details.

@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/bug-fix Mention under Bug Fixes in Changelogs. rn/feature Mention under Features in Changelogs. labels Mar 5, 2022
@mrkaye97 mrkaye97 mentioned this pull request Mar 5, 2022
29 tasks
mlflow/server/handlers.py Outdated Show resolved Hide resolved
mlflow/server/handlers.py Outdated Show resolved Hide resolved
mlflow/server/handlers.py Outdated Show resolved Hide resolved
mlflow/server/handlers.py Outdated Show resolved Hide resolved
mlflow/server/handlers.py Outdated Show resolved Hide resolved
@mrkaye97
Copy link
Contributor Author

mrkaye97 commented Mar 5, 2022

@dbczumar Moved this over to here. I was having some issues with huge diffs when trying to rebase on top of master which was causing GH to hang ¯\_(ツ)_/¯

I left some comments about design decisions I made and a couple of open questions. There are also two failing tests that I was seeing on local, caused by trying to log NaN as a number.

Let me know what the next steps for this should be, or if you've got any feedback! Also, who should I be requesting for review?

Separately, I feel like I should probably add some tests that check if you actually do end up getting 400 back for (e.g.) missing a required param. I can work on that in a bit. Did some of this. We can always add more

@mrkaye97
Copy link
Contributor Author

mrkaye97 commented Mar 6, 2022

Oh and another question: Do I need to add jsonschema to the requirements.txt files? Or which, if any?


return schema

def _validate_param_against_schema(schema, param, value):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to thread both the parameter (name) and the value through, so that we could supply a more helpful error to the user by telling them which parameter was the one causing the failure

@dbczumar
Copy link
Collaborator

dbczumar commented Mar 7, 2022

Oh and another question: Do I need to add jsonschema to the requirements.txt files? Or which, if any?

Hi @mrkaye97 ! Thanks for the awesome PR. Is it possible to implement this validation without introducing additional library dependencies? We try to avoid new library dependencies in core MLflow modules (such as mlflow server) unless it's absolutely necessary because MLflow is used in a wide variety of system / software environments.

@mrkaye97
Copy link
Contributor Author

mrkaye97 commented Mar 7, 2022

Oh and another question: Do I need to add jsonschema to the requirements.txt files? Or which, if any?

Hi @mrkaye97 ! Thanks for the awesome PR. Is it possible to implement this validation without introducing additional library dependencies? We try to avoid new library dependencies in core MLflow modules (such as mlflow server) unless it's absolutely necessary because MLflow is used in a wide variety of system / software environments.

@dbczumar Yeah, I definitely can. But just to push back a little, how strongly do you feel about this? On one hand, I agree with you that I'd prefer to not introduce a dependency. But I guess it depends on what "absolutely necessary" means. In this case, it'd be pretty easy to roll our own validation. But that also means we'd increase the maintenance costs of this code, in addition to increasing the complexity of adding additional validation for future endpoints, schemas, etc.

Again, I'm happy to go either way, but just wanted to throw this out there before leaning into rolling our own

@dbczumar
Copy link
Collaborator

dbczumar commented Mar 7, 2022

Oh and another question: Do I need to add jsonschema to the requirements.txt files? Or which, if any?

Hi @mrkaye97 ! Thanks for the awesome PR. Is it possible to implement this validation without introducing additional library dependencies? We try to avoid new library dependencies in core MLflow modules (such as mlflow server) unless it's absolutely necessary because MLflow is used in a wide variety of system / software environments.

@dbczumar Yeah, I definitely can. But just to push back a little, how strongly do you feel about this? On one hand, I agree with you that I'd prefer to not introduce a dependency. But I guess it depends on what "absolutely necessary" means. In this case, it'd be pretty easy to roll our own validation. But that also means we'd increase the maintenance costs of this code, in addition to increasing the complexity of adding additional validation for future endpoints, schemas, etc.

Again, I'm happy to go either way, but just wanted to throw this out there before leaning into rolling our own

@mrkaye97 Thanks for digging further here. Apologies for the vague "absolutely necessary" designation. The general guidance is that we shouldn't add a new dependency if: 1. there's a relatively straightforward way to correctly implement comparable logic, or 2. we can easily extract the relevant modules from the third-party library into the project directly without significant ramifications (e.g. breach of licensing / TOS) or maintenance challenges (e.g. perhaps the library is updated frequently with critical fixes / improvements, and it would be cumbersome to continually update the inlined modules).

In this case, (1) seems to apply, given the limited set of validation functionality that we require for this use case. Let me know if you have any questions, and thanks for your flexibility :D

@mrkaye97
Copy link
Contributor Author

mrkaye97 commented Mar 7, 2022

Oh and another question: Do I need to add jsonschema to the requirements.txt files? Or which, if any?

Hi @mrkaye97 ! Thanks for the awesome PR. Is it possible to implement this validation without introducing additional library dependencies? We try to avoid new library dependencies in core MLflow modules (such as mlflow server) unless it's absolutely necessary because MLflow is used in a wide variety of system / software environments.

@dbczumar Yeah, I definitely can. But just to push back a little, how strongly do you feel about this? On one hand, I agree with you that I'd prefer to not introduce a dependency. But I guess it depends on what "absolutely necessary" means. In this case, it'd be pretty easy to roll our own validation. But that also means we'd increase the maintenance costs of this code, in addition to increasing the complexity of adding additional validation for future endpoints, schemas, etc.
Again, I'm happy to go either way, but just wanted to throw this out there before leaning into rolling our own

@mrkaye97 Thanks for digging further here. Apologies for the vague "absolutely necessary" designation. The general guidance is that we shouldn't add a new dependency if: 1. there's a relatively straightforward way to correctly implement comparable logic, or 2. we can easily extract the relevant modules from the third-party library into the project directly without significant ramifications (e.g. breach of licensing / TOS) or maintenance challenges (e.g. perhaps the library is updated frequently with critical fixes / improvements, and it would be cumbersome to continually update the inlined modules).

In this case, (1) seems to apply, given the limited set of validation functionality that we require for this use case. Let me know if you have any questions, and thanks for your flexibility :D

Sure, that makes sense! Let me report back in a few days on this. Thanks for the feedback!

@mrkaye97
Copy link
Contributor Author

alright @dbczumar, what do you think of this approach? IMO, it's a lot less elegant, but at least it rolls our own instead of relying on a library. Still have the same two failing tests on local (NaN supplied, number required) but everything else seems to be working.

Also a TODO: I'm not sure how to best extend this for log_batch, since it takes a list of (e.g.) metrics, which seems like a pain. It expects a list of metrics, where each metric is a dictionary, and we want to validate each k-v pair of that dictionary. Haven't figured out the best way to do that yet.

LMK what you think whenever you get a chance 👍

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright @dbczumar, what do you think of this approach? IMO, it's a lot less elegant, but at least it rolls our own instead of relying on a library. Still have the same two failing tests on local (NaN supplied, number required) but everything else seems to be working.

Also a TODO: I'm not sure how to best extend this for log_batch, since it takes a list of (e.g.) metrics, which seems like a pain. It expects a list of metrics, where each metric is a dictionary, and we want to validate each k-v pair of that dictionary. Haven't figured out the best way to do that yet.

LMK what you think whenever you get a chance 👍

@mrkaye97 This is fantastic! IMO, your implementation is quite elegant. Let me know if you'd like help / input regarding NaNs (as far as I can tell, NaNs are interpreted as floats when parsing JSON).

Regarding LogBatch, I think we can add comparable per-field validation logic to the body of the LogBatch handler. It may not look super nice, but I think it's the simplest approach.

Can't wait to merge this once NaNs and LogBatch have been addressed. Thanks again!

Comment on lines 282 to 286
def assert_max_1k(x):
assert x <= 1000

def _assert_max_50k(x):
assert x <= 50000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we consolidate these as:

def _assert_less_than_or_equal(x, max_value):
    assert x <= max_value

Then, we can pass lambda x: _assert_less_than_or_equal(x, 50000), etc. as part of the schema parameter of _get_request_message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, nice idea. will add this. I was trying to do this to no avail 🤦 :

lambda x: assert x < 50000

mlflow/server/handlers.py Outdated Show resolved Hide resolved
@mrkaye97
Copy link
Contributor Author

alright @dbczumar, what do you think of this approach? IMO, it's a lot less elegant, but at least it rolls our own instead of relying on a library. Still have the same two failing tests on local (NaN supplied, number required) but everything else seems to be working.
Also a TODO: I'm not sure how to best extend this for log_batch, since it takes a list of (e.g.) metrics, which seems like a pain. It expects a list of metrics, where each metric is a dictionary, and we want to validate each k-v pair of that dictionary. Haven't figured out the best way to do that yet.
LMK what you think whenever you get a chance 👍

@mrkaye97 This is fantastic! IMO, your implementation is quite elegant. Let me know if you'd like help / input regarding NaNs (as far as I can tell, NaNs are interpreted as floats when parsing JSON).

Regarding LogBatch, I think we can add comparable per-field validation logic to the body of the LogBatch handler. It may not look super nice, but I think it's the simplest approach.

Can't wait to merge this once NaNs and LogBatch have been addressed. Thanks again!

Works for me! I might need a few days -- going to be a busy week for me. Hopefully should have this over the finish line this weekend though. Thanks for the feedback.

On the NaN thing: I guess the question is whether we want to allow them or not. IMO, if MLFlow expects a number for a field, then NaN should throw an error -- it's explicitly not a number, after all. If we decide that not allowing NaN is the right way to go, I can probably just update those two tests. If we want to allow NaN as a float, I can allow them

@mrkaye97
Copy link
Contributor Author

@dbczumar Update on this:

  • I've added validation for log_batch in the best way I could think of (🤮 )
  • I've used some lambdas to clean up repeated code a little, per your suggestion
  • I think I've got status being validated right (can I just check if it's a string? I think it's a special class)

On the NaNs: I'm not actually sure why that's failing tests -- I verified that Python parses "NaN" and "nan" as floats, and in a repl, running isinstance(float("nan"), float) works fine. I also tried parsing JSON in the repl directly, but that also seemed fine. Here's what I tried:

import json

x = json.loads('{"Key": NaN}')

print(x)
print(type(x["Key"]))

## Checking the same assertion that the validator is performing
print(isinstance(x["Key"], float))

## To make sure that Python does, in fact, consider NaN to be a float when we tell it
print(float("nan"))
print(isinstance(float("nan"), float))

## To make sure it's actually the float validator failing
print(float("nan") is not None)

So yeah, I'm not actually sure what's going wrong here. Any ideas?

@mrkaye97
Copy link
Contributor Author

Another thing: The R tests are failing because of a bug in the R package itself, not in this change.

The issue is here:

body = if (is.null(data)) NULL else rapply(data, as.character, how = "replace"),

Basically, your R package converts all of the elements of the body to character vectors, which is now failing because it tries to make an API call with (e.g.) that start time as "100" instead of 100, which now fails. Honestly, I'm not sure why the type conversion was happening in the first place though. My suggestion would be to refactor how mlflow_rest works as follows (how we've done it in our lightMLFlow package): https://github.com/collegevine/lightMLFlow/blob/main/R/rest.R#L113-L118 I will put in that change.

See this reprex:

data <- list(
    foo = "bar",
    baz = 314159
)

rapply(data, as.character, how = "replace")
#> $foo
#> [1] "bar"
#> 
#> $baz
#> [1] "314159"

Created on 2022-03-24 by the reprex package (v2.0.1)

Comment on lines 63 to 69
GET(
api_url,
query = query,
mlflow_rest_timeout(),
config = rest_config$config,
req_headers
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed drive-by fixing these. Honestly, I'm not sure why these were implemented how they were in the first place

@mrkaye97 mrkaye97 requested a review from dbczumar March 24, 2022 22:18
@dbczumar
Copy link
Collaborator

dbczumar commented Mar 27, 2022

Another thing: The R tests are failing because of a bug in the R package itself, not in this change.

The issue is here:

body = if (is.null(data)) NULL else rapply(data, as.character, how = "replace"),

Basically, your R package converts all of the elements of the body to character vectors, which is now failing because it tries to make an API call with (e.g.) that start time as "100" instead of 100, which now fails. Honestly, I'm not sure why the type conversion was happening in the first place though. My suggestion would be to refactor how mlflow_rest works as follows (how we've done it in our lightMLFlow package): https://github.com/collegevine/lightMLFlow/blob/main/R/rest.R#L113-L118 I will put in that change.

See this reprex:

data <- list(
    foo = "bar",
    baz = 314159
)

rapply(data, as.character, how = "replace")
#> $foo
#> [1] "bar"
#> 
#> $baz
#> [1] "314159"

Created on 2022-03-24 by the reprex package (v2.0.1)

@mrkaye97 Thank you for looking into this! My interpretation is that, if the MLflow server currently accepts string request parameters with numeric contents (e.g. "100") and successfully converts them to the corresponding int / float values for downstream operations, then we should preserve that behavior in this PR. The R client may not be the only source of such requests; for example, third-party tools / user workflows may be making similar requests with "numeric" strings, and we don't want to break them.

Rather than drive-by fixing the R client (I think we can revert that for the scope of this PR), can we instead adjust how we're handling type validation to rely on the existing parse_dict() (

def parse_dict(js_dict, message):
"""Parses a JSON dictionary into a message proto, ignoring unknown fields in the JSON."""
_stringify_all_experiment_ids(js_dict)
ParseDict(js_dict=js_dict, message=message, ignore_unknown_fields=True)
) implementation as the source of truth? When handling the request, if parse_dict() succeeds, we should assume that the data types are good and proceed to verify that other request data requirements are satisfied (e.g. expected fields are present, values are less than the configured maximum thresholds, ...). On the other hand, if parse_dict() fails, then we can use the expected schema / type information for the request to return an informative error to the user.

Apologies for the additional wrinkle here. Thank you for all your hard work! Let me know if you have any questions.

@mrkaye97
Copy link
Contributor Author

thanks for the feedback @dbczumar -- I'll revert the R changes in favor of your suggestion to preserve existing behavior. Just to make sure I'm understanding how you're thinking about this though: your assumption is that ifparse_dict succeeds (as in, doesn't throw an error) then the types are good to go?

Do you have any idea how prevalent this issue is? Is it only for numeric types? or might we break other things like "true" --> True? I can reliably get the API to throw a ParseError if I (e.g.) specify 123 for a string parameter, but I'm concerned that catching all parse errors and treating them as type errors is dangerous.

It seems like maybe a better solution (if we could think of all of the cases) would be to change _assert_integer (for example) to _assert_integerish and try to first parse the thing that we're asserting on to an int. If the parsing fails, we throw a type error and 400, and if parsing succeeds then the integer type assertion will trivially succeed as well. What do you think of that approach? I'd think that we'd need _assert_floatish, _assert_integerish, and _assert_booleanish and that might be it? It'd keep the logic more concise I think that way ¯_(ツ)_/¯

@dbczumar
Copy link
Collaborator

thanks for the feedback @dbczumar -- I'll revert the R changes in favor of your suggestion to preserve existing behavior. Just to make sure I'm understanding how you're thinking about this though: your assumption is that ifparse_dict succeeds (as in, doesn't throw an error) then the types are good to go?

Do you have any idea how prevalent this issue is? Is it only for numeric types? or might we break other things like "true" --> True? I can reliably get the API to throw a ParseError if I (e.g.) specify 123 for a string parameter, but I'm concerned that catching all parse errors and treating them as type errors is dangerous.

It seems like maybe a better solution (if we could think of all of the cases) would be to change _assert_integer (for example) to _assert_integerish and try to first parse the thing that we're asserting on to an int. If the parsing fails, we throw a type error and 400, and if parsing succeeds then the integer type assertion will trivially succeed as well. What do you think of that approach? I'd think that we'd need _assert_floatish, _assert_integerish, and _assert_booleanish and that might be it? It'd keep the logic more concise I think that way ¯_(ツ)_/¯

Thanks @mrkaye97 ! Correct, if parse_dict succeeds, we can assume the types are good to go. Because I'm not 100% sure of the scope / prevalence here, I'd be wary of trying to implement our own _assert_integerish, _assert_booleanish, etc logic here. I understand your concern about treating all parse_dict failures as type errors, and I think we can resolve it by carefully wording our error message when parse_dict fails. For example, we could say something to the effect of

Failed to parse request body. Please verify that your LogMetric request includes a valid JSON body with fields satisfying the following specifications:

run_id: String, required
key: String, required
value: Double, required
timestamp: Int, required
step: Int

This allows us to provide an informative message for most cases while keeping the implementation cheap / simple. If the user can verify that their request looks good and it still isn't working, they can always file an issue.

For more specific cases where we know the cause of the failure - e.g. the field is missing but required or fails to satisfy some numeric constraint - we can use the more specific error messages that have already been implemented in your PR.

Let me know what your thoughts are here.

@mrkaye97
Copy link
Contributor Author

Thanks @mrkaye97 ! Correct, if parse_dict succeeds, we can assume the types are good to go. Because I'm not 100% sure of the scope / prevalence here, I'd be wary of trying to implement our own _assert_integerish, _assert_booleanish, etc logic here. I understand your concern about treating all parse_dict failures as type errors, and I think we can resolve it by carefully wording our error message when parse_dict fails. For example, we could say something to the effect of

Failed to parse request body. Please verify that your LogMetric request includes a valid JSON body with fields satisfying the following specifications:

run_id: String, required
key: String, required
value: Double, required
timestamp: Int, required
step: Int

This allows us to provide an informative message for most cases while keeping the implementation cheap / simple. If the user can verify that their request looks good and it still isn't working, they can always file an issue.

For more specific cases where we know the cause of the failure - e.g. the field is missing but required or fails to satisfy some numeric constraint - we can use the more specific error messages that have already been implemented in your PR.

Let me know what your thoughts are here.

Gotcha, that makes sense to me 👍 Agreed that trying to cover all the cases is risky.

I imagine that I'm going to need to rewire some things a little bit, since this conditional logic (only check types on parse failure) is a little more involved than the original logic was. I'll work on this over the next few days and report back!

@mrkaye97
Copy link
Contributor Author

mrkaye97 commented Apr 2, 2022

@dbczumar Circling back: I think this approach actually raises some bigger issues buried deeper in the code. For instance: If I make a call to log_metric with key = 31, parse_dict succeeds. As such, the validation for the type of key is skipped (even though it should fail, but that's okay in theory).

And then the problem occurs: Downstream, I end up with this:

>       if name is None or not _VALID_PARAM_AND_METRIC_NAMES.match(name):
E       TypeError: expected string or bytes-like object

name       = 31

Which I think results in a 500. I actually can't repro this from an R client making the same calls, which I'm a little confused by. In that case, I get back a 400 with the correct validation error. Got any ideas on this?

Separately, it's weird to me from a design perspective that this validation would happen twice for some calls (but not others?). That seems unnecessary, and it'll result in odd UX for users who will get different error messages for the "same" error, depending on the endpoint they're hitting.

IMO, the bottom line is that it's not clear to me that it's safe to assume that if parse_dict succeeds, all is well with the types

Comment on lines 296 to 335
if f in [
_assert_int,
_assert_string,
_assert_bool,
_assert_float,
_assert_array,
_assert_item_type_string
] and parsing_succeeded:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the best way I could think of to do this that didn't require a huge refactor

Copy link
Collaborator

@dbczumar dbczumar Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these methods invoked at any point? If not, can we remove them from the PR entirely?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, misread the logic. This makes sense :)

Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Signed-off-by: mrkaye97 <mrkaye97@gmail.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for all your hard work, @mrkaye97! This is awesome! I'll go ahead and clean up some lint errors and get this merged.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrkaye97 I resolved the lint errors and updated the tests to make use of raw REST requests to bypass client-side validations when testing server-side validations. I also updated the server-side validation logic to check all fields associated with an _assert_required function, even if they aren't present in the request body (previously, we only validated a field if it was present in the request, but we should throw if a required field is absent), and I improved the error message for the case where _assert_required fails. Finally, I added validations for the CreateExperiment tags and artifact_location fields. I plan to merge tomorrow. Thanks again!

@mrkaye97
Copy link
Contributor Author

awesome, thanks for all the help and doing all of that cleanup to get this over the line @dbczumar! excited to see this go live -- I think it'll improve the UX a lot 🥳

Signed-off-by: dbczumar <corey.zumar@databricks.com>
@dbczumar
Copy link
Collaborator

awesome, thanks for all the help and doing all of that cleanup to get this over the line @dbczumar! excited to see this go live -- I think it'll improve the UX a lot 🥳

Of course! Thanks so much for all your hard work! Merging! 🎆

@amesar
Copy link

amesar commented Jun 3, 2022

I did a deep dive into JSON schema back in 2012 and although a great idea, there were so many issues I ditched it. Glad to see its working now! Schema-less JSON makes me shudder. It's like fingers crossed programming ;)

@mrkaye97 mrkaye97 mentioned this pull request Apr 28, 2023
33 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracking Tracking service, tracking client APIs, autologging rn/bug-fix Mention under Bug Fixes in Changelogs. rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] Validate request JSON with JSON schema or similar
4 participants