New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Fix run_diagnostics for contrib expectations #3096
[BUGFIX] Fix run_diagnostics for contrib expectations #3096
Conversation
✔️ Deploy Preview for knoxpod ready! 🔨 Explore the source changes: 4891cf8 🔍 Inspect the deploy log: https://app.netlify.com/sites/knoxpod/deploys/60f99a57fb302a000733cf45 😎 Browse the preview: https://deploy-preview-3096--knoxpod.netlify.app |
HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖Please don't forget to add a clear and succinct description of your change under the Develop header in ✨ Thank you! ✨ |
… github.com:great-expectations/great_expectations into bugfix/fix_run_diagnostics_for_contrib_expectations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minor comments but none of them are blocking so LGTM 🚀
This is more of a general software engineering question but if we find bugs that haven't been caught by our test suite, should we add new tests to cover the missing functionality upon fixing the bug? So in the case of None
being returned by sort()
, should we add a test that ensures we properly handles cases that failed before the change to sorted()
?
value = sorted(value, key=lambda x: str(x)) | ||
result["result"]["unexpected_list"] = sorted( | ||
result["result"]["unexpected_list"], key=lambda x: str(x) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sort()
vs sorted()
is always annoying. Great catch!
What are the types of value
and unexpected_list
here? Do we need the custom comparator for sorting? I think its perfectly fine but just asking for my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types for both of these should be list
. We do a check on L1905 to see if value
is a list, and we only get here if that resolves to True. We also assume that unexpected_list
is True, and I think that's just a fundamental part of expectations (I'm not sure whether we test than anywhere). I can add some typing here as well - I'm not going to add it everywhere in this module, so it may look a bit funny, but if adding it here makes it clearer, I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope I think that explanation is more than sufficient for me :) thanks for clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and the custom comparator is due to an issue that was arising if the values in the list were of different types. So if one of the values was a list, but the rest were strings, this would raise an error due to trying to sort items of different types.
"out": { | ||
"unexpected_list": ["Jan 01 2001 12:00:01"], | ||
"unexpected_index_list": [0, 9], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what this removal is doing exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, this json test is running a test for this expectation for both pandas and sqlalchemy. I believe that parse_strings_as_datetimes
functions differently for each of these. So in the case of pandas, the unexpected list would look like what was there (though it was previously missing the 1970 date from L9). However in the case of sqlalchemy, this date actually gets parsed into "2001-01-01 12:00:01" or thereabouts. (I also might be mixing up pandas and sqlalchemy here). Though the expectation for both backgrounds is correct, the unexpected list for each would be slightly different. I could split out the section completely for the two backends, but that adds a fair amount of complexity, and we have most of the functionality due to unexpected_index_list - this tells us the index of the unexpected dates. So I decided to go with that approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense as well. Do you think we should add a ticket about bifurcating these backends to our backlog? I think that split might be helpful but I'm not sure if the value it'll provide is worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut tells me that the bifurcation would not be worthwhile. I think it would really only help in very severe edge cases, like this one, and it would introduce a bunch of complexity and duplicated code.
It might be worth a re-visit of this testing infrastructure entirely to make sure that it's effective for V3. When we initially set it up, it was basically a port of V2 functionality to V3. I'm not sure how much additional thought and work has been done to it since then, but it's possible that it could stand some updating.
Thanks for the review, @cdkini! To your point about adding test coverage - the way I'm looking at this, we thought we had test coverage here, and we didn't, due to the sorted/sort issue. That is fixed now, and so now we should catch any errors not previously caught (which is the case in the two other files amended here. Adding additional test coverage feels like we would be testing our testing capability. Am I thinking about this right? Happy to chat. |
Hmm I think that makes sense. My question was more general and less specific to this PR so it may have been better to have discussed this externally. Sorry if it's off-topic. The change you made is a very internal part of the method and so, I think isolating that specific behavior in a test would be difficult. Plus, more tests are not always the answer. I more so asked the question to get a better sense of when and when not to add tests but I think the points you've made are valid and I'm happy to have this merged as is! |
…https://github.com/great-expectations/great_expectations into docs/GDOC-199/snowflake-connections-closed-correctly * 'docs/GDOC-199/snowflake-connections-closed-correctly' of https://github.com/great-expectations/great_expectations: Update util.py [MAINTENANCE] rename map_metric.py to map_metric_provider.py (with DeprecationWarning) for a better code readability/interpretability (#3103) [MAINTENANCE] rename ColumnMetricProvider to ColumnAggregateMetricProvider (with DeprecationWarning) (#3100) use correct variable name (#3069) [RELEASE] release candidate for 2021-07-22 (#3101) [FEATURE] SqlAlchemy engine support for column.most_common_value metric (#3020) [BUGFIX] Fix run_diagnostics for contrib expectations (#3096) [BUGFIX] Fix typos discovered by codespell (#3064) [DOCS] Migrating pages under guides/miscellaneous (#3094) [DOCS] Port over "How to run a Checkpoint in Airflow" from RTD to Docusaurus (#3074) disable snowflake tests temporarily (#3093) [BUGFIX] fix incorrect pandas top rows usage (#3091)
Changes proposed in this pull request:
check_json_test_result
function whereby the result oflist.sort()
, which is None, was being assigned to a variable, when it should have beensorted(list)
..sort()
tosorted()
, this also adds a key tosorted()
to make the sorting function more robust when it comes to comparing items of different data types.expect_column_values_to_be_between
, I removed theunexpected_list
due to conflicts between timestamp parsing betweenpandas
andsqlalchemy
.Thanks to @manyshapes for surfacing!
Definition of Done
Please delete options that are not relevant.
Thank you for submitting!