Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2873: DataQuality test specifications and APIs #2906

Merged
merged 5 commits into from Feb 24, 2022
Merged

Conversation

harshach
Copy link
Collaborator

Describe your changes :

see #2873

Type of change :

  • New feature

Frontend Preview (Screenshots) :

For frontend related change, please link screenshots of your changes preview! Optional for backend related changes.

Checklist:

  • I have read the CONTRIBUTING document.
  • I have performed a self-review of my own.
  • I have tagged my reviewers below.
  • I have commented on my code, particularly in hard-to-understand areas.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • All new and existing tests passed.

Reviewers

@github-actions
Copy link

Schema Change Detected. Needs ingestion-core version bump

Please run make core_bump_version_dev in the project's root and commit the changes to _version.py in this PR. Please ignore if this has been handled already.

@harshach harshach marked this pull request as draft February 22, 2022 01:47
@sonarcloud
Copy link

sonarcloud bot commented Feb 22, 2022

[catalog] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

"name": "ColumnTest"
}
]
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add tests for say reports, dashboards and other data assets, would they be defined here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats the idea . One thing with the JsonSchema lets say we define type like above but I want to designate a type in the test definition. i.e call out a value in the enum to be the type. I am not able to find a way to configure that.

{
  "$id": "https://open-metadata.org/schema/tests/column/columnValuesToMatchRegex.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "columnValuesToBeUnique",
  "description": "This schema defines the test ColumnValuesToMatchRegex. Test the values in a column to match a given regular expression. ",
  "type": "object",
  "javaType": "org.openmetadata.catalog.entity.tests.column.ColumnValuesToMatchRegex",
  "properties": {
    "column": {
      "description": "Name of the column in a table.",
      "type": "string"
    },
    "regex": {
      "description": "The regular expression the column entries should match.",
      "type": "string"
    }
   "type":  {
        "ref":"./basic.json#definitions/testType#ColumnTest"
      }
  },
  "required": ["column", "regex"]
}

Copy link
Collaborator

@pmbrull pmbrull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @harshach, thanks for laying this down.

This approach LGTM, I think we can move forward with a proper definition of the tests and remove the expression from the profiler altogether. In the end, having a specific language describing test cases will just end up being an unnecessary layer of abstraction on top of the supported definitions, so it feels more natural to go directly to the source.

On the Python side, we will just need a bridge between the test definition, the required metric to extract the data from, and the actual Metric implementation.

What we will need to have super clearly documented, tutored, workshoped, etc. is the addition of new tests. Having a tightly coupled implementation makes it straightforward to use what we have, but there's always that "other" test. At some point, we should come up with the different "families" of tests we want to support and the first implementation of them that could serve as an example. I would then try to type the test's definitions a bit further (for example, tableCount test is a StaticTest). I could come up with the following groups of possible implementations we can do (almost 1:1 relation to the Metric definitions)

Static Tests

Independent and individual computations, e.g., MIN, MAX or COUNT

  • Note that in the metrics' core.py I also defined the Composed type, but that is just an internal implementation to not repeat computations, so we can skip that here.

Time/Window Tests

We are interested in evolution and trends, not one-off results. E.g., the COUNT cannot increase more than a specific % in the last N days.

  • Here the actual implementation might vary. I'd be inclined to have these tests pick up past profile data and not compute everything every time. While this means that the test will be "unavailable" during the first N days, afterwards the performance will be much higher. Another option would be having a smarter logic and computing all missing data, but then the issue becomes how and where to store this info.

Rule Tests

To find relations/conditions with the same Table or other Tables. E.g.,

  • Same Table condition: if delivered_status is NULL then driver is NULL
  • Table A to Table B condition: SUM(orders.amount) == SUM(invoices.amount)

Stats Tests

For the future, if we feel brave, check data distribution. This feels a bit more niche, but can be a huge help on ML work.

Custom Tests

Where the user provides an SQL query and the condition to check. Here we would still need some ground rules, as it can be hard to support absolutely anything. I could imagine something like: if the query returns 0, means test OK. If it returns 1, test Error. Users could easily prepare statements with subqueries or CTEs to check for this.

Other considerations

Tests on query result

While it makes sense for us to have test definitions linked to a table, that might not always cover all the scenarios. We might need to allow any of the supported tests to run on top of a query, not on top of a table.

Users could still decide under which Table the test would live, so our link between Test <-> Table can be kept.

Confidence Intervals

(IDK how to name it better) Depending on the number of values failing the test, or how far the failure is from the accepted value, we "could live with it". It is different to set COUNT > 100 and get 99 vs. 0. We could allow users to define what % of precision they expect and then introduce a new test case status Warning. Users should then be able to validate the result and manually mark it either as OK or KO.

Test Suites

As discussed, some homework for further releases could be to add the Entity TestSuite, so that when we talk about executions we not only take into account individual tests but specific groupings. This can be interesting to find out linked test failures or more broad effects. However, let's keep that in the fridge for now. We can revisit after we have figured out this first version of the UI.

Running on Data Sample

Let users specify the % of the data they want the tests to run on top of. E.g., run on top of 20% of the table.

Filters

In the test definition, add some conditions to filter data. COUNT where market == 'food'. This could be a simple SQL-like predicate expression. Setting it as a string and parsing it should be simple enough and it helps us not overcomplicate the test JSON definition.

Skip Columns

This one might be more on the Profiler side, but we should allow skipping columns where profile information just has no value, such as any ID.

Tests & Profiler

Just laying down some thoughts here. The approach in the workflow I followed so far is:

  • In the workflow we need to at least, define a Profiler with something like
    "name": "my_profiler",
    "table_metrics": ["row_number"],
    "metrics": ["min", "COUNT", "null_count"],
    Note that here we are passing the Metrics we want to execute at table and column level.
  • We run the profiler and store the metric values'
  • We then jump to the testing phase and we perform simple validations: Is my metric result matching the expected one?

My question then is, how do we want to link each test to the required metrics? With the current expression language, I just use the metric name, but we might need to take this into consideration when defining the tests. Just on top of my mind, I would:

  1. If we are defining the Tests, we should as well define all the supported metrics in the JSON schema as well
  2. The tests should then have an EntityRef to the metric(s) they require

This way, we can validate that we are going to run a test that has all the necessary ingredients. Moreover, when working on the UI implementation of the testing workflows, we could then make sure that we are defining the Profiler metrics based on what is necessary for the tests.

I am sure that for 90% of the cases, just having a "default" Profiler with a set of the most common metrics would be enough, but I am thinking about what would happen for more involved tests with super-specific metrics.


Sorry for the long write-up, I braindumped the thoughts of working on the Python implementation 🤔 Happy to jump on a call to fine-tune the process & defs!

Thanks 🙏

test-schema drawio

@harshach harshach marked this pull request as ready for review February 24, 2022 06:34
Copy link
Collaborator

@pmbrull pmbrull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 Excited to get this tested! We can iterate if needed, thanks!!

@sonarcloud
Copy link

sonarcloud bot commented Feb 24, 2022

[open-metadata-ingestion] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@harshach
Copy link
Collaborator Author

merging in as the failed test is not related to code here

@harshach harshach merged commit 724989e into main Feb 24, 2022
@harshach harshach deleted the issue-2873 branch February 24, 2022 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants