Fix #2873: DataQuality test specifications and APIs #2906

harshach · 2022-02-22T01:46:48Z

Describe your changes :

see #2873

Type of change :

New feature

Frontend Preview (Screenshots) :

For frontend related change, please link screenshots of your changes preview! Optional for backend related changes.

Checklist:

I have read the CONTRIBUTING document.
I have performed a self-review of my own.
I have tagged my reviewers below.
I have commented on my code, particularly in hard-to-understand areas.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.
All new and existing tests passed.

Reviewers

github-actions · 2022-02-22T01:47:12Z

Schema Change Detected. Needs ingestion-core version bump

Please run make core_bump_version_dev in the project's root and commit the changes to _version.py in this PR. Please ignore if this has been handled already.

sonarcloud · 2022-02-22T02:07:01Z

[catalog] Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

catalog-rest-service/src/main/resources/json/schema/tests/basic.json

sureshms · 2022-02-22T04:32:48Z

catalog-rest-service/src/main/resources/json/schema/tests/basic.json

+          "name": "ColumnTest"
+        }
+      ]
+    },


If we add tests for say reports, dashboards and other data assets, would they be defined here?

Yes thats the idea . One thing with the JsonSchema lets say we define type like above but I want to designate a type in the test definition. i.e call out a value in the enum to be the type. I am not able to find a way to configure that.

{ "$id": "https://open-metadata.org/schema/tests/column/columnValuesToMatchRegex.json", "$schema": "http://json-schema.org/draft-07/schema#", "title": "columnValuesToBeUnique", "description": "This schema defines the test ColumnValuesToMatchRegex. Test the values in a column to match a given regular expression. ", "type": "object", "javaType": "org.openmetadata.catalog.entity.tests.column.ColumnValuesToMatchRegex", "properties": { "column": { "description": "Name of the column in a table.", "type": "string" }, "regex": { "description": "The regular expression the column entries should match.", "type": "string" } "type": { "ref":"./basic.json#definitions/testType#ColumnTest" } }, "required": ["column", "regex"] }

pmbrull

Hi, @harshach, thanks for laying this down.

This approach LGTM, I think we can move forward with a proper definition of the tests and remove the expression from the profiler altogether. In the end, having a specific language describing test cases will just end up being an unnecessary layer of abstraction on top of the supported definitions, so it feels more natural to go directly to the source.

On the Python side, we will just need a bridge between the test definition, the required metric to extract the data from, and the actual Metric implementation.

What we will need to have super clearly documented, tutored, workshoped, etc. is the addition of new tests. Having a tightly coupled implementation makes it straightforward to use what we have, but there's always that "other" test. At some point, we should come up with the different "families" of tests we want to support and the first implementation of them that could serve as an example. I would then try to type the test's definitions a bit further (for example, tableCount test is a StaticTest). I could come up with the following groups of possible implementations we can do (almost 1:1 relation to the Metric definitions)

Static Tests

Independent and individual computations, e.g., MIN, MAX or COUNT

Note that in the metrics' core.py I also defined the Composed type, but that is just an internal implementation to not repeat computations, so we can skip that here.

Time/Window Tests

We are interested in evolution and trends, not one-off results. E.g., the COUNT cannot increase more than a specific % in the last N days.

Here the actual implementation might vary. I'd be inclined to have these tests pick up past profile data and not compute everything every time. While this means that the test will be "unavailable" during the first N days, afterwards the performance will be much higher. Another option would be having a smarter logic and computing all missing data, but then the issue becomes how and where to store this info.

Rule Tests

To find relations/conditions with the same Table or other Tables. E.g.,

Same Table condition: if delivered_status is NULL then driver is NULL
Table A to Table B condition: SUM(orders.amount) == SUM(invoices.amount)

Stats Tests

For the future, if we feel brave, check data distribution. This feels a bit more niche, but can be a huge help on ML work.

Custom Tests

Where the user provides an SQL query and the condition to check. Here we would still need some ground rules, as it can be hard to support absolutely anything. I could imagine something like: if the query returns 0, means test OK. If it returns 1, test Error. Users could easily prepare statements with subqueries or CTEs to check for this.

Other considerations

Tests on query result

While it makes sense for us to have test definitions linked to a table, that might not always cover all the scenarios. We might need to allow any of the supported tests to run on top of a query, not on top of a table.

Users could still decide under which Table the test would live, so our link between Test <-> Table can be kept.

Confidence Intervals

(IDK how to name it better) Depending on the number of values failing the test, or how far the failure is from the accepted value, we "could live with it". It is different to set COUNT > 100 and get 99 vs. 0. We could allow users to define what % of precision they expect and then introduce a new test case status Warning. Users should then be able to validate the result and manually mark it either as OK or KO.

Test Suites

As discussed, some homework for further releases could be to add the Entity TestSuite, so that when we talk about executions we not only take into account individual tests but specific groupings. This can be interesting to find out linked test failures or more broad effects. However, let's keep that in the fridge for now. We can revisit after we have figured out this first version of the UI.

Running on Data Sample

Let users specify the % of the data they want the tests to run on top of. E.g., run on top of 20% of the table.

Filters

In the test definition, add some conditions to filter data. COUNT where market == 'food'. This could be a simple SQL-like predicate expression. Setting it as a string and parsing it should be simple enough and it helps us not overcomplicate the test JSON definition.

Skip Columns

This one might be more on the Profiler side, but we should allow skipping columns where profile information just has no value, such as any ID.

Tests & Profiler

Just laying down some thoughts here. The approach in the workflow I followed so far is:

In the workflow we need to at least, define a Profiler with something like
```
"name": "my_profiler",
"table_metrics": ["row_number"],
"metrics": ["min", "COUNT", "null_count"],
```
Note that here we are passing the Metrics we want to execute at table and column level.
We run the profiler and store the metric values'
We then jump to the testing phase and we perform simple validations: Is my metric result matching the expected one?

My question then is, how do we want to link each test to the required metrics? With the current expression language, I just use the metric name, but we might need to take this into consideration when defining the tests. Just on top of my mind, I would:

If we are defining the Tests, we should as well define all the supported metrics in the JSON schema as well
The tests should then have an EntityRef to the metric(s) they require

This way, we can validate that we are going to run a test that has all the necessary ingredients. Moreover, when working on the UI implementation of the testing workflows, we could then make sure that we are defining the Profiler metrics based on what is necessary for the tests.

I am sure that for 90% of the cases, just having a "default" Profiler with a set of the most common metrics would be enough, but I am thinking about what would happen for more involved tests with super-specific metrics.

Sorry for the long write-up, I braindumped the thoughts of working on the Python implementation 🤔 Happy to jump on a call to fine-tune the process & defs!

Thanks 🙏

pmbrull

LGTM 🚀 Excited to get this tested! We can iterate if needed, thanks!!

sonarcloud · 2022-02-24T06:52:15Z

[open-metadata-ingestion] Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

0.0% Coverage
0.0% Duplication

harshach · 2022-02-24T07:12:52Z

merging in as the failed test is not related to code here

Fix #2873: DataQuality test specifications and APIs

ec5a8e9

harshach requested review from sureshms and pmbrull February 22, 2022 01:46

harshach marked this pull request as draft February 22, 2022 01:47

Fix #2873: DataQuality test specifications and APIs

b5d5fd5

mithmatt reviewed Feb 22, 2022

View reviewed changes

catalog-rest-service/src/main/resources/json/schema/tests/basic.json Show resolved Hide resolved

sureshms reviewed Feb 22, 2022

View reviewed changes

pmbrull reviewed Feb 22, 2022

View reviewed changes

Fix #2873: DataQuality test specifications and APIs

ffee4b1

pmbrull mentioned this pull request Feb 23, 2022

Update profiler grammar with quotes & spaces #2876

Closed

harshach added 2 commits February 23, 2022 17:41

Fix #2873: DataQuality test specifications and APIs

5a63cc9

Fix #2873: DataQuality test specifications and APIs

75f10e6

harshach marked this pull request as ready for review February 24, 2022 06:34

pmbrull approved these changes Feb 24, 2022

View reviewed changes

harshach merged commit 724989e into main Feb 24, 2022

harshach deleted the issue-2873 branch February 24, 2022 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #2873: DataQuality test specifications and APIs #2906

Fix #2873: DataQuality test specifications and APIs #2906

harshach commented Feb 22, 2022

github-actions bot commented Feb 22, 2022

sonarcloud bot commented Feb 22, 2022

sureshms Feb 22, 2022

harshach Feb 22, 2022

pmbrull left a comment •

edited

pmbrull left a comment

sonarcloud bot commented Feb 24, 2022

harshach commented Feb 24, 2022

Fix #2873: DataQuality test specifications and APIs #2906

Fix #2873: DataQuality test specifications and APIs #2906

Conversation

harshach commented Feb 22, 2022

Describe your changes :

Type of change :

Frontend Preview (Screenshots) :

Checklist:

Reviewers

github-actions bot commented Feb 22, 2022

sonarcloud bot commented Feb 22, 2022

sureshms Feb 22, 2022

Choose a reason for hiding this comment

harshach Feb 22, 2022

Choose a reason for hiding this comment

pmbrull left a comment • edited

Choose a reason for hiding this comment

Static Tests

Time/Window Tests

Rule Tests

Stats Tests

Custom Tests

Other considerations

Tests on query result

Confidence Intervals

Test Suites

Running on Data Sample

Filters

Skip Columns

Tests & Profiler

pmbrull left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 24, 2022

harshach commented Feb 24, 2022

pmbrull left a comment •

edited