Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2981 - Update Profile to match TableProfile #2982

Merged
merged 27 commits into from Feb 25, 2022

Conversation

pmbrull
Copy link
Collaborator

@pmbrull pmbrull commented Feb 24, 2022

Describe your changes :

This PR fixes #2981 and closes #2980

Reviewing the API definitions, we noticed that our current approach to computing profiling data was not properly synced with the TableProfile definition.

Instead of adding yet-another-step and mapping the results artificially, we have done a refactor on how the profiling data gets computed. We are now relying on a single Profiler object, which is able to compute both table and column metrics and itself handles all the columns in the table.

Its results can then be safely parsed into TableProfile without any issues.

In terms of the JSON Schema, we modified a bit the types of the metrics to better represent the results and added a couple of metrics that were not present and we are already computing. We also changed a few names of the metrics. ⚠️ This might have an impact on the UI as well @Sachin-chaurasiya, @darth-coder00 ⚠️ However, we still need to work on the UI anyway for this EPIC so we could maybe live with it for a while.

We also added a method to manually name the metrics, which will make it easier to map with TableProfile specs.

An interesting addition is the function add_props in metrics/core. As we now perform the metrics instantiation inside the Profiler, we are passing the class definition as an input parameter. However, there are metrics that require specific attributes such as bins or expression. To have the Profiler as generic as possible and not keep checking specific metrics for individualities, add_props dynamically prepares a new class definition overriding the __init__ method to add any required attribute we pass as **kwargs.

This means that we can run new_hist = add_props(bins=5)(Metrics.HISTOGRAM.value) and new_hist will be a completely independent and new class definition that replicates Metrics.HISTOGRAM.value but will assign bins=5 to all its instantiations.

Happy to discuss!

Thanks

Type of change :

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Frontend Preview (Screenshots) :

For frontend related change, please link screenshots of your changes preview! Optional for backend related changes.

Checklist:

  • I have read the CONTRIBUTING document.
  • I have performed a self-review of my own.
  • I have tagged my reviewers below.
  • I have commented on my code, particularly in hard-to-understand areas.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • All new and existing tests passed.

Reviewers

@github-actions
Copy link

Schema Change Detected. Needs ingestion-core version bump

Please run make core_bump_version_dev in the project's root and commit the changes to _version.py in this PR. Please ignore if this has been handled already.

"description": "No.of null value proportion in columns.",
"type": "number"
},
"missingPercentage": {
"missingRatio": {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these changes are an issue now, I can revert this and change the naming on the python side. We can tackle it later if needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not change it. UI can fix it.

@pmbrull
Copy link
Collaborator Author

pmbrull commented Feb 24, 2022

Ill revert the datatype changes and handle them separatedly as it is a handful

ingestion/setup.py Outdated Show resolved Hide resolved
@harshach
Copy link
Collaborator

@pmbrull overall LGTM. Left couple of comments. If we don't need numpy lets not include it

@harshach
Copy link
Collaborator

also we need to update the sample data

@sonarcloud
Copy link

sonarcloud bot commented Feb 25, 2022

[open-metadata-ingestion] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

},
"max": {
"description": "Maximum value in a column.",
"type": ["number", "integer", "string"]
},
"minLength": {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @harshach, reverted all other changes and removed numpy dependency. The only change I'd like to stick here is: adding minLength and maxLength and updating the type for min and max. Looks like the tests are passing and this way I can compute it for numbers and strings

@sonarcloud
Copy link

sonarcloud bot commented Feb 25, 2022

[catalog] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@harshach harshach merged commit 9906085 into open-metadata:main Feb 25, 2022
harshach added a commit that referenced this pull request Feb 26, 2022
* Fix  #2981 - Update Profile to match TableProfile (#2982)

* Fix #2991: Add ES support for glossary (#2993)

* Atlas Tag Propagation

* Fix #3004:  Add Atlas Tag Propagator

* Fix #3004:  Add Atlas Tag Propagator

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
akash-jain-10 pushed a commit that referenced this pull request Mar 1, 2022
* Fix  #2981 - Update Profile to match TableProfile (#2982)

* Fix #2991: Add ES support for glossary (#2993)

* Atlas Tag Propagation

* Fix #3004:  Add Atlas Tag Propagator

* Fix #3004:  Add Atlas Tag Propagator

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
akash-jain-10 pushed a commit that referenced this pull request Mar 1, 2022
* Fix  #2981 - Update Profile to match TableProfile (#2982)

* Fix #2991: Add ES support for glossary (#2993)

* Atlas Tag Propagation

* Fix #3004:  Add Atlas Tag Propagator

* Fix #3004:  Add Atlas Tag Propagator

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Profile to match TableProfile ImportError - Numpy - Ingestion
2 participants