Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2894 - Profiler Processor & Metrics #2900

Merged
merged 24 commits into from Feb 22, 2022
Merged

Conversation

pmbrull
Copy link
Collaborator

@pmbrull pmbrull commented Feb 21, 2022

Describe your changes :

This PR fixes #2894 and partially #2878

There are two main changes in this PR:

  • Instead of relying on all the profiling and validation logic in the Workflow, we have prepared a Processor to handle that stuff, trying to follow a more strict workflow logic. We have also updated the Status to properly log the results.
  • We have implement new metrics:
    • AVG - which also handles AVG_LENGTH for strings and text
    • DISTINCT_COUNT
    • DUPLICATE_COUNT
    • HISTOGRAM -> This is interesting because it helped us notice that we were missing a more complex type of Metric QueryMetric. We have also developed intermediate functions (CONCAT) that can be used when computing other metrics.

Type of change :

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Frontend Preview (Screenshots) :

For frontend related change, please link screenshots of your changes preview! Optional for backend related changes.

Checklist:

  • I have read the CONTRIBUTING document.
  • I have performed a self-review of my own.
  • I have tagged my reviewers below.
  • I have commented on my code, particularly in hard-to-understand areas.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • All new and existing tests passed.

Reviewers

config_dict: dict,
metadata_config_dict: dict,
ctx: WorkflowContext,
**kwargs
Copy link
Collaborator Author

@pmbrull pmbrull Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows us to pass more parameters to the processor, such as the session in the ORM Processor. It should be transparent for the existent Processors

self.session.add_all(data)
self.session.commit()

hist = Metrics.HISTOGRAM(TestHist.num, bins=5)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now pass kwargs when creating metrics to be more flexible in internal functionalities. Still needs to be updated in the Processor

return issubclass(_type.__class__, Numeric)


def is_quantifiable(_type) -> bool:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helps handle what types we support for each metric

@pmbrull
Copy link
Collaborator Author

pmbrull commented Feb 21, 2022

[open-metadata-ingestion] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug C 1 Bug Vulnerability A 0 Vulnerabilities Security Hotspot A 0 Security Hotspots Code Smell A 0 Code Smells

0.0% 0.0% Coverage 0.0% 0.0% Duplication

Sonar is complaining about the join condition and_(1 == 1). I could not figure out any other way of replicating this.

EDIT: ok fixed this, just needed a fresher mind 😵‍💫

@sonarcloud
Copy link

sonarcloud bot commented Feb 21, 2022

[open-metadata-ingestion] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

:param _from: From where do we load the sink class. Ingestion by default.
"""
processor_class = get_class(
"metadata.{}.processor.{}.{}Processor".format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make this more flexible to load the class without depending on the packaging convention.
Not important right now but can be annoying for users to write their own sources, processors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Let's tackle this together with #2848. I'll add a comment there

@pmbrull pmbrull merged commit 1224d20 into open-metadata:main Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Profiler processor and status
2 participants