-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #10013: Implement first stage of NoSQL profiler #15189
Conversation
1. Implemented the NoSQLProfilerInterface as an entrypoint for the nosql profiler. 2. Added the NoSQLMetric as an abstract class. 3. Implemented the interface for the MongoDB database source. 4. Implemented an e2e test using testcontainers.
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
@@ -311,6 +311,7 @@ | |||
VERSIONS["snowflake"], | |||
VERSIONS["elasticsearch8"], | |||
VERSIONS["giturlparse"], | |||
"testcontainers==3.7.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea. Did not know we had that for Python too. Would be worth a small discussion with the team to see where else we can use it 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely. moving forward we can also use it to setup the OM instance in the tests instead of relying on whatever's running on our local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set it up globally, (i.e. you only create it once at the beginning of the test session?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its possible with some hacking. Best practice is to abstract the setup to a function and start it with with every test suite (class) using setUpClass so that state is not corrupted between tests. The image can be cached locally so it should be low overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, It might add a bit of overhead if we need to start the server for every tests. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in backend, we start the containers for each Test Class
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
anybody know why ingestion/src/metadata/great_expectations/action.py:110:4: W0237: Parameter 'expectation_suite_identifier' has been renamed to 'payload' in overriding 'OpenMetadataValidationAction._run' method (arguments-renamed)
ingestion/src/metadata/great_expectations/action.py:110:4: W0237: Parameter 'checkpoint_identifier' has been renamed to 'expectation_suite_identifier' in overriding 'OpenMetadataValidationAction._run' method (arguments-renamed) |
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
You should check the parent method signature (it looks like some parameters do not use the same name in the method of the children class. Weird though, as you have not touched that. |
f"{traceback.format_exc()}\n" | ||
f"Error trying to compute metric {metric} for {self.table.fullyQualifiedName}: {exc}" | ||
) | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not raise in the loop. The idea is that if we fail to compute 1 metric then we should log the error but continue to compute the rest of the metrics so we don't stop the pipeline mid way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PandasProfilerInterface
, it appears that a single error also breaks the loop:
OpenMetadata/ingestion/src/metadata/profiler/interface/pandas/profiler_interface.py
Lines 144 to 153 in 9a4a9df
try: | |
row_dict = {} | |
df_list = [df.where(pd.notnull(df), None) for df in runner] | |
for metric in metrics: | |
row_dict[metric.name()] = metric().df_fn(df_list) | |
return row_dict | |
except Exception as exc: | |
logger.debug(traceback.format_exc()) | |
logger.warning(f"Error trying to compute profile for {exc}") | |
raise RuntimeError(exc) |
Should the errors be registered anywhere (other than log)? Should an exception be thrown at the end of the loop?
ingestion/src/metadata/profiler/interface/nosql/profiler_interface.py
Outdated
Show resolved
Hide resolved
ingestion/src/metadata/profiler/interface/nosql/profiler_interface.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sushi30 thanks for the PR. Letf a few comments
The files which are being reported were not changed in this delivery (neither |
It appears |
- removed unused inheritance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to add any migration after changing this schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you will need to add that key to the connection. You can either handle it here or do it in a separate PR -- up to you.
Awesome. Thank you |
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
Return the function to be used for NoSQL clients to calculate the metric. | ||
By default, returns a "do nothing" function that returns None. | ||
""" | ||
return lambda table: None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to return a function and not just None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is returning a Callable
:
If it returns a bare None
it will need to handle a branch to to avoid TypeError: 'NoneType' object is not callable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sushi30 LGTM. I just left a comment, you can run make py_format_check
to check where the style is failing Once this passes we can merge it in 😊
@TeddyCr thanks. Should I take any action on this discussion or are we good? |
Quality Gate passed for 'open-metadata-ui'Issues Measures |
Quality Gate passed for 'open-metadata-ingestion'Issues Measures |
…filer (open-metadata#15189)" This reverts commit 18c22c4.
* Revert "add migration for MongoDB supportsProfiler = true (#15254)" This reverts commit ec3eb29. * Revert "MINOR: Mongodb column profile (#15252)" This reverts commit 50b2709. * Revert "MINOR: modified nosql factory to not use pymongo (#15316)" This reverts commit bdf2745. * Revert "MINOR: add MongoDB sample data (#15237)" This reverts commit ff2ecc5. * Revert "MINOR: add test for sqla compiler (#15275)" This reverts commit 4967e09. * Revert "Fixes #10013: Implement first stage of NoSQL profiler (#15189)" This reverts commit 18c22c4. * chore: added tests back after revert
Describe your changes:
Fixes #10013
Type of change:
Checklist:
Fixes <issue-number>: <short explanation>