Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #10013: Implement first stage of NoSQL profiler #15189

Merged
merged 11 commits into from
Feb 22, 2024
Merged

Conversation

sushi30
Copy link
Contributor

@sushi30 sushi30 commented Feb 15, 2024

Describe your changes:

Fixes #10013

  1. Implemented the NoSQLProfilerInterface as an entrypoint for the nosql profiler.
  2. Added the NoSQLMetric as an abstract class.
  3. Implemented the interface for the MongoDB database source.
  4. Implemented an e2e test using testcontainers.

Type of change:

  • Improvement

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added tests around the new logic.
  • For connector/ingestion changes: I updated the documentation.

1. Implemented the NoSQLProfilerInterface as an entrypoint for the nosql profiler.
2. Added the NoSQLMetric as an abstract class.
3. Implemented the interface for the MongoDB database source.
4. Implemented an e2e test using testcontainers.
@github-actions github-actions bot added Ingestion backend safe to test Add this label to run secure Github workflows on PRs labels Feb 15, 2024
Copy link

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@@ -311,6 +311,7 @@
VERSIONS["snowflake"],
VERSIONS["elasticsearch8"],
VERSIONS["giturlparse"],
"testcontainers==3.7.1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea. Did not know we had that for Python too. Would be worth a small discussion with the team to see where else we can use it 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely. moving forward we can also use it to setup the OM instance in the tests instead of relying on whatever's running on our local machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set it up globally, (i.e. you only create it once at the beginning of the test session?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its possible with some hacking. Best practice is to abstract the setup to a function and start it with with every test suite (class) using setUpClass so that state is not corrupted between tests. The image can be cached locally so it should be low overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, It might add a bit of overhead if we need to start the server for every tests. 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in backend, we start the containers for each Test Class

Copy link

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 61%
61.6% (31443/51047) 40.59% (12542/30898) 40.91% (3862/9441)

@sushi30
Copy link
Contributor Author

sushi30 commented Feb 15, 2024

anybody know why py_format_check is failing for files I have not changed?

ingestion/src/metadata/great_expectations/action.py:110:4: W0237: Parameter 'expectation_suite_identifier' has been renamed to 'payload' in overriding 'OpenMetadataValidationAction._run' method (arguments-renamed)
ingestion/src/metadata/great_expectations/action.py:110:4: W0237: Parameter 'checkpoint_identifier' has been renamed to 'expectation_suite_identifier' in overriding 'OpenMetadataValidationAction._run' method (arguments-renamed)

Copy link

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@TeddyCr
Copy link
Contributor

TeddyCr commented Feb 15, 2024

OpenMetadataValidationAction

You should check the parent method signature (it looks like some parameters do not use the same name in the method of the children class. Weird though, as you have not touched that.

f"{traceback.format_exc()}\n"
f"Error trying to compute metric {metric} for {self.table.fullyQualifiedName}: {exc}"
)
raise RuntimeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not raise in the loop. The idea is that if we fail to compute 1 metric then we should log the error but continue to compute the rest of the metrics so we don't stop the pipeline mid way.

Copy link
Contributor Author

@sushi30 sushi30 Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PandasProfilerInterface, it appears that a single error also breaks the loop:

try:
row_dict = {}
df_list = [df.where(pd.notnull(df), None) for df in runner]
for metric in metrics:
row_dict[metric.name()] = metric().df_fn(df_list)
return row_dict
except Exception as exc:
logger.debug(traceback.format_exc())
logger.warning(f"Error trying to compute profile for {exc}")
raise RuntimeError(exc)

Should the errors be registered anywhere (other than log)? Should an exception be thrown at the end of the loop?

Copy link
Contributor

@TeddyCr TeddyCr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sushi30 thanks for the PR. Letf a few comments

@sushi30
Copy link
Contributor Author

sushi30 commented Feb 15, 2024

OpenMetadataValidationAction

You should check the parent method signature (it looks like some parameters do not use the same name in the method of the children class

@TeddyCr

The files which are being reported were not changed in this delivery (neither OpenMetadataValidationAction class). I dont mind fixing the linting issue but I it is a bit strange that it is getting reported here. I also dont see any breaking change in great-expectations

@sushi30
Copy link
Contributor Author

sushi30 commented Feb 15, 2024

re: #15189 (comment)

It appears gx have the same issue with their linter. I will add an ignore comment like they did...

https://github.com/great-expectations/great_expectations/blame/0.18.8/great_expectations/checkpoint/actions.py#L374C16-L374C83

- removed unused inheritance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to add any migration after changing this schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you will need to add that key to the connection. You can either handle it here or do it in a separate PR -- up to you.

@TeddyCr
Copy link
Contributor

TeddyCr commented Feb 15, 2024

re: #15189 (comment)

It appears gx have the same issue with their linter. I will add an ignore comment like they did...

https://github.com/great-expectations/great_expectations/blame/0.18.8/great_expectations/checkpoint/actions.py#L374C16-L374C83

Awesome. Thank you

Copy link

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sushi30 sushi30 self-assigned this Feb 15, 2024
Copy link

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Return the function to be used for NoSQL clients to calculate the metric.
By default, returns a "do nothing" function that returns None.
"""
return lambda table: None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to return a function and not just None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is returning a Callable:

https://github.com/open-metadata/OpenMetadata/pull/15189/files/74c2abd3de021f11d8b1630399605288260a007d..1733b9af92e92f317bd518855337da94f6699b76#diff-e439fc0df5e00e522353c3ee002acdf38e1320efcd148877a5ed01accee72279R162

If it returns a bare None it will need to handle a branch to to avoid TypeError: 'NoneType' object is not callable.

Copy link
Contributor

@TeddyCr TeddyCr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sushi30 LGTM. I just left a comment, you can run make py_format_check to check where the style is failing Once this passes we can merge it in 😊

@sushi30
Copy link
Contributor Author

sushi30 commented Feb 19, 2024

@TeddyCr thanks. Should I take any action on this discussion or are we good?

Copy link

sonarcloud bot commented Feb 19, 2024

Quality Gate Passed Quality Gate passed for 'open-metadata-ui'

Issues
0 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link

sonarcloud bot commented Feb 19, 2024

Quality Gate Passed Quality Gate passed for 'open-metadata-ingestion'

Issues
2 New issues

Measures
0 Security Hotspots
84.9% Coverage on New Code
2.7% Duplication on New Code

See analysis details on SonarCloud

@sushi30 sushi30 merged commit 18c22c4 into main Feb 22, 2024
21 checks passed
@sushi30 sushi30 deleted the nosql-row-count branch February 22, 2024 10:46
TeddyCr added a commit to TeddyCr/OpenMetadata that referenced this pull request Feb 28, 2024
TeddyCr added a commit that referenced this pull request Feb 28, 2024
* Revert "add migration for MongoDB supportsProfiler = true (#15254)"

This reverts commit ec3eb29.

* Revert "MINOR: Mongodb column profile (#15252)"

This reverts commit 50b2709.

* Revert "MINOR: modified nosql factory to not use pymongo (#15316)"

This reverts commit bdf2745.

* Revert "MINOR: add MongoDB sample data (#15237)"

This reverts commit ff2ecc5.

* Revert "MINOR: add test for sqla compiler (#15275)"

This reverts commit 4967e09.

* Revert "Fixes #10013: Implement first stage of NoSQL profiler (#15189)"

This reverts commit 18c22c4.

* chore: added tests back after revert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Ingestion safe to test Add this label to run secure Github workflows on PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Profiler Support for NoSQL DB (DynamoDB, Mongo, etc.)
6 participants