Fixes #10013: Implement first stage of NoSQL profiler #15189

sushi30 · 2024-02-15T10:42:28Z

Describe your changes:

Fixes #10013

Implemented the NoSQLProfilerInterface as an entrypoint for the nosql profiler.
Added the NoSQLMetric as an abstract class.
Implemented the interface for the MongoDB database source.
Implemented an e2e test using testcontainers.

Type of change:

Improvement

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

I have added tests around the new logic.
For connector/ingestion changes: I updated the documentation.

1. Implemented the NoSQLProfilerInterface as an entrypoint for the nosql profiler. 2. Added the NoSQLMetric as an abstract class. 3. Implemented the interface for the MongoDB database source. 4. Implemented an e2e test using testcontainers.

github-actions · 2024-02-15T10:48:54Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

ingestion/tests/integration/profiler/test_nosql_profiler.py

pmbrull · 2024-02-15T10:55:24Z

ingestion/setup.py

@@ -311,6 +311,7 @@
    VERSIONS["snowflake"],
    VERSIONS["elasticsearch8"],
    VERSIONS["giturlparse"],
+    "testcontainers==3.7.1",


I think this is a great idea. Did not know we had that for Python too. Would be worth a small discussion with the team to see where else we can use it 🚀

absolutely. moving forward we can also use it to setup the OM instance in the tests instead of relying on whatever's running on our local machine.

Can you set it up globally, (i.e. you only create it once at the beginning of the test session?

Its possible with some hacking. Best practice is to abstract the setup to a function and start it with with every test suite (class) using setUpClass so that state is not corrupted between tests. The image can be cached locally so it should be low overhead.

I see, It might add a bit of overhead if we need to start the server for every tests. 🤔

in backend, we start the containers for each Test Class

github-actions · 2024-02-15T11:11:45Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

github-actions · 2024-02-15T11:23:54Z

Jest test Coverage

UI tests summary

Lines	Statements	Branches	Functions
	61.6% (31443/51047)	40.59% (12542/30898)	40.91% (3862/9441)

sushi30 · 2024-02-15T13:03:30Z

anybody know why py_format_check is failing for files I have not changed?

ingestion/src/metadata/great_expectations/action.py:110:4: W0237: Parameter 'expectation_suite_identifier' has been renamed to 'payload' in overriding 'OpenMetadataValidationAction._run' method (arguments-renamed)
ingestion/src/metadata/great_expectations/action.py:110:4: W0237: Parameter 'checkpoint_identifier' has been renamed to 'expectation_suite_identifier' in overriding 'OpenMetadataValidationAction._run' method (arguments-renamed)

github-actions · 2024-02-15T13:13:33Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

TeddyCr · 2024-02-15T13:43:23Z

OpenMetadataValidationAction

You should check the parent method signature (it looks like some parameters do not use the same name in the method of the children class. Weird though, as you have not touched that.

TeddyCr · 2024-02-15T13:49:32Z

ingestion/src/metadata/profiler/interface/nosql/profiler_interface.py

+                    f"{traceback.format_exc()}\n"
+                    f"Error trying to compute metric {metric} for {self.table.fullyQualifiedName}: {exc}"
+                )
+                raise RuntimeError(


We should not raise in the loop. The idea is that if we fail to compute 1 metric then we should log the error but continue to compute the rest of the metrics so we don't stop the pipeline mid way.

For the PandasProfilerInterface, it appears that a single error also breaks the loop:

OpenMetadata/ingestion/src/metadata/profiler/interface/pandas/profiler_interface.py

Lines 144 to 153 in 9a4a9df

try:

row_dict = {}

df_list = [df.where(pd.notnull(df), None) for df in runner]

for metric in metrics:

row_dict[metric.name()] = metric().df_fn(df_list)

return row_dict

except Exception as exc:

logger.debug(traceback.format_exc())

logger.warning(f"Error trying to compute profile for {exc}")

raise RuntimeError(exc)

Should the errors be registered anywhere (other than log)? Should an exception be thrown at the end of the loop?

ingestion/src/metadata/profiler/interface/nosql/profiler_interface.py

ingestion/src/metadata/profiler/metrics/nosql_metric.py

ingestion/src/metadata/profiler/metrics/static/row_count.py

ingestion/tests/integration/profiler/test_nosql_profiler.py

ingestion/src/metadata/profiler/interface/nosql/profiler_interface.py

TeddyCr

@sushi30 thanks for the PR. Letf a few comments

sushi30 · 2024-02-15T14:04:43Z

OpenMetadataValidationAction
You should check the parent method signature (it looks like some parameters do not use the same name in the method of the children class

@TeddyCr

The files which are being reported were not changed in this delivery (neither OpenMetadataValidationAction class). I dont mind fixing the linting issue but I it is a bit strange that it is getting reported here. I also dont see any breaking change in great-expectations

sushi30 · 2024-02-15T14:29:10Z

re: #15189 (comment)

It appears gx have the same issue with their linter. I will add an ignore comment like they did...

https://github.com/great-expectations/great_expectations/blame/0.18.8/great_expectations/checkpoint/actions.py#L374C16-L374C83

- removed unused inheritance

sushi30 · 2024-02-15T14:36:58Z

...c/src/main/resources/json/schema/entity/services/connections/database/mongoDBConnection.json

Do I need to add any migration after changing this schema?

Yes, you will need to add that key to the connection. You can either handle it here or do it in a separate PR -- up to you.

TeddyCr · 2024-02-15T14:37:40Z

re: #15189 (comment)

It appears gx have the same issue with their linter. I will add an ignore comment like they did...

https://github.com/great-expectations/great_expectations/blame/0.18.8/great_expectations/checkpoint/actions.py#L374C16-L374C83

Awesome. Thank you

github-actions · 2024-02-15T16:43:43Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

github-actions · 2024-02-16T13:40:32Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

TeddyCr · 2024-02-16T16:04:04Z

ingestion/src/metadata/profiler/metrics/core.py

+        Return the function to be used for NoSQL clients to calculate the metric.
+        By default, returns a "do nothing" function that returns None.
+        """
+        return lambda table: None


why do we need to return a function and not just None?

The method is returning a Callable:

https://github.com/open-metadata/OpenMetadata/pull/15189/files/74c2abd3de021f11d8b1630399605288260a007d..1733b9af92e92f317bd518855337da94f6699b76#diff-e439fc0df5e00e522353c3ee002acdf38e1320efcd148877a5ed01accee72279R162

If it returns a bare None it will need to handle a branch to to avoid TypeError: 'NoneType' object is not callable.

TeddyCr

Thanks @sushi30 LGTM. I just left a comment, you can run make py_format_check to check where the style is failing Once this passes we can merge it in 😊

sushi30 · 2024-02-19T07:46:55Z

@TeddyCr thanks. Should I take any action on this discussion or are we good?

sonarcloud · 2024-02-19T08:04:16Z

Quality Gate passed for 'open-metadata-ui'

Issues
0 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

sonarcloud · 2024-02-19T08:07:40Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
2 New issues

Measures
0 Security Hotspots
84.9% Coverage on New Code
2.7% Duplication on New Code

See analysis details on SonarCloud

…filer (open-metadata#15189)" This reverts commit 18c22c4.

* Revert "add migration for MongoDB supportsProfiler = true (#15254)" This reverts commit ec3eb29. * Revert "MINOR: Mongodb column profile (#15252)" This reverts commit 50b2709. * Revert "MINOR: modified nosql factory to not use pymongo (#15316)" This reverts commit bdf2745. * Revert "MINOR: add MongoDB sample data (#15237)" This reverts commit ff2ecc5. * Revert "MINOR: add test for sqla compiler (#15275)" This reverts commit 4967e09. * Revert "Fixes #10013: Implement first stage of NoSQL profiler (#15189)" This reverts commit 18c22c4. * chore: added tests back after revert

sushi30 added 2 commits February 15, 2024 11:37

feat(nosql-profiler): row count

ee63fc9

1. Implemented the NoSQLProfilerInterface as an entrypoint for the nosql profiler. 2. Added the NoSQLMetric as an abstract class. 3. Implemented the interface for the MongoDB database source. 4. Implemented an e2e test using testcontainers.

added profiler support for mongodb connection

5982786

sushi30 requested review from akash-jain-10, harshach and a team as code owners February 15, 2024 10:42

github-actions bot added Ingestion backend safe to test Add this label to run secure Github workflows on PRs labels Feb 15, 2024

pmbrull reviewed Feb 15, 2024

View reviewed changes

doc

9357355

sushi30 requested review from karanh37, chirag-madlani and Sachin-chaurasiya as code owners February 15, 2024 11:04

Merge remote-tracking branch 'origin/main' into nosql-row-count

91e103e

use int_admin_ometa in test setup

74c2abd