feat(profiler): enable complex data type profiling (#15627)#27529
feat(profiler): enable complex data type profiling (#15627)#27529david-mamani wants to merge 3 commits intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Split NOT_COMPUTE into NOT_COMPUTE (truly unprofileable: NullType, UndeterminedType) and COMPLEX_TYPES (JSON, arrays, geo, structs, etc.) that receive a restricted safe subset of metrics (nullCount, valuesCount). Changes: - registry.py: new COMPLEX_TYPES set, COMPLEX_TYPE_METRICS, is_complex_type() - core.py: _prepare_column_metrics() now routes complex columns to limited metrics - unique_count.py: guard extended to exclude COMPLEX_TYPES - Added 48 unit tests validating the registry refactoring
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
047143e to
fdd2841
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
Enables profiler support for complex column data types by allowing a limited, “safe” subset of metrics (nullCount, valuesCount) to be computed for types that were previously fully excluded.
Changes:
- Split the previous “do not compute” type bucket into truly-unprofileable types vs. complex types eligible for limited metrics.
- Updated profiler processor column-metric preparation to route complex columns to
COMPLEX_TYPE_METRICS. - Updated
UniqueCountto skip complex types.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ingestion/src/metadata/profiler/orm/registry.py | Introduces COMPLEX_TYPES, COMPLEX_TYPE_METRICS, and is_complex_type(); narrows NOT_COMPUTE. |
| ingestion/src/metadata/profiler/processor/core.py | Routes complex columns to limited static metrics and skips composed/hybrid only for fully excluded types. |
| ingestion/src/metadata/profiler/metrics/static/unique_count.py | Prevents uniqueCount computation on complex types. |
| ingestion/tests/unit/observability/profiler/test_complex_type_profiling.py | Adds unit coverage for the registry split and safe metrics set. |
| ingestion/tests/unit/observability/profiler/run_complex_type_tests.py | Adds a standalone runner for manual/local checks. |
| class DataType(str, Enum): | ||
| INT="INT"; BIGINT="BIGINT"; SMALLINT="SMALLINT"; TINYINT="TINYINT" | ||
| NUMBER="NUMBER"; NUMERIC="NUMERIC"; DECIMAL="DECIMAL" | ||
| DOUBLE="DOUBLE"; FLOAT="FLOAT"; JSON="JSON"; ARRAY="ARRAY" | ||
| MAP="MAP"; STRUCT="STRUCT"; UNION="UNION"; SET="SET" | ||
| GEOGRAPHY="GEOGRAPHY"; GEOMETRY="GEOMETRY"; ENUM="ENUM" | ||
| STRING="STRING"; TEXT="TEXT"; CHAR="CHAR"; VARCHAR="VARCHAR" | ||
| BOOLEAN="BOOLEAN"; DATE="DATE"; DATETIME="DATETIME" | ||
| TIMESTAMP="TIMESTAMP"; TIME="TIME"; BINARY="BINARY" | ||
| VARBINARY="VARBINARY"; BLOB="BLOB"; BYTEA="BYTEA" | ||
| MEDIUMTEXT="MEDIUMTEXT"; NULL="NULL"; SUPER="SUPER" | ||
| INTERVAL="INTERVAL"; XML="XML"; FIXED="FIXED" | ||
| LONG="LONG"; BYTES="BYTES" | ||
|
|
||
| class MetricType(str, Enum): | ||
| valuesCount="valuesCount"; nullCount="nullCount" | ||
| nullProportion="nullProportion"; uniqueCount="uniqueCount" | ||
| distinctCount="distinctCount"; distinctProportion="distinctProportion" | ||
| min="min"; max="max"; mean="mean"; sum="sum"; stddev="stddev" | ||
| median="median"; firstQuartile="firstQuartile" | ||
| thirdQuartile="thirdQuartile" | ||
| interQuartileRange="interQuartileRange" | ||
| nonParametricSkew="nonParametricSkew" | ||
| columnCount="columnCount"; columnNames="columnNames" | ||
| rowCount="rowCount"; histogram="histogram" | ||
| uniqueProportion="uniqueProportion" | ||
| duplicateCount="duplicateCount" | ||
| nullMissingCount="nullMissingCount"; system="system" | ||
|
|
There was a problem hiding this comment.
This file is not formatted to the repo’s Python formatting standard (Black/pycln/isort run on ingestion/ via make py_format_check). Examples include multiple statements per line and compact Enum definitions; black --check will fail on this file as-is. Please run the formatter and commit the formatted output, or drop this script from the repo if it’s only for local ad-hoc runs.
| Standalone test runner for complex type profiling changes. | ||
|
|
||
| Uses a sys.meta_path finder to intercept ALL metadata.generated.* | ||
| imports, returning permissive stubs. The 'metadata' package itself is | ||
| replaced with a bare module so its __init__.py never runs. | ||
|
|
||
| Usage: python run_complex_type_tests.py | ||
| See: https://github.com/open-metadata/OpenMetadata/issues/15627 | ||
| """ | ||
|
|
||
| import sys | ||
| import os | ||
| import logging | ||
| import importlib | ||
| from types import ModuleType | ||
| from enum import Enum | ||
|
|
||
| # ── Prevent script dir from shadowing real packages ────────────────── | ||
| _script_dir = os.path.dirname(os.path.abspath(__file__)) | ||
| sys.path = [p for p in sys.path if os.path.abspath(p) != _script_dir] | ||
|
|
||
| # ── Ensure ingestion/src is on path ────────────────────────────────── | ||
| _src_dir = os.path.normpath(os.path.join(_script_dir, "..", "..", "..", "..", "src")) | ||
| if _src_dir not in sys.path: | ||
| sys.path.insert(0, _src_dir) | ||
|
|
||
|
|
||
| # ════════════════════════════════════════════════════════════════════════ | ||
| # 1) Install 'metadata' as a bare package (skip its __init__.py) | ||
| # ════════════════════════════════════════════════════════════════════════ | ||
| _meta_pkg = ModuleType("metadata") | ||
| _meta_pkg.__path__ = [os.path.join(_src_dir, "metadata")] | ||
| _meta_pkg.__package__ = "metadata" | ||
| sys.modules["metadata"] = _meta_pkg | ||
|
|
||
|
|
||
| # ════════════════════════════════════════════════════════════════════════ | ||
| # 2) Meta-path finder: auto-stub ALL metadata.generated.* imports | ||
| # ════════════════════════════════════════════════════════════════════════ | ||
| _null_logger = logging.getLogger("stub") | ||
|
|
||
|
|
||
| class _StubModule(ModuleType): | ||
| """A stub module whose attributes are either explicitly set or | ||
| fall back to a dummy class that has .__name__, is iterable, etc.""" | ||
|
|
||
| class _Dummy: | ||
| __name__ = "_Dummy" | ||
| def __init_subclass__(cls, **kw): pass | ||
| def __init__(self, *a, **kw): pass | ||
| def __call__(self, *a, **kw): return self | ||
| def __iter__(self): return iter([]) | ||
| def __bool__(self): return False | ||
| def __str__(self): return "_Dummy" | ||
| def items(self): return [] | ||
| def values(self): return [] | ||
| def keys(self): return [] | ||
|
|
||
| def __getattr__(self, name): | ||
| # Return the class (not an instance) so it can be used as a | ||
| # type annotation, base class, or called to construct instances. | ||
| return _StubModule._Dummy | ||
|
|
||
|
|
||
| class _GeneratedFinder: | ||
| """Intercepts `import metadata.generated.*` and returns stubs.""" | ||
| PREFIX = "metadata.generated" | ||
|
|
||
| def find_module(self, fullname, path=None): | ||
| if fullname == self.PREFIX or fullname.startswith(self.PREFIX + "."): | ||
| return self | ||
| return None | ||
|
|
||
| def load_module(self, fullname): | ||
| if fullname in sys.modules: | ||
| return sys.modules[fullname] | ||
| mod = _StubModule(fullname) | ||
| mod.__path__ = [] | ||
| mod.__package__ = fullname | ||
| mod.__loader__ = self | ||
| sys.modules[fullname] = mod | ||
| return mod | ||
|
|
||
|
|
||
| sys.meta_path.insert(0, _GeneratedFinder()) | ||
|
|
||
|
|
||
| # ════════════════════════════════════════════════════════════════════════ |
There was a problem hiding this comment.
This script lives under ingestion/tests/unit/ but won’t be picked up by pytest (it doesn’t match the test_*.py pattern) and it duplicates what the proper unit test file already validates. To avoid dead/duplicated test logic, consider removing it or converting its assertions into a real pytest/unittest test module.
| Standalone test runner for complex type profiling changes. | |
| Uses a sys.meta_path finder to intercept ALL metadata.generated.* | |
| imports, returning permissive stubs. The 'metadata' package itself is | |
| replaced with a bare module so its __init__.py never runs. | |
| Usage: python run_complex_type_tests.py | |
| See: https://github.com/open-metadata/OpenMetadata/issues/15627 | |
| """ | |
| import sys | |
| import os | |
| import logging | |
| import importlib | |
| from types import ModuleType | |
| from enum import Enum | |
| # ── Prevent script dir from shadowing real packages ────────────────── | |
| _script_dir = os.path.dirname(os.path.abspath(__file__)) | |
| sys.path = [p for p in sys.path if os.path.abspath(p) != _script_dir] | |
| # ── Ensure ingestion/src is on path ────────────────────────────────── | |
| _src_dir = os.path.normpath(os.path.join(_script_dir, "..", "..", "..", "..", "src")) | |
| if _src_dir not in sys.path: | |
| sys.path.insert(0, _src_dir) | |
| # ════════════════════════════════════════════════════════════════════════ | |
| # 1) Install 'metadata' as a bare package (skip its __init__.py) | |
| # ════════════════════════════════════════════════════════════════════════ | |
| _meta_pkg = ModuleType("metadata") | |
| _meta_pkg.__path__ = [os.path.join(_src_dir, "metadata")] | |
| _meta_pkg.__package__ = "metadata" | |
| sys.modules["metadata"] = _meta_pkg | |
| # ════════════════════════════════════════════════════════════════════════ | |
| # 2) Meta-path finder: auto-stub ALL metadata.generated.* imports | |
| # ════════════════════════════════════════════════════════════════════════ | |
| _null_logger = logging.getLogger("stub") | |
| class _StubModule(ModuleType): | |
| """A stub module whose attributes are either explicitly set or | |
| fall back to a dummy class that has .__name__, is iterable, etc.""" | |
| class _Dummy: | |
| __name__ = "_Dummy" | |
| def __init_subclass__(cls, **kw): pass | |
| def __init__(self, *a, **kw): pass | |
| def __call__(self, *a, **kw): return self | |
| def __iter__(self): return iter([]) | |
| def __bool__(self): return False | |
| def __str__(self): return "_Dummy" | |
| def items(self): return [] | |
| def values(self): return [] | |
| def keys(self): return [] | |
| def __getattr__(self, name): | |
| # Return the class (not an instance) so it can be used as a | |
| # type annotation, base class, or called to construct instances. | |
| return _StubModule._Dummy | |
| class _GeneratedFinder: | |
| """Intercepts `import metadata.generated.*` and returns stubs.""" | |
| PREFIX = "metadata.generated" | |
| def find_module(self, fullname, path=None): | |
| if fullname == self.PREFIX or fullname.startswith(self.PREFIX + "."): | |
| return self | |
| return None | |
| def load_module(self, fullname): | |
| if fullname in sys.modules: | |
| return sys.modules[fullname] | |
| mod = _StubModule(fullname) | |
| mod.__path__ = [] | |
| mod.__package__ = fullname | |
| mod.__loader__ = self | |
| sys.modules[fullname] = mod | |
| return mod | |
| sys.meta_path.insert(0, _GeneratedFinder()) | |
| # ════════════════════════════════════════════════════════════════════════ | |
| Deprecated standalone test runner. | |
| This file intentionally no longer contains executable test logic. | |
| The canonical assertions for complex type profiling belong in the | |
| proper pytest test module that is already collected by the unit test | |
| suite, which avoids keeping duplicated or non-discoverable tests | |
| under ``ingestion/tests/unit``. | |
| """ | |
| # ════════════════════════════════════════════════════════════════════════ |
| @@ -443,6 +458,28 @@ def _prepare_column_metrics(self) -> List: | |||
| ) | |||
| ) | |||
|
|
|||
| # Add safe metrics for complex type columns | |||
| for column in complex_columns: | |||
| safe_metrics = [ | |||
| metric | |||
| for metric in self.metric_filter.get_column_metrics( | |||
| StaticMetric, | |||
| column, | |||
| self.profiler_interface.table_entity.serviceType, | |||
| ) | |||
| if not metric.is_window_metric() | |||
| and metric.name() in COMPLEX_TYPE_METRICS | |||
| ] | |||
| if safe_metrics: | |||
| column_metrics_for_thread_pool.append( | |||
| ThreadPoolMetrics( | |||
| metrics=safe_metrics, | |||
| metric_type=MetricTypes.Static, | |||
| column=column, | |||
| table=self.table, | |||
| ) | |||
| ) | |||
|
|
|||
There was a problem hiding this comment.
Core behavior change (routing complex columns to COMPLEX_TYPE_METRICS) isn’t exercised by tests. Add/extend an existing profiler unit test to assert that ARRAY/JSON/GEOGRAPHY columns only schedule nullCount & valuesCount and do not schedule query/window metrics (and that regular columns remain unaffected).
| @@ -135,8 +143,16 @@ class Dialects(metaclass=EnumAdapter): | |||
| CustomTypes.ARRAY.value.__name__, | |||
| CustomTypes.SQADATETIMERANGE.value.__name__, | |||
| DataType.XML.value, | |||
| CustomTypes.UNDETERMINED.value.__name__, | |||
| } | |||
There was a problem hiding this comment.
COMPLEX_TYPES is missing the Geo type name "GEOGRAPHY" (e.g., created via create_sqlalchemy_type("GEOGRAPHY") in multiple dialects), so GEOGRAPHY columns will be treated as regular types and may still attempt unsafe metrics (e.g., DISTINCT/GROUP BY). Add DataType.GEOGRAPHY.value (or an equivalent "GEOGRAPHY" entry) to COMPLEX_TYPES to ensure they only get COMPLEX_TYPE_METRICS.
| See: https://github.com/open-metadata/OpenMetadata/issues/15627 | ||
| """ | ||
|
|
||
| import importlib |
There was a problem hiding this comment.
This test file has an unused import (importlib) which will be flagged by pycln during make py_format_check (it runs on the whole ingestion/ directory). Remove the unused import to keep formatting checks passing.
| import importlib |
| def test_sqa_geography_in_complex_types(self): | ||
| """SQASGeography should be in COMPLEX_TYPES.""" | ||
| self.assertIn(SQASGeography.__name__, COMPLEX_TYPES) | ||
|
|
||
| def test_geometry_in_complex_types(self): | ||
| """GEOMETRY should be in COMPLEX_TYPES.""" | ||
| self.assertIn("GEOMETRY", COMPLEX_TYPES) | ||
|
|
||
| def test_xml_in_complex_types(self): | ||
| """XML should be in COMPLEX_TYPES.""" | ||
| self.assertIn("XML", COMPLEX_TYPES) | ||
|
|
||
| def test_datetimerange_in_complex_types(self): | ||
| """CustomDateTimeRange should be in COMPLEX_TYPES.""" | ||
| self.assertIn(CustomDateTimeRange.__name__, COMPLEX_TYPES) | ||
|
|
There was a problem hiding this comment.
The new registry behavior for complex geo types isn’t covered for the common SQLAlchemy type class name "GEOGRAPHY" (created via create_sqlalchemy_type in Snowflake/Redshift/BigQuery/etc.). Add assertions here that "GEOGRAPHY" is included in COMPLEX_TYPES (and not in NOT_COMPUTE) so the test suite catches regressions for geo columns.
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
🔴 Playwright Results — 1 failure(s), 17 flaky✅ 3669 passed · ❌ 1 failed · 🟡 17 flaky · ⏭️ 89 skipped
Genuine Failures (failed on all attempts)❌
|
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
Code Review ✅ ApprovedEnables complex data type profiling to improve instrumentation depth. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| class _GeneratedFinder: | ||
| """Intercepts `import metadata.generated.*` and returns stubs.""" | ||
|
|
||
| PREFIX = "metadata.generated" | ||
|
|
||
| def find_module(self, fullname, path=None): | ||
| if fullname == self.PREFIX or fullname.startswith(self.PREFIX + "."): | ||
| return self | ||
| return None | ||
|
|
||
| def load_module(self, fullname): | ||
| if fullname in sys.modules: | ||
| return sys.modules[fullname] | ||
| mod = _StubModule(fullname) | ||
| mod.__path__ = [] | ||
| mod.__package__ = fullname | ||
| mod.__loader__ = self | ||
| sys.modules[fullname] = mod | ||
| return mod | ||
|
|
There was a problem hiding this comment.
_GeneratedFinder implements the deprecated find_module/load_module import hook API, which is discouraged and can break with newer Python importlib behavior. If this runner is kept, consider using importlib.abc.MetaPathFinder + importlib.abc.Loader (find_spec/exec_module) instead.
| See: https://github.com/open-metadata/OpenMetadata/issues/15627 | ||
| """ | ||
|
|
||
| import importlib |
There was a problem hiding this comment.
importlib is imported but never used in this test module. Please remove the unused import to keep the test clean.
| import importlib |
| def _bootstrap_generated_stub(src_dir): | ||
| """Creates minimal stubs for metadata.generated so that the | ||
| orm.registry module can be imported in environments where | ||
| the full code-generation pipeline has not been run. | ||
| """ |
There was a problem hiding this comment.
_bootstrap_generated_stub takes a src_dir parameter but doesn’t use it. Either remove the parameter or use it (e.g., for locating/creating the stub package paths) to avoid dead arguments.
| # Wire up stub modules in sys.modules | ||
| for mod_name in [ | ||
| "metadata.generated", | ||
| "metadata.generated.schema", | ||
| "metadata.generated.schema.entity", | ||
| "metadata.generated.schema.entity.data", | ||
| "metadata.generated.schema.entity.data.table", | ||
| "metadata.generated.schema.configuration", | ||
| "metadata.generated.schema.configuration.profilerConfiguration", | ||
| "metadata.generated.schema.api", | ||
| "metadata.generated.schema.api.data", | ||
| "metadata.generated.schema.api.data.createTableProfile", | ||
| "metadata.generated.schema.entity.services", | ||
| "metadata.generated.schema.entity.services.databaseService", | ||
| "metadata.generated.schema.entity.services.connections", | ||
| "metadata.generated.schema.entity.services.connections.database", | ||
| "metadata.generated.schema.entity.services.connections.database.sqliteConnection", | ||
| "metadata.generated.schema.entity.services.connections.metadata", | ||
| "metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection", | ||
| "metadata.generated.schema.settings", | ||
| "metadata.generated.schema.settings.settings", | ||
| "metadata.generated.schema.tests", | ||
| "metadata.generated.schema.tests.customMetric", | ||
| "metadata.generated.schema.type", | ||
| "metadata.generated.schema.type.basic", | ||
| ]: | ||
| if mod_name not in sys.modules: | ||
| stub = ModuleType(mod_name) | ||
| sys.modules[mod_name] = stub | ||
|
|
There was a problem hiding this comment.
When the generated-schema import is missing, this test mutates sys.modules by installing large metadata.generated.* stubs and doesn’t restore them afterward. That can leak into other tests in the same run and change their behavior. Consider scoping the stubs to the test (e.g., using a context manager/fixture that cleans up sys.modules entries after import) or skipping the test when metadata.generated is unavailable.
| # Add safe metrics for complex type columns | ||
| for column in complex_columns: | ||
| safe_metrics = [ | ||
| metric | ||
| for metric in self.metric_filter.get_column_metrics( | ||
| StaticMetric, | ||
| column, | ||
| self.profiler_interface.table_entity.serviceType, | ||
| ) | ||
| if not metric.is_window_metric() | ||
| and metric.name() in COMPLEX_TYPE_METRICS | ||
| ] | ||
| if safe_metrics: | ||
| column_metrics_for_thread_pool.append( | ||
| ThreadPoolMetrics( | ||
| metrics=safe_metrics, | ||
| metric_type=MetricTypes.Static, | ||
| column=column, | ||
| table=self.table, | ||
| ) | ||
| ) | ||
|
|
There was a problem hiding this comment.
The new complex-type path in _prepare_column_metrics() (routing complex columns to COMPLEX_TYPE_METRICS) isn’t covered by existing profiler unit tests. Please add a unit test that builds a table with a complex column (e.g., JSON/ARRAY) and asserts that only nullCount/valuesCount are scheduled/computed for it (and that unsafe metrics like uniqueCount/distinctCount are not).
| for column in self.columns: | ||
| # Skip composed/hybrid metrics for columns that are fully excluded | ||
| if column.type.__class__.__name__ in NOT_COMPUTE: | ||
| continue | ||
| self.run_composed_metrics(column) | ||
| self.run_hybrid_metrics(column) |
There was a problem hiding this comment.
The new logic still runs composed metrics for complex columns (since only NOT_COMPUTE is skipped), which will compute at least nullProportion (and potentially add other composed/hybrid keys as None). This doesn’t match the PR description of restricting complex types to only nullCount/valuesCount. Consider either skipping composed/hybrid metrics for complex types as well, or updating the PR description/implementation to explicitly allow additional derived metrics for complex types.
| """ | ||
| Standalone test runner for complex type profiling changes. | ||
|
|
||
| Uses a sys.meta_path finder to intercept ALL metadata.generated.* | ||
| imports, returning permissive stubs. The 'metadata' package itself is | ||
| replaced with a bare module so its __init__.py never runs. | ||
|
|
||
| Usage: python run_complex_type_tests.py | ||
| See: https://github.com/open-metadata/OpenMetadata/issues/15627 | ||
| """ |
There was a problem hiding this comment.
This file lacks the standard repository license header present in other ingestion Python files (e.g., ingestion/src/metadata/profiler/metrics/static/unique_count.py:1). Please add the appropriate header or remove the script from the repo if it’s only for local/manual runs.
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|



Summary
Resolves #15627
Enable the profiler to compute a safe subset of metrics (nullCount, valuesCount) for complex data types (JSON, arrays, geo, structs) that were previously fully excluded from profiling.
Problem
OpenMetadata's profiler maintained a NOT_COMPUTE set that completely excluded complex data types from profiling. This meant no metrics at all were collected for columns with types like JSON, ARRAY, GEOMETRY, STRUCT, MAP, etc. - even universal metrics like null counts that work on any column type.
Solution
Architecture: Two-tier type classification
Changes
Safe metrics for complex types
Test coverage
Risk assessment