Skip to content

feat(profiler): enable complex data type profiling (#15627)#27529

Open
david-mamani wants to merge 3 commits intoopen-metadata:mainfrom
david-mamani:feat/15627-profiler-complex-data-types
Open

feat(profiler): enable complex data type profiling (#15627)#27529
david-mamani wants to merge 3 commits intoopen-metadata:mainfrom
david-mamani:feat/15627-profiler-complex-data-types

Conversation

@david-mamani
Copy link
Copy Markdown

Summary

Resolves #15627
Enable the profiler to compute a safe subset of metrics (nullCount, valuesCount) for complex data types (JSON, arrays, geo, structs) that were previously fully excluded from profiling.

Problem

OpenMetadata's profiler maintained a NOT_COMPUTE set that completely excluded complex data types from profiling. This meant no metrics at all were collected for columns with types like JSON, ARRAY, GEOMETRY, STRUCT, MAP, etc. - even universal metrics like null counts that work on any column type.

Solution

Architecture: Two-tier type classification

  1. NOT_COMPUTE (reduced) - Contains only truly unprofileable types: NullType, UndeterminedType
    1. COMPLEX_TYPES (new) - Contains complex types that receive a restricted, safe subset of metrics

Changes

File Change
orm/registry.py Split NOT_COMPUTE into NOT_COMPUTE + COMPLEX_TYPES; added COMPLEX_TYPE_METRICS set and is_complex_type() helper
processor/core.py Updated _prepare_column_metrics() to route complex columns to limited metrics; updated compute_metrics() to skip composed/hybrid metrics only for fully excluded types
metrics/static/unique_count.py Extended guard to exclude COMPLEX_TYPES (GROUP BY/subqueries unsafe on complex data)

Safe metrics for complex types

  • nullCount - Works universally via COUNT(*) - COUNT(col)
    • valuesCount - Works universally via COUNT(col)

Test coverage

  • 48 test assertions covering:
    • NOT_COMPUTE set content (7 tests)
    • COMPLEX_TYPES set content (11 tests)
    • is_complex_type() helper function (9 tests)
    • COMPLEX_TYPE_METRICS content (10 tests)
    • Set isolation / no overlap (11 tests)

Risk assessment

  • Zero risk to existing types: Integer, String, Float, Boolean, Date columns are completely unaffected
    • Minimal scope: Only nullCount and valuesCount are enabled for complex types - no numeric, orderable, or aggregation metrics
    • Backward compatible: Types previously in NOT_COMPUTE that move to COMPLEX_TYPES will now get MORE metrics, not fewer

Copilot AI review requested due to automatic review settings April 19, 2026 20:44
@david-mamani david-mamani requested a review from a team as a code owner April 19, 2026 20:44
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Split NOT_COMPUTE into NOT_COMPUTE (truly unprofileable: NullType,
UndeterminedType) and COMPLEX_TYPES (JSON, arrays, geo, structs, etc.)
that receive a restricted safe subset of metrics (nullCount, valuesCount).

Changes:
- registry.py: new COMPLEX_TYPES set, COMPLEX_TYPE_METRICS, is_complex_type()
- core.py: _prepare_column_metrics() now routes complex columns to limited metrics
- unique_count.py: guard extended to exclude COMPLEX_TYPES
- Added 48 unit tests validating the registry refactoring
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@david-mamani david-mamani force-pushed the feat/15627-profiler-complex-data-types branch from 047143e to fdd2841 Compare April 19, 2026 20:50
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables profiler support for complex column data types by allowing a limited, “safe” subset of metrics (nullCount, valuesCount) to be computed for types that were previously fully excluded.

Changes:

  • Split the previous “do not compute” type bucket into truly-unprofileable types vs. complex types eligible for limited metrics.
  • Updated profiler processor column-metric preparation to route complex columns to COMPLEX_TYPE_METRICS.
  • Updated UniqueCount to skip complex types.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ingestion/src/metadata/profiler/orm/registry.py Introduces COMPLEX_TYPES, COMPLEX_TYPE_METRICS, and is_complex_type(); narrows NOT_COMPUTE.
ingestion/src/metadata/profiler/processor/core.py Routes complex columns to limited static metrics and skips composed/hybrid only for fully excluded types.
ingestion/src/metadata/profiler/metrics/static/unique_count.py Prevents uniqueCount computation on complex types.
ingestion/tests/unit/observability/profiler/test_complex_type_profiling.py Adds unit coverage for the registry split and safe metrics set.
ingestion/tests/unit/observability/profiler/run_complex_type_tests.py Adds a standalone runner for manual/local checks.

Comment on lines +93 to +121
class DataType(str, Enum):
INT="INT"; BIGINT="BIGINT"; SMALLINT="SMALLINT"; TINYINT="TINYINT"
NUMBER="NUMBER"; NUMERIC="NUMERIC"; DECIMAL="DECIMAL"
DOUBLE="DOUBLE"; FLOAT="FLOAT"; JSON="JSON"; ARRAY="ARRAY"
MAP="MAP"; STRUCT="STRUCT"; UNION="UNION"; SET="SET"
GEOGRAPHY="GEOGRAPHY"; GEOMETRY="GEOMETRY"; ENUM="ENUM"
STRING="STRING"; TEXT="TEXT"; CHAR="CHAR"; VARCHAR="VARCHAR"
BOOLEAN="BOOLEAN"; DATE="DATE"; DATETIME="DATETIME"
TIMESTAMP="TIMESTAMP"; TIME="TIME"; BINARY="BINARY"
VARBINARY="VARBINARY"; BLOB="BLOB"; BYTEA="BYTEA"
MEDIUMTEXT="MEDIUMTEXT"; NULL="NULL"; SUPER="SUPER"
INTERVAL="INTERVAL"; XML="XML"; FIXED="FIXED"
LONG="LONG"; BYTES="BYTES"

class MetricType(str, Enum):
valuesCount="valuesCount"; nullCount="nullCount"
nullProportion="nullProportion"; uniqueCount="uniqueCount"
distinctCount="distinctCount"; distinctProportion="distinctProportion"
min="min"; max="max"; mean="mean"; sum="sum"; stddev="stddev"
median="median"; firstQuartile="firstQuartile"
thirdQuartile="thirdQuartile"
interQuartileRange="interQuartileRange"
nonParametricSkew="nonParametricSkew"
columnCount="columnCount"; columnNames="columnNames"
rowCount="rowCount"; histogram="histogram"
uniqueProportion="uniqueProportion"
duplicateCount="duplicateCount"
nullMissingCount="nullMissingCount"; system="system"

Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not formatted to the repo’s Python formatting standard (Black/pycln/isort run on ingestion/ via make py_format_check). Examples include multiple statements per line and compact Enum definitions; black --check will fail on this file as-is. Please run the formatter and commit the formatted output, or drop this script from the repo if it’s only for local ad-hoc runs.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +89
Standalone test runner for complex type profiling changes.

Uses a sys.meta_path finder to intercept ALL metadata.generated.*
imports, returning permissive stubs. The 'metadata' package itself is
replaced with a bare module so its __init__.py never runs.

Usage: python run_complex_type_tests.py
See: https://github.com/open-metadata/OpenMetadata/issues/15627
"""

import sys
import os
import logging
import importlib
from types import ModuleType
from enum import Enum

# ── Prevent script dir from shadowing real packages ──────────────────
_script_dir = os.path.dirname(os.path.abspath(__file__))
sys.path = [p for p in sys.path if os.path.abspath(p) != _script_dir]

# ── Ensure ingestion/src is on path ──────────────────────────────────
_src_dir = os.path.normpath(os.path.join(_script_dir, "..", "..", "..", "..", "src"))
if _src_dir not in sys.path:
sys.path.insert(0, _src_dir)


# ════════════════════════════════════════════════════════════════════════
# 1) Install 'metadata' as a bare package (skip its __init__.py)
# ════════════════════════════════════════════════════════════════════════
_meta_pkg = ModuleType("metadata")
_meta_pkg.__path__ = [os.path.join(_src_dir, "metadata")]
_meta_pkg.__package__ = "metadata"
sys.modules["metadata"] = _meta_pkg


# ════════════════════════════════════════════════════════════════════════
# 2) Meta-path finder: auto-stub ALL metadata.generated.* imports
# ════════════════════════════════════════════════════════════════════════
_null_logger = logging.getLogger("stub")


class _StubModule(ModuleType):
"""A stub module whose attributes are either explicitly set or
fall back to a dummy class that has .__name__, is iterable, etc."""

class _Dummy:
__name__ = "_Dummy"
def __init_subclass__(cls, **kw): pass
def __init__(self, *a, **kw): pass
def __call__(self, *a, **kw): return self
def __iter__(self): return iter([])
def __bool__(self): return False
def __str__(self): return "_Dummy"
def items(self): return []
def values(self): return []
def keys(self): return []

def __getattr__(self, name):
# Return the class (not an instance) so it can be used as a
# type annotation, base class, or called to construct instances.
return _StubModule._Dummy


class _GeneratedFinder:
"""Intercepts `import metadata.generated.*` and returns stubs."""
PREFIX = "metadata.generated"

def find_module(self, fullname, path=None):
if fullname == self.PREFIX or fullname.startswith(self.PREFIX + "."):
return self
return None

def load_module(self, fullname):
if fullname in sys.modules:
return sys.modules[fullname]
mod = _StubModule(fullname)
mod.__path__ = []
mod.__package__ = fullname
mod.__loader__ = self
sys.modules[fullname] = mod
return mod


sys.meta_path.insert(0, _GeneratedFinder())


# ════════════════════════════════════════════════════════════════════════
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script lives under ingestion/tests/unit/ but won’t be picked up by pytest (it doesn’t match the test_*.py pattern) and it duplicates what the proper unit test file already validates. To avoid dead/duplicated test logic, consider removing it or converting its assertions into a real pytest/unittest test module.

Suggested change
Standalone test runner for complex type profiling changes.
Uses a sys.meta_path finder to intercept ALL metadata.generated.*
imports, returning permissive stubs. The 'metadata' package itself is
replaced with a bare module so its __init__.py never runs.
Usage: python run_complex_type_tests.py
See: https://github.com/open-metadata/OpenMetadata/issues/15627
"""
import sys
import os
import logging
import importlib
from types import ModuleType
from enum import Enum
# ── Prevent script dir from shadowing real packages ──────────────────
_script_dir = os.path.dirname(os.path.abspath(__file__))
sys.path = [p for p in sys.path if os.path.abspath(p) != _script_dir]
# ── Ensure ingestion/src is on path ──────────────────────────────────
_src_dir = os.path.normpath(os.path.join(_script_dir, "..", "..", "..", "..", "src"))
if _src_dir not in sys.path:
sys.path.insert(0, _src_dir)
# ════════════════════════════════════════════════════════════════════════
# 1) Install 'metadata' as a bare package (skip its __init__.py)
# ════════════════════════════════════════════════════════════════════════
_meta_pkg = ModuleType("metadata")
_meta_pkg.__path__ = [os.path.join(_src_dir, "metadata")]
_meta_pkg.__package__ = "metadata"
sys.modules["metadata"] = _meta_pkg
# ════════════════════════════════════════════════════════════════════════
# 2) Meta-path finder: auto-stub ALL metadata.generated.* imports
# ════════════════════════════════════════════════════════════════════════
_null_logger = logging.getLogger("stub")
class _StubModule(ModuleType):
"""A stub module whose attributes are either explicitly set or
fall back to a dummy class that has .__name__, is iterable, etc."""
class _Dummy:
__name__ = "_Dummy"
def __init_subclass__(cls, **kw): pass
def __init__(self, *a, **kw): pass
def __call__(self, *a, **kw): return self
def __iter__(self): return iter([])
def __bool__(self): return False
def __str__(self): return "_Dummy"
def items(self): return []
def values(self): return []
def keys(self): return []
def __getattr__(self, name):
# Return the class (not an instance) so it can be used as a
# type annotation, base class, or called to construct instances.
return _StubModule._Dummy
class _GeneratedFinder:
"""Intercepts `import metadata.generated.*` and returns stubs."""
PREFIX = "metadata.generated"
def find_module(self, fullname, path=None):
if fullname == self.PREFIX or fullname.startswith(self.PREFIX + "."):
return self
return None
def load_module(self, fullname):
if fullname in sys.modules:
return sys.modules[fullname]
mod = _StubModule(fullname)
mod.__path__ = []
mod.__package__ = fullname
mod.__loader__ = self
sys.modules[fullname] = mod
return mod
sys.meta_path.insert(0, _GeneratedFinder())
# ════════════════════════════════════════════════════════════════════════
Deprecated standalone test runner.
This file intentionally no longer contains executable test logic.
The canonical assertions for complex type profiling belong in the
proper pytest test module that is already collected by the unit test
suite, which avoids keeping duplicated or non-discoverable tests
under ``ingestion/tests/unit``.
"""
# ════════════════════════════════════════════════════════════════════════

Copilot uses AI. Check for mistakes.
Comment on lines 381 to +482
@@ -443,6 +458,28 @@ def _prepare_column_metrics(self) -> List:
)
)

# Add safe metrics for complex type columns
for column in complex_columns:
safe_metrics = [
metric
for metric in self.metric_filter.get_column_metrics(
StaticMetric,
column,
self.profiler_interface.table_entity.serviceType,
)
if not metric.is_window_metric()
and metric.name() in COMPLEX_TYPE_METRICS
]
if safe_metrics:
column_metrics_for_thread_pool.append(
ThreadPoolMetrics(
metrics=safe_metrics,
metric_type=MetricTypes.Static,
column=column,
table=self.table,
)
)

Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core behavior change (routing complex columns to COMPLEX_TYPE_METRICS) isn’t exercised by tests. Add/extend an existing profiler unit test to assert that ARRAY/JSON/GEOGRAPHY columns only schedule nullCount & valuesCount and do not schedule query/window metrics (and that regular columns remain unaffected).

Copilot uses AI. Check for mistakes.
Comment on lines 132 to 146
@@ -135,8 +143,16 @@ class Dialects(metaclass=EnumAdapter):
CustomTypes.ARRAY.value.__name__,
CustomTypes.SQADATETIMERANGE.value.__name__,
DataType.XML.value,
CustomTypes.UNDETERMINED.value.__name__,
}
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMPLEX_TYPES is missing the Geo type name "GEOGRAPHY" (e.g., created via create_sqlalchemy_type("GEOGRAPHY") in multiple dialects), so GEOGRAPHY columns will be treated as regular types and may still attempt unsafe metrics (e.g., DISTINCT/GROUP BY). Add DataType.GEOGRAPHY.value (or an equivalent "GEOGRAPHY" entry) to COMPLEX_TYPES to ensure they only get COMPLEX_TYPE_METRICS.

Copilot uses AI. Check for mistakes.
See: https://github.com/open-metadata/OpenMetadata/issues/15627
"""

import importlib
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file has an unused import (importlib) which will be flagged by pycln during make py_format_check (it runs on the whole ingestion/ directory). Remove the unused import to keep formatting checks passing.

Suggested change
import importlib

Copilot uses AI. Check for mistakes.
Comment on lines +319 to +334
def test_sqa_geography_in_complex_types(self):
"""SQASGeography should be in COMPLEX_TYPES."""
self.assertIn(SQASGeography.__name__, COMPLEX_TYPES)

def test_geometry_in_complex_types(self):
"""GEOMETRY should be in COMPLEX_TYPES."""
self.assertIn("GEOMETRY", COMPLEX_TYPES)

def test_xml_in_complex_types(self):
"""XML should be in COMPLEX_TYPES."""
self.assertIn("XML", COMPLEX_TYPES)

def test_datetimerange_in_complex_types(self):
"""CustomDateTimeRange should be in COMPLEX_TYPES."""
self.assertIn(CustomDateTimeRange.__name__, COMPLEX_TYPES)

Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new registry behavior for complex geo types isn’t covered for the common SQLAlchemy type class name "GEOGRAPHY" (created via create_sqlalchemy_type in Snowflake/Redshift/BigQuery/etc.). Add assertions here that "GEOGRAPHY" is included in COMPLEX_TYPES (and not in NOT_COMPUTE) so the test suite catches regressions for geo columns.

Copilot uses AI. Check for mistakes.
@harshach harshach added the safe to test Add this label to run secure Github workflows on PRs label Apr 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 20, 2026

🔴 Playwright Results — 1 failure(s), 17 flaky

✅ 3669 passed · ❌ 1 failed · 🟡 17 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 478 1 2 4
🟡 Shard 2 652 0 1 7
🟡 Shard 3 654 0 5 1
🟡 Shard 4 631 0 3 27
🟡 Shard 5 610 0 1 42
🟡 Shard 6 644 0 5 8

Genuine Failures (failed on all attempts)

Pages/SearchIndexApplication.spec.ts › Search Index Application (shard 1)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

Expected: �[32mStringMatching /success|activeError/g�[39m
Received: �[31m"failed"�[39m
🟡 17 flaky test(s) (passed on retry)
  • Pages/Customproperties-part1.spec.ts › no duplicate card after update (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/CustomizeWidgets.spec.ts › Domains Widget (shard 3, 1 retry)
  • Flow/SchemaTable.spec.ts › schema table test (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Domain Rbac (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should allow Data Consumer to view all tabs for topic (shard 5, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot AI review requested due to automatic review settings April 20, 2026 06:38
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 20, 2026

Code Review ✅ Approved

Enables complex data type profiling to improve instrumentation depth. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Comment on lines +84 to +103
class _GeneratedFinder:
"""Intercepts `import metadata.generated.*` and returns stubs."""

PREFIX = "metadata.generated"

def find_module(self, fullname, path=None):
if fullname == self.PREFIX or fullname.startswith(self.PREFIX + "."):
return self
return None

def load_module(self, fullname):
if fullname in sys.modules:
return sys.modules[fullname]
mod = _StubModule(fullname)
mod.__path__ = []
mod.__package__ = fullname
mod.__loader__ = self
sys.modules[fullname] = mod
return mod

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_GeneratedFinder implements the deprecated find_module/load_module import hook API, which is discouraged and can break with newer Python importlib behavior. If this runner is kept, consider using importlib.abc.MetaPathFinder + importlib.abc.Loader (find_spec/exec_module) instead.

Copilot uses AI. Check for mistakes.
See: https://github.com/open-metadata/OpenMetadata/issues/15627
"""

import importlib
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importlib is imported but never used in this test module. Please remove the unused import to keep the test clean.

Suggested change
import importlib

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +103
def _bootstrap_generated_stub(src_dir):
"""Creates minimal stubs for metadata.generated so that the
orm.registry module can be imported in environments where
the full code-generation pipeline has not been run.
"""
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bootstrap_generated_stub takes a src_dir parameter but doesn’t use it. Either remove the parameter or use it (e.g., for locating/creating the stub package paths) to avoid dead arguments.

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +176
# Wire up stub modules in sys.modules
for mod_name in [
"metadata.generated",
"metadata.generated.schema",
"metadata.generated.schema.entity",
"metadata.generated.schema.entity.data",
"metadata.generated.schema.entity.data.table",
"metadata.generated.schema.configuration",
"metadata.generated.schema.configuration.profilerConfiguration",
"metadata.generated.schema.api",
"metadata.generated.schema.api.data",
"metadata.generated.schema.api.data.createTableProfile",
"metadata.generated.schema.entity.services",
"metadata.generated.schema.entity.services.databaseService",
"metadata.generated.schema.entity.services.connections",
"metadata.generated.schema.entity.services.connections.database",
"metadata.generated.schema.entity.services.connections.database.sqliteConnection",
"metadata.generated.schema.entity.services.connections.metadata",
"metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection",
"metadata.generated.schema.settings",
"metadata.generated.schema.settings.settings",
"metadata.generated.schema.tests",
"metadata.generated.schema.tests.customMetric",
"metadata.generated.schema.type",
"metadata.generated.schema.type.basic",
]:
if mod_name not in sys.modules:
stub = ModuleType(mod_name)
sys.modules[mod_name] = stub

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the generated-schema import is missing, this test mutates sys.modules by installing large metadata.generated.* stubs and doesn’t restore them afterward. That can leak into other tests in the same run and change their behavior. Consider scoping the stubs to the test (e.g., using a context manager/fixture that cleans up sys.modules entries after import) or skipping the test when metadata.generated is unavailable.

Copilot uses AI. Check for mistakes.
Comment on lines +462 to +483
# Add safe metrics for complex type columns
for column in complex_columns:
safe_metrics = [
metric
for metric in self.metric_filter.get_column_metrics(
StaticMetric,
column,
self.profiler_interface.table_entity.serviceType,
)
if not metric.is_window_metric()
and metric.name() in COMPLEX_TYPE_METRICS
]
if safe_metrics:
column_metrics_for_thread_pool.append(
ThreadPoolMetrics(
metrics=safe_metrics,
metric_type=MetricTypes.Static,
column=column,
table=self.table,
)
)

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new complex-type path in _prepare_column_metrics() (routing complex columns to COMPLEX_TYPE_METRICS) isn’t covered by existing profiler unit tests. Please add a unit test that builds a table with a complex column (e.g., JSON/ARRAY) and asserts that only nullCount/valuesCount are scheduled/computed for it (and that unsafe metrics like uniqueCount/distinctCount are not).

Copilot uses AI. Check for mistakes.
Comment on lines 507 to 512
for column in self.columns:
# Skip composed/hybrid metrics for columns that are fully excluded
if column.type.__class__.__name__ in NOT_COMPUTE:
continue
self.run_composed_metrics(column)
self.run_hybrid_metrics(column)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic still runs composed metrics for complex columns (since only NOT_COMPUTE is skipped), which will compute at least nullProportion (and potentially add other composed/hybrid keys as None). This doesn’t match the PR description of restricting complex types to only nullCount/valuesCount. Consider either skipping composed/hybrid metrics for complex types as well, or updating the PR description/implementation to explicitly allow additional derived metrics for complex types.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +10
"""
Standalone test runner for complex type profiling changes.

Uses a sys.meta_path finder to intercept ALL metadata.generated.*
imports, returning permissive stubs. The 'metadata' package itself is
replaced with a bare module so its __init__.py never runs.

Usage: python run_complex_type_tests.py
See: https://github.com/open-metadata/OpenMetadata/issues/15627
"""
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file lacks the standard repository license header present in other ingestion Python files (e.g., ingestion/src/metadata/profiler/metrics/static/unique_count.py:1). Please add the appropriate header or remove the script from the repo if it’s only for local/manual runs.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Profiler support for Complex Data Types (json, arrays, geo...)

3 participants