Skip to content

ISSUE #3027 - Better Default#26158

Open
TeddyCr wants to merge 9 commits intoopen-metadata:mainfrom
TeddyCr:ISSUE-3027
Open

ISSUE #3027 - Better Default#26158
TeddyCr wants to merge 9 commits intoopen-metadata:mainfrom
TeddyCr:ISSUE-3027

Conversation

@TeddyCr
Copy link
Collaborator

@TeddyCr TeddyCr commented Feb 27, 2026

Describe your changes:

This PR implements better default metrics for the profiler, removing system, cardinality and histogram from the default metrics. We are also removing the computeMetrics config as this is no longer relevant for the profiler and are adding in the advance setting the opportunity for users to configure which metrics to compute at the workflow level.

Finally, we renamed the metrics from the Enum registry to aligned with the name of the MetricTypes

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Improvement

  • I have added tests around the new logic.
  • For connector/ingestion changes: I updated the documentation.

@TeddyCr TeddyCr requested review from a team as code owners February 27, 2026 21:27
@TeddyCr TeddyCr requested a review from Copilot February 27, 2026 21:27
@github-actions github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Feb 27, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the profiler's default metric configuration by removing computeMetrics (deprecated toggle) and introducing a metrics array field at the workflow level that lets users control which metrics are computed. Histogram and cardinality distribution are removed from defaults to reduce cost. It also renames all Metrics enum keys from UPPER_SNAKE_CASE to camelCase to align with the MetricType schema.

Changes:

  • Removes computeMetrics from the profiler pipeline schema (JSON, generated TypeScript) and unconditionally invokes compute_metrics() in the profiler core
  • Adds a metrics array field to databaseServiceProfilerPipeline.json with a curated default set, wiring it through Java (ProfilerWorkflowConfig) and Python (profiler_source.py)
  • Renames all Metrics enum entries from UPPER_SNAKE_CASE to camelCase and adds schema_metric_type to every metric class; adds valueRank and nullMissingCount to the MetricType enum

Reviewed changes

Copilot reviewed 111 out of 173 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
openmetadata-spec/.../databaseServiceProfilerPipeline.json Removes computeMetrics, adds metrics array with default set, updates useStatistics title
openmetadata-spec/.../profilerConfiguration.json Adds valueRank and nullMissingCount to the metricType enum
openmetadata-ui/.../workflow.ts, databaseServiceProfilerPipeline.ts, ingestionPipeline.ts, createIngestionPipeline.ts Generated TypeScript: removes computeMetrics field
openmetadata-ui/.../Services.constant.ts Adds 'metrics' to ADVANCED_PROPERTIES
openmetadata-ui/.../profiler.md Removes Compute Metrics section, adds Metrics section
openmetadata-service/.../ProfilerWorkflowConfig.java Builds processor with metrics list from pipeline config
openmetadata-service/.../IngestionPipelineResourceTest.java Adds test for profiler pipeline with metrics
openmetadata-service/.../WorkflowConfigBuilderTest.java Removes computeMetrics: true from expected YAML
ingestion/.../profiler_source.py Applies workflow-level metrics override when computing profiler runner
ingestion/.../default.py Removes SYSTEM, HISTOGRAM, CARDINALITY_DISTRIBUTION from default metrics; renames to camelCase
ingestion/.../core.py Removes computeMetrics guard, always computes metrics
ingestion/.../registry.py (metrics) Renames all Metrics enum entries from UPPER_SNAKE_CASE to camelCase
All metric classes (static, composed, window, hybrid, system) Adds schema_metric_type attribute to each metric class
All data quality validations Updates all references from Metrics.X_Y to Metrics.xY naming
bootstrap/sql/migrations/.../postDataMigrationSQLScript.sql Removes computeMetrics from existing profiler pipeline configs
New Python test files Tests for metric registry, profiler models, entity fetcher, converter, sampling across DBs

```
"""

schema_metric_type = MetricType.nullCount
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NullMissingCount metric has schema_metric_type = MetricType.nullCount, which is the same value assigned to NullCount. These two distinct metric classes should map to different MetricType values. Looking at profilerConfiguration.json, nullMissingCount was added as a separate enum value, so NullMissingCount.schema_metric_type should be MetricType.nullMissingCount, not MetricType.nullCount.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (34)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (9)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (34)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (21)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
google-cloud-aiplatform CVE-2026-2472 🚨 HIGH 1.130.0 1.131.0
google-cloud-aiplatform CVE-2026-2473 🚨 HIGH 1.130.0 1.133.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: usr/bin/docker

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
stdlib CVE-2025-68121 🔥 CRITICAL v1.25.5 1.24.13, 1.25.7, 1.26.0-rc.3
stdlib CVE-2025-61726 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2025-61728 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2025-61730 🚨 HIGH v1.25.5 1.24.12, 1.25.6

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

"title": "Timeout (in sec.)"
},
"metrics": {
"description": "List of metrics to compute. If empty, then all metrics will be computed",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Schema description contradicts new default behavior

The metrics field description says "List of metrics to compute. If empty, then all metrics will be computed", but the field now has a non-empty default list of 20 metrics. This means new pipelines will never have an empty metrics list (they'll get the defaults). Users expecting "empty = all metrics" may be confused since the field description implies they can clear the list to compute all metrics, but that behavior would need verification in the profiler source code. Consider updating the description to reflect the new behavior, e.g., "List of metrics to compute. Defaults to a core set of metrics excluding expensive ones like histogram and cardinalityDistribution."

Was this helpful? React with 👍 / 👎

@github-actions
Copy link
Contributor

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 65%
65.8% (56730/86217) 45.3% (29748/65673) 47.98% (8946/18645)

harshach
harshach previously approved these changes Feb 27, 2026
@gitar-bot
Copy link

gitar-bot bot commented Feb 28, 2026

🔍 CI failure analysis for bdcb8f1: Multiple CI failures: (1) generate-types job failing due to infrastructure issues (3 failures), (2) maven-collate-ci jobs failing due to fork PR limitations (2 failures), (3) Python unit tests failing across 4 jobs due to outdated metric names in test fixtures - THIS MUST BE FIXED.

Issue 1: generate-types CI Job - Persistent Infrastructure Failure

The generate-types CI job has failed three times during the repository checkout phase before any code could be analyzed.

Root Cause: Persistent infrastructure failure. The GitHub Actions runner repeatedly fails to fetch branch ISSUE-3027 with exit code 1.

Failure occurrences:

  • Commit a2ddea2e (job 65199599741): Failed at 21:27:56 UTC (Feb 27)
  • Commit dfc61cfb (job 65204617941): Failed at 22:18:18 UTC (Feb 27)
  • Commit bdcb8f12 (job 65217769863): Failed at 01:20:19 UTC (Feb 28)

Issue 2: maven-collate-ci Jobs - Fork PR Limitation

The maven-collate-ci jobs have triggered collate workflows that consistently fail when trying to checkout PR commits in the OpenMetadata submodule.

Root Cause: Git submodules cannot checkout commits from forks - they can only reference commits that exist in the configured submodule repository.

Failure occurrences:

This is a known limitation for fork PRs.


Issue 3: Unit Tests - Metric Name Validation Failures ⚠️ REQUIRES FIX

The Python unit tests are failing across 4 jobs due to incompatible metric names in test fixtures.

Affected jobs:

  • Unit Tests & Static Checks (3.10) - job 65217769805
  • Unit Tests & Static Checks (3.11) - job 65217769819
  • py-run-tests (3.10) - job 65217769945
  • py-run-tests (3.11) - job 65217769935 (ran for 69 minutes, likely same failures)

Root Cause: The PR renamed Metrics enum values from uppercase/snake_case to camelCase:

  • ROW_COUNTrowCount
  • NULL_COUNTnullCount
  • COUNTvaluesCount
  • NULL_MISSING_COUNTnullMissingCount
  • COUNT_IN_SETcountInSet

However, test fixtures still reference the old metric names, causing Pydantic validation errors:

pydantic_core._pydantic_core.ValidationError: 3 validation errors for ProfilerProcessorConfig
  Value error, Metric name row_count is not a proper metric name from the Registry
  Value error, Metric name COUNT is not a proper metric name from the Registry
  Value error, Metric name null_count is not a proper metric name from the Registry

Test results:

  • 3 failed tests across all affected jobs
  • 5 error tests (all in ProfilerPartitionUnitTest, failing at setup)
  • 4069 passed, 12 skipped, 1 xfailed

Failed/Error tests:

  • tests/unit/observability/profiler/test_workflow.py::test_profile_def
  • tests/unit/observability/profiler/test_workflow.py::test_default_profile_def
  • tests/unit/observability/profiler/test_profiler_models.py::test_valid_metrics
  • tests/unit/observability/profiler/test_profiler_partitions.py::ProfilerPartitionUnitTest (5 tests, all failing at setup)

Files requiring updates:

  • Test configuration fixtures in tests/unit/observability/profiler/ that hardcode metric names
  • Any test data using old metric names like row_count, null_count, COUNT, etc.

Summary

Infrastructure failures (unrelated to PR changes):

  1. generate-types job - infrastructure issues (can be ignored or retried)
  2. maven-collate-ci jobs - fork PR limitation (expected for fork PRs)

Test failures (RELATED to PR changes - MUST BE FIXED):
3. ❌ Python unit tests (4 jobs) - test fixtures must be updated to use new camelCase metric names to match the renamed Metrics enum

Code Review ⚠️ Changes requested 1 resolved / 4 findings

Large-scale metric naming refactor (SCREAMING_SNAKE → camelCase) is mechanically correct across 180+ files. However, the new workflow-level metrics configuration has an inverted precedence bug where profiler-specific config overrides source config, contradicting the intended behavior. Previous findings about the schema description mismatch and silent system metric removal remain unresolved.

⚠️ Bug: System metric silently removed from default profiler metrics

📄 ingestion/src/metadata/profiler/processor/default.py:35

The system metric (which collects DML operation counts, affected rows, and freshness metrics for Snowflake, BigQuery, Redshift, and Databricks) was removed from get_default_metrics() along with the camelCase rename. The PR description and JSON schema defaults only mention removing histogram and cardinalityDistribution as expensive metrics, but the system metric removal appears to be an accidental omission during the refactor rather than an intentional exclusion.

The system metric is still registered in the Metrics enum, still in the MetricType schema, and all its infrastructure (_prepare_system_metrics, SystemProfile, database-specific implementations) remains in place. Without it in defaults, system profiles will no longer be collected for any database unless explicitly configured.

If this removal is intentional, it should be documented. If not, it should be restored.

Suggested fix
    # Table Metrics
    metrics_registry.rowCount.value,
    add_props(table=table)(metrics_registry.columnCount.value),
    add_props(table=table)(metrics_registry.columnNames.value),
    # System Metrics
    add_props(table=table, ometa_client=ometa_client, db_service=db_service)(
        metrics_registry.system.value
    ),
    # Column Metrics
    metrics_registry.median.value,
⚠️ Bug: Metric precedence inverted: profiler config overrides source config

📄 ingestion/src/metadata/profiler/source/database/base/profiler_source.py:238

The reference_metrics assignment at lines 238-242 gives profiler_config.profiler.metrics precedence over source_metrics (workflow-level config):

reference_metrics = (
    profiler_config.profiler.metrics
    if profiler_config.profiler
    else source_metrics
)

This contradicts the PR's stated intent: "When metrics are specified in the source configuration, they take precedence over profiler-specific config."

Additionally, when profiler_config.profiler exists but has metrics=None (e.g., a ProfilerDef with only a name set), reference_metrics becomes None — discarding the user's explicit source_metrics and falling through to get_default_metrics(). This means a user who configures metrics at the workflow level would have them silently ignored if a profiler definition (without metrics) also exists.

Suggested fix
        reference_metrics = (
            source_metrics
            if source_metrics
            else profiler_config.profiler.metrics
            if profiler_config.profiler
            else None
        )
💡 Edge Case: Schema description contradicts new default behavior

📄 openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json:121

The metrics field description says "List of metrics to compute. If empty, then all metrics will be computed", but the field now has a non-empty default list of 20 metrics. This means new pipelines will never have an empty metrics list (they'll get the defaults). Users expecting "empty = all metrics" may be confused since the field description implies they can clear the list to compute all metrics, but that behavior would need verification in the profiler source code. Consider updating the description to reflect the new behavior, e.g., "List of metrics to compute. Defaults to a core set of metrics excluding expensive ones like histogram and cardinalityDistribution."

✅ 1 resolved
Bug: Missing __init__.py in observability/ breaks test discovery

📄 ingestion/tests/unit/observability/data_quality/test_metric_registry.py:1
Tests were moved from ingestion/tests/unit/data_quality/ and ingestion/tests/unit/profiler/ into the new ingestion/tests/unit/observability/ directory. The child directories have __init__.py files (carried over from the original locations), but the new parent directory ingestion/tests/unit/observability/ is missing its __init__.py.

With pytest's default prepend import mode (no custom importmode is configured in pyproject.toml), Python's package import system requires __init__.py files for proper package discovery. Without it at the observability/ level, test modules that require imports across this package boundary (e.g., conftest fixtures, shared utilities) may fail to resolve correctly. This could cause all moved profiler and data quality unit tests to silently not run or to fail with import errors.

Fix: Add an empty ingestion/tests/unit/observability/__init__.py file.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

[metrics_registry.get(name) for name in profiler_config.profiler.metrics]
if profiler_config.profiler.metrics
else get_default_metrics(
reference_metrics = (
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Metric precedence inverted: profiler config overrides source config

The reference_metrics assignment at lines 238-242 gives profiler_config.profiler.metrics precedence over source_metrics (workflow-level config):

reference_metrics = (
    profiler_config.profiler.metrics
    if profiler_config.profiler
    else source_metrics
)

This contradicts the PR's stated intent: "When metrics are specified in the source configuration, they take precedence over profiler-specific config."

Additionally, when profiler_config.profiler exists but has metrics=None (e.g., a ProfilerDef with only a name set), reference_metrics becomes None — discarding the user's explicit source_metrics and falling through to get_default_metrics(). This means a user who configures metrics at the workflow level would have them silently ignored if a profiler definition (without metrics) also exists.

Suggested fix:

reference_metrics = (
    source_metrics
    if source_metrics
    else profiler_config.profiler.metrics
    if profiler_config.profiler
    else None
)

Was this helpful? React with 👍 / 👎

@sonarqubecloud
Copy link

@sonarqubecloud
Copy link

Copy link
Contributor

@edg956 edg956 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM though I'd check Gitar's suggestions in case they make sense, specially the one about the changing priorities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backward-Incompatible-Change Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants