Skip to content

Fix bigquery e2e region#26754

Merged
SumanMaharana merged 9 commits intomainfrom
fix-bigquery-region
Mar 30, 2026
Merged

Fix bigquery e2e region#26754
SumanMaharana merged 9 commits intomainfrom
fix-bigquery-region

Conversation

@SumanMaharana
Copy link
Copy Markdown
Contributor

Describe your changes:

This pull request enhances the BigQuery connector to support region-aware INFORMATION_SCHEMA queries, ensuring that metadata retrieval (for stored procedures and table DDLs) works correctly for datasets located outside the default region. It introduces new region-aware query templates, updates the main logic to use them when appropriate, and adds thorough unit tests to verify correct behavior and error handling.

Region-aware query support:

  • Added new query templates BIGQUERY_GET_STORED_PROCEDURES_BY_REGION and BIGQUERY_GET_TABLE_DDLS_BY_REGION in queries.py to support querying INFORMATION_SCHEMA in a region-specific manner. [1] [2]
  • Updated imports in metadata.py to include the new region-aware queries.

BigQuery metadata retrieval logic:

  • Updated _prefetch_table_ddls and get_stored_procedures in metadata.py to detect the dataset's region and use the appropriate region-aware query if a location is set, falling back to the original query otherwise. Improved error handling and logging for cases when the dataset or query fails. [1] [2]

Testing:

  • Added a comprehensive test suite (TestBigqueryRegionAwareQueries) to verify that region-aware queries are used when appropriate, that the fallback logic works, and that failures are handled gracefully without propagating exceptions or recording producer failures.

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Copilot AI review requested due to automatic review settings March 25, 2026 02:35
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Mar 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the BigQuery ingestion connector to support region-aware INFORMATION_SCHEMA queries so that stored procedure metadata and table DDL retrieval work for datasets outside the default/engine location.

Changes:

  • Added new region-scoped query templates for stored procedures and table DDLs.
  • Updated BigQuery metadata ingestion logic to choose region-scoped vs dataset-scoped queries based on dataset location.
  • Added unit tests validating region routing, fallback behavior, and graceful error handling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
ingestion/src/metadata/ingestion/source/database/bigquery/queries.py Adds region-aware SQL templates for ROUTINES and TABLES (DDL).
ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py Uses dataset location to route stored procedure + DDL queries to region-scoped INFORMATION_SCHEMA.
ingestion/tests/unit/topology/database/test_bigquery.py Adds unit tests covering region-aware query selection and failure paths.

Comment thread ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 25, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (37)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http CVE-2026-33870 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.10.Final
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 CVE-2026-33871 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.11.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (15)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.5 3.1.8
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 25, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (37)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http CVE-2026-33870 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.10.Final
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 CVE-2026-33871 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.11.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
Authlib CVE-2026-27962 🔥 CRITICAL 1.6.6 1.6.9
Authlib CVE-2026-28490 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28498 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28802 🚨 HIGH 1.6.6 1.6.7
PyJWT CVE-2026-32597 🚨 HIGH 2.10.1 2.12.0
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.5 3.1.8
apache-airflow-providers-http CVE-2025-69219 🚨 HIGH 5.6.0 6.0.0
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
google-cloud-aiplatform CVE-2026-2472 🚨 HIGH 1.130.0 1.131.0
google-cloud-aiplatform CVE-2026-2473 🚨 HIGH 1.130.0 1.133.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
pyasn1 CVE-2026-30922 🚨 HIGH 0.6.1 0.6.3
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
tornado CVE-2026-31958 🚨 HIGH 6.5.3 6.5.5
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: usr/bin/docker

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
stdlib CVE-2025-68121 🔥 CRITICAL v1.25.5 1.24.13, 1.25.7, 1.26.0-rc.3
stdlib CVE-2025-61726 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2025-61728 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2026-25679 🚨 HIGH v1.25.5 1.25.8, 1.26.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

mohittilala
mohittilala previously approved these changes Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 25, 2026

🟡 Playwright Results — all passed (20 flaky)

✅ 3419 passed · ❌ 0 failed · 🟡 20 flaky · ⏭️ 223 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 451 0 4 2
🟡 Shard 2 608 0 1 32
🟡 Shard 3 617 0 3 27
🟡 Shard 4 603 0 7 47
🟡 Shard 5 585 0 2 67
🟡 Shard 6 555 0 3 48
🟡 20 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Stored Procedure - customization should work (shard 1, 1 retry)
  • Features/CustomizeDetailPage.spec.ts › Data Product - customization should work (shard 1, 1 retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Topic (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Database Schema (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Owner Rule Is_Set (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with deeply nested subdomains (3+ levels) verifies FQN propagation (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Domain Rbac (shard 4, 1 retry)
  • Pages/DomainUIInteractions.spec.ts › Delete data product via UI (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for pipeline (shard 5, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings March 30, 2026 13:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +551 to +568
dataset_obj = self.get_dataset_obj(schema_name)
location = getattr(dataset_obj, "location", None)
except Exception as exc:
logger.debug(traceback.format_exc())
logger.debug(
"Could not retrieve dataset location for '%s.%s', "
"falling back to dataset-scoped query: %s",
database,
schema_name,
exc,
)
location = None

query = (
BIGQUERY_GET_TABLE_DDLS_BY_REGION.format(
database_name=database,
schema_name=schema_name,
region=location,
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset-location lookup + region-vs-dataset scoped query selection logic is duplicated here and again in get_stored_procedures(). Consider extracting a small helper (e.g., get_dataset_location(schema_name) + build_information_schema_query(template_by_region, template_by_dataset, ...)) so the fallback behavior and logging stay consistent as this evolves.

Copilot uses AI. Check for mistakes.
Comment on lines +1191 to +1217
database = self.context.get().database
schema = self.context.get().database_schema
try:
dataset_obj = self.get_dataset_obj(schema)
location = getattr(dataset_obj, "location", None)
except Exception as exc:
logger.debug(traceback.format_exc())
logger.debug(
"Could not retrieve dataset location for '%s.%s', "
"falling back to dataset-scoped query: %s",
database,
schema,
exc,
)
location = None
query = (
BIGQUERY_GET_STORED_PROCEDURES_BY_REGION.format(
database_name=database,
schema_name=schema,
region=location,
)
if location
else BIGQUERY_GET_STORED_PROCEDURES.format(
database_name=database,
schema_name=schema,
)
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repeats the same dataset-location lookup + region-vs-dataset scoped query selection pattern as _prefetch_table_ddls(). To avoid the two code paths drifting (e.g., different fallback rules or normalization), consider refactoring this into a shared helper used by both methods.

Copilot uses AI. Check for mistakes.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Mar 30, 2026

Code Review ⚠️ Changes requested 0 resolved / 6 findings

Fixes BigQuery region detection in e2e tests but introduces multiple issues: duplicate region-detection logic across two methods, lost nested column metadata during array field merges, missing SLF4J placeholder in migration logging, dropped ML_MODEL_SERVICE in entity icon mapping, and unclean polling loop in login handler that may not stop on component unmount.

⚠️ Bug: Nested column metadata lost during array field merge

The new _sort_array_entity_fields function merges source and destination attributes for top-level array fields (columns, tasks, fields), but does not recursively handle nested children. Since children is NOT in RESTRICT_UPDATE_LIST, when a destination column has children in its model_fields_set, those children are included in update_dict and completely overwrite the source's children via model_copy(update=update_dict) at line 643.

This means user-added metadata (descriptions, tags, displayName) on nested/child columns will be lost during ingestion re-runs, because the destination (from the source database) won't carry that metadata.

For example: if a user adds a description to a nested column table.col1.child1, re-running ingestion will overwrite children with the raw schema from the database, losing that description.

Suggested fix
Either:
1. Add recursive handling in _sort_array_entity_fields for `children` fields, or
2. Add `children` to RESTRICT_UPDATE_LIST so nested children are treated as restricted and only updated on ADD (not REPLACE), or
3. Recursively call _sort_array_entity_fields for nested children arrays after the top-level merge.
⚠️ Bug: entityIconMapping drops SearchIndex.ML_MODEL_SERVICE (different string)

The refactoring that de-duplicated the entityIconMapping removed SearchIndex.ML_MODEL_SERVICE (string value 'mlModelService') and only kept EntityType.MLMODEL_SERVICE (string value 'mlmodelService'). These are different strings due to casing ('mlModelService' vs 'mlmodelService'). Any caller passing SearchIndex.ML_MODEL_SERVICE or the raw string 'mlModelService' to getEntityIcon() will no longer find a matching icon, silently falling through to the default undefined icon. Callers like ServiceInsightsTab and search result components may pass SearchIndex values.

Suggested fix
Add the missing entry back to the mapping:

  [SearchIndex.ML_MODEL_SERVICE]: MlModelIcon,

along with the existing EntityType.MLMODEL_SERVICE entry.
💡 Quality: Duplicate region-detection logic in two methods

📄 ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py:550-562 📄 ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py:1193-1205

Both _prefetch_table_ddls (lines 550-562) and get_stored_procedures (lines 1193-1205) contain an identical block: call get_dataset_obj, extract location via getattr, handle the exception with the same fallback-to-None pattern, and then use a conditional ternary to pick the region-aware vs. dataset-scoped query. Extracting this into a small helper (e.g., _get_dataset_location(schema_name) -> Optional[str]) would remove the duplication and make it easier to maintain if the fallback logic ever changes.

Suggested fix
def _get_dataset_location(self, schema_name: str) -> Optional[str]:
    """Return the GCP region for *schema_name*, or None on failure."""
    try:
        dataset_obj = self.get_dataset_obj(schema_name)
        return getattr(dataset_obj, "location", None)
    except Exception as exc:
        logger.debug(traceback.format_exc())
        logger.debug(
            "Could not retrieve dataset location for '%s.%s', "
            "falling back to dataset-scoped query: %s",
            self.context.get().database, schema_name, exc,
        )
        return None
💡 Bug: Missing SLF4J placeholder in MigrationUtil v150 log message

The log message was refactored from String.format("Error updating automator [%s] due to [%s]", row, ex) to LOG.warn("Error updating automator [{}] due to ", row, ex), but the second {} placeholder was dropped. SLF4J will treat ex as the throwable argument (printing its stack trace) rather than formatting it inline as the original code intended. The message will read "Error updating automator [] due to" without showing what the error was in the message text itself.

Suggested fix
LOG.warn("Error updating automator [{}] due to [{}]", row, ex.getMessage(), ex);
💡 Edge Case: onLoginHandler polling loop not cleaned up on unmount

The new polling mechanism in onLoginHandler uses recursive setTimeout(invokeLogin, 50) (up to 100 attempts) but has no cleanup if the component unmounts during polling. The existing cleanup function in the useEffect only clears the token expiry timer, not these timeouts. While bounded to 5 seconds max and React 18 suppresses the 'state update on unmounted component' warning, this could still call setApplicationLoading(false) after unmount. Consider storing the timeout ID and clearing it on unmount, or using an isMounted ref.

Suggested fix
Use a ref to track mount state or store timeout IDs:

const timeoutRef = useRef<number>();
// In onLoginHandler:
timeoutRef.current = window.setTimeout(invokeLogin, 50);
// In cleanup effect:
return () => { clearTimeout(timeoutRef.current); };
💡 Quality: matillionDPC.json schema has no required fields for auth config

The new matillionDPC.json schema defines clientId and clientSecret for OAuth2 authentication plus a personalAccessToken alternative, but "required": [] means an empty object {} is a valid config. At minimum, one authentication method should be required to prevent misconfiguration at runtime. Consider using a oneOf or anyOf to require either clientId+clientSecret or personalAccessToken.

🤖 Prompt for agents
Code Review: Fixes BigQuery region detection in e2e tests but introduces multiple issues: duplicate region-detection logic across two methods, lost nested column metadata during array field merges, missing SLF4J placeholder in migration logging, dropped ML_MODEL_SERVICE in entity icon mapping, and unclean polling loop in login handler that may not stop on component unmount.

1. 💡 Quality: Duplicate region-detection logic in two methods
   Files: ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py:550-562, ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py:1193-1205

   Both `_prefetch_table_ddls` (lines 550-562) and `get_stored_procedures` (lines 1193-1205) contain an identical block: call `get_dataset_obj`, extract `location` via `getattr`, handle the exception with the same fallback-to-None pattern, and then use a conditional ternary to pick the region-aware vs. dataset-scoped query. Extracting this into a small helper (e.g., `_get_dataset_location(schema_name) -> Optional[str]`) would remove the duplication and make it easier to maintain if the fallback logic ever changes.

   Suggested fix:
   def _get_dataset_location(self, schema_name: str) -> Optional[str]:
       """Return the GCP region for *schema_name*, or None on failure."""
       try:
           dataset_obj = self.get_dataset_obj(schema_name)
           return getattr(dataset_obj, "location", None)
       except Exception as exc:
           logger.debug(traceback.format_exc())
           logger.debug(
               "Could not retrieve dataset location for '%s.%s', "
               "falling back to dataset-scoped query: %s",
               self.context.get().database, schema_name, exc,
           )
           return None

2. ⚠️ Bug: Nested column metadata lost during array field merge

   The new `_sort_array_entity_fields` function merges source and destination attributes for top-level array fields (columns, tasks, fields), but does not recursively handle nested `children`. Since `children` is NOT in `RESTRICT_UPDATE_LIST`, when a destination column has `children` in its `model_fields_set`, those children are included in `update_dict` and completely overwrite the source's children via `model_copy(update=update_dict)` at line 643.
   
   This means user-added metadata (descriptions, tags, displayName) on nested/child columns will be lost during ingestion re-runs, because the destination (from the source database) won't carry that metadata.
   
   For example: if a user adds a description to a nested column `table.col1.child1`, re-running ingestion will overwrite `children` with the raw schema from the database, losing that description.

   Suggested fix:
   Either:
   1. Add recursive handling in _sort_array_entity_fields for `children` fields, or
   2. Add `children` to RESTRICT_UPDATE_LIST so nested children are treated as restricted and only updated on ADD (not REPLACE), or
   3. Recursively call _sort_array_entity_fields for nested children arrays after the top-level merge.

3. 💡 Bug: Missing SLF4J placeholder in MigrationUtil v150 log message

   The log message was refactored from `String.format("Error updating automator [%s] due to [%s]", row, ex)` to `LOG.warn("Error updating automator [{}] due to ", row, ex)`, but the second `{}` placeholder was dropped. SLF4J will treat `ex` as the throwable argument (printing its stack trace) rather than formatting it inline as the original code intended. The message will read "Error updating automator [<row>] due to" without showing what the error was in the message text itself.

   Suggested fix:
   LOG.warn("Error updating automator [{}] due to [{}]", row, ex.getMessage(), ex);

4. ⚠️ Bug: entityIconMapping drops SearchIndex.ML_MODEL_SERVICE (different string)

   The refactoring that de-duplicated the `entityIconMapping` removed `SearchIndex.ML_MODEL_SERVICE` (string value `'mlModelService'`) and only kept `EntityType.MLMODEL_SERVICE` (string value `'mlmodelService'`). These are different strings due to casing ('mlModelService' vs 'mlmodelService'). Any caller passing `SearchIndex.ML_MODEL_SERVICE` or the raw string `'mlModelService'` to `getEntityIcon()` will no longer find a matching icon, silently falling through to the default undefined icon. Callers like `ServiceInsightsTab` and search result components may pass SearchIndex values.

   Suggested fix:
   Add the missing entry back to the mapping:
   
     [SearchIndex.ML_MODEL_SERVICE]: MlModelIcon,
   
   along with the existing EntityType.MLMODEL_SERVICE entry.

5. 💡 Edge Case: onLoginHandler polling loop not cleaned up on unmount

   The new polling mechanism in `onLoginHandler` uses recursive `setTimeout(invokeLogin, 50)` (up to 100 attempts) but has no cleanup if the component unmounts during polling. The existing `cleanup` function in the useEffect only clears the token expiry timer, not these timeouts. While bounded to 5 seconds max and React 18 suppresses the 'state update on unmounted component' warning, this could still call `setApplicationLoading(false)` after unmount. Consider storing the timeout ID and clearing it on unmount, or using an `isMounted` ref.

   Suggested fix:
   Use a ref to track mount state or store timeout IDs:
   
   const timeoutRef = useRef<number>();
   // In onLoginHandler:
   timeoutRef.current = window.setTimeout(invokeLogin, 50);
   // In cleanup effect:
   return () => { clearTimeout(timeoutRef.current); };

6. 💡 Quality: matillionDPC.json schema has no required fields for auth config

   The new `matillionDPC.json` schema defines `clientId` and `clientSecret` for OAuth2 authentication plus a `personalAccessToken` alternative, but `"required": []` means an empty object `{}` is a valid config. At minimum, one authentication method should be required to prevent misconfiguration at runtime. Consider using a `oneOf` or `anyOf` to require either `clientId`+`clientSecret` or `personalAccessToken`.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@SumanMaharana SumanMaharana merged commit 6d6d7e3 into main Mar 30, 2026
44 checks passed
@SumanMaharana SumanMaharana deleted the fix-bigquery-region branch March 30, 2026 21:51
SumanMaharana added a commit that referenced this pull request Apr 2, 2026
* Fix Bigquery region

* Add tests

* address comments

---------

Co-authored-by: Mayur Singal <39544459+ulixius9@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants