Skip to content

Fix: Optimise redshift memory usage#25904

Merged
ulixius9 merged 2 commits intomainfrom
redshift_mem
Mar 4, 2026
Merged

Fix: Optimise redshift memory usage#25904
ulixius9 merged 2 commits intomainfrom
redshift_mem

Conversation

@ulixius9
Copy link
Copy Markdown
Member

@ulixius9 ulixius9 commented Feb 16, 2026

Describe your changes:

Fixes #20649

This PR optimizes memory usage in the Redshift connector during metadata ingestion by preventing unbounded memory growth when processing multiple schemas. The changes replace global SQLAlchemy reflection caches with per-schema caching and move partition detail loading from database-level to schema-level scope.


Summary by Gitar

  • Memory optimization:
    • Removed @reflection.cache decorators from get_columns, _get_schema_column_info, and _get_all_relation_info to prevent cross-schema cache accumulation
    • Added _clear_reflection_cache() method to clear SQLAlchemy inspector cache before processing each schema
  • Per-schema caching:
    • Implemented custom single-schema cache in _get_schema_column_info and _get_all_relation_info that auto-invalidates when schema changes
  • Scoped partition loading:
    • Modified get_partition_details() to accept optional schema_name parameter for targeted queries
    • Moved partition loading from global database initialization to per-schema processing in query_table_names_and_types()
  • Resource cleanup:
    • Added explicit result.close() calls to release database cursors

This will update automatically on new commits.


Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

).fetchall()
query = REDSHIFT_PARTITION_DETAILS
if schema_name:
query += f" AND \"schema\" = '{schema_name}'"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Security: SQL injection via string formatting in get_partition_details

The new schema_name parameter is concatenated directly into the SQL query via f-string interpolation:

query += f" AND \"schema\" = '{schema_name}'"

While the schema_name value typically comes from Redshift's own metadata catalog (and is thus unlikely to contain malicious content in normal usage), this is still a SQL injection vector. A schema name containing a single quote (e.g., foo'; DROP TABLE ...--) would break the query or allow injection.

Note that the existing codebase has similar patterns in _get_schema_column_info and _get_all_relation_info, so this is consistent with the current style. However, for new code, it's better to use parameterized queries.

The method query_table_names_and_types in the same file already demonstrates the correct pattern using sql.text() with named parameters (line 236-246). The same approach should be used here.

Suggested fix: Use a parameterized query approach. Since the base query is a raw string constant, wrap the full query with sql.text() and use a bind parameter:

query = REDSHIFT_PARTITION_DETAILS
if schema_name:
    query += " AND \"schema\" = :schema_name"
results = self.connection.execute(
    statement=sql.text(query),
    {"schema_name": schema_name}
).fetchall()

Suggested fix:

            query = REDSHIFT_PARTITION_DETAILS
            if schema_name:
                query += " AND \"schema\" = :schema_name"
            results = self.connection.execute(
                statement=sql.text(query),
                {"schema_name": schema_name} if schema_name else {},
            ).fetchall()

Was this helpful? React with 👍 / 👎

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 16, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (38)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.12.7 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.13.4 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.15.2 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.16.1 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (21)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
google-cloud-aiplatform CVE-2026-2472 🚨 HIGH 1.130.0 1.131.0
google-cloud-aiplatform CVE-2026-2473 🚨 HIGH 1.130.0 1.133.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: usr/bin/docker

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
stdlib CVE-2025-68121 🔥 CRITICAL v1.25.5 1.24.13, 1.25.7, 1.26.0-rc.3
stdlib CVE-2025-61726 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2025-61728 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2025-61730 🚨 HIGH v1.25.5 1.24.12, 1.25.6

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 16, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (37)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.12.7 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.13.4 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.15.2 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (9)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

Comment on lines +393 to +396
# Single-schema cache: invalidate when schema changes
cached = getattr(self, "_relation_info_cache", None)
if cached is not None and cached[0] == schema:
return cached[1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Cache key ignores table_name, may return stale results

The single-schema cache in _get_all_relation_info keys only on schema, but the function also filters by table_name (lines 400-403). If the function is first called with (schema="public", table_name="users"), the cached result contains only the "users" relation. A subsequent call with (schema="public", table_name="orders") would hit the cache and return the stale result (only "users"), missing "orders".

The previous @reflection.cache decorator keyed on all arguments (including table_name), so this is a behavioral regression.

The fix should include table_name in the cache key, or (better) only cache when table_name is None (the common bulk-fetch path), since caching single-table lookups provides little benefit.

Suggested fix:

    # Single-schema cache: invalidate when schema or table_name changes
    table_name = kw.get("table_name", None)
    cached = getattr(self, "_relation_info_cache", None)
    if cached is not None and cached[0] == schema and cached[1] == table_name:
        return cached[2]

Was this helpful? React with 👍 / 👎

@d-miranda
Copy link
Copy Markdown

Hey guys!
We were being affected with this issue. The changes this MR introduces, based on the 1.11.9 release solved the issue.
Can you guys include it in the next release?

Comment on lines +209 to +216
def _clear_reflection_cache(self) -> None:
"""Clear the SQLAlchemy inspector's info_cache to release
cached column / relation data from prior schemas.

This prevents unbounded memory growth when ingesting many
schemas, since _get_schema_column_info, get_columns, and
_get_all_relation_info all use @reflection.cache.
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Stale docstring references removed @reflection.cache

The docstring for _clear_reflection_cache() states:

"since _get_schema_column_info, get_columns, and _get_all_relation_info all use @reflection.cache"

However, this PR specifically removes @reflection.cache from all three of those functions. The docstring should be updated to reflect that the info_cache is being cleared for the remaining @reflection.cache-decorated functions (e.g., get_table_comment, get_view_definition, _get_pg_column_info) and to clean up any residual cached data.

Suggested fix:

def _clear_reflection_cache(self) -> None:
    """Clear the SQLAlchemy inspector's info_cache to release
    cached column / relation data from prior schemas.

    This prevents unbounded memory growth when ingesting many
    schemas by clearing residual cached data from remaining
    @reflection.cache-decorated functions (e.g., get_table_comment).
    """

Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Mar 2, 2026

🔍 CI failure analysis for 51b740d: SonarCloud Quality Gate failed due to insufficient code coverage (6.1%) on new code, below the required ≥20% threshold. Unit test and Playwright E2E failures from previous runs are unrelated to this PR's Redshift changes.

Issue

The [open-metadata-ingestion] SonarCloud Code Analysis check failed due to insufficient code coverage on new code.

Root Cause

Code coverage on new code is 6.1%, below the required threshold of ≥20%.

The PR introduces new methods and logic in the Redshift connector without corresponding test coverage:

  • New method: _clear_reflection_cache() in metadata.py (lines 209–221)
  • Modified method: get_partition_details() with new schema parameter logic (lines 193–207)
  • New per-schema caching logic in _get_schema_column_info() (lines 137–153)
  • New per-schema caching logic in _get_all_relation_info() (lines 393–417)

Details

To pass the Quality Gate, unit tests should be added covering:

  • Cache clearing functionality in _clear_reflection_cache()
  • Schema-specific partition loading in get_partition_details(schema_name)
  • Per-schema cache invalidation logic in both _get_schema_column_info() and _get_all_relation_info()
  • Edge cases: schema changes, cache hit/miss scenarios, error handling

View SonarCloud dashboard


Previously observed unrelated failures (may or may not still be present):

  • Unit Tests & Static Checks (3.11): 3 REST API connector tests failing with 'str' object has no attribute 'get' — unrelated to Redshift changes
  • playwright-ci-postgresql (3/6): Frontend E2E test timeouts/flakiness — unrelated to Redshift changes
Code Review ⚠️ Changes requested 0 resolved / 3 findings

Memory optimization approach is sound, but the SQL injection risk in get_partition_details and the table_name-ignoring cache key in _get_all_relation_info remain unaddressed from prior review.

⚠️ Bug: Cache key ignores table_name, may return stale results

📄 ingestion/src/metadata/ingestion/source/database/redshift/utils.py:393-396

The single-schema cache in _get_all_relation_info keys only on schema, but the function also filters by table_name (lines 400-403). If the function is first called with (schema="public", table_name="users"), the cached result contains only the "users" relation. A subsequent call with (schema="public", table_name="orders") would hit the cache and return the stale result (only "users"), missing "orders".

The previous @reflection.cache decorator keyed on all arguments (including table_name), so this is a behavioral regression.

The fix should include table_name in the cache key, or (better) only cache when table_name is None (the common bulk-fetch path), since caching single-table lookups provides little benefit.

Suggested fix
    # Single-schema cache: invalidate when schema or table_name changes
    table_name = kw.get("table_name", None)
    cached = getattr(self, "_relation_info_cache", None)
    if cached is not None and cached[0] == schema and cached[1] == table_name:
        return cached[2]
⚠️ Security: SQL injection via string formatting in get_partition_details

📄 ingestion/src/metadata/ingestion/source/database/redshift/metadata.py:201

The new schema_name parameter is concatenated directly into the SQL query via f-string interpolation:

query += f" AND \"schema\" = '{schema_name}'"

While the schema_name value typically comes from Redshift's own metadata catalog (and is thus unlikely to contain malicious content in normal usage), this is still a SQL injection vector. A schema name containing a single quote (e.g., foo'; DROP TABLE ...--) would break the query or allow injection.

Note that the existing codebase has similar patterns in _get_schema_column_info and _get_all_relation_info, so this is consistent with the current style. However, for new code, it's better to use parameterized queries.

The method query_table_names_and_types in the same file already demonstrates the correct pattern using sql.text() with named parameters (line 236-246). The same approach should be used here.

Suggested fix: Use a parameterized query approach. Since the base query is a raw string constant, wrap the full query with sql.text() and use a bind parameter:

query = REDSHIFT_PARTITION_DETAILS
if schema_name:
    query += " AND \"schema\" = :schema_name"
results = self.connection.execute(
    statement=sql.text(query),
    {"schema_name": schema_name}
).fetchall()
Suggested fix
            query = REDSHIFT_PARTITION_DETAILS
            if schema_name:
                query += " AND \"schema\" = :schema_name"
            results = self.connection.execute(
                statement=sql.text(query),
                {"schema_name": schema_name} if schema_name else {},
            ).fetchall()
💡 Quality: Stale docstring references removed @reflection.cache

📄 ingestion/src/metadata/ingestion/source/database/redshift/metadata.py:209-216

The docstring for _clear_reflection_cache() states:

"since _get_schema_column_info, get_columns, and _get_all_relation_info all use @reflection.cache"

However, this PR specifically removes @reflection.cache from all three of those functions. The docstring should be updated to reflect that the info_cache is being cleared for the remaining @reflection.cache-decorated functions (e.g., get_table_comment, get_view_definition, _get_pg_column_info) and to clean up any residual cached data.

Suggested fix
    def _clear_reflection_cache(self) -> None:
        """Clear the SQLAlchemy inspector's info_cache to release
        cached column / relation data from prior schemas.

        This prevents unbounded memory growth when ingesting many
        schemas by clearing residual cached data from remaining
        @reflection.cache-decorated functions (e.g., get_table_comment).
        """
🤖 Prompt for agents
Code Review: Memory optimization approach is sound, but the SQL injection risk in `get_partition_details` and the `table_name`-ignoring cache key in `_get_all_relation_info` remain unaddressed from prior review.

1. ⚠️ Bug: Cache key ignores table_name, may return stale results
   Files: ingestion/src/metadata/ingestion/source/database/redshift/utils.py:393-396

   The single-schema cache in `_get_all_relation_info` keys only on `schema`, but the function also filters by `table_name` (lines 400-403). If the function is first called with `(schema="public", table_name="users")`, the cached result contains only the "users" relation. A subsequent call with `(schema="public", table_name="orders")` would hit the cache and return the stale result (only "users"), missing "orders".
   
   The previous `@reflection.cache` decorator keyed on all arguments (including `table_name`), so this is a behavioral regression.
   
   The fix should include `table_name` in the cache key, or (better) only cache when `table_name` is None (the common bulk-fetch path), since caching single-table lookups provides little benefit.

   Suggested fix:
   # Single-schema cache: invalidate when schema or table_name changes
   table_name = kw.get("table_name", None)
   cached = getattr(self, "_relation_info_cache", None)
   if cached is not None and cached[0] == schema and cached[1] == table_name:
       return cached[2]

2. ⚠️ Security: SQL injection via string formatting in get_partition_details
   Files: ingestion/src/metadata/ingestion/source/database/redshift/metadata.py:201

   The new `schema_name` parameter is concatenated directly into the SQL query via f-string interpolation:
   
   ```python
   query += f" AND \"schema\" = '{schema_name}'"
   ```
   
   While the `schema_name` value typically comes from Redshift's own metadata catalog (and is thus unlikely to contain malicious content in normal usage), this is still a SQL injection vector. A schema name containing a single quote (e.g., `foo'; DROP TABLE ...--`) would break the query or allow injection.
   
   Note that the existing codebase has similar patterns in `_get_schema_column_info` and `_get_all_relation_info`, so this is consistent with the current style. However, for new code, it's better to use parameterized queries.
   
   The method `query_table_names_and_types` in the same file already demonstrates the correct pattern using `sql.text()` with named parameters (line 236-246). The same approach should be used here.
   
   **Suggested fix**: Use a parameterized query approach. Since the base query is a raw string constant, wrap the full query with `sql.text()` and use a bind parameter:
   
   ```python
   query = REDSHIFT_PARTITION_DETAILS
   if schema_name:
       query += " AND \"schema\" = :schema_name"
   results = self.connection.execute(
       statement=sql.text(query),
       {"schema_name": schema_name}
   ).fetchall()
   ```

   Suggested fix:
   query = REDSHIFT_PARTITION_DETAILS
   if schema_name:
       query += " AND \"schema\" = :schema_name"
   results = self.connection.execute(
       statement=sql.text(query),
       {"schema_name": schema_name} if schema_name else {},
   ).fetchall()

3. 💡 Quality: Stale docstring references removed @reflection.cache
   Files: ingestion/src/metadata/ingestion/source/database/redshift/metadata.py:209-216

   The docstring for `_clear_reflection_cache()` states:
   
   > "since _get_schema_column_info, get_columns, and _get_all_relation_info all use @reflection.cache"
   
   However, this PR specifically removes `@reflection.cache` from all three of those functions. The docstring should be updated to reflect that the info_cache is being cleared for the remaining `@reflection.cache`-decorated functions (e.g., `get_table_comment`, `get_view_definition`, `_get_pg_column_info`) and to clean up any residual cached data.

   Suggested fix:
   def _clear_reflection_cache(self) -> None:
       """Clear the SQLAlchemy inspector's info_cache to release
       cached column / relation data from prior schemas.
   
       This prevents unbounded memory growth when ingesting many
       schemas by clearing residual cached data from remaining
       @reflection.cache-decorated functions (e.g., get_table_comment).
       """

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Mar 4, 2026

Quality Gate Failed Quality Gate failed for 'open-metadata-ingestion'

Failed conditions
6.1% Coverage on New Code (required ≥ 20%)

See analysis details on SonarQube Cloud

@ulixius9 ulixius9 merged commit f16bcd3 into main Mar 4, 2026
43 of 46 checks passed
@ulixius9 ulixius9 deleted the redshift_mem branch March 4, 2026 06:23
ulixius9 added a commit that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak in Redshift ingestion

3 participants