Skip to content

fix(ingestion): release Engine resources on database switch (#27625)#27627

Merged
harshach merged 2 commits intomainfrom
fix/snowflake-engine-release-on-db-switch
Apr 22, 2026
Merged

fix(ingestion): release Engine resources on database switch (#27625)#27627
harshach merged 2 commits intomainfrom
fix/snowflake-engine-release-on-db-switch

Conversation

@ulixius9
Copy link
Copy Markdown
Member

Summary

  • CommonDbSourceService.set_inspector / close now close every checked-out connection in _connection_map and clear _inspector_map before disposing the engine. engine.dispose() alone does not release ConnectionFairy objects or free Inspector.info_cache — prior code dropped the map refs without closing, leaving the old engine pinned via _ConnectionRecord → pool → engine. Across many databases this leaks tens of MB per switch (large info_cache of reflection results) and at interpreter shutdown the orphaned fairies trigger RecursionError in _finalize_fairy returning to a disposed pool's Condition lock.
  • MultiDBSource._execute_database_query and SnowflakeSource.get_database_names_raw now eagerly .fetchall() instead of streaming. Lazy iteration kept a cursor alive across set_inspector calls; since set_inspector disposes the engine the cursor is bound to, the cursor would have been invalidated when we started closing connections. Fetching the DB-name list up front (a small finite set) closes the cursor before any close can happen.
  • scoped_session is rebound to the new engine on each set_inspector — otherwise sessionmaker.bind retains the disposed engine.

Test plan

  • New unit tests in ingestion/tests/unit/topology/database/test_common_db_source.py and test_snowflake.py — 12 tests, all pass (39 total in the two files, 0 regressions).
  • Acceptance testtest_old_engine_becomes_gc_eligible_after_release creates a real SQLAlchemy engine, stashes a fairy in _connection_map, takes a weakref, calls _release_engine, drops strong refs, runs gc.collect(), asserts the weakref is dead. This fails against the prior kill_active_connections-only code path and passes with the fix — the direct regression guard.
  • _release_engine covered against a real in-memory SQLite engine: closes every map entry (including entries keyed by arbitrary worker thread ids), clears the inspector map, removes the session, disposes the pool, idempotent when engine is None, tolerates already-closed connections.
  • .fetchall() behavior validated: test_generator_survives_engine_dispose_mid_iteration advances the generator, disposes the engine, confirms remaining rows still yield — i.e. the cursor is no longer live at the point set_inspector disposes the engine.
  • Snowflake-side fetchall verified by mocking self.connection and asserting .fetchall() called exactly once with results yielded in order.
  • Recommended before merge: run an ingestion against a multi-DB Snowflake account and confirm (a) container RSS no longer grows linearly per DB switch, and (b) the Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda>> stderr warning at shutdown no longer appears.

Related symptom

A production ingestion against a 39-database Snowflake account exhibited container memory growing from ~1 GiB to ~4 GiB (pod memory limit, OOMKill territory) across a single run, followed by RecursionError: maximum recursion depth exceeded at interpreter shutdown in sqlalchemy/pool/base.py:_finalize_fairy and snowflake/connector/vendored/urllib3/connectionpool.py:_close_pool_connections. Both tracebacks bottom out in threading.Condition / RLock acquisition — the signature of weakref finalizers firing on pools that had been disposed while their fairies were still in flight. This PR eliminates the SQLAlchemy-side path by closing fairies explicitly before dispose.

🤖 Generated with Claude Code

* fix(ingestion): release engine resources on database switch

Close fairies in _connection_map and clear _inspector_map before
engine.dispose() in CommonDbSourceService.set_inspector/close. Dispose
alone does not free Inspector.info_cache or release checked-out
ConnectionFairies, leaving the old engine GC-pinned across DB switches
and triggering _finalize_fairy RecursionError at interpreter shutdown.

Eagerly fetch multi-DB name queries (MultiDBSource._execute_database_query
and SnowflakeSource.get_database_names_raw) so the cursor closes before
the caller invokes set_inspector, which disposes the engine the cursor
was bound to.

Also rebind scoped_session to the new engine so it doesn't keep the
disposed one alive via sessionmaker.bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* py format

* fix(ingestion): address PR review feedback from gitar-bot and Copilot

- Set self.engine = None after dispose in _release_engine (gitar-bot):
  prevents close() from leaving a dangling disposed-engine reference
  that would produce a confusing pool error on accidental later access.

- _FakeSource now has close() and is wrapped in a fixture that cleans
  up its checked-out connection (Copilot #1): avoids resource warnings
  and an interfering fairy across test teardown.

- Rewrite test_generator_survives_engine_dispose_mid_iteration as
  test_generator_survives_connection_close_mid_iteration (Copilot #2):
  Engine.dispose() does not close checked-out connections, so the old
  test did not reproduce what _release_engine actually does. The real
  regression is the explicit conn.close() on the fairy in
  _connection_map before dispose. The new test closes the connection
  mid-iteration, which is what fetchall() needs to survive.

- Switch the query in _FakeSource.get_database_names_raw and the
  seeded INSERT assertions to the TEXT name column (Copilot #3):
  _execute_database_query is typed Iterable[str]; testing on integer
  ids obscured the actual contract.

- Update test_disposes_pool to assert surrogate.engine is None after
  release (follows from the new self.engine = None behavior) and
  verify the original pool's checkedout() is 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 22, 2026 10:39
@ulixius9 ulixius9 requested a review from a team as a code owner April 22, 2026 10:39
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026
Comment on lines +185 to +186
except Exception as exc: # pylint: disable=broad-except
logger.warning(f"Failed to dispose engine: {exc}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: _release_engine silently swallows dispose failure details

In _release_engine, the engine.dispose() failure is logged at warning level with only the exception message (f"Failed to dispose engine: {exc}"), while all other failures in the method use logger.debug(…, exc_info=True) to capture the full traceback. For consistency and debuggability, consider adding exc_info=True to the dispose warning as well, since a dispose failure is the most important one to diagnose.

Suggested fix:

except Exception as exc:  # pylint: disable=broad-except
    logger.warning(f"Failed to dispose engine: {exc}", exc_info=True)

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses SQLAlchemy resource leaks during multi-database ingestion by explicitly releasing checked-out connections/inspectors and rebinding sessions when switching databases, and by avoiding streaming DB-name cursors across engine disposal.

Changes:

  • Replace kill_active_connections() usage with a new CommonDbSourceService._release_engine() that closes all tracked connections, clears inspector/session state, and disposes the engine on database switch/close.
  • Make DB-name queries eager by calling .fetchall() in MultiDBSource._execute_database_query() and SnowflakeSource.get_database_names_raw() to avoid invalidated cursors during engine disposal.
  • Add unit/acceptance tests covering engine release behavior, GC eligibility, and eager-fetch semantics.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ingestion/src/metadata/ingestion/source/database/common_db_source.py Adds _release_engine() and uses it from set_inspector()/close(); rebinds session on engine switch.
ingestion/src/metadata/ingestion/source/database/multi_db_source.py Eagerly buffers DB-name query results via .fetchall().
ingestion/src/metadata/ingestion/source/database/snowflake/metadata.py Uses .fetchall() for Snowflake DB-name enumeration to prevent cursor invalidation.
ingestion/tests/unit/topology/database/test_common_db_source.py Adds tests for _release_engine() cleanup, idempotency, and GC eligibility + eager-fetch generator safety.
ingestion/tests/unit/topology/database/test_snowflake.py Adds tests asserting Snowflake DB-name enumeration calls .fetchall() once and yields in order.

Comment on lines +154 to +187
self._release_engine()
logger.info(f"Ingesting from database: {database_name}")

new_service_connection = deepcopy(self.service_connection)
new_service_connection.database = database_name
self.engine = get_connection(new_service_connection)
self.session = create_and_bind_thread_safe_session(self.engine)

self._connection_map = {} # Lazy init as well
def _release_engine(self) -> None:
# Close fairies first so _ConnectionRecord drops its pool reference;
# dispose alone leaves them orphaned and causes _finalize_fairy
# RecursionErrors at GC time. Clearing _inspector_map is what
# actually frees Inspector.info_cache — dispose() does not.
if getattr(self, "engine", None) is None:
return
for conn in self._connection_map.values():
try:
conn.close()
except Exception: # pylint: disable=broad-except
logger.debug("Connection already closed", exc_info=True)
self._connection_map = {}
self._inspector_map = {}
session = getattr(self, "session", None)
if session is not None:
try:
session.remove()
except Exception: # pylint: disable=broad-except
logger.debug("Session cleanup failed", exc_info=True)
self.session = None
try:
self.engine.dispose()
except Exception as exc: # pylint: disable=broad-except
logger.warning(f"Failed to dispose engine: {exc}")
self.engine = None
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_release_engine() disposes and nulls self.engine, but CommonDbSourceService.__init__ stores the initial engine in self.connection_obj and it is never updated/cleared on database switches. That strong reference can keep the old Engine (and its pool/Inspector caches) alive even after _release_engine(), undermining the intended leak fix. Consider clearing self.connection_obj in _release_engine() (or before/after dispose) and rebinding it to the new engine in set_inspector() to keep it consistent with self.engine.

Copilot uses AI. Check for mistakes.
pmbrull
pmbrull previously approved these changes Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

🟡 Playwright Results — all passed (19 flaky)

✅ 3692 passed · ❌ 0 failed · 🟡 19 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 480 0 1 4
🟡 Shard 2 653 0 3 7
🟡 Shard 3 663 0 3 1
🟡 Shard 4 645 0 3 27
🟡 Shard 5 610 0 1 42
🟡 Shard 6 641 0 8 8
🟡 19 flaky test(s) (passed on retry)
  • Features/DataAssetRulesEnabled.spec.ts › should enforce single domain selection for glossary term when entity rules are enabled (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataQuality/TableLevelTests.spec.ts › Table Row Count To Be Between (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Complete Incident lifecycle with table owner (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/TestSuitePipelineRedeploy.spec.ts › Re-deploy all test-suite ingestion pipelines (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Set & Update table-cp, hyperlink-cp, string, integer, markdown, number, duration, email, enum, sqlQuery, timestamp, entityReference, entityReferenceList, timeInterval, time-cp, date-cp, dateTime-cp Custom Property (shard 4, 1 retry)
  • Pages/Entity.spec.ts › User as Owner with unsorted list (shard 4, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
  • Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6, 1 retry)
  • Pages/Glossary.spec.ts › Delete Glossary and Glossary Term using Delete Modal (shard 6, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Worksheet (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…tches

self.connection_obj is set once in __init__ to the initial engine and
never updated. After set_inspector rebuilds self.engine, connection_obj
still points at the disposed original engine — pinning its dialect and
compiled_cache alive for the source's lifetime.

Rebind connection_obj when creating the new engine in set_inspector,
and clear it in _release_engine so close() leaves nothing dangling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 22, 2026

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Releases Engine resources during database switches to prevent memory leaks. Consider logging the underlying exception in _release_engine instead of swallowing dispose failures.

💡 Quality: _release_engine silently swallows dispose failure details

📄 ingestion/src/metadata/ingestion/source/database/common_db_source.py:185-186

In _release_engine, the engine.dispose() failure is logged at warning level with only the exception message (f"Failed to dispose engine: {exc}"), while all other failures in the method use logger.debug(…, exc_info=True) to capture the full traceback. For consistency and debuggability, consider adding exc_info=True to the dispose warning as well, since a dispose failure is the most important one to diagnose.

Suggested fix
except Exception as exc:  # pylint: disable=broad-except
    logger.warning(f"Failed to dispose engine: {exc}", exc_info=True)
🤖 Prompt for agents
Code Review: Releases Engine resources during database switches to prevent memory leaks. Consider logging the underlying exception in _release_engine instead of swallowing dispose failures.

1. 💡 Quality: _release_engine silently swallows dispose failure details
   Files: ingestion/src/metadata/ingestion/source/database/common_db_source.py:185-186

   In `_release_engine`, the `engine.dispose()` failure is logged at `warning` level with only the exception message (`f"Failed to dispose engine: {exc}"`), while all other failures in the method use `logger.debug(…, exc_info=True)` to capture the full traceback. For consistency and debuggability, consider adding `exc_info=True` to the dispose warning as well, since a dispose failure is the most important one to diagnose.

   Suggested fix:
   except Exception as exc:  # pylint: disable=broad-except
       logger.warning(f"Failed to dispose engine: {exc}", exc_info=True)

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@harshach harshach merged commit 8571a70 into main Apr 22, 2026
48 checks passed
@harshach harshach deleted the fix/snowflake-engine-release-on-db-switch branch April 22, 2026 18:41
ulixius9 added a commit that referenced this pull request Apr 23, 2026
…27627)

* fix(ingestion): release Engine resources on database switch (#27625)

* fix(ingestion): release engine resources on database switch

Close fairies in _connection_map and clear _inspector_map before
engine.dispose() in CommonDbSourceService.set_inspector/close. Dispose
alone does not free Inspector.info_cache or release checked-out
ConnectionFairies, leaving the old engine GC-pinned across DB switches
and triggering _finalize_fairy RecursionError at interpreter shutdown.

Eagerly fetch multi-DB name queries (MultiDBSource._execute_database_query
and SnowflakeSource.get_database_names_raw) so the cursor closes before
the caller invokes set_inspector, which disposes the engine the cursor
was bound to.

Also rebind scoped_session to the new engine so it doesn't keep the
disposed one alive via sessionmaker.bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* py format

* fix(ingestion): address PR review feedback from gitar-bot and Copilot

- Set self.engine = None after dispose in _release_engine (gitar-bot):
  prevents close() from leaving a dangling disposed-engine reference
  that would produce a confusing pool error on accidental later access.

- _FakeSource now has close() and is wrapped in a fixture that cleans
  up its checked-out connection (Copilot #1): avoids resource warnings
  and an interfering fairy across test teardown.

- Rewrite test_generator_survives_engine_dispose_mid_iteration as
  test_generator_survives_connection_close_mid_iteration (Copilot #2):
  Engine.dispose() does not close checked-out connections, so the old
  test did not reproduce what _release_engine actually does. The real
  regression is the explicit conn.close() on the fairy in
  _connection_map before dispose. The new test closes the connection
  mid-iteration, which is what fetchall() needs to survive.

- Switch the query in _FakeSource.get_database_names_raw and the
  seeded INSERT assertions to the TEXT name column (Copilot #3):
  _execute_database_query is typed Iterable[str]; testing on integer
  ids obscured the actual contract.

- Update test_disposes_pool to assert surrogate.engine is None after
  release (follows from the new self.engine = None behavior) and
  verify the original pool's checkedout() is 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestion): keep connection_obj in sync with engine across DB switches

self.connection_obj is set once in __init__ to the initial engine and
never updated. After set_inspector rebuilds self.engine, connection_obj
still points at the disposed original engine — pinning its dialect and
compiled_cache alive for the source's lifetime.

Rebind connection_obj when creating the new engine in set_inspector,
and clear it in _release_engine so close() leaves nothing dangling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ulixius9 added a commit that referenced this pull request Apr 23, 2026
…27627)

* fix(ingestion): release Engine resources on database switch (#27625)

* fix(ingestion): release engine resources on database switch

Close fairies in _connection_map and clear _inspector_map before
engine.dispose() in CommonDbSourceService.set_inspector/close. Dispose
alone does not free Inspector.info_cache or release checked-out
ConnectionFairies, leaving the old engine GC-pinned across DB switches
and triggering _finalize_fairy RecursionError at interpreter shutdown.

Eagerly fetch multi-DB name queries (MultiDBSource._execute_database_query
and SnowflakeSource.get_database_names_raw) so the cursor closes before
the caller invokes set_inspector, which disposes the engine the cursor
was bound to.

Also rebind scoped_session to the new engine so it doesn't keep the
disposed one alive via sessionmaker.bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* py format

* fix(ingestion): address PR review feedback from gitar-bot and Copilot

- Set self.engine = None after dispose in _release_engine (gitar-bot):
  prevents close() from leaving a dangling disposed-engine reference
  that would produce a confusing pool error on accidental later access.

- _FakeSource now has close() and is wrapped in a fixture that cleans
  up its checked-out connection (Copilot #1): avoids resource warnings
  and an interfering fairy across test teardown.

- Rewrite test_generator_survives_engine_dispose_mid_iteration as
  test_generator_survives_connection_close_mid_iteration (Copilot #2):
  Engine.dispose() does not close checked-out connections, so the old
  test did not reproduce what _release_engine actually does. The real
  regression is the explicit conn.close() on the fairy in
  _connection_map before dispose. The new test closes the connection
  mid-iteration, which is what fetchall() needs to survive.

- Switch the query in _FakeSource.get_database_names_raw and the
  seeded INSERT assertions to the TEXT name column (Copilot #3):
  _execute_database_query is typed Iterable[str]; testing on integer
  ids obscured the actual contract.

- Update test_disposes_pool to assert surrogate.engine is None after
  release (follows from the new self.engine = None behavior) and
  verify the original pool's checkedout() is 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestion): keep connection_obj in sync with engine across DB switches

self.connection_obj is set once in __init__ to the initial engine and
never updated. After set_inspector rebuilds self.engine, connection_obj
still points at the disposed original engine — pinning its dialect and
compiled_cache alive for the source's lifetime.

Rebind connection_obj when creating the new engine in set_inspector,
and clear it in _release_engine so close() leaves nothing dangling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants