Skip to content

fix(ingestion): release Engine resources on database switch#27625

Merged
ulixius9 merged 3 commits intoopen-metadata:fix/snowflake-engine-release-on-db-switchfrom
ulixius9:fix/snowflake-engine-release-on-db-switch
Apr 22, 2026
Merged

fix(ingestion): release Engine resources on database switch#27625
ulixius9 merged 3 commits intoopen-metadata:fix/snowflake-engine-release-on-db-switchfrom
ulixius9:fix/snowflake-engine-release-on-db-switch

Conversation

@ulixius9
Copy link
Copy Markdown
Member

Summary

  • CommonDbSourceService.set_inspector / close now close every checked-out connection in _connection_map and clear _inspector_map before disposing the engine. engine.dispose() alone does not release ConnectionFairy objects or free Inspector.info_cache — prior code dropped the map refs without closing, leaving the old engine pinned via _ConnectionRecord → pool → engine. Across many databases this leaks tens of MB per switch (large info_cache of reflection results) and at interpreter shutdown the orphaned fairies trigger RecursionError in _finalize_fairy returning to a disposed pool's Condition lock.
  • MultiDBSource._execute_database_query and SnowflakeSource.get_database_names_raw now eagerly .fetchall() instead of streaming. Lazy iteration kept a cursor alive across set_inspector calls; since set_inspector disposes the engine the cursor is bound to, the cursor would have been invalidated when we started closing connections. Fetching the DB-name list up front (a small finite set) closes the cursor before any close can happen.
  • scoped_session is rebound to the new engine on each set_inspector — otherwise sessionmaker.bind retains the disposed engine.

Test plan

  • New unit tests in ingestion/tests/unit/topology/database/test_common_db_source.py and test_snowflake.py — 12 tests, all pass (39 total in the two files, 0 regressions).
  • Acceptance testtest_old_engine_becomes_gc_eligible_after_release creates a real SQLAlchemy engine, stashes a fairy in _connection_map, takes a weakref, calls _release_engine, drops strong refs, runs gc.collect(), asserts the weakref is dead. This fails against the prior kill_active_connections-only code path and passes with the fix — the direct regression guard.
  • _release_engine covered against a real in-memory SQLite engine: closes every map entry (including entries keyed by arbitrary worker thread ids), clears the inspector map, removes the session, disposes the pool, idempotent when engine is None, tolerates already-closed connections.
  • .fetchall() behavior validated: test_generator_survives_engine_dispose_mid_iteration advances the generator, disposes the engine, confirms remaining rows still yield — i.e. the cursor is no longer live at the point set_inspector disposes the engine.
  • Snowflake-side fetchall verified by mocking self.connection and asserting .fetchall() called exactly once with results yielded in order.
  • Recommended before merge: run an ingestion against a multi-DB Snowflake account and confirm (a) container RSS no longer grows linearly per DB switch, and (b) the Exception ignored in: <function _ConnectionRecord.checkout.<locals>.<lambda>> stderr warning at shutdown no longer appears.

Related symptom

A production ingestion against a 39-database Snowflake account exhibited container memory growing from ~1 GiB to ~4 GiB (pod memory limit, OOMKill territory) across a single run, followed by RecursionError: maximum recursion depth exceeded at interpreter shutdown in sqlalchemy/pool/base.py:_finalize_fairy and snowflake/connector/vendored/urllib3/connectionpool.py:_close_pool_connections. Both tracebacks bottom out in threading.Condition / RLock acquisition — the signature of weakref finalizers firing on pools that had been disposed while their fairies were still in flight. This PR eliminates the SQLAlchemy-side path by closing fairies explicitly before dispose.

🤖 Generated with Claude Code

Close fairies in _connection_map and clear _inspector_map before
engine.dispose() in CommonDbSourceService.set_inspector/close. Dispose
alone does not free Inspector.info_cache or release checked-out
ConnectionFairies, leaving the old engine GC-pinned across DB switches
and triggering _finalize_fairy RecursionError at interpreter shutdown.

Eagerly fetch multi-DB name queries (MultiDBSource._execute_database_query
and SnowflakeSource.get_database_names_raw) so the cursor closes before
the caller invokes set_inspector, which disposes the engine the cursor
was bound to.

Also rebind scoped_session to the new engine so it doesn't keep the
disposed one alive via sessionmaker.bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ulixius9 ulixius9 requested a review from a team as a code owner April 22, 2026 10:15
Copilot AI review requested due to automatic review settings April 22, 2026 10:15
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes SQLAlchemy engine/resource leaks when multi-database sources switch databases during ingestion by explicitly closing checked-out connections and clearing inspector/session state before disposing the old engine, and by ensuring database-name queries don’t leave cursors streaming across engine switches.

Changes:

  • Add CommonDbSourceService._release_engine() and use it from set_inspector() and close() to close all checked-out connections, clear inspector caches, remove scoped sessions, and dispose the engine.
  • Update MultiDBSource._execute_database_query() and SnowflakeSource.get_database_names_raw() to eagerly .fetchall() to avoid a live cursor across engine disposal.
  • Add unit tests covering engine release/GC eligibility and eager-fetch behavior for MultiDB/Snowflake paths.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ingestion/src/metadata/ingestion/source/database/common_db_source.py Introduces _release_engine() and uses it for DB switching/cleanup to prevent engine/inspector/session leaks.
ingestion/src/metadata/ingestion/source/database/multi_db_source.py Eagerly fetches DB-name query results to avoid cursor lifetime issues across engine disposal.
ingestion/src/metadata/ingestion/source/database/snowflake/metadata.py Applies eager fetch to Snowflake database-name enumeration.
ingestion/tests/unit/topology/database/test_common_db_source.py Adds unit + acceptance coverage for engine release semantics and eager-fetch behavior.
ingestion/tests/unit/topology/database/test_snowflake.py Adds Snowflake-specific tests to assert .fetchall() usage and yielded ordering.

Comment thread ingestion/tests/unit/topology/database/test_common_db_source.py
Comment thread ingestion/tests/unit/topology/database/test_common_db_source.py Outdated
Comment thread ingestion/tests/unit/topology/database/test_common_db_source.py
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

- Set self.engine = None after dispose in _release_engine (gitar-bot):
  prevents close() from leaving a dangling disposed-engine reference
  that would produce a confusing pool error on accidental later access.

- _FakeSource now has close() and is wrapped in a fixture that cleans
  up its checked-out connection (Copilot #1): avoids resource warnings
  and an interfering fairy across test teardown.

- Rewrite test_generator_survives_engine_dispose_mid_iteration as
  test_generator_survives_connection_close_mid_iteration (Copilot #2):
  Engine.dispose() does not close checked-out connections, so the old
  test did not reproduce what _release_engine actually does. The real
  regression is the explicit conn.close() on the fairy in
  _connection_map before dispose. The new test closes the connection
  mid-iteration, which is what fetchall() needs to survive.

- Switch the query in _FakeSource.get_database_names_raw and the
  seeded INSERT assertions to the TEXT name column (Copilot #3):
  _execute_database_query is typed Iterable[str]; testing on integer
  ids obscured the actual contract.

- Update test_disposes_pool to assert surrogate.engine is None after
  release (follows from the new self.engine = None behavior) and
  verify the original pool's checkedout() is 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 22, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Explicitly releasing engine resources during a database switch resolves potential reference leakage. No open issues found.

✅ 1 resolved
Edge Case: _release_engine leaves disposed engine ref; close() may confuse callers

📄 ingestion/src/metadata/ingestion/source/database/common_db_source.py:162-176
After _release_engine(), self.engine remains set to the disposed Engine object (not None). This is fine in set_inspector() where it's immediately reassigned, but in close() it leaves a dangling disposed-engine reference. If any code accidentally accesses self.connection or self.engine after close(), it will get a confusing error from the disposed pool rather than a clean AttributeError or None check failure. The method already sets self.session = None — doing the same for self.engine after dispose() would be consistent and defensive.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@ulixius9 ulixius9 changed the base branch from main to fix/snowflake-engine-release-on-db-switch April 22, 2026 10:36
@ulixius9 ulixius9 merged commit f9df440 into open-metadata:fix/snowflake-engine-release-on-db-switch Apr 22, 2026
31 checks passed
@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 1 failure(s), 10 flaky

✅ 3700 passed · ❌ 1 failed · 🟡 10 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 481 0 0 4
🟡 Shard 2 654 0 2 7
🟡 Shard 3 665 0 1 1
🟡 Shard 4 645 0 3 27
🔴 Shard 5 609 1 1 42
🟡 Shard 6 646 0 3 8

Genuine Failures (failed on all attempts)

Pages/Glossary.spec.ts › Add and Remove Assets (shard 5)
�[31mTest timeout of 180000ms exceeded.�[39m
🟡 10 flaky test(s) (passed on retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Database Schema (shard 2, 1 retry)
  • Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › Enum (shard 4, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Table (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should follow Data Consumer role policies for ownerless searchIndex (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Tag.spec.ts › Verify Owner Add Delete (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

harshach pushed a commit that referenced this pull request Apr 22, 2026
…27627)

* fix(ingestion): release Engine resources on database switch (#27625)

* fix(ingestion): release engine resources on database switch

Close fairies in _connection_map and clear _inspector_map before
engine.dispose() in CommonDbSourceService.set_inspector/close. Dispose
alone does not free Inspector.info_cache or release checked-out
ConnectionFairies, leaving the old engine GC-pinned across DB switches
and triggering _finalize_fairy RecursionError at interpreter shutdown.

Eagerly fetch multi-DB name queries (MultiDBSource._execute_database_query
and SnowflakeSource.get_database_names_raw) so the cursor closes before
the caller invokes set_inspector, which disposes the engine the cursor
was bound to.

Also rebind scoped_session to the new engine so it doesn't keep the
disposed one alive via sessionmaker.bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* py format

* fix(ingestion): address PR review feedback from gitar-bot and Copilot

- Set self.engine = None after dispose in _release_engine (gitar-bot):
  prevents close() from leaving a dangling disposed-engine reference
  that would produce a confusing pool error on accidental later access.

- _FakeSource now has close() and is wrapped in a fixture that cleans
  up its checked-out connection (Copilot #1): avoids resource warnings
  and an interfering fairy across test teardown.

- Rewrite test_generator_survives_engine_dispose_mid_iteration as
  test_generator_survives_connection_close_mid_iteration (Copilot #2):
  Engine.dispose() does not close checked-out connections, so the old
  test did not reproduce what _release_engine actually does. The real
  regression is the explicit conn.close() on the fairy in
  _connection_map before dispose. The new test closes the connection
  mid-iteration, which is what fetchall() needs to survive.

- Switch the query in _FakeSource.get_database_names_raw and the
  seeded INSERT assertions to the TEXT name column (Copilot #3):
  _execute_database_query is typed Iterable[str]; testing on integer
  ids obscured the actual contract.

- Update test_disposes_pool to assert surrogate.engine is None after
  release (follows from the new self.engine = None behavior) and
  verify the original pool's checkedout() is 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestion): keep connection_obj in sync with engine across DB switches

self.connection_obj is set once in __init__ to the initial engine and
never updated. After set_inspector rebuilds self.engine, connection_obj
still points at the disposed original engine — pinning its dialect and
compiled_cache alive for the source's lifetime.

Rebind connection_obj when creating the new engine in set_inspector,
and clear it in _release_engine so close() leaves nothing dangling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ulixius9 added a commit that referenced this pull request Apr 23, 2026
…27627)

* fix(ingestion): release Engine resources on database switch (#27625)

* fix(ingestion): release engine resources on database switch

Close fairies in _connection_map and clear _inspector_map before
engine.dispose() in CommonDbSourceService.set_inspector/close. Dispose
alone does not free Inspector.info_cache or release checked-out
ConnectionFairies, leaving the old engine GC-pinned across DB switches
and triggering _finalize_fairy RecursionError at interpreter shutdown.

Eagerly fetch multi-DB name queries (MultiDBSource._execute_database_query
and SnowflakeSource.get_database_names_raw) so the cursor closes before
the caller invokes set_inspector, which disposes the engine the cursor
was bound to.

Also rebind scoped_session to the new engine so it doesn't keep the
disposed one alive via sessionmaker.bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* py format

* fix(ingestion): address PR review feedback from gitar-bot and Copilot

- Set self.engine = None after dispose in _release_engine (gitar-bot):
  prevents close() from leaving a dangling disposed-engine reference
  that would produce a confusing pool error on accidental later access.

- _FakeSource now has close() and is wrapped in a fixture that cleans
  up its checked-out connection (Copilot #1): avoids resource warnings
  and an interfering fairy across test teardown.

- Rewrite test_generator_survives_engine_dispose_mid_iteration as
  test_generator_survives_connection_close_mid_iteration (Copilot #2):
  Engine.dispose() does not close checked-out connections, so the old
  test did not reproduce what _release_engine actually does. The real
  regression is the explicit conn.close() on the fairy in
  _connection_map before dispose. The new test closes the connection
  mid-iteration, which is what fetchall() needs to survive.

- Switch the query in _FakeSource.get_database_names_raw and the
  seeded INSERT assertions to the TEXT name column (Copilot #3):
  _execute_database_query is typed Iterable[str]; testing on integer
  ids obscured the actual contract.

- Update test_disposes_pool to assert surrogate.engine is None after
  release (follows from the new self.engine = None behavior) and
  verify the original pool's checkedout() is 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestion): keep connection_obj in sync with engine across DB switches

self.connection_obj is set once in __init__ to the initial engine and
never updated. After set_inspector rebuilds self.engine, connection_obj
still points at the disposed original engine — pinning its dialect and
compiled_cache alive for the source's lifetime.

Rebind connection_obj when creating the new engine in set_inspector,
and clear it in _release_engine so close() leaves nothing dangling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ulixius9 added a commit that referenced this pull request Apr 23, 2026
…27627)

* fix(ingestion): release Engine resources on database switch (#27625)

* fix(ingestion): release engine resources on database switch

Close fairies in _connection_map and clear _inspector_map before
engine.dispose() in CommonDbSourceService.set_inspector/close. Dispose
alone does not free Inspector.info_cache or release checked-out
ConnectionFairies, leaving the old engine GC-pinned across DB switches
and triggering _finalize_fairy RecursionError at interpreter shutdown.

Eagerly fetch multi-DB name queries (MultiDBSource._execute_database_query
and SnowflakeSource.get_database_names_raw) so the cursor closes before
the caller invokes set_inspector, which disposes the engine the cursor
was bound to.

Also rebind scoped_session to the new engine so it doesn't keep the
disposed one alive via sessionmaker.bind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* py format

* fix(ingestion): address PR review feedback from gitar-bot and Copilot

- Set self.engine = None after dispose in _release_engine (gitar-bot):
  prevents close() from leaving a dangling disposed-engine reference
  that would produce a confusing pool error on accidental later access.

- _FakeSource now has close() and is wrapped in a fixture that cleans
  up its checked-out connection (Copilot #1): avoids resource warnings
  and an interfering fairy across test teardown.

- Rewrite test_generator_survives_engine_dispose_mid_iteration as
  test_generator_survives_connection_close_mid_iteration (Copilot #2):
  Engine.dispose() does not close checked-out connections, so the old
  test did not reproduce what _release_engine actually does. The real
  regression is the explicit conn.close() on the fairy in
  _connection_map before dispose. The new test closes the connection
  mid-iteration, which is what fetchall() needs to survive.

- Switch the query in _FakeSource.get_database_names_raw and the
  seeded INSERT assertions to the TEXT name column (Copilot #3):
  _execute_database_query is typed Iterable[str]; testing on integer
  ids obscured the actual contract.

- Update test_disposes_pool to assert surrogate.engine is None after
  release (follows from the new self.engine = None behavior) and
  verify the original pool's checkedout() is 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ingestion): keep connection_obj in sync with engine across DB switches

self.connection_obj is set once in __init__ to the initial engine and
never updated. After set_inspector rebuilds self.engine, connection_obj
still points at the disposed original engine — pinning its dialect and
compiled_cache alive for the source's lifetime.

Rebind connection_obj when creating the new engine in set_inspector,
and clear it in _release_engine so close() leaves nothing dangling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants