Fixes #27148: Eliminate N+1 is_paused queries in AirflowSource.get_pipelines_list#27149
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
Eliminates an N+1 query pattern in AirflowSource.get_pipelines_list() by selecting DagModel.is_paused as part of the main paginated query instead of issuing a per-DAG lookup.
Changes:
- Airflow 2.x path: adds a
LEFT OUTER JOINtoDagModeland selectsDagModel.is_pausedin the main query. - Airflow 3.x path: adds
DagModel.is_pausedto the selected columns (no additional join). - Removes the per-row
DagModel.is_pausedscalar query and drops the unusedsqlalchemy.joinimport.
94ab23d to
dea826d
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…Source.get_pipelines_list
dea826d to
8a5e277
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| # Airflow 2.x: fileloc is on SerializedDagModel. | ||
| # Always LEFT OUTER JOIN DagModel so we can select is_paused in the | ||
| # main query and avoid an extra DB round-trip per DAG (N+1). | ||
| session_query = ( | ||
| self.session.query( | ||
| SerializedDagModel.dag_id, | ||
| json_data_column, | ||
| SerializedDagModel.fileloc, | ||
| DagModel.is_paused, | ||
| ) | ||
| .join( | ||
| latest_dag_subquery, | ||
| and_( | ||
| SerializedDagModel.dag_id == latest_dag_subquery.c.dag_id, | ||
| timestamp_column == latest_dag_subquery.c.max_timestamp, | ||
| ), | ||
| ) | ||
| .outerjoin( | ||
| DagModel, | ||
| SerializedDagModel.dag_id == DagModel.dag_id, | ||
| ) |
There was a problem hiding this comment.
The Airflow 2.x query now always uses a LEFT OUTER JOIN to DagModel, even when includeUnDeployedPipelines is false. Since you then filter on DagModel.is_paused, the outer join becomes effectively an inner join anyway; using an explicit inner join in the includeUnDeployedPipelines=false path would be simpler and may yield a better query plan.
| # Build a mock that chains through any SQLAlchemy query method and returns | ||
| # our fake rows on the first .all() call, then [] to stop pagination. | ||
| mock_q = MagicMock() | ||
| for method in ("join", "outerjoin", "filter", "order_by", "limit", "offset", "group_by"): | ||
| getattr(mock_q, method).return_value = mock_q | ||
| mock_q.subquery.return_value = MagicMock() | ||
| mock_q.all.side_effect = [ | ||
| [active_row, inactive_row, null_row], | ||
| [], | ||
| ] | ||
|
|
||
| mock_session = MagicMock() | ||
| mock_session.query.return_value = mock_q | ||
|
|
There was a problem hiding this comment.
This test intends to ensure there is no per-DAG is_paused lookup, but it doesn't explicitly assert that no additional Session.query()/scalar() calls are made during row iteration. Consider adding assertions like mock_session.query.call_count == 2 (subquery + main query) and/or mock_q.scalar.assert_not_called() to make the regression protection explicit and more robust.
Fixes the Python Checkstyle CI failure by applying black's line-length formatting to the mock method tuple. Also addresses Copilot review feedback by adding explicit regression guards: - mock_q.scalar.assert_not_called() - proves no per-row scalar lookup - mock_session.query.call_count == 2 - proves only subquery + main query
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ Approved 2 resolved / 2 findingsEliminates N+1 is_paused queries in AirflowSource.get_pipelines_list by ensuring test compatibility with read-only properties and correctly handling undeployed DAGs during INNER JOIN operations. ✅ 2 resolved✅ Bug: Test sets read-only property; will raise AttributeError
✅ Edge Case: Airflow 3.x uses INNER JOIN; undeployed DAGs silently dropped
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
Describe your changes:
Fixes #27148
AirflowSource.get_pipelines_listwas issuing one extraSELECT is_paused FROM dag WHERE dag_id = ?query for every DAG returned by the paginated main query. With N DAGs, the ingestion run performed N+1 DB round-trips where 1 would suffice.Root cause: The original main
SELECTdid not includeDagModel.is_paused. A separate per-row sub-query was introduced as a workaround.Fix:
includeUnDeployedPipelines=TrueSerializedDagModelqueried alone +DagModelper rowLEFT OUTER JOIN dagonDagModel+is_pausedin SELECTincludeUnDeployedPipelines=Falseselect_from(join(SDM, DagModel))for the filter +DagModelper rowouterjoinnow in base query;select_from(join(...))removed (redundant)DagModelalready joined forfileloc;is_pausedfetched per rowDagModel.is_pausedadded to SELECT — no extra join neededThe inner
try/exceptblock that firedself.session.query(DagModel.is_paused).filter(...).scalar()per row is deleted.is_pausedis now read fromserialized_dag[3](the new 4th column in all query paths).The unused
joinimport fromsqlalchemyis removed.How I tested:
ast.parseDagModelis in the FROM clause via the appropriate join type beforeDagModel.is_pausedis referenced in the filterType of change:
Checklist:
Fixes <issue-number>: Eliminate N+1 is_paused queries in AirflowSource.get_pipelines_list