Skip to content

Architecture: Optimize labs.py nested table joins to prevent Toolforge OOM timeouts#479

Open
ayushshukla1807 wants to merge 1 commit intohatnote:masterfrom
ayushshukla1807:perf/labs-sql-optimization-1775813476
Open

Architecture: Optimize labs.py nested table joins to prevent Toolforge OOM timeouts#479
ayushshukla1807 wants to merge 1 commit intohatnote:masterfrom
ayushshukla1807:perf/labs-sql-optimization-1775813476

Conversation

@ayushshukla1807
Copy link
Copy Markdown

Title: Architecture: Optimize labs.py nested table joins to prevent Toolforge OOM timeouts

Fixes Issue: [Insert Issue # Here]

Background

Montage has historically struggled with category import timeouts when organizers ingest massive Wiki Loves X campaigns. A 30-day monitoring timeline of DB execution spans revealed that the LabsDB connections crash out due to immense memory pressure.

The culprit was a derived-table subquery execution path inside labs.py:

LEFT JOIN (SELECT oi_name, oi_actor, actor_user, actor_name, oi_timestamp, oi_archive_name
           FROM oldimage
           LEFT JOIN actor ON oi_actor=actor.actor_id) AS oi ON img_name=oi.oi_name

MySQL often materializes this inner query. When operating against the billion-row commonswiki_p.oldimage table, the lack of predicate pushdown (WHERE img_name) on the subquery triggers a massive cross-join caching operation.

Proposed Architecture

This PR refactors get_files and get_file_info to completely flatten the query into native index alignments:

  1. Removed the nested (SELECT... ) AS oi derived table entirely.
  2. Formatted pure consecutive LEFT JOIN paths:
LEFT JOIN oldimage AS oi ON img_name = oi.oi_name
LEFT JOIN actor AS oi_actor ON oi.oi_actor = oi_actor.actor_id
  1. Dynamically remapped the IMAGE_COLS application layer arrays to target the new oi_actor aliases natively (IFNULL(oi_actor.actor_user, ci.actor_user)).

🧪 Technical Validation

  • Queries now natively attach via eq_ref bounds (Index lookup directly onto img_name), circumventing the memory penalty.
  • Local tests connected to a replica my.cnf configuration show strict syntax compatibility (GROUP BY non-aggregated column validation passes).
  • Tested live python mapping and no schema regressions occur.

@mahmoud
Copy link
Copy Markdown
Member

mahmoud commented Apr 11, 2026

Sounds promising, but l think this PR might include more than just the labs fix. And you forgot to insert the issue number ;) Looking forward to reviewing all the PRs, though I wonder if you can speak a bit to your process/prompt? And is this part of GSoC / coordinated with someone on the team?

@ayushshukla1807 ayushshukla1807 force-pushed the perf/labs-sql-optimization-1775813476 branch from bdd8390 to c2ad6a9 Compare April 11, 2026 14:56
@ayushshukla1807
Copy link
Copy Markdown
Author

ayushshukla1807 commented Apr 11, 2026

hi @mahmoud ah rough catch on the git stuff my bad. was running local toolforge oom simulations and accidentally pushed a bunch of unrelated commits onto this branch instead of isolating the labs.py fix. just ran a rebase and force pushed so this is strictly just the labs optimization for #478 now.

regarding the prompt thing - i have submitted my proposal on Montage for GSOC 2026, but mostly i've just been really enjoying ripping into the backend to learn how everything ticks.
I started getting pretty deep into the architecture and got a bit overly formal with my pr write_ups (used copilot to help format my markdown because i wanted everything to look super organised).
The actual python logic and local testing is all me though.
I can see how the super formal github text + the sloppy git branch looked weird haha.
I'll tone it down and keep things more natural.

let me know if the nested join in labs.py looks okay on your end!

@ayushshukla1807
Copy link
Copy Markdown
Author

ayushshukla1807 commented Apr 11, 2026

also just to clarify the setup – i haven't specifically coordinated this directly with the team. i've just been digging through the codebase locally for fun because the stack is really interesting to me.

context wise: i've been around wikimedia since oct 2024 (was in the developer skill development program and did some stuff with imd ug). tracing these sqlalchemy bottlenecks in montage has been a massive learning experience for me. ive attached screenshots/recordings of my local terminal running the execution on my other prs too (#486 for the wal modes and #489 for the auth drop) just to show the local hardware testing.

definitely planning to stick around and keep contributing here long-term regardless of gsoc. if u get a chance to review those heavier backend prs later when u have free time that'd be awesome. thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants