Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET#1065
Performance: Replace slow ORDER BY RAND() with COUNT+OFFSET#1065audiodude merged 2 commits intoopenzim:mainfrom
Conversation
|
I replaced ORDER BY RAND() with a COUNT + OFFSET approach. ORDER BY RAND() is inefficient because it forces the database to sort every matching row before picking one. The new method skips the sort entirely, making it significantly faster and more scalable for large datasets. |
|
Hi @audiodude, @rao107 Please Review my PR, |
Response to Index ConcernI've verified that the COUNT query does NOT do a table scan. Here's the proof: Current Index Structure:The EXPLAIN Analysis:Query 1: COUNT with project filter EXPLAIN SELECT COUNT(*) FROM ratings WHERE r_project = 'en.wikipedia.org';Result:
Query 2: SELECT with OFFSET EXPLAIN SELECT * FROM ratings WHERE r_project = 'en.wikipedia.org' LIMIT 1 OFFSET 100;Result:
The PRIMARY KEY's first column ( Performance Comparison:
No additional indices are needed. |
Update: No Migration Needed ✅I've verified using Evidence:The COUNT query EXPLAIN output: SELECT with OFFSET EXPLAIN output: Key observations:
The PRIMARY KEY's first column ( Performance comparison:
No migration or new indices are required. The existing PRIMARY KEY provides optimal performance. |
|
We can merge this, but I encourage you to review our https://github.com/openzim/wp1/blob/main/CONTRIBUTING.md, especially the sections on the use of LLMs/AI. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1065 +/- ##
==========================================
- Coverage 92.78% 92.78% -0.01%
==========================================
Files 74 74
Lines 4297 4308 +11
==========================================
+ Hits 3987 3997 +10
- Misses 310 311 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@audiodude |
|
@ARCoder181105 thanks for being transparent about your LLM use. However, it seems your responses/explanations to my questions where directly copy/pasted from an LLM. Please try to use your own reasoning and voice in the future. |
|
Yeah, Understood👍👍 |
e8575de to
12a5b82
Compare
This PR addresses the performance concerns regarding
ORDER BY RAND()discussed in #1057.Fixes #1058
Context:
As noted in the issue,
ORDER BY RAND()forces a sort of all matching rows, which is inefficient. While this hasn't been critical due to low traffic, it is a known SQL anti-pattern.Solution:
I have replaced the query with a
COUNT()+OFFSETapproach. This utilizes the existing indexes to count and fetch a single row without sorting the entire result set, effectively resolving the performance bottleneck.Verification:
Ran
wp1/logic/rating_test.pylocally and all 28 tests passed.