Fix task enqueue race during job preparation#198
Conversation
Summary of ChangesHello @NumericalAdvantage, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical race condition within the job processing system where tasks could be dispatched to workers before their associated parent job had completed its initial preparation phase. By strategically delaying the enqueuing of tasks until the job's status is confirmed as PENDING, the changes prevent premature task execution and eliminate assertion errors that previously occurred due to inconsistent job states. This ensures robust and reliable job processing, particularly for long-running preparation steps. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Pull Request Review: Fix task enqueue race during job preparationSummaryThis PR addresses a critical race condition (#197) in the Job/Task processing model where tasks could be enqueued and picked up by workers while the parent job was still in PREPARING status, causing assertion failures in ✅ Positive FindingsCode Quality
Bug Fix Approach
Test Coverage
🔍 Areas for Improvement1. Duplicate Database QueriesIn task = ExtractionTask.objects.get(id=task_id) # Re-fetching after processing
task.queued_job_id = None
task.save()Issue: The task is re-fetched from the database after processing. This is inefficient and could cause issues if the task was modified during processing. Recommendation: Consider accessing the task from the processor: processor = ExtractionTaskProcessor(task)
processor.start()
task.queued_job_id = None # task is already in scope
task.save()However, if 2. Redundant Job Status Check in ExtractionsIn if job.queued_job_id is not None:
job.queued_job_id = None
job.save()Issue: This check appears redundant since:
Recommendation: Remove lines 105-107 unless there's a specific edge case this handles. If it's a defensive check, add a comment explaining when this would trigger. 3. Missing queued_job_id Cleanup in Task ProcessorIn Question: Should 4. Potential N+1 Query IssueIn for task in tasks_to_enqueue:
if not task.is_queued:
task.delay()Issue: Each Recommendation: Consider if there's a way to batch the delay operations or use bulk updates if applicable. This may not be possible with Procrastinate's API, but worth investigating for performance optimization. 5. Test Assertion Could Be More SpecificIn both test files, the final assertion: assert enqueue_job_statuses
assert all(status == ExtractionJob.Status.PENDING for status in enqueue_job_statuses)Suggestion: Consider adding an assertion on the count to ensure the expected number of tasks were enqueued: assert len(enqueue_job_statuses) == 2 # We created 2 reports
assert all(status == ExtractionJob.Status.PENDING for status in enqueue_job_statuses)🔒 Security Considerations
⚡ Performance Considerations
📊 Test Coverage
📝 Code Style & Conventions
✨ SummaryThis is a solid fix for a critical race condition. The approach is sound, well-documented, and includes proper test coverage. The main suggestions are around minor efficiency improvements and code clarity, not correctness issues. Recommendation: ✅ Approve with minor suggestions The race condition fix is correct and the tests verify the invariant properly. The suggestions above are optimizations and clarifications that can be addressed in follow-up work or in this PR based on your preference. Great work identifying and fixing this subtle concurrency issue! 🎉 |
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition during job preparation by ensuring that tasks are only enqueued after the job preparation is complete and the job status is PENDING. The changes primarily involve refactoring the logic in process_extraction_job and process_subscription_job to delay task enqueuing. New regression tests have been added to verify this behavior.
My review focuses on the implementation details. I've found a potential improvement in radis/extractions/tasks.py to make the code more efficient and robust against race conditions when clearing the queued_job_id. The rest of the changes look solid and effectively fix the described issue.
| task = ExtractionTask.objects.get(id=task_id) | ||
| task.queued_job_id = None | ||
| task.save() |
There was a problem hiding this comment.
Re-fetching the task object and then calling save() without update_fields can be inefficient and may lead to race conditions. A concurrent update to another field on this task could be overwritten. It's better to update only the necessary field. The task object is already in memory from the start of the function.
| task = ExtractionTask.objects.get(id=task_id) | |
| task.queued_job_id = None | |
| task.save() | |
| task.queued_job_id = None | |
| task.save(update_fields=["queued_job_id"]) |
📝 WalkthroughWalkthroughThe pull request refactors task enqueuing behavior in both extraction and subscription job processing. Tasks are now enqueued only after the job transitions to PENDING state, rather than during task creation. The job's Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Job
participant TaskQueue as Task Queue
User->>Job: process_extraction_job / process_subscription_job
activate Job
alt existing tasks (resume/retry)
Job->>Job: collect pending tasks
else no existing tasks
Job->>Job: set PREPARING state
Job->>Job: create new tasks
end
Job->>Job: complete preparation
Job->>Job: transition to PENDING
Job->>Job: clear queued_job_id
Job->>Job: collect pending tasks
loop for each pending task not queued
Job->>TaskQueue: task.delay()
end
deactivate Job
TaskQueue->>User: tasks ready to process
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code ReviewSummaryThis PR fixes a race condition where tasks could be enqueued while their parent job is still in the PREPARING state, causing assertion failures in AnalysisTaskProcessor.start() (radis/core/processors.py:46). The fix ensures tasks are only enqueued after the job transitions to PENDING. ✅ Strengths
🔍 Code Quality ObservationsPositive
Minor Suggestions
🐛 Potential IssuesNone identified - The logic appears sound and the fix addresses the root cause correctly. 🔒 Security ConsiderationsNo security concerns. The changes are internal state management improvements. ⚡ Performance ConsiderationsMinor improvement: The fix batches task creation (during PREPARING) separately from task enqueuing (after PENDING), which is more efficient than interleaving them. Database queries: The current implementation makes multiple queries in the enqueuing loop. Consider using bulk_update() if task modifications are needed, though the current approach is fine for typical workloads. 🧪 Test CoverageExcellent: Both modules have comprehensive regression tests that:
Suggestion: Consider adding an integration test that verifies the end-to-end behavior with actual Procrastinate workers, though the current unit tests are sufficient for regression prevention. 📋 Additional Notes
VerdictLGTM ✅ This is a well-designed fix with excellent test coverage. The minor suggestions above are optional improvements, not blockers. The PR successfully resolves issue #197 and prevents the assertion error by enforcing the proper state machine invariant. Review generated by Claude Code |
When we have a Job that might take a long time to "PREPARE", what happens is that a lot of tasks from this Job might actually get created and even be ready to be picked up by the worker. In this case, an assertion fails which checks that the Job is in PENDING (and then switches it to IN_PROGRESS) and not PREPARING. If the Job is still being prepared, we raise assertion error that a task is being queued for the worker before the job the task belongs to has fully been PREPARED. So the fix was to make sure that we first finish preparing all the tasks in the DB before we enqueue them in the procrastinate worker.
Summary by CodeRabbit
Bug Fixes
Tests