docs: add ADR for course authoring automatic migration#251
docs: add ADR for course authoring automatic migration#251mariajgrimaldi merged 6 commits intoopenedx:mainfrom
Conversation
|
Thanks for the pull request, @BryanttV! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
rodmgwgu
left a comment
There was a problem hiding this comment.
Looking good, just some comments, thanks!
| the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of | ||
| the specific mechanism. This ADR addresses that gap. | ||
|
|
||
| The current manual approach presents the following risks: |
There was a problem hiding this comment.
I want to make sure I understand the need for a different mechanism than a 1-time migration or a more controlled migration that on-demand job operators could use. I'm thinking of a mechanism like in forums V2, when the storage backend is changed:
- If the flag is on during initialization (tutor), then the migration is executed
- If not then not much happens
- If there's a failure during the migration, then automatically rollback
Why is this not an acceptable solution, given that it directly impacts operators and that they can manage this kind of controlled migration better than in a live environment?
There was a problem hiding this comment.
For Verawood this will have to be a course-by-course or org-by-org decision, I think?
- For a lot of instances there are many people with django admin access who can make those changes, but few with the ability to run management commands
- Having the changes happen in sync with the course/org flag overrides prevents outages where permissions don't exist in the system being switched to until the management command is run, which might need to be done by a different team on a different schedule
- If the migration fails when the flag is being set the rollback of both the flag and migration would be automatic so there wouldn't be a time where permissions simply don't work for a course or org
- If we want to have a separate path for instance-wide migration at init time that might make sense, but there is still a reasonable chance that some bugs will make it necessary for courses or orgs to override the instance default in which case we would still need this work
I think for Willow this flag would go away and we would rely solely on an instance-wide init.
There was a problem hiding this comment.
The key difference is that the authz.enable_course_authoring flag is intended to be a runtime source of truth, not just an initialization setting.
Unlike one-time migrations (like Forums V2), this flag can change dynamically (course/org level) and is expected to immediately reflect the system’s state. Therefore, a manual or operator-driven migration is not sufficient, as it introduces inconsistencies, depends on manual coordination, and does not guarantee that the system aligns with the current flag state.
This is my current thinking, however, we could also consider not implementing automatic migration and relying only on management commands. What do you think?
There was a problem hiding this comment.
These are strong arguments for me to understand the need for the automated sync. Can we add these to the ADR context? Thanks!
| migration_type = models.CharField(max_length=20) # forward / rollback | ||
| scope_type = models.CharField(max_length=20) # course / org | ||
| scope_key = models.CharField(max_length=255) | ||
| status = models.CharField(max_length=20) # pending, running, completed, skipped |
There was a problem hiding this comment.
If it failed how would I get the exact log / error?
There was a problem hiding this comment.
A failed state isn't included here because some roles might be added successfully while others fail. Instead, the count of successful and failed roles will be stored in the metadata field of each tracking record.
There was a problem hiding this comment.
How big can the field be though?
There was a problem hiding this comment.
Actually, I was thinking that this field would only have the success/failure count, something like this:
{
"successes": 10,
"errors": 5
}But maybe it would be useful to include the reason why X role failed? What do you think?"
Regarding the size, I don't think there will be a large amount of data, considering that the migration is at the course or organization level.
There was a problem hiding this comment.
This feature isn't for operators only, so users won't have access to the server logs to review the failures, so it makes sense to show them here for debugging or reporting.
- Rename Django setting - Rename tracking model
| ``authz.enable_course_authoring`` feature flag. The solution consists of: | ||
|
|
||
| #. Django signal handler to detect flag state changes. | ||
| #. Celery tasks to execute migrations asynchronously. |
There was a problem hiding this comment.
This seems like it might be overkill. Even if an org has 5000 courses, with 5 people with roles in each we should be able to batch those into a few manageable queries synchronously and preserve the atomicity of the flag change. This would give us both automatic rollback on failure and transaction-level locking for concurrency protection
There was a problem hiding this comment.
I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673
bmtcril
left a comment
There was a problem hiding this comment.
I think overall this adds more complexity than necessary and introduces a bunch of edge cases due to that, which we can avoid by just doing everything transactionally. Even if a transaction fails due to something like a huge org with many permissions nothing is left broken and the management command path still works.
I think that calling it a data migration might be overselling things since it seems like the majority case for Verawood would be single courses changing at a time and moving generally less than 5 rows from one table to another.
If there is significant concern about the size of migrations having a performance impact we could consider only using this for courses and make org migrations require the manual path.
|
My bad, re-reading this I misread the feature flag in question and thought this was for my proposal which was to do this on the course/org override flags not the instance wide flag). If this is only coming out of that original proposal, it's not what I was thinking of and I would favor the management commands for that case. All of the other comments make more sense now, sorry about that. |
| migration_type = models.CharField(max_length=20) # forward / rollback | ||
| scope_type = models.CharField(max_length=20) # course / org | ||
| scope_key = models.CharField(max_length=255) | ||
| status = models.CharField(max_length=20) # pending, running, completed, skipped |
There was a problem hiding this comment.
How big can the field be though?
| **Utility Function Updates** above). | ||
|
|
||
| All database operations within the migration itself execute inside an atomic transaction. | ||
| If the migration fails, no data is deleted from either system, preserving consistency. |
There was a problem hiding this comment.
What's considered failing in this case? Failures during the migration, or at least the errors caught, are skipped, and the migration continues. What kind of failures are we expecting here? Also, I don't think this migration happens in an atomic transaction cause some records can be migrated while others can fail so we might need more clarity on this
There was a problem hiding this comment.
You’re right that the current behavior is not strictly atomic. The migration functions operate in a best-effort manner, where individual record failures are caught and skipped, allowing the rest of the migration to continue. This means partial migrations are possible.
To make this clearer, I think we should explicitly define migration outcomes as:
completed: all records migrated successfully.partial_success: some records failed, but the process completed.failed: the migration could not complete due to a critical error.
This would better reflect the actual behavior and improve observability. What do you think?
| recover. | ||
| - **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible | ||
| for very large organizations), the lock will expire and a new migration for the same scope | ||
| could start concurrently for the same scope. |
There was a problem hiding this comment.
What if the cache backend fails and two flag changes run at the same time? The migration itself is not atomic so we could end up in an invalid state
There was a problem hiding this comment.
Mmm, great point. I was thinking, and to address this, I’m considering enforcing concurrency at the database level using a UniqueConstraint on (scope_type, scope_key) for active migrations (e.g., pending and running states).
This would ensure that only one migration per scope can be active at any time, regardless of cache availability, and would eliminate the risk of concurrent executions even in failure scenarios. What do you think?
| ``authz.enable_course_authoring`` feature flag. The solution consists of: | ||
|
|
||
| #. Django signal handler to detect flag state changes. | ||
| #. Celery tasks to execute migrations asynchronously. |
There was a problem hiding this comment.
I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673
|
Thank you very much for your valuable feedback! To make sure everything is clear and that we’re all on the same page: The automatic migration is intended to be performed only at the course and organization levels through the feature flag, as suggested in ADR 10 and 11. Automatic migration at the instance level will not be carried out due to performance concerns. That said, I believe the migration at the course and organization levels can be made much simpler than currently proposed, as Ty suggested:
In addition to applying the changes discussed regarding the tracking model. I’d love to hear your thoughts! @mariajgrimaldi @rodmgwgu @bmtcril |
d7ecaaf to
c55800c
Compare
|
Hi there, I've applied the latest changes to the ADR based on your suggestions! @mariajgrimaldi @bmtcril @rodmgwgu |
mariajgrimaldi
left a comment
There was a problem hiding this comment.
I don't have any additional comments. Thank you so much for addressing all of my concerns! :)
LGTM!
(The only thing pending is a follow-up on the deprecation of the flag and the automated migration after Verawood, or when we think we're ready. Thanks.)
bmtcril
left a comment
There was a problem hiding this comment.
This looks great, thank you! Just a comment and question.
| before the flag change is written. This approach violates ACID principles: at the moment | ||
| ``pre_save`` fires, the new flag value has not yet been committed to the database. If the | ||
| subsequent ``save()`` were to fail (e.g., a validation error, a database constraint | ||
| violation, or a network issue), the migration would have already run against a state that |
There was a problem hiding this comment.
I think this is not true if they are operating within the same transaction
There was a problem hiding this comment.
I updated that rejected alternative: 72c52c7
| created_at = models.DateTimeField(auto_now_add=True) | ||
| updated_at = models.DateTimeField(auto_now=True) | ||
| completed_at = models.DateTimeField(null=True, blank=True) | ||
| metadata = models.JSONField(default=dict) |
There was a problem hiding this comment.
Would this include things like the traceback of any errors that occurred? I think that would probably be the most valuable thing to capture.
There was a problem hiding this comment.
Yes, that's the idea. The Migration Outcome Semantics section mentions that the metadata field will contain those details.
bmtcril
left a comment
There was a problem hiding this comment.
Thanks for all of the work on this!
|
|
||
| .. code:: python | ||
|
|
||
| class AuthzCourseAuthoringMigrationRun(models.Model): |
There was a problem hiding this comment.
My understanding is that this model is scoped per course or organization, correct?
If the trigger comes from WaffleFlagOrgOverrideModel, how would the data be stored? Would it include all course IDs for the organization, or is storing just the organization info sufficient?
There was a problem hiding this comment.
The idea is that the metadata field stores information about each migration that has been performed, including the scope (course_id).
|
|
||
| .. code:: python | ||
|
|
||
| class AuthzCourseAuthoringMigrationRun(models.Model): |
Related issue: #223
Description
This PR adds ADR 0013 - Course Authoring Automatic Migration, proposing an automatic and asynchronous migration mechanism triggered by changes in the
authz.enable_course_authoringfeature flag,Related PR
Merge checklist
Check off if complete or not applicable: