docs: add ADR for course authoring automatic migration by BryanttV · Pull Request #251 · openedx/openedx-authz

BryanttV · 2026-04-09T18:36:52Z

Related issue: #223

Description

This PR adds ADR 0013 - Course Authoring Automatic Migration, proposing an automatic and asynchronous migration mechanism triggered by changes in the authz.enable_course_authoring feature flag,

Related PR

feat: course authoring automatic migration #259

Merge checklist

Check off if complete or not applicable:

Version bumped
Changelog record added
Documentation updated (not only docstrings)
Fixup commits are squashed away
Unit tests added/updated
Manual testing instructions provided
Noted any: Concerns, dependencies, migration issues, deadlines, tickets

openedx-webhooks · 2026-04-09T18:37:00Z

Thanks for the pull request, @BryanttV!

This repository is currently maintained by @openedx/committers-openedx-authz.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

rodmgwgu

Looking good, just some comments, thanks!

mariajgrimaldi · 2026-04-10T08:59:33Z

+the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of
+the specific mechanism. This ADR addresses that gap.
+
+The current manual approach presents the following risks:


I want to make sure I understand the need for a different mechanism than a 1-time migration or a more controlled migration that on-demand job operators could use. I'm thinking of a mechanism like in forums V2, when the storage backend is changed:

If the flag is on during initialization (tutor), then the migration is executed

If not then not much happens

If there's a failure during the migration, then automatically rollback

Why is this not an acceptable solution, given that it directly impacts operators and that they can manage this kind of controlled migration better than in a live environment?

For Verawood this will have to be a course-by-course or org-by-org decision, I think?

For a lot of instances there are many people with django admin access who can make those changes, but few with the ability to run management commands

Having the changes happen in sync with the course/org flag overrides prevents outages where permissions don't exist in the system being switched to until the management command is run, which might need to be done by a different team on a different schedule

If the migration fails when the flag is being set the rollback of both the flag and migration would be automatic so there wouldn't be a time where permissions simply don't work for a course or org

If we want to have a separate path for instance-wide migration at init time that might make sense, but there is still a reasonable chance that some bugs will make it necessary for courses or orgs to override the instance default in which case we would still need this work

I think for Willow this flag would go away and we would rely solely on an instance-wide init.

The key difference is that the authz.enable_course_authoring flag is intended to be a runtime source of truth, not just an initialization setting.

Unlike one-time migrations (like Forums V2), this flag can change dynamically (course/org level) and is expected to immediately reflect the system’s state. Therefore, a manual or operator-driven migration is not sufficient, as it introduces inconsistencies, depends on manual coordination, and does not guarantee that the system aligns with the current flag state.

This is my current thinking, however, we could also consider not implementing automatic migration and relying only on management commands. What do you think?

These are strong arguments for me to understand the need for the automated sync. Can we add these to the ADR context? Thanks!

mariajgrimaldi · 2026-04-10T09:16:43Z

+        migration_type = models.CharField(max_length=20)  # forward / rollback
+        scope_type = models.CharField(max_length=20)  # course / org
+        scope_key = models.CharField(max_length=255)
+        status = models.CharField(max_length=20)  # pending, running, completed, skipped


If it failed how would I get the exact log / error?

A failed state isn't included here because some roles might be added successfully while others fail. Instead, the count of successful and failed roles will be stored in the metadata field of each tracking record.

How big can the field be though?

Actually, I was thinking that this field would only have the success/failure count, something like this:

{ "successes": 10, "errors": 5 }

But maybe it would be useful to include the reason why X role failed? What do you think?"

Regarding the size, I don't think there will be a large amount of data, considering that the migration is at the course or organization level.

This feature isn't for operators only, so users won't have access to the server logs to review the failures, so it makes sense to show them here for debugging or reporting.

- Rename Django setting - Rename tracking model

bmtcril · 2026-04-10T16:04:47Z

+``authz.enable_course_authoring`` feature flag. The solution consists of:
+
+#. Django signal handler to detect flag state changes.
+#. Celery tasks to execute migrations asynchronously.


This seems like it might be overkill. Even if an org has 5000 courses, with 5 people with roles in each we should be able to batch those into a few manageable queries synchronously and preserve the atomicity of the flag change. This would give us both automatic rollback on failure and transaction-level locking for concurrency protection

I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673

bmtcril

I think overall this adds more complexity than necessary and introduces a bunch of edge cases due to that, which we can avoid by just doing everything transactionally. Even if a transaction fails due to something like a huge org with many permissions nothing is left broken and the management command path still works.

I think that calling it a data migration might be overselling things since it seems like the majority case for Verawood would be single courses changing at a time and moving generally less than 5 rows from one table to another.

If there is significant concern about the size of migrations having a performance impact we could consider only using this for courses and make org migrations require the manual path.

bmtcril · 2026-04-10T16:52:48Z

My bad, re-reading this I misread the feature flag in question and thought this was for my proposal which was to do this on the course/org override flags not the instance wide flag). If this is only coming out of that original proposal, it's not what I was thinking of and I would favor the management commands for that case.

All of the other comments make more sense now, sorry about that.

mariajgrimaldi · 2026-04-13T11:00:23Z

+        migration_type = models.CharField(max_length=20)  # forward / rollback
+        scope_type = models.CharField(max_length=20)  # course / org
+        scope_key = models.CharField(max_length=255)
+        status = models.CharField(max_length=20)  # pending, running, completed, skipped


How big can the field be though?

mariajgrimaldi · 2026-04-13T11:02:27Z

+**Utility Function Updates** above).
+
+All database operations within the migration itself execute inside an atomic transaction.
+If the migration fails, no data is deleted from either system, preserving consistency.


What's considered failing in this case? Failures during the migration, or at least the errors caught, are skipped, and the migration continues. What kind of failures are we expecting here? Also, I don't think this migration happens in an atomic transaction cause some records can be migrated while others can fail so we might need more clarity on this

You’re right that the current behavior is not strictly atomic. The migration functions operate in a best-effort manner, where individual record failures are caught and skipped, allowing the rest of the migration to continue. This means partial migrations are possible.

To make this clearer, I think we should explicitly define migration outcomes as:

completed: all records migrated successfully.

partial_success: some records failed, but the process completed.

failed: the migration could not complete due to a critical error.

This would better reflect the actual behavior and improve observability. What do you think?

That works!

mariajgrimaldi · 2026-04-13T11:04:25Z

+  recover.
+- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible
+  for very large organizations), the lock will expire and a new migration for the same scope
+  could start concurrently for the same scope.


What if the cache backend fails and two flag changes run at the same time? The migration itself is not atomic so we could end up in an invalid state

Mmm, great point. I was thinking, and to address this, I’m considering enforcing concurrency at the database level using a UniqueConstraint on (scope_type, scope_key) for active migrations (e.g., pending and running states).

This would ensure that only one migration per scope can be active at any time, regardless of cache availability, and would eliminate the risk of concurrent executions even in failure scenarios. What do you think?

mariajgrimaldi · 2026-04-13T11:05:57Z

+``authz.enable_course_authoring`` feature flag. The solution consists of:
+
+#. Django signal handler to detect flag state changes.
+#. Celery tasks to execute migrations asynchronously.


I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673

mariajgrimaldi

I don't have any more comments besides the ones I already submitted. My main concern was understanding the need for this automated sync, but now that I do, I don't have any more blocking comments!

Thank you so much for all the clarifications!

BryanttV · 2026-04-13T15:31:02Z

Thank you very much for your valuable feedback! To make sure everything is clear and that we’re all on the same page:

The automatic migration is intended to be performed only at the course and organization levels through the feature flag, as suggested in ADR 10 and 11. Automatic migration at the instance level will not be carried out due to performance concerns.

That said, I believe the migration at the course and organization levels can be made much simpler than currently proposed, as Ty suggested:

Synchronous execution, considering that the data volume will not be large.
Removal of cache locking, using instead a Unique Constraint and atomic database transactions.

In addition to applying the changes discussed regarding the tracking model.

I’d love to hear your thoughts! @mariajgrimaldi @rodmgwgu @bmtcril

BryanttV · 2026-04-13T21:15:36Z

Hi there, I've applied the latest changes to the ADR based on your suggestions! @mariajgrimaldi @bmtcril @rodmgwgu

mariajgrimaldi

I don't have any additional comments. Thank you so much for addressing all of my concerns! :)

LGTM!

(The only thing pending is a follow-up on the deprecation of the flag and the automated migration after Verawood, or when we think we're ready. Thanks.)

bmtcril

This looks great, thank you! Just a comment and question.

bmtcril · 2026-04-14T15:48:58Z

+  before the flag change is written. This approach violates ACID principles: at the moment
+  ``pre_save`` fires, the new flag value has not yet been committed to the database. If the
+  subsequent ``save()`` were to fail (e.g., a validation error, a database constraint
+  violation, or a network issue), the migration would have already run against a state that


I think this is not true if they are operating within the same transaction

I updated that rejected alternative: 72c52c7

bmtcril · 2026-04-14T15:50:48Z

+        created_at = models.DateTimeField(auto_now_add=True)
+        updated_at = models.DateTimeField(auto_now=True)
+        completed_at = models.DateTimeField(null=True, blank=True)
+        metadata = models.JSONField(default=dict)


Would this include things like the traceback of any errors that occurred? I think that would probably be the most valuable thing to capture.

Yes, that's the idea. The Migration Outcome Semantics section mentions that the metadata field will contain those details.

bmtcril

Thanks for all of the work on this!

rodmgwgu

Thanks for this!

dwong2708 · 2026-04-14T22:28:43Z

+
+.. code:: python
+
+    class AuthzCourseAuthoringMigrationRun(models.Model):


My understanding is that this model is scoped per course or organization, correct?
If the trigger comes from WaffleFlagOrgOverrideModel, how would the data be stored? Would it include all course IDs for the organization, or is storing just the organization info sufficient?

The idea is that the metadata field stores information about each migration that has been performed, including the scope (course_id).

Sounds good! Thanks,

dwong2708

LGTM! Thanks

dwong2708 · 2026-04-15T18:00:28Z

+
+.. code:: python
+
+    class AuthzCourseAuthoringMigrationRun(models.Model):


Sounds good! Thanks,

docs: add course authoring automatic migration adr

818a444

openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Apr 9, 2026

openedx-webhooks added this to Contributions Apr 9, 2026

github-project-automation bot moved this to Needs Triage in Contributions Apr 9, 2026

BryanttV changed the title ~~docs: add course authoring automatic migration adr~~ docs: add ADR for course authoring automatic migration Apr 9, 2026

BryanttV marked this pull request as ready for review April 9, 2026 18:39

BryanttV requested review from bmtcril, mariajgrimaldi and rodmgwgu April 9, 2026 18:40

BryanttV mentioned this pull request Apr 9, 2026

Task – RBAC AuthZ – Waffle flag - Automatic Migration Trigger #223

Open

mphilbrick211 moved this from Needs Triage to Ready for Review in Contributions Apr 9, 2026

rodmgwgu reviewed Apr 9, 2026

View reviewed changes

mariajgrimaldi requested changes Apr 10, 2026

View reviewed changes

chore: address pr review

944808e

- Rename Django setting - Rename tracking model

bmtcril reviewed Apr 10, 2026

View reviewed changes

docs: clarify concurrency protection risks

eea181e

mariajgrimaldi reviewed Apr 13, 2026

View reviewed changes

mariajgrimaldi self-requested a review April 13, 2026 11:07

mariajgrimaldi reviewed Apr 13, 2026

View reviewed changes

chore: address PR reviews

c55800c

BryanttV force-pushed the bav/adr-0013-automatic-migration branch from d7ecaaf to c55800c Compare April 13, 2026 21:08

BryanttV requested review from bmtcril, mariajgrimaldi and rodmgwgu April 13, 2026 21:15

mariajgrimaldi approved these changes Apr 14, 2026

View reviewed changes

BryanttV mentioned this pull request Apr 14, 2026

feat: course authoring automatic migration #259

Open

7 tasks

bmtcril reviewed Apr 14, 2026

View reviewed changes

docs: refine rationale for rejecting pre_save in migration logic

72c52c7

BryanttV requested a review from bmtcril April 14, 2026 18:50

bmtcril approved these changes Apr 14, 2026

View reviewed changes

rodmgwgu approved these changes Apr 14, 2026

View reviewed changes

dwong2708 reviewed Apr 14, 2026

View reviewed changes

docs: enhance migration outcome semantics with metadata details

7bd9194

dwong2708 approved these changes Apr 15, 2026

View reviewed changes

mariajgrimaldi merged commit 14070da into openedx:main Apr 16, 2026
8 checks passed

github-project-automation bot moved this from Ready for Review to Done in Contributions Apr 16, 2026


		.. code:: python

		class AuthzCourseAuthoringMigrationRun(models.Model):

Conversation

BryanttV commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related PR

Merge checklist

Uh oh!

openedx-webhooks commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rodmgwgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bmtcril Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmtcril left a comment

Choose a reason for hiding this comment

Uh oh!

bmtcril commented Apr 10, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariajgrimaldi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanttV commented Apr 13, 2026

Uh oh!

BryanttV commented Apr 13, 2026

Uh oh!

mariajgrimaldi left a comment

Choose a reason for hiding this comment

Uh oh!

bmtcril left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

BryanttV commented Apr 9, 2026 •

edited

Loading

openedx-webhooks commented Apr 9, 2026 •

edited

Loading

bmtcril Apr 10, 2026 •

edited

Loading

mariajgrimaldi left a comment •

edited

Loading