Skip to content

docs: add ADR for course authoring automatic migration#251

Merged
mariajgrimaldi merged 6 commits intoopenedx:mainfrom
eduNEXT:bav/adr-0013-automatic-migration
Apr 16, 2026
Merged

docs: add ADR for course authoring automatic migration#251
mariajgrimaldi merged 6 commits intoopenedx:mainfrom
eduNEXT:bav/adr-0013-automatic-migration

Conversation

@BryanttV
Copy link
Copy Markdown
Contributor

@BryanttV BryanttV commented Apr 9, 2026

Related issue: #223

Description

This PR adds ADR 0013 - Course Authoring Automatic Migration, proposing an automatic and asynchronous migration mechanism triggered by changes in the authz.enable_course_authoring feature flag,

Related PR

Merge checklist

Check off if complete or not applicable:

  • Version bumped
  • Changelog record added
  • Documentation updated (not only docstrings)
  • Fixup commits are squashed away
  • Unit tests added/updated
  • Manual testing instructions provided
  • Noted any: Concerns, dependencies, migration issues, deadlines, tickets

@openedx-webhooks openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Apr 9, 2026
@openedx-webhooks
Copy link
Copy Markdown

openedx-webhooks commented Apr 9, 2026

Thanks for the pull request, @BryanttV!

This repository is currently maintained by @openedx/committers-openedx-authz.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@github-project-automation github-project-automation bot moved this to Needs Triage in Contributions Apr 9, 2026
@BryanttV BryanttV changed the title docs: add course authoring automatic migration adr docs: add ADR for course authoring automatic migration Apr 9, 2026
@BryanttV BryanttV marked this pull request as ready for review April 9, 2026 18:39
@mphilbrick211 mphilbrick211 moved this from Needs Triage to Ready for Review in Contributions Apr 9, 2026
Copy link
Copy Markdown
Contributor

@rodmgwgu rodmgwgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just some comments, thanks!

Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
the feature flag ``authz.enable_course_authoring`` changes state, but they deferred the definition of
the specific mechanism. This ADR addresses that gap.

The current manual approach presents the following risks:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand the need for a different mechanism than a 1-time migration or a more controlled migration that on-demand job operators could use. I'm thinking of a mechanism like in forums V2, when the storage backend is changed:

  1. If the flag is on during initialization (tutor), then the migration is executed
  2. If not then not much happens
  3. If there's a failure during the migration, then automatically rollback

Why is this not an acceptable solution, given that it directly impacts operators and that they can manage this kind of controlled migration better than in a live environment?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Verawood this will have to be a course-by-course or org-by-org decision, I think?

  • For a lot of instances there are many people with django admin access who can make those changes, but few with the ability to run management commands
  • Having the changes happen in sync with the course/org flag overrides prevents outages where permissions don't exist in the system being switched to until the management command is run, which might need to be done by a different team on a different schedule
  • If the migration fails when the flag is being set the rollback of both the flag and migration would be automatic so there wouldn't be a time where permissions simply don't work for a course or org
  • If we want to have a separate path for instance-wide migration at init time that might make sense, but there is still a reasonable chance that some bugs will make it necessary for courses or orgs to override the instance default in which case we would still need this work

I think for Willow this flag would go away and we would rely solely on an instance-wide init.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is that the authz.enable_course_authoring flag is intended to be a runtime source of truth, not just an initialization setting.

Unlike one-time migrations (like Forums V2), this flag can change dynamically (course/org level) and is expected to immediately reflect the system’s state. Therefore, a manual or operator-driven migration is not sufficient, as it introduces inconsistencies, depends on manual coordination, and does not guarantee that the system aligns with the current flag state.

This is my current thinking, however, we could also consider not implementing automatic migration and relying only on management commands. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are strong arguments for me to understand the need for the automated sync. Can we add these to the ADR context? Thanks!

Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
migration_type = models.CharField(max_length=20) # forward / rollback
scope_type = models.CharField(max_length=20) # course / org
scope_key = models.CharField(max_length=255)
status = models.CharField(max_length=20) # pending, running, completed, skipped
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it failed how would I get the exact log / error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failed state isn't included here because some roles might be added successfully while others fail. Instead, the count of successful and failed roles will be stored in the metadata field of each tracking record.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big can the field be though?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was thinking that this field would only have the success/failure count, something like this:

{
  "successes": 10,
  "errors": 5
}

But maybe it would be useful to include the reason why X role failed? What do you think?"

Regarding the size, I don't think there will be a large amount of data, considering that the migration is at the course or organization level.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature isn't for operators only, so users won't have access to the server logs to review the failures, so it makes sense to show them here for debugging or reporting.

Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
- Rename Django setting
- Rename tracking model
``authz.enable_course_authoring`` feature flag. The solution consists of:

#. Django signal handler to detect flag state changes.
#. Celery tasks to execute migrations asynchronously.
Copy link
Copy Markdown
Contributor

@bmtcril bmtcril Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might be overkill. Even if an org has 5000 courses, with 5 people with roles in each we should be able to batch those into a few manageable queries synchronously and preserve the atomicity of the flag change. This would give us both automatic rollback on failure and transaction-level locking for concurrency protection

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673

Copy link
Copy Markdown
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this adds more complexity than necessary and introduces a bunch of edge cases due to that, which we can avoid by just doing everything transactionally. Even if a transaction fails due to something like a huge org with many permissions nothing is left broken and the management command path still works.

I think that calling it a data migration might be overselling things since it seems like the majority case for Verawood would be single courses changing at a time and moving generally less than 5 rows from one table to another.

If there is significant concern about the size of migrations having a performance impact we could consider only using this for courses and make org migrations require the manual path.

@bmtcril
Copy link
Copy Markdown
Contributor

bmtcril commented Apr 10, 2026

My bad, re-reading this I misread the feature flag in question and thought this was for my proposal which was to do this on the course/org override flags not the instance wide flag). If this is only coming out of that original proposal, it's not what I was thinking of and I would favor the management commands for that case.

All of the other comments make more sense now, sorry about that.

Comment thread docs/decisions/0013-course-authoring-automatic-migration.rst Outdated
migration_type = models.CharField(max_length=20) # forward / rollback
scope_type = models.CharField(max_length=20) # course / org
scope_key = models.CharField(max_length=255)
status = models.CharField(max_length=20) # pending, running, completed, skipped
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big can the field be though?

**Utility Function Updates** above).

All database operations within the migration itself execute inside an atomic transaction.
If the migration fails, no data is deleted from either system, preserving consistency.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's considered failing in this case? Failures during the migration, or at least the errors caught, are skipped, and the migration continues. What kind of failures are we expecting here? Also, I don't think this migration happens in an atomic transaction cause some records can be migrated while others can fail so we might need more clarity on this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right that the current behavior is not strictly atomic. The migration functions operate in a best-effort manner, where individual record failures are caught and skipped, allowing the rest of the migration to continue. This means partial migrations are possible.

To make this clearer, I think we should explicitly define migration outcomes as:

  • completed: all records migrated successfully.
  • partial_success: some records failed, but the process completed.
  • failed: the migration could not complete due to a critical error.

This would better reflect the actual behavior and improve observability. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works!

recover.
- **Lock TTL edge cases**: if a migration takes longer than 1 hour (unlikely but possible
for very large organizations), the lock will expire and a new migration for the same scope
could start concurrently for the same scope.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the cache backend fails and two flag changes run at the same time? The migration itself is not atomic so we could end up in an invalid state

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, great point. I was thinking, and to address this, I’m considering enforcing concurrency at the database level using a UniqueConstraint on (scope_type, scope_key) for active migrations (e.g., pending and running states).

This would ensure that only one migration per scope can be active at any time, regardless of cache availability, and would eliminate the risk of concurrent executions even in failure scenarios. What do you think?

``authz.enable_course_authoring`` feature flag. The solution consists of:

#. Django signal handler to detect flag state changes.
#. Celery tasks to execute migrations asynchronously.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this risk is also related to this comment as well: https://github.com/openedx/openedx-authz/pull/251/changes#r3072539673

@mariajgrimaldi mariajgrimaldi self-requested a review April 13, 2026 11:07
Copy link
Copy Markdown
Member

@mariajgrimaldi mariajgrimaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any more comments besides the ones I already submitted. My main concern was understanding the need for this automated sync, but now that I do, I don't have any more blocking comments!

Thank you so much for all the clarifications!

@BryanttV
Copy link
Copy Markdown
Contributor Author

Thank you very much for your valuable feedback! To make sure everything is clear and that we’re all on the same page:

The automatic migration is intended to be performed only at the course and organization levels through the feature flag, as suggested in ADR 10 and 11. Automatic migration at the instance level will not be carried out due to performance concerns.

That said, I believe the migration at the course and organization levels can be made much simpler than currently proposed, as Ty suggested:

  • Synchronous execution, considering that the data volume will not be large.
  • Removal of cache locking, using instead a Unique Constraint and atomic database transactions.

In addition to applying the changes discussed regarding the tracking model.

I’d love to hear your thoughts! @mariajgrimaldi @rodmgwgu @bmtcril

@BryanttV BryanttV force-pushed the bav/adr-0013-automatic-migration branch from d7ecaaf to c55800c Compare April 13, 2026 21:08
@BryanttV
Copy link
Copy Markdown
Contributor Author

Hi there, I've applied the latest changes to the ADR based on your suggestions! @mariajgrimaldi @bmtcril @rodmgwgu

Copy link
Copy Markdown
Member

@mariajgrimaldi mariajgrimaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any additional comments. Thank you so much for addressing all of my concerns! :)

LGTM!

(The only thing pending is a follow-up on the deprecation of the flag and the automated migration after Verawood, or when we think we're ready. Thanks.)

Copy link
Copy Markdown
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you! Just a comment and question.

before the flag change is written. This approach violates ACID principles: at the moment
``pre_save`` fires, the new flag value has not yet been committed to the database. If the
subsequent ``save()`` were to fail (e.g., a validation error, a database constraint
violation, or a network issue), the migration would have already run against a state that
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not true if they are operating within the same transaction

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated that rejected alternative: 72c52c7

created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
completed_at = models.DateTimeField(null=True, blank=True)
metadata = models.JSONField(default=dict)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this include things like the traceback of any errors that occurred? I think that would probably be the most valuable thing to capture.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. The Migration Outcome Semantics section mentions that the metadata field will contain those details.

@BryanttV BryanttV requested a review from bmtcril April 14, 2026 18:50
Copy link
Copy Markdown
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of the work on this!

Copy link
Copy Markdown
Contributor

@rodmgwgu rodmgwgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!


.. code:: python

class AuthzCourseAuthoringMigrationRun(models.Model):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this model is scoped per course or organization, correct?
If the trigger comes from WaffleFlagOrgOverrideModel, how would the data be stored? Would it include all course IDs for the organization, or is storing just the organization info sufficient?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that the metadata field stores information about each migration that has been performed, including the scope (course_id).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Thanks,

Copy link
Copy Markdown
Contributor

@dwong2708 dwong2708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks


.. code:: python

class AuthzCourseAuthoringMigrationRun(models.Model):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Thanks,

@mariajgrimaldi mariajgrimaldi merged commit 14070da into openedx:main Apr 16, 2026
8 checks passed
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in Contributions Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core contributor PR author is a Core Contributor (who may or may not have write access to this repo). open-source-contribution PR author is not from Axim or 2U

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants