Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: ADR for serving static assets #110

Merged
merged 3 commits into from
Dec 4, 2023
Merged

Conversation

ormsbee
Copy link
Contributor

@ormsbee ormsbee commented Oct 31, 2023

A first stab at an asset serving ADR that could serve Studio and the LMS. I'm especially looking for feedback on how we could do the cookie auth handoff (or for someone to tell me why it's a terrible idea).


OPEN QUESTIONS:

* What's the best way to do handoff between LMS/Studio and get that cookie information over to the asset domains?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea would be to have a one-time key that we store in cache and redirect the user's browser with it. So it could be something like:

  1. Browser makes a request to a new endpoint in Studio/LMS.
  2. Studio/LMS generates a random token, and stores user information associated with it in our backend cache (redis or memecached).
  3. Studio/LMS then redirects the request to an auth URL on the static assets server with that random token in the querystring.
  4. Asset server finds that entry in the cache, logs that user in, and removes the cache entry.
  5. Asset server is now using session auth for that user.

It's cheating in a way because it relies on the asset server and the Studio or LMS instance being on the same service backend. But we're making that the case anyway because of permissions checking integration. And it's simpler than having signed requests (though it could ultimately be extended to go in that direction if we ever want to separate it into an entirely different service).

docs/decisions/0015-serving-static-assets.rst Outdated Show resolved Hide resolved
docs/decisions/0015-serving-static-assets.rst Outdated Show resolved Hide resolved

OPEN QUESTIONS:

* What's the best way to do handoff between LMS/Studio and get that cookie information over to the asset domains?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

  1. Browser requests a file without any auth headers
  2. Edge function requests the metadata from the LMS API (if not already cached) to determine if auth is needed.
  3. If no auth is needed, serve the file directly.
  4. If auth is needed, return a 302 redirect to a special LMS endpoint. That endpoint redirects back to the CDN but includes an auth header in the query string.
  5. Edge function serves the file along with a Set-Cookie header, moving the auth token from query string to cookie.

For future requests, the browser will send the auth cookie and the edge function can verify it with the LMS if needed, or ignore it if the asset doesn't require auth.


The high scale version of this approach will require having a CDN with programmable workers and a scalable object store that supports signed URLs (such as S3). The main objective would be to shift the file streaming burden out of Django and onto the CDN and object store.

In Learning Core, we would implement an APIView (or possibly extend the asset-serving one) to return a JSON response of file metadata. This would include things like size, MIME type, last modified date, cache expiration policy, etc. The response would also contain a signed URL pointing to a object store resource, like S3. The CDN worker then does the fetch on that resource. We would create an example worker for at least CloudFlare.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I think we should just go with this approach right away, though for now we can substitute a minimal in-process python django app in lieu of an edge function.

Copy link

@connorhaugh connorhaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intriguing. I have some questions above and then a document level question.

I think an implicit choice made is that there is a 1-1 mapping of assets to xblocks. We, however, know that reuse of assets is very common across blocks. How does that square with permissions and versioning?

Context
--------

Both Studio and the LMS need to serve course team authored static assets as part of the authoring and learning experiences. These will most often be images, but may also include things like subtitles, audio files, and even JavaScript. It does NOT typically include video files, which are treated separately because of their size.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Both Studio and the LMS need to serve course team authored static assets as part of the authoring and learning experiences. These will most often be images, but may also include things like subtitles, audio files, and even JavaScript. It does NOT typically include video files, which are treated separately because of their size.
Both Studio and the LMS need to serve course team authored static assets as part of the authoring and learning experiences. "Static assets" in the edx-platform context presently refers to: image files, audio files, text document files like pdfs, video transcription files, and even javascript and python files. As video files have a different implementation due to their large file size, they are not commonly referred to as static assets.

I think this is an important distinction to make, so I made it a little clearer. I also think that you should note in the decision that we are choosing to maintain this distinction between static assets and videos.

Assets may reference each other in relative links, e.g. a JavaScript file that references images or other JavaScript files. That means that our solution cannot require querystring-based authorization tokens in the style of S3 signed URLs, since asset files would have no way to encode those into their relative links.

**Multiple versions of the asset should be available at the same time.**
Our system should be able to serve at minimum the current draft and published versions of an asset. Ideally, it should be able to serve any version of an asset. This is a departure from the way Studio and the LMS currently handle files and uploads, since there is currently no versioning at all–assets exist in a flat namespace at the course level and are immediately published.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why this is useful, but I want to dig into this a little bit, because this is inherently a costly choice, right? I am curious: what is the process by which an author augments a file to create a new version? Is it re-upload with the same name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, basically. If you have a Problem and /static/figure1.webp is reference, and you upload a new /static/figure1.webp, then you're creating a new version of that XBlock with that file updated.

It is somewhat costly, since we're holding onto both the old and new images in this case. But at the same time, it gives us the ability to actually batch changes in XBlock content and files together, e.g. for publishing them both at the same time.

Security Requirements
~~~~~~~~~~~~~~~~~~~~~

**Assets must enforce user+file read permissions at the Learning Context level.**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this as being a major req.

@ormsbee
Copy link
Contributor Author

ormsbee commented Nov 1, 2023

@connorhaugh:

I think an implicit choice made is that there is a 1-1 mapping of assets to xblocks. We, however, know that reuse of assets is very common across blocks. How does that square with permissions and versioning?

The mapping isn't really 1:1 though. The raw data for the asset lives in a file referenced by a RawContent model. But that same asset can exist in many different XBlock components. It can even exist as multiple different file names within the same XBlock if we really want it to, though I can't think of a plausible use case for that.

In terms of re-use, I think there are two scenarios that can play out:

Incidental re-use, with component-local assets.

We upload a cute cat image in one problem, and then upload it (and a few others) in a second problem in the same library. We're effectively "re-using" the image data, but not in any way that requires centralized management. The system saves the raw data once, but there are two different rows of ComponentVersionRawContent that reference it. The two images will be available at two different URLs and can have separate permissions checks applied, since they are addressed as part of two different XBlock components, and the system may give a 403 for one you're not yet allowed to see.

Intentional re-use, with centralized management.

I think this one's trickier, because there are a couple flavors.

The first kind is linking to the assets of other Components directly. Under the covers, this is how existing Files and Uploads in Studio would be handled (with a Component that represents all the current files and uploads stored for a course). Things could be linked directly, and anyone enrolled in the course would have access to it.

There's also linking to a shared asset where the dependency is more explicitly tracked. Say we have a library of custom grader code, that includes both Python code that can be executed, as well as JavaScript code for certain problem types. Our component XBlock (a ProblemBlock) uses these assets. There are a few broad approaches I can think of:

  1. Optimize for the browser by downloading things from the original component location, even if that's in another Learning Package. This is the best for browser caching, especially if we're using it across many different ProblemBlocks in our course, but it can make permissions checking difficult.
  2. Make the download URL of the shared content appear to be nested inside the borrowing component–i.e. make it look like the asset belongs to the ProblemBlock and not wherever it originally came from. This simplifies permissions checking because you're only ever checking that you have permissions to the ProblemBlock, not to anything it might be using.
  3. Something in between, possibly at the LearningPackage level.

@ormsbee
Copy link
Contributor Author

ormsbee commented Nov 5, 2023

Made some major revisions. I'll probably ping the forums for more eyes this coming week.

* Accepted some suggested edits on how we define assets.
* Revised proposal to take advantage of X-Accel-Redirect.
* Mandated object storage server.
* Made a first pass at how auth could work.
@ormsbee
Copy link
Contributor Author

ormsbee commented Dec 2, 2023

@bradenmacdonald: Finally getting back to this. Addressed your last comment about how an object store should no longer be a requirement. Honestly relieved to not have to try to push that change through just yet. 😛

@ormsbee
Copy link
Contributor Author

ormsbee commented Dec 2, 2023

@connorhaugh: I've included your edit suggestion. Please let me know if you have any other concerns.

Copy link
Contributor

@bradenmacdonald bradenmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll learn more as we go but this seems like a good direction to me.

@ormsbee ormsbee merged commit 66f4fa2 into openedx:main Dec 4, 2023
7 checks passed
@ormsbee ormsbee deleted the assets-adr branch December 4, 2023 22:38
regisb

This comment was marked as duplicate.

Copy link

@regisb regisb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned about this ADR via https://github.com/openedx/blockstore/issues/314#issuecomment-1924458461 on the Blockstore DEPR issue. I realize I'm very late to the game but I think very strongly that we must reconsider the new requirement of a second top-level domain.


The further implication of this requirement is that *permissions checking must be extensible*. The openedx-learning repo will implement the details of how to serve an asset, but it will not have the necessary models and logic to determine whether it is allowed to.

**Assets must be served from an entirely different domain than the LMS and Studio instances.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this requirement apply to all Open edX platforms, and not just edX.org? We can't possibly expect Open edX users to register a second domain name to host their platforms. At the very least, this item should be discussed with the BTR. Personally, I would oppose such a requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisb: It would apply to all Open edX platforms.

For now, we'll use a different sub-domain for development purposes (so it would come across like a new service). There are still some long term security risks with that, but it's better than the current state of things. I'll put a writeup together on security tradeoffs before bringing this to the BTR and likely the security working group. We won't add a hard requirement of a new domain for assets before doing that feedback cycle.

That being said, right now I'm really trying to unblock some folks on libraries work, and then I'm taking time off to go back to my hometown for the first time since the pandemic. It's likely that I won't have the time to go through this process until next month.

Thank you for your feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put a writeup together on security tradeoffs before bringing this to the BTR and likely the security working group.

Thanks Dave, I'll keep an eye out for that.

We can't possibly expect Open edX users to register a second domain name to host their platforms.

This is a good point, and it makes me wonder if a sibling subdomain would be sufficient for security. For example, if a site operator today had:

  • example.org # lms
  • studio.example.org # cms
  • apps.example.org # mfes

could we move them to:

  • assets.example.org # assets
  • lms.example.org # lms
  • studio.example.org # cms
  • apps.example.org # mfes
  • example.org/* # caddy wildcard redirect to lms.example.org/*, just to preserve existing LMS URLs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sibling subdomain only works if you are extremely careful to never set cookies onto the root domain. If you look at tutor today, for example, many cookies like lms_sessionid get set onto the root .local.edly.io domain. While it may be possible to configure Open edX to work without root domain cookies, there is not really any secure way to enforce this "don't set cookies on the root domain" as a policy, so it remains a potential security issue. Plus it works both ways - if you use a subdomain domain for untrusted content, that untrusted content can set cookies onto your root domain. (Unless it's a public suffix like opencraft.hosting.)

It's much safer to use a completely unrelated domain. This is exactly why GitHub puts Pages on github.io rather than GitHub.com, btw (and many other sites do similar things).

We can't possibly expect Open edX users to register a second domain name to host their platforms.

To be clear, registering a second domain name is usually not going to be necessary except for aesthetic reasons.

Almost all production deployments happen on the cloud and virtually all cloud providers provide free domain names on public suffixes, e.g. d111111abcdef8.cloudfront.net or x.s3.us-west-2.amazonaws.com (AWS), x.r2.dev or x.workers.dev (CloudFlare), etc. Any of these free domains will suffice. (Edit: of course these aren't actually "free" unless you're already paying for a CDN, block storage, edge workers, or a load balancer. But you probably want at least one of those anyways.)

The further implication of this requirement is that *permissions checking must be extensible*. The openedx-learning repo will implement the details of how to serve an asset, but it will not have the necessary models and logic to determine whether it is allowed to.

**Assets must be served from an entirely different domain than the LMS and Studio instances.**
To reduce our chance of maliciously uploaded JavaScript compromising LMS and Studio users, user-uploaded assets must live on an entirely different domain from LMS and Studio (i.e. not just another subdomain). So if our LMS is located at ``sandbox.openedx.org``, the files should be accessed at a URL like ``assets.sandbox.openedx.io``.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see how this security issue would be mitigated by hosting the javascript files on a different domain. It seems to me that the cookie access capability of custom javascript depends on the runtime context, not the hosting domain.

For instance, javascript loaded from abc.com, but executed on def.com, would be able to access the cookies from abc.com. What am I missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Actually this is only required assuming that content is being run in an iframe, and the main source code file (HTML) for the iframe is being hosted among the static assets. It's true that as long as the iframe has a different origin than the LMS, it doesn't matter where the actual asset files are stored. I think we need to think this through a bit more. CC @ormsbee

Copy link
Contributor Author

@ormsbee ormsbee Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, people can (and occasionally do) upload HTML files as static assets.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I probably don't have the whole picture, but it seems to me that this is a very niche use case. It would be a shame if this specific scenario by itself forced us to setup another domain name.

Is this a use case that we want to preserve, or it a new one? In its current form, isn't it already a security liability, as it allows course staff to run arbitrary scripts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a use case that we want to preserve, or it a new one? In its current form, isn't it already a security liability, as it allows course staff to run arbitrary scripts?

It is a liability. One we're hoping to close here.

Whether these use cases should be preserved is a product call. I've seen HTML uploads used by folks who have some fancy JS simulation already built out from some other project, and they want to run it in their course. I've also seen it used by people who have some kind of syllabus from a Word export, where all the images are base64-encoded data urls embedded into the HTML itself, leading to a 20 MB file that would choke the HTMLBlock editor.

I'd be delighted to kick those cases to the curb, but I don't know what other vectors there are. PDFs run JavaScript for form validation these days. I think that's locked down fairly tight and wouldn't have access to cookies, but I'm not sure. A more serious vulnerability is that JS can be embedded into an SVG file. Browsers won't run it when you include SVGs using the <img> tag, but if you send someone a link to look at this cool image and they click it, their session could be compromised.

Now there are ways to sanitize it, set up proper content security policies, etc. But I guess my point is that I don't know what else is out there, or will be out there five years from now. Using a separate domain helps to remove this class of vulnerability, which is probably why places like MDN recommend storing it on a different domain:

Sandbox uploaded files. Store them on a different server and allow access to the file only through a different subdomain or even better through a completely different domain.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow answer...

Now I understand the general recommendation about hosting uploaded assets on a different subdomain or domain. But that should be a recommendation, not a requirement. It's a whole different thing to say "we recommend you host your uploaded assets on a different domain, but if you understand the risks you can use a subdomain of your LMS, or even the same domain" -- and to actually support those use cases.

~~~~~~~~~~~~~~~~~~~~~~~~

**The asset server must be capable of handling high levels of traffic.**
Django views are poor choice for streaming files at scale, especially when deploying using WSGI (as Open edX does), since it will tie down a worker process for the entire duration of the response. While a Django-based streaming response may sufficient for small-to-medium traffic sites, we should allow for a more scalable solution that fully takes advantage of modern CDN capabilities.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, this is not the case for uwsgi, which can serve static assets without pausing workers. This is the primary reason why tutor uses uwsgi. But on the other hand uwsgi is no longer maintained, and thus we find ourselves shopping for a replacement, hoping that we can find one with the same feature set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But even if uwsgi can send files efficiently when its file serving subsystem is invoked, a Django view is still going to tie down the worker process because it's Django doing the streaming in that case. I don't know the specifics of how you'd offload the file serving from the Django view to uwsgi (I'm sure it's possible), but at that point, it's functionally equivalent to using the X-Sendfile header and letting caddy do it, isn't it? In either case, the Django view is saying, "I can't send this efficiently, so I'm just going to send the headers for where to find it, and delegate the sending to something with better concurrency/performance than I have."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants