Conversation
I think the big challenge here from a practical perspective is that currently both the LMS and Studio make blocking calls to Blockstore. When learners learn from a blockstore-based content library (and in future, courses etc.), the LMS needs to ask blockstore for the OLX of any XBlocks that aren't currently cached in its memcached. This could be a good forcing function to make us move ahead with implementing a process for "compiling" Blockstore OLX to a more optimized field data format that the LMS can load faster. (Reading from S3 is really really slow). |
Yeah, that lack of compilation is the thing that concerns me the most right now. Does it make sense to compile it into MySQL text fields where 1 row = 1 unit, and just leave it in OLX form there? |
That would be a simple and effective solution, yes. But a publish workflow which parses the XML and writes the resulting key-value fielddata pairs into MySQL would be more efficient (publishes would be slower or a lot slower if there are many changed blocks in a bundle, since XML parsing requires loading the XBlock runtime for each block; loading a block in the LMS would be faster). I think I would prefer the latter approach for its simplicity and efficiency. For example, in the LMS, for registered users all field data (authored and student-specific) could then be loaded from |
It is overwhelmingly likely that I'm missing something, so please feel free to enlighten me. If the following are, or eventually become, true:
Then why do we need to store OLX in S3? |
Thinking on this some more, I would really like the LMS optimized form to actually have an honest-to-goodness hierarchical data model with Units and Modules being modeled in the database. We would have version awareness and lift some fields like In terms of storing the optimized form or not... I lean towards storing the OLX because:
But it's not a hill I'd die on. |
The main reasons are:
S3 is a compromise between something purely version control oriented (like git/hg), and something we feel confident in scaling out both in terms of performance and cost, particularly for very large asset files. But for the LMS XBlock runtime piece, where we're only serving OLX and we only need the latest published version, having it in MySQL is fast, simple, and flexible. |
Thanks for that, @ormsbee. :) We can discuss this elsewhere when there's a bit of time for it. In any case, I do understand wanting to keep publish history, though that would probably only be truly valuable with an accompanying Studio feature to revert or compare it. (Contentstore does it but reversion was never added to Studio, with the result that it now requires constant cleanup.) As for storing static files, yes, that much was never in question. I was referring to OLX specifically. |
That makes a lot of sense, yes.
It's true that the downside of "parse on publish" is that when you publish a huge bundle with lots of changed OLX files, mass parsing of the changes could be slow enough to require being done via celery. But publishing doesn't need to be instant. I think it's clearly more important for learning to be fast than for publishing to be fast.
Yes, although it's not reliably minimal. It happens synchronously in XBlock-controlled code, and the XBlocks can in theory do whatever slow things they want during parsing.
My concern with this is that once you include the OLX layer in the LMS, even if you add another optimized layer later, you'll may never be able to remove the OLX layer due to dependencies. So you end up with two layers in the LMS instead of one, and the system is more complicated.
Yeah I don't think that's a concern since Blockstore/Studio will still have the OLX. For example, the new Content Libraries v2 APIs let you get the OLX of any block, even if its XBlock plugin is uninstalled. So I consider this part solved already. Another way of looking at this is just from the LMS perspective: what is the value of having the OLX in the LMS, compared to pre-parsed fields? I would argue that there is no value at all; it adds complexity, redundant data, and [may] require a parsing+caching layer. If Studio hands the pre-parsed field data to the LMS, then you have the data already in the right format*, and there's no need for a parsing+caching process. (* almost; you still need the eventual JSON->Python conversion, but it's better than XML->Python->JSON/pickle->python etc.) |
Usefulness of Publish History
Its value comes up even in content libraries because (according to various course team interviews), course teams don't want content that their courses use to magically change underneath them without their notice. So a course run could be using a version of a problem bank that is actually four versions behind the latest, because they haven't had time to test/review the latest version yet and they don't want to introduce the risk of breaking it to their class.
Right. I only meant to highlight that the distinction between OLX and other content is going to get a lot blurrier with Blockstore, in the sense that we'll have images, transcripts, etc. associated much more tightly with the XBlock content that they're used in. Publishing
I've been trying to get the publish boundary nailed down in my head, in a world with stricter boundaries between Studio and LMS. I feel like I've always had a high level sense of what that boundary should be, but the pieces weren't totally fitting together. This has caused me to revisit some of my core assumptions. Now, this definitely still half-baked in my head, but here goes... The data that needs to be updated when course publish happens is (roughly):
Ideally, we want:
The pattern that I initially thought we'd use is to read that OLX via the XBlock runtime in the LMS and have it compile/store/extract what it needs. But aside from being unpredictably slow, that runtime is problematic because it's allowed to do any kind of smart substitutions (e.g. edx-when). We could definitely run it in both modes, but that complicates the plumbing and it never gets us to items 3 or 4 on the above list. As you pointed out, we can never trust custom parsing of XBlock code to be fast or reliable. That really rankles. It gets even worse if the XBlock runtime is a floating, stateless thing out in space somewhere that needs to get remotely invoked each time we need to do this sort of translation during the publish process. Then I was thinking that in this sort of arrangement, the XBlock runtime-as-a-service would have to have some sort of celery task that knew how to suck up information from Studio, munch it, and publish it back into LMS. But that also sounded slow, with more moving pieces. Which led me to my current thought, which is: What if we don't invoke the XBlock runtime at all during publish? We sometimes treat OLX like a serialization convenience, but it is an interchange format. Individual XBlocks are unpredictable, but all the fields that we absolutely need for publishing structural and high level metadata (structure, names, identifiers, dates, content groups, etc.) are well defined. The LMS can safely and quickly pull that information out, parse it from the XML itself, and stuff it into the appropriate system. Unit-level XBlock content OLX can just be straight up copied into the appropriate spaces.
That's true, but they can also do anything in their
I don't think it's appropriate for Studio to pre-parse it into field data to push over as part of a public interface. The OLX is a supported, documented interchange format. The optimized view can be anything the LMS needs to evolve it to be. So I think the value proposition is:
I'm also not nearly as concerned about the performance implications of XML vs. parsed forms for individual units, and I think that might be the crux of it. Almost every XBlock problem around performance that I've beaten my head on over the years has involved pulling and examining orders of magnitude more data than were actually necessary for the problem -- whether that's inheritance, permissions, or inefficient encoding of basic information. It's why we can examine a megabyte of data and do potentially hundreds of thousands of opaque keys comparisons to process a save position handler for a video. Based on that, my chief concern is that XBlock process as little data as possible, rather than it having the most efficient encoding of that data. |
Side note, but implicit in the above is that there's some differentiation between a core set of things that are necessary to atomically publish a course and the slew of things that might trigger off of that which require XBlock logic and may take much longer (e.g. search indexing). |
Yes, exactly, and that works today in both v1 and v2, requiring in both cases a complete version history.
It's worth seriously considering eliminating that option from the new runtime. The way that parsing works currently is a big mess, but we could do much better by standardizing it and moving control of the XML->fielddata process out of the XBlock code and strictly into the runtime. The fact is that the vast majority of XBlocks use the default implementation of
OK, that's a good argument. |
What do you think about the format for ORA2 in this scheme? Also, this conversation has drifted pretty deeply into the things that would be necessary down the road to make this happen, but is it okay to merge this ADR as intended direction for now? |
Hmm, I see - it has a non-standard XML representation of its fields. It could still be made to work in this scheme with little effort, and fully backwards compatible, by converting its code to use what I'm calling "raw mode, where the root olx XBlock element contains some non-OLX content". Essentially the entire XML portion inside the root element would get parsed by the runtime as a single text field called
Good with me! I really hope we can change the XML parsing as suggested above; the way XML parsing to field data works in XBlocks currently was easily the biggest pain point while implementing the new blockstore-based runtime. |
Blockstore will remain a separate application and repo, but be added as an installed app to Studio and will run in-process.