Skip to content
This repository has been archived by the owner on Feb 15, 2024. It is now read-only.

ADR for Blockstore as App #71

Merged
merged 1 commit into from Jun 17, 2020
Merged

ADR for Blockstore as App #71

merged 1 commit into from Jun 17, 2020

Conversation

ormsbee
Copy link
Collaborator

@ormsbee ormsbee commented Jun 8, 2020

Blockstore will remain a separate application and repo, but be added as an installed app to Studio and will run in-process.

@bradenmacdonald
Copy link
Contributor

I think the big challenge here from a practical perspective is that currently both the LMS and Studio make blocking calls to Blockstore. When learners learn from a blockstore-based content library (and in future, courses etc.), the LMS needs to ask blockstore for the OLX of any XBlocks that aren't currently cached in its memcached.

This could be a good forcing function to make us move ahead with implementing a process for "compiling" Blockstore OLX to a more optimized field data format that the LMS can load faster. (Reading from S3 is really really slow).

@ormsbee
Copy link
Collaborator Author

ormsbee commented Jun 8, 2020

Yeah, that lack of compilation is the thing that concerns me the most right now. Does it make sense to compile it into MySQL text fields where 1 row = 1 unit, and just leave it in OLX form there?

@bradenmacdonald
Copy link
Contributor

Does it make sense to compile it into MySQL text fields where 1 row = 1 unit, and just leave it in OLX form there?

That would be a simple and effective solution, yes. But a publish workflow which parses the XML and writes the resulting key-value fielddata pairs into MySQL would be more efficient (publishes would be slower or a lot slower if there are many changed blocks in a bundle, since XML parsing requires loading the XBlock runtime for each block; loading a block in the LMS would be faster). I think I would prefer the latter approach for its simplicity and efficiency. For example, in the LMS, for registered users all field data (authored and student-specific) could then be loaded from DjangoKeyValueStore instead of a mix of BlockstoreFieldData+DjangoKeyValueStore.

@arbrandes
Copy link

It is overwhelmingly likely that I'm missing something, so please feel free to enlighten me. If the following are, or eventually become, true:

  • The LMS will get OLX directly from MySQL
  • The OLX in S3 will only ever be touched by Studio Blockstore when saving it
  • Blockstore will never serve more than one Studio instance
  • Blockstore will always update a MySQL record when the OLX is updated

Then why do we need to store OLX in S3?

@ormsbee
Copy link
Collaborator Author

ormsbee commented Jun 9, 2020

That would be a simple and effective solution, yes. But a publish workflow which parses the XML and writes the resulting key-value fielddata pairs into MySQL would be more efficient (publishes would be slower or a lot slower if there are many changed blocks in a bundle, since XML parsing requires loading the XBlock runtime for each block; loading a block in the LMS would be faster). I think I would prefer the latter approach for its simplicity and efficiency. For example, in the LMS, for registered users all field data (authored and student-specific) could then be loaded from DjangoKeyValueStore instead of a mix of BlockstoreFieldData+DjangoKeyValueStore.

Thinking on this some more, I would really like the LMS optimized form to actually have an honest-to-goodness hierarchical data model with Units and Modules being modeled in the database. We would have version awareness and lift some fields like display_name into actual columns. I'm honestly not sure how that squares with FieldData vs. behind DjangoKeyValueStore–it's certainly compatible with those interfaces, but I don't know how much they add to it. We do some weird contortions to get around the fact that it's really hard to efficiently query limited sub-trees of the hierarchy, but none of that should be necessary with a decent data model.

In terms of storing the optimized form or not... I lean towards storing the OLX because:

  1. As long as the most important metadata (e.g. display_name, content group) is lifted out into columns in the data model, we shouldn't need to do mass parsing of large parts of the OLX at once.
  2. The overheard for OLX parsing of a single Unit is minimal.
  3. As long as we keep versioned data, we can add very effective caching.
  4. We can add the optimized layer later if these assumptions turn out to be wrong.
  5. XBlocks get uninstalled, and not having the OLX around has been problematic for us in the past. This is less of a concern since it's the LMS and not Studio.

But it's not a hill I'd die on.

@ormsbee
Copy link
Collaborator Author

ormsbee commented Jun 9, 2020

It is overwhelmingly likely that I'm missing something, so please feel free to enlighten me. If the following are, or eventually become, true:

The LMS will get OLX directly from MySQL
The OLX in S3 will only ever be touched by Studio Blockstore when saving it
Blockstore will never serve more than one Studio instance
Blockstore will always update a MySQL record when the OLX is updated
Then why do we need to store OLX in S3?

The main reasons are:

  1. Not all Blockstore/Studio saves make their way to LMS.
  2. Blockstore is basically keeping infinite publish history, which gets expensive in space and cost. S3 is cheaper than RDS here.
  3. Blockstore is intended to store many things outside of OLX, like transcripts, video files, PDFs, and probably non-XBlock course content formats like QTI at some point.

S3 is a compromise between something purely version control oriented (like git/hg), and something we feel confident in scaling out both in terms of performance and cost, particularly for very large asset files.

But for the LMS XBlock runtime piece, where we're only serving OLX and we only need the latest published version, having it in MySQL is fast, simple, and flexible.

@arbrandes
Copy link

Thanks for that, @ormsbee. :) We can discuss this elsewhere when there's a bit of time for it. In any case, I do understand wanting to keep publish history, though that would probably only be truly valuable with an accompanying Studio feature to revert or compare it. (Contentstore does it but reversion was never added to Studio, with the result that it now requires constant cleanup.)

As for storing static files, yes, that much was never in question. I was referring to OLX specifically.

@bradenmacdonald
Copy link
Contributor

bradenmacdonald commented Jun 10, 2020

I would really like the LMS optimized form to actually have an honest-to-goodness hierarchical data model

That makes a lot of sense, yes.

In terms of storing the optimized form or not... I lean towards storing the OLX because:

As long as the most important metadata (e.g. display_name, content group) is lifted out into columns in the data model, we shouldn't need to do mass parsing of large parts of the OLX at once.

It's true that the downside of "parse on publish" is that when you publish a huge bundle with lots of changed OLX files, mass parsing of the changes could be slow enough to require being done via celery. But publishing doesn't need to be instant. I think it's clearly more important for learning to be fast than for publishing to be fast.

The overheard for OLX parsing of a single Unit is minimal.

Yes, although it's not reliably minimal. It happens synchronously in XBlock-controlled code, and the XBlocks can in theory do whatever slow things they want during parsing.

As long as we keep versioned data, we can add very effective caching.
We can add the optimized layer later if these assumptions turn out to be wrong.

My concern with this is that once you include the OLX layer in the LMS, even if you add another optimized layer later, you'll may never be able to remove the OLX layer due to dependencies. So you end up with two layers in the LMS instead of one, and the system is more complicated.

XBlocks get uninstalled, and not having the OLX around has been problematic for us in the past. This is less of a concern since it's the LMS and not Studio.

Yeah I don't think that's a concern since Blockstore/Studio will still have the OLX. For example, the new Content Libraries v2 APIs let you get the OLX of any block, even if its XBlock plugin is uninstalled. So I consider this part solved already.


Another way of looking at this is just from the LMS perspective: what is the value of having the OLX in the LMS, compared to pre-parsed fields? I would argue that there is no value at all; it adds complexity, redundant data, and [may] require a parsing+caching layer. If Studio hands the pre-parsed field data to the LMS, then you have the data already in the right format*, and there's no need for a parsing+caching process.

(* almost; you still need the eventual JSON->Python conversion, but it's better than XML->Python->JSON/pickle->python etc.)

@ormsbee
Copy link
Collaborator Author

ormsbee commented Jun 11, 2020

@arbrandes:

Usefulness of Publish History

Thanks for that, @ormsbee. :) We can discuss this elsewhere when there's a bit of time for it. In any case, I do understand wanting to keep publish history, though that would probably only be truly valuable with an accompanying Studio feature to revert or compare it. (Contentstore does it but reversion was never added to Studio, with the result that it now requires constant cleanup.)

Its value comes up even in content libraries because (according to various course team interviews), course teams don't want content that their courses use to magically change underneath them without their notice. So a course run could be using a version of a problem bank that is actually four versions behind the latest, because they haven't had time to test/review the latest version yet and they don't want to introduce the risk of breaking it to their class.

As for storing static files, yes, that much was never in question. I was referring to OLX specifically.

Right. I only meant to highlight that the distinction between OLX and other content is going to get a lot blurrier with Blockstore, in the sense that we'll have images, transcripts, etc. associated much more tightly with the XBlock content that they're used in.


@bradenmacdonald:

Publishing

As long as the most important metadata (e.g. display_name, content group) is lifted out into columns in the data model, we shouldn't need to do mass parsing of large parts of the OLX at once.

It's true that the downside of "parse on publish" is that when you publish a huge bundle with lots of changed OLX files, mass parsing of the changes could be slow enough to require being done via celery. But publishing doesn't need to be instant. I think it's clearly more important for learning to be fast than for publishing to be fast.

I've been trying to get the publish boundary nailed down in my head, in a world with stricter boundaries between Studio and LMS. I feel like I've always had a high level sense of what that boundary should be, but the pieces weren't totally fitting together. This has caused me to revisit some of my core assumptions. Now, this definitely still half-baked in my head, but here goes...

The data that needs to be updated when course publish happens is (roughly):

  • XBlock unit/leaf content of various kinds.
  • Structural data, so sections + subsections (+units?)
  • Course-level configuration that's been jammed into the Course Descriptor
  • A couple of disconnected pieces of standalone HTML content like static_tab or course_info (which we'll probably want to move out).

Ideally, we want:

  1. Fast LMS performance for students.
  2. Atomic publishes (i.e. dates, outline, content don't fall out of sync)
  3. Fast and simple publishes.
  4. (Long term) An XBlock runtime that has been extracted and does not run in the same process as edx-platform, to better isolate it from user data and platform stability.

The pattern that I initially thought we'd use is to read that OLX via the XBlock runtime in the LMS and have it compile/store/extract what it needs. But aside from being unpredictably slow, that runtime is problematic because it's allowed to do any kind of smart substitutions (e.g. edx-when). We could definitely run it in both modes, but that complicates the plumbing and it never gets us to items 3 or 4 on the above list.

As you pointed out, we can never trust custom parsing of XBlock code to be fast or reliable. That really rankles. It gets even worse if the XBlock runtime is a floating, stateless thing out in space somewhere that needs to get remotely invoked each time we need to do this sort of translation during the publish process. Then I was thinking that in this sort of arrangement, the XBlock runtime-as-a-service would have to have some sort of celery task that knew how to suck up information from Studio, munch it, and publish it back into LMS. But that also sounded slow, with more moving pieces.

Which led me to my current thought, which is: What if we don't invoke the XBlock runtime at all during publish? We sometimes treat OLX like a serialization convenience, but it is an interchange format. Individual XBlocks are unpredictable, but all the fields that we absolutely need for publishing structural and high level metadata (structure, names, identifiers, dates, content groups, etc.) are well defined. The LMS can safely and quickly pull that information out, parse it from the XML itself, and stuff it into the appropriate system. Unit-level XBlock content OLX can just be straight up copied into the appropriate spaces.

The overheard for OLX parsing of a single Unit is minimal.

Yes, although it's not reliably minimal. It happens synchronously in XBlock-controlled code, and the XBlocks can in theory do whatever slow things they want during parsing.

That's true, but they can also do anything in their student_view. I think that's insidious when we're doing operations across a swath of blocks across the course. For a single Unit, the difference between parsing XML vs. JSON is lost in the noise.

Another way of looking at this is just from the LMS perspective: what is the value of having the OLX in the LMS, compared to pre-parsed fields? I would argue that there is no value at all; it adds complexity, redundant data, and [may] require a parsing+caching layer. If Studio hands the pre-parsed field data to the LMS, then you have the data already in the right format*, and there's no need for a parsing+caching process.

I don't think it's appropriate for Studio to pre-parse it into field data to push over as part of a public interface. The OLX is a supported, documented interchange format. The optimized view can be anything the LMS needs to evolve it to be.

So I think the value proposition is:

  • Faster, simpler publishes.
  • Reduces the involvement of the XBlock runtime to prepare for eventual separation.

I'm also not nearly as concerned about the performance implications of XML vs. parsed forms for individual units, and I think that might be the crux of it. Almost every XBlock problem around performance that I've beaten my head on over the years has involved pulling and examining orders of magnitude more data than were actually necessary for the problem -- whether that's inheritance, permissions, or inefficient encoding of basic information. It's why we can examine a megabyte of data and do potentially hundreds of thousands of opaque keys comparisons to process a save position handler for a video. Based on that, my chief concern is that XBlock process as little data as possible, rather than it having the most efficient encoding of that data.

@ormsbee
Copy link
Collaborator Author

ormsbee commented Jun 11, 2020

Side note, but implicit in the above is that there's some differentiation between a core set of things that are necessary to atomically publish a course and the slew of things that might trigger off of that which require XBlock logic and may take much longer (e.g. search indexing).

@bradenmacdonald
Copy link
Contributor

bradenmacdonald commented Jun 11, 2020

@ormsbee

Its value comes up even in content libraries because (according to various course team interviews), course teams don't want content that their courses use to magically change underneath them without their notice. So a course run could be using a version of a problem bank that is actually four versions behind the latest, because they haven't had time to test/review the latest version yet and they don't want to introduce the risk of breaking it to their class.

Yes, exactly, and that works today in both v1 and v2, requiring in both cases a complete version history.

As you pointed out, we can never trust custom parsing of XBlock code to be fast or reliable. That really rankles.

It's worth seriously considering eliminating that option from the new runtime. The way that parsing works currently is a big mess, but we could do much better by standardizing it and moving control of the XML->fielddata process out of the XBlock code and strictly into the runtime. The fact is that the vast majority of XBlocks use the default implementation of parse_xml. I think that in practice there are really only two important modes of parsing: (1) default mode, where the children of the root olx XBlock element are themselves child XBlock definitions, and (2) raw mode, where the root olx XBlock element contains some non-OLX content, such as HTML or Capa XML. A single, standard implementation could support both modes, and XBlocks that do other weird stuff during parsing (e.g. video can make calls to VAL during parse) can move that functionality into the render/handler/authoring code.

I don't think it's appropriate for Studio to pre-parse it into field data to push over as part of a public interface. The OLX is a supported, documented interchange format.

OK, that's a good argument.

@ormsbee
Copy link
Collaborator Author

ormsbee commented Jun 17, 2020

It's worth seriously considering eliminating that option from the new runtime. The way that parsing works currently is a big mess, but we could do much better by standardizing it and moving control of the XML->fielddata process out of the XBlock code and strictly into the runtime. The fact is that the vast majority of XBlocks use the default implementation of parse_xml. I think that in practice there are really only two important modes of parsing: (1) default mode, where the children of the root olx XBlock element are themselves child XBlock definitions, and (2) raw mode, where the root olx XBlock element contains some non-OLX content, such as HTML or Capa XML. A single, standard implementation could support both modes, and XBlocks that do other weird stuff during parsing (e.g. video can make calls to VAL during parse) can move that functionality into the render/handler/authoring code.

What do you think about the format for ORA2 in this scheme?

Also, this conversation has drifted pretty deeply into the things that would be necessary down the road to make this happen, but is it okay to merge this ADR as intended direction for now?

@bradenmacdonald
Copy link
Contributor

What do you think about the format for ORA2 in this scheme?

Hmm, I see - it has a non-standard XML representation of its fields. It could still be made to work in this scheme with little effort, and fully backwards compatible, by converting its code to use what I'm calling "raw mode, where the root olx XBlock element contains some non-OLX content". Essentially the entire XML portion inside the root element would get parsed by the runtime as a single text field called data, containing XML. Later, when the block goes to render itself, its student_view methods (etc.) can parse that XML (using the existing parse_from_xml method) and do whatever they need to. The only difference is that those parsed fields don't get saved into mongo as official "XBlock fields" anymore. So it would work like capa does.

Also, this conversation has drifted pretty deeply into the things that would be necessary down the road to make this happen, but is it okay to merge this ADR as intended direction for now?

Good with me! I really hope we can change the XML parsing as suggested above; the way XML parsing to field data works in XBlocks currently was easily the biggest pain point while implementing the new blockstore-based runtime.

@ormsbee ormsbee merged commit 518b418 into master Jun 17, 2020
@symbolist symbolist mentioned this pull request Nov 16, 2020
6 tasks
@kdmccormick kdmccormick deleted the ormsbee/adr-service-vs-app branch June 21, 2021 17:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants