Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

Why Files? #29

Closed
ormsbee opened this issue Jan 8, 2019 · 4 comments
Closed

Why Files? #29

ormsbee opened this issue Jan 8, 2019 · 4 comments
Labels
arch Architecture / API Design question Further information is requested

Comments

@ormsbee
Copy link
Collaborator

ormsbee commented Jan 8, 2019

I'm re-posting a question from @regisb that he asked on the Confluence comments page. I wanted to have the discussion here because it's more likely to be found and to be useful. My apologies for not getting around to replying until after the holidays. :(

Quoting @regisb from the wiki design doc:

I'm a bit surprised by the choice of files for storing content. My 2 cents:

The data structure should be chosen by taking into consideration what kind of read/write access will be required. For instance, file systems are not good at answering the question "what is the most recent file in this folder" (they don't have an index on dates). And, it seems to me that we are frequently going to have to make such queries, for instance to get the latest version of a content element.

I think we all agree that accounting for read/write data access patterns is critical, and was much of the focus of issues #16 and #26. Accounting for the dependencies (and transitive dependencies of dependencies) is part of the motivation for having all that information stored in a single snapshot summary file, so that we can get that sort of data for any given BundleVersion with a single file request.

Also, frankly, with the amount of content that edX has, using an object store is just way cheaper than paying for the equivalent amount of database storage.

Filesystems are bad at searching: will we have to rely on a grep-like tool (i.e: slow) when searching for content?

Search capabilities will be provided by Elasticsearch, which we would still have used even if we went with a more SQL-based solution. We'll eventually want to index a bunch of things (PDF files, subtitles, etc.) and MySQL's full text search capabilities are pretty limited.

Separating data in two different storage systems (filesystem and SQL db) requires some synchronization, which is a hard problem.

Yeah. We somewhat mitigate this by having it so that the database entries are more or less pointers to immutable file system snapshots (with some deduplication on the snapshot side so we don't make useless copies of files that don't change between versions). So the synchronization is one way in that sense -- we first build a Snapshot at the file system level, and then point to it with a BundleVersion at the database level. If a failure happens halfway through the process, we might have created a Snapshot that never gets pointed to. But it's orphaned data that will be ignored, as opposed to having something two-way where conflicts can arise.

Is it going to be a requirement for large Open edX platforms to have an object storage platform like S3 to store assets? Filesystems are bad at storing many files per folder; does it mean that object storage is the only way to go?

I think that object storage will be the preferred deployment method for most folks, but it should work with filesystems. The current design already groups data files by Bundle, and so would require that the file system support ~10K files in a given folder to yield decent performance with large courses. My understanding is that this is okay for modern filesystems. If it's not, we can further sub-divide the files by prefix, so that instead of raw data files being a "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx..." file, they can be "xxxxx/xxxxx/xxxxx/xxxxx..." etc.

I'm really not familiar with how Django implements its file-based storage API or large directory performance with more recent file systems (I have definitely been bit by that back in the ext2 days). What file systems are you thinking of, and what are the practical limits we should take into account?

Files don't have schema: one of the current major issues with xblocks is that they are extremely difficult to migrate, whenever their definition changes. Backward compatibility becomes very hard to maintain. Files have the same problem.

I completely agree, but addressing that is not a goal we have for Blockstore. I think that explicit versioning is something that's going to have to be added to the serialization format, but that's a whole separate topic. Blockstore is intended to be a dumb storage layer, with XBlock understanding happening at the caller level (so we don't get another giant Modulestore mess of coupled storage/XBlock logic).

(I am very new to the blockstore discussion, so it's quite possible my comments are completely irrelevant (smile))

You are very familiar with Open edX, have valuable perspectives on deployment issues, and are the author of what is still probably my all time favorite presentation on Open edX internals. So please continue to ask away. We did a lot of soul searching with this question in issue #26 a month ago, so I think using files as a basic approach is going to stay, but we can certainly look at what we can do to make sure it performs well outside of S3-like object stores.

@ormsbee ormsbee added question Further information is requested arch Architecture / API Design labels Jan 8, 2019
@regisb
Copy link

regisb commented Jan 11, 2019

@ormsbee Thanks for taking the time to thoroughly answer my questions, I appreciate it.

I can understand the cost/simplicity argument for choosing file-based storage. For large amounts of data, such as those required by edX, cost and simplicity are important factors to take into account.

That being said, file-based data storage and not all flowers and lollipops 🌹 🍭 Basically, what we are building here is a database. And databases usually have nice properties that we are so much used to, that we tend to forget they even exist. For SQL databases, these properties are summarised as ACID. With file-based databases, it's difficult to guarantee Atomicity, and Isolation, and very difficult to guarantee Consistency.

For instance, what happens in case of power failure? In MySQL, a transaction either succeeds entirely, or fails entirely. In blockstore, a power failure or crash in the middle of writing a bundle would leave us with corrupt data:

Here is the function responsible for writing a snapshot to disk:

def create_snapshot(self, bundle_uuid, paths_to_files):
    files = {}
    for path, data in paths_to_files.items():
        files[str(path)] = self._save_file(bundle_uuid, path, data)
    return self._create_snapshot(bundle_uuid, files)

There are ways to make this method ACID, but they are not trivial. If you try to improve this function, basically you are going to re-create a file-based database, and there are other, very sophisticated tools that already do that.

I would suggest using plain old Postgresql. Just store binary blobs and json files there. Costs can be reduced by limiting read/write calls, and not using RDS. For serving large binary assets, such as videos, we can use a file-based caching layer, such as HAProxy, or even S3. That way, we shift the responsibility of serving assets away from the blockstore, which then becomes a more simple component.

In addition, choosing Postgresql is a first step for moving Open edX away from MySQL, which would be an improvement.

@ormsbee
Copy link
Collaborator Author

ormsbee commented Jan 11, 2019

@regisb

With file-based databases, it's difficult to guarantee Atomicity, and Isolation, and very difficult to guarantee Consistency.

For instance, what happens in case of power failure? In MySQL, a transaction either succeeds entirely, or fails entirely. In blockstore, a power failure or crash in the middle of writing a bundle would leave us with corrupt data:

I agree with you in the general case, but I believe that the way we're using files guards against this when creating new versions. Putting files one by one into a file system isn't going to be ACID compliant without a lot of extra complexity (which I have no desire to add to Blockstore). But the result of an abrupt failure in this scenario should be a little wasted space, and not data corruption.

The overall flow of creating a new BundleVersion is:

  1. Create data files, with naming derived from a hash of the data.
  2. Create a summary JSON file that points to the files that was created. This is the Snapshot.
  3. Create a BundleVersion in the database that points to the Snapshot.

File writes are atomic on object systems, and they can be made atomic in most file systems by writing to a temp file and then doing a move (again, not sure if Django storage does the right thing here). So now let's say we're creating a new version in this way and we have a sudden failure:

If failure happens during data file creation, then those files are basically orphaned. No summary JSON file is ever created, so no Snapshot exists. We just have junk data lying around. But re-trying will want to create the same data files over again, and since the files are named after their content hash, we will end up using those files for the next Snapshot.

If failure happens after Snapshot creation but before the BundleVersion is created, then we have a Snapshot file that's never referenced. Re-trying will create a new Snapshot (the create time is part of the snapshot content), and the BundleVersion will point to that new Snapshot. It's not atomic or consistent in totality, but it is for the parts we actually care about.

If failure happens during BundleVersion creation, we are now purely in the database, and the net result of a rollback is again a Snapshot that has no reference to it.

The abrupt failure scenario is more of a potential issue with Drafts, but one that I hope we've mitigated in other ways. Typical writes to Draft files in the Studio scenario are one at a time (as changes are made to a given XBlock). Every import would get its own Draft, so imports wouldn't step on each other, a failure mid-import means that you have some junk left in the system, but have not caused corruption -- your next attempt to import would create a new Draft.

We're also making the dependency between Drafts and Bundles one-way, to make it easier to throw out the Drafts implementation we have today if those assumptions turn out to be wrong.

FWIW, the reason BundleVersion and Snapshots are separate things is mostly to have a stronger separation between where we're going to hang the data itself (Snapshot) vs. the metadata that references it (tags, search, etc.). This makes it easier for us to pull down data that violates licensing agreements without disrupting a bunch of plugins which may need to update asynchronously. It also lets us walk back from the storage decisions we're making today and migrate to something else (whether that's a database, a version control system, or something else), with less disruption to those same plugins.

I would suggest using plain old Postgresql. Just store binary blobs and json files there. Costs can be reduced by limiting read/write calls, and not using RDS. For serving large binary assets, such as videos, we can use a file-based caching layer, such as HAProxy, or even S3. That way, we shift the responsibility of serving assets away from the blockstore, which then becomes a more simple component.

Managing a 40+ TB replicated Postgres database + caching layer is more operational complexity than we want to take on. Putting metadata in the database via hosted RDS and the raw data files in an object store strikes a balance between cost and operational complexity that we're more confident about.

In addition, choosing Postgresql is a first step for moving Open edX away from MySQL, which would be an improvement.

I am not a fan of MySQL. Most of the bundles models.py docstring is me explaining the horribleness of MySQL. But for the sake of overall stack simplification, I don't want to introduce a transitional period where for years we're going to have part of the Open edX stack on MySQL and part of it on PostgreSQL. As much as MySQL annoys me, there are more important things to work on, and the feature set difference is not so compelling that it makes up for the long term pain of understanding both systems operating at scale.


I realize that going with file/object storage is not all "flowers and lollipops". But I think it's currently the most pragmatic tradeoff we can make between simplicity, scale, and cost. I definitely agree that it's an area where we could fall into a trap of trying to graft on features that our primitives don't support well. It's definitely something we need to keep an eye on going forward.

@bradenmacdonald
Copy link
Contributor

bradenmacdonald commented Jan 11, 2019

Great discussion so far, guys - thanks!

A couple other points I want to mention:

A lot of the data that will eventually be stored in blockstore (everything on the "Files & Uploads" page in Studio) is currently stored as blobs in MongoDB (GridFS), and that approach is awful. Regardless of the decision about where XBlock data & metdata gets stored, moving the image/PDF/video/etc. files that they use to a proper object store (which can more easily be fronted by a CDN) is a huge win, and helps us get rid of MongoDB.

Remember that most reading + searching queries done in the system will happen in the LMS, and the LMS is unlikely to talk to Blockstore directly, ever. XBlock data will be read from some intermediate caching system like the course blocks API (with hot data stored in redis/memcached) and course content searching will be done from ElasticSearch.

Even in Studio, which will talk to Blockstore directly, most read operations other than "fetch the data of one specific XBlock for rendering in the XBlock runtime" should be handled by Blockstore's SQL DB and/or ElasticSearch, and not require reads to S3.

I would love to move to PostgreSQl too, but the fact is that we haven't been able to do the (relatively straightforward) MySQL 5.6->5.7 upgrade yet, nor even the utf8->utfmb4 upgrade, both of which I think offer major improvements (efficient JSON fields, emoji support). That should give us pause when considering anything that's orders of magnitude harder, like a MySQL->PostgreSQL change.

@ormsbee
Copy link
Collaborator Author

ormsbee commented Apr 3, 2019

Closing this for now, but happy to reopen if folks want to discuss further.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch Architecture / API Design question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants