Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash-based object store #53

Closed
graft opened this issue Oct 12, 2019 · 2 comments
Closed

Hash-based object store #53

graft opened this issue Oct 12, 2019 · 2 comments

Comments

@graft
Copy link
Contributor

graft commented Oct 12, 2019

Currently each file on Metis is stored on disk in an actual directory structure. This is cumbersome to manage and requires several filesystem operations in order to move files to a different path. Ostensibly the reason for this was some sort of inspectibility of the file store on-disk. In practice this isn't really the case (the hex-encoded file paths are hard to read), and they cause some serious issues (there is a linux file system size limit that leaks into Metis, as it takes two hex characters to encode each file name character).

A better object store could use MD5s to organize files. Each file content is stored in a directory structure according to its hash; the object store maintains a table mapping a file key (i.e., a full path including :project_name/:bucket_name) to an md5. Newly-uploaded files are stored at a temporary location until the object store can hash them, after which it is moved to its md5-location.

This has several advantages: duplicate files don't take up extra space, and moving files from one path to another merely involves changing a database entry in the object store. This also abstracts the "object store" away from Metis' folder/file structure, paving the way for future use of other object stores (e.g., a cloud-based store or a Ceph store).

@graft
Copy link
Contributor Author

graft commented Feb 19, 2020

The full path/object store notion is flawed for the same reasons seen in #35, when we first opted for a directory structure on disk, viz: if each item ever stores its full path, when folder names change the path is destroyed. This might read to the difficult situation where renaming or re-rooting a folder requires substantially rewriting large portions of a file hierarchy, i.e. renaming hundreds or thousands of files. Let's say this violates "least surprise" that renaming a single node would be so expensive.

And for what? The file already knows its path - it merely needs an association with the record responsible for the data block (i.e., a foreign key to the other table). This assocation, in fact, already exists, it's just called "backup" right now, and it points to a location on amazon instead of a location on disk. We should, instead, rename this table to "data_blocks", and simply make it point to a location on disk in addition. The method Metis::File#location should (instead of constructing a file system path) defer to Metis::DataBlock#location. Currently the Backup object is formed via the "archive" command. The DataBlock should, instead, be formed by the completed Upload, which hands the actual data off to DataBlock and attaches it to the appropriate Metis::File.

This was referenced Feb 29, 2020
@graft
Copy link
Contributor Author

graft commented Mar 12, 2020

Closed by #56

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant