Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize files on disk #35

Closed
graft opened this issue Feb 22, 2018 · 8 comments
Closed

Organize files on disk #35

graft opened this issue Feb 22, 2018 · 8 comments
Assignees

Comments

@graft
Copy link
Contributor

graft commented Feb 22, 2018

Files should be sorted by project. Currently Metis defines a 'project_path' for each project in config.yml.

File data is stored in this path - how is it organized?

Each Metis::File expects to have data at a certain location. Currently this is stored by file_name, which creates some issues:

  • The file_name cannot be an arbitrary string. In general it seems dangerous to create files based on arbitrary user-strings. Restricting the file-name format to something safe (no spaces, apostrophes, etc.) will probably irritate users.

  • A file_name in the Metis::File record and a file_name on disk may get out of sync. If the client is also identifying files based on their file name, it might also get out of sync.

Alternatives:

  1. Use an md5sum to store the file. This is bad because if the data changes, the location changes. Also the server does not know the location until the upload is complete.

  2. Use a unique id for each file. ???

@graft
Copy link
Contributor Author

graft commented Feb 22, 2018

A unique id for file might mean orphaned (unidentifiable) data if the corresponding record is deleted. On the other hand it is easy to rename files this way.

@graft
Copy link
Contributor Author

graft commented Mar 7, 2018

Filenames can have a fairly large range of values; supposedly a valid path string in Unix is ^[^\0]+$. We should be more restrictive. Windows forbids these characters: <>:/|?* as well as ascii 0-31, so this pattern: /\A[^<>:;,?"*\|\/\x00-\x1f]+\z/

However, I still don't want to write these names to disk; it feels untrustworthy somehow. Here are some alternatives:

  1. Use a random uid for the file. This means only the database knows what the filename is. Renaming the file means changing the filename in the database entry.

  2. Base64-encode the filename and use the encoded_filename for the file on disk. The database contains only the filename; the file-on-disk is found by encoding the filename. The assocation, if broken, can be rediscovered perhaps from the filename. Renaming the file must be done in both the database and the file-on-disk.

I think I prefer 2. This allows the filename to be safe, allows us to maintain the association, and allows a wide range of possible filenames.

@graft
Copy link
Contributor Author

graft commented Mar 7, 2018

What is the directory structure of a metis project?

Options:

  1. There is no directory structure. All files are stored flat in the same directory.

  2. There are top-level 'buckets', but no sub-structure.

  3. Any arbitrary folder structure is valid. Folders may have any valid filenames.

Whatever the structure, for simplicity's sake they should be encoded in the filename. I.e., we do not say that there is a file "clinical.xls" in the folder "patient/lmnop-4321/", we say there is a file "patient/lmnop-4321/clinical.xls".

In this case I think I prefer (3), with dirnames base64-encoded as above on disk.

So, given a valid filename, we can get its location-on-disk as filename.split(%r{/}).map{|name| Base64::url_encode64(name)}.join('/')

However, I think that separately there should be a 'bucket' attribute on files, so that, e.g., magma can put files in a controlled location. If unspecified, the bucket should just be a default 'files' bucket.

@graft
Copy link
Contributor Author

graft commented May 15, 2018

The back end for this is in place, along with a folder creation endpoint; the get-files request must accept a folder argument.

On the front end I would like to add a basic breadcrumb and folder/file listing, and connect a 'create folder' button.

@graft
Copy link
Contributor Author

graft commented May 23, 2018

Okay, the front-end works nicely, and so does the back end.

However, my experience leads me to believe the current back-end structure (i.e., how the files are stored) is inadequate/bad.

The most significant issue is that currently a filename within a folder ('pictures/picture.jpg') encodes both the user-visible name of the current resource (e.g. 'picture.jpg') as well as the containing folder (e.g. 'pictures/') - this means if we want to change the name of the containing folder ('pictures/') the filename of any files contained within must also change (e.g. 'pictures/' => 'photos/' means 'pictures/picture.jpg' => 'photos/picture.jpg'). This is cumbersome, especially if the folder contains hundreds of files - this means renaming hundreds of files on disk, surely a risky operation.

I am also mourning the filesystem-level browsing, which means remote-mounting directly becomes more difficult/impossible in the future. While I still like the encoded file names (I went with hex-encoding rather than Base64, which has some odd characters, is not very standard, and might include slashes), they are also inconvenient for filesystem-level browsing.

For this reason I think I will propose:

  1. Uniqueness constraints should apply across [ file_name, folder_id, bucket_id, project_name ] - i.e., my file_name need not be unique if any of the other three differs.

  2. The file_name does not contain the parent folder name (e.g. the file at 'pictures/picture.jpg' would have file_name 'picture.jpg' and folder_id => the entry for 'pictures/')

  3. When we create a folder, we use mkdir to make a corresponding folder on disk. Files are stored in the folder hierarchy as usual.

  4. Read-only attributes are set on disk as well as in the database. This means if we set a directory to be guo-w, we cannot create a file in it or rename it.

@graft
Copy link
Contributor Author

graft commented Jun 19, 2018

https://dirtsimple.org/2010/11/simplest-way-to-do-tree-based-queries.html

Here's an article on the current big problem: how to store the folder hierarchy in the DB. The old approach, to write the full path into the folder_name, does not work because it makes renaming expensive. The new approach, to only store the folder's name and have a link to its parent, does not work because querying for a path becomes expensive (requires multiple SQL queries).

It's possible there is a solution with common table expressions ("with" clauses), but the approach above (a closure table) might also work.

@graft graft self-assigned this Jun 29, 2018
@graft
Copy link
Contributor Author

graft commented Jun 29, 2018

Okay, I solved my tree issues with a recursive query - although the query is unwieldy it can be confined to a single point, reconstructing the folder hierarchy given the full path. I have also made it so that directories are created on disk corresponding to each folder (with a safe name). With this in place I have repaired the backend to use a new Folder model. There is some repair to be done to the client - following this this basic problem should be solved.

@graft
Copy link
Contributor Author

graft commented Aug 7, 2019

Implemented in #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant