Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use closure tables to implement file cache without full path column #4209

Closed
butonic opened this issue Jul 26, 2013 · 16 comments
Closed

Use closure tables to implement file cache without full path column #4209

butonic opened this issue Jul 26, 2013 · 16 comments

Comments

@butonic
Copy link
Member

butonic commented Jul 26, 2013

I need to track this somewhere, so I might just as well do it publicly. Past and current filecache implementations rely on a path column containing the full path of file. Currently it is limited to 512 chars: https://github.com/owncloud/core/blob/master/db_structure.xml#L224

There has been a lot of research on how to store tree like structures in relational databases. A good starting point is this stackoverflow entry. Another comparison can be found in One more Nested Intervals vs. Adjacency List comparison.

The most promising approach which I consider superior to eg. Nested Sets are Closure Tables, due to the clear meaning of column values. I know few people who actually understand nested sets ...

I already had a rudimentary version of them working in the OC4.5 file cache ... dunno if the code is still in some branch on one of my local machines ...

cc @icewind1991 @bartv2 opinions?

@icewind1991
Copy link
Contributor

A problem is that the most used query on that table is selecting a row by path (or rather path_hash), closure tables can speedup getting all children for copy/move operations but we will still need to keep the full path somehow if we want to keep fast reads

@jancborchardt
Copy link
Member

@butonic is this still a problem? What’s the call on this?

@butonic
Copy link
Member Author

butonic commented Jul 16, 2014

It is a completely different approach to storing the file tree in the database. If someone feels inclined to do some performance testing I'd still like to try this. I have a branch somewhere but it needs to add a join for every subdirectory. Depending on the DB that might be costly ...

@butonic butonic added this to the Maybe someday milestone Jul 16, 2014
@butonic
Copy link
Member Author

butonic commented Jul 16, 2014

moving to someday maybe milestone

@jancborchardt jancborchardt modified the milestones: Maybe someday, backlog Mar 23, 2015
@butonic
Copy link
Member Author

butonic commented May 13, 2015

Regarding the most used query being the lookup of the fileid by path: Those are only SELECTS. They can be cached and scaled easily. UPDATES cant. We could also use self joins to look up more than one directory depth in a single query.

The only theoretical problem I see is additional space requirement because the closure table grows exponentially: https://b2berry.files.wordpress.com/2011/12/b2berryscalingrtctables1.pdf

I prefer closure tables becaus we have to update mtime, etag and size up to all tree roots (multiple because of sharing). Closure tables allow us to select or update all parents in one query.

@MTRichards
Copy link
Contributor

@DeepDiver1975 @cmonteroluque @karlitschek bringing this to the top of the backlog to help open the discussion.

@DeepDiver1975
Copy link
Member

to be honest - with a different database structure we can gain speed up - maybe - this has to be analysed.

But - we have a more pressing issue with the database: concurrent access.
We implemented a workaround with 8.0.3 which is also part of 8.1.

But the whole way we operate on the database is still fishy and requires a detailed analysis to ensure data integrity. There have been thoughs from @dragotin on this topic - thanks a lot once more.

I prefer to first work on this area instead of restructuring the schema.

@butonic
Copy link
Member Author

butonic commented May 13, 2015

Just to be clear: I totally agree with @DeepDiver1975 here.

Concurrent access is a problem because our updates / etag and mtime propagation are not atomic and we don't handle transaction rollbacks correctly.

AFAICT Closure tables solve the data integrity problems. BUT the only way to know for sure if they solve all our problems without creating new ones is by trying them: do the necessary implementation and test with huge data sets.

@PVince81
Copy link
Contributor

@butonic I suppose it would be possible to make such experiments as a separate app.
That app would provide its own Storage/Scanner/Cache implementations using the DB models you proposed.

The only change needed in core is a config switch to make it possible to use another Storage class as home storage, so we could switch it to that one.

@PVince81
Copy link
Contributor

Well or it could simply be an external storage backend (provided by the app) that actually uses another local folder to store the files. Then mount it somewhere for testing.

@PVince81
Copy link
Contributor

PVince81 commented Jun 8, 2015

@butonic where's the POC branch you mentioned ?

@PVince81
Copy link
Contributor

PVince81 commented Jun 8, 2015

Also found this Doctrine extension that brings nested sets and closure tables: https://github.com/Atlantic18/DoctrineExtensions/blob/master/doc/tree.md

@PVince81
Copy link
Contributor

PVince81 commented Jun 8, 2015

I prefer closure tables becaus we have to update mtime, etag and size up to all tree roots (multiple because of sharing). Closure tables allow us to select or update all parents in one query.

Hmm that sounds cool. But that might not work for sizes, as the size will be different for each folder. Unless there is an aggregate query that can automatically calculate the sizes on the go while updating.

@PVince81
Copy link
Contributor

@butonic relevant for your rawstorage experiments as well ?

@butonic
Copy link
Member Author

butonic commented Oct 14, 2016

no but we could try implementing a local storage with a closure table based cache ... could ... if we had the time ...

@PVince81
Copy link
Contributor

PVince81 commented Mar 5, 2018

other cases where closure tables would come in handy:

  • with a given folder, check whether it contains any children with a given tag: the closure table gives you direct access to all child file ids which you can join with the tags tables
  • with a given folder, find every parents with one query to check if one is shared

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants