Use closure tables to implement file cache without full path column #4209

butonic · 2013-07-26T13:53:03Z

I need to track this somewhere, so I might just as well do it publicly. Past and current filecache implementations rely on a path column containing the full path of file. Currently it is limited to 512 chars: https://github.com/owncloud/core/blob/master/db_structure.xml#L224

There has been a lot of research on how to store tree like structures in relational databases. A good starting point is this stackoverflow entry. Another comparison can be found in One more Nested Intervals vs. Adjacency List comparison.

The most promising approach which I consider superior to eg. Nested Sets are Closure Tables, due to the clear meaning of column values. I know few people who actually understand nested sets ...

I already had a rudimentary version of them working in the OC4.5 file cache ... dunno if the code is still in some branch on one of my local machines ...

cc @icewind1991 @bartv2 opinions?

icewind1991 · 2013-07-27T14:37:28Z

A problem is that the most used query on that table is selecting a row by path (or rather path_hash), closure tables can speedup getting all children for copy/move operations but we will still need to keep the full path somehow if we want to keep fast reads

jancborchardt · 2014-07-16T15:51:23Z

@butonic is this still a problem? What’s the call on this?

butonic · 2014-07-16T20:14:41Z

It is a completely different approach to storing the file tree in the database. If someone feels inclined to do some performance testing I'd still like to try this. I have a branch somewhere but it needs to add a join for every subdirectory. Depending on the DB that might be costly ...

butonic · 2014-07-16T20:15:05Z

moving to someday maybe milestone

butonic · 2015-05-13T08:55:25Z

Regarding the most used query being the lookup of the fileid by path: Those are only SELECTS. They can be cached and scaled easily. UPDATES cant. We could also use self joins to look up more than one directory depth in a single query.

The only theoretical problem I see is additional space requirement because the closure table grows exponentially: https://b2berry.files.wordpress.com/2011/12/b2berryscalingrtctables1.pdf

I prefer closure tables becaus we have to update mtime, etag and size up to all tree roots (multiple because of sharing). Closure tables allow us to select or update all parents in one query.

MTRichards · 2015-05-13T11:43:56Z

@DeepDiver1975 @cmonteroluque @karlitschek bringing this to the top of the backlog to help open the discussion.

DeepDiver1975 · 2015-05-13T11:56:52Z

to be honest - with a different database structure we can gain speed up - maybe - this has to be analysed.

But - we have a more pressing issue with the database: concurrent access.
We implemented a workaround with 8.0.3 which is also part of 8.1.

But the whole way we operate on the database is still fishy and requires a detailed analysis to ensure data integrity. There have been thoughs from @dragotin on this topic - thanks a lot once more.

I prefer to first work on this area instead of restructuring the schema.

butonic · 2015-05-13T13:06:06Z

Just to be clear: I totally agree with @DeepDiver1975 here.

Concurrent access is a problem because our updates / etag and mtime propagation are not atomic and we don't handle transaction rollbacks correctly.

AFAICT Closure tables solve the data integrity problems. BUT the only way to know for sure if they solve all our problems without creating new ones is by trying them: do the necessary implementation and test with huge data sets.

PVince81 · 2015-05-28T09:48:45Z

@butonic I suppose it would be possible to make such experiments as a separate app.
That app would provide its own Storage/Scanner/Cache implementations using the DB models you proposed.

The only change needed in core is a config switch to make it possible to use another Storage class as home storage, so we could switch it to that one.

PVince81 · 2015-05-28T09:49:54Z

Well or it could simply be an external storage backend (provided by the app) that actually uses another local folder to store the files. Then mount it somewhere for testing.

PVince81 · 2015-06-08T19:12:03Z

@butonic where's the POC branch you mentioned ?

PVince81 · 2015-06-08T19:18:45Z

Also found this Doctrine extension that brings nested sets and closure tables: https://github.com/Atlantic18/DoctrineExtensions/blob/master/doc/tree.md

PVince81 · 2015-06-08T19:20:57Z

I prefer closure tables becaus we have to update mtime, etag and size up to all tree roots (multiple because of sharing). Closure tables allow us to select or update all parents in one query.

Hmm that sounds cool. But that might not work for sizes, as the size will be different for each folder. Unless there is an aggregate query that can automatically calculate the sizes on the go while updating.

PVince81 · 2016-10-14T14:10:59Z

@butonic relevant for your rawstorage experiments as well ?

butonic · 2016-10-14T14:37:42Z

no but we could try implementing a local storage with a closure table based cache ... could ... if we had the time ...

PVince81 · 2018-03-05T13:00:36Z

other cases where closure tables would come in handy:

with a given folder, check whether it contains any children with a given tag: the closure table gives you direct access to all child file ids which you can join with the tags tables
with a given folder, find every parents with one query to check if one is shared

butonic added this to the Maybe someday milestone Jul 16, 2014

jancborchardt modified the milestones: Maybe someday, backlog Mar 23, 2015

butonic added the research label May 13, 2015

craigpg added the sev2-high label May 13, 2015

butonic mentioned this issue Sep 18, 2015

File Handling Scalability Improvements #18722

Closed

DeepDiver1975 mentioned this issue Apr 18, 2016

PUT requests very SLOW #11588

Closed

PVince81 mentioned this issue Apr 3, 2017

Async file operations #24509

Closed

ownclouders added the status/STALE label Jan 5, 2018

ownclouders removed the status/STALE label Mar 5, 2018

ownclouders added the status/STALE label Apr 5, 2018

felixboehm removed the sev2-high label Apr 9, 2018

ownclouders removed the status/STALE label Apr 9, 2018

ownclouders added the status/STALE label May 10, 2018

butonic closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use closure tables to implement file cache without full path column #4209

Use closure tables to implement file cache without full path column #4209

butonic commented Jul 26, 2013 •

edited by DeepDiver1975

Loading

icewind1991 commented Jul 27, 2013

jancborchardt commented Jul 16, 2014

butonic commented Jul 16, 2014

butonic commented Jul 16, 2014

butonic commented May 13, 2015

MTRichards commented May 13, 2015

DeepDiver1975 commented May 13, 2015

butonic commented May 13, 2015

PVince81 commented May 28, 2015

PVince81 commented May 28, 2015

PVince81 commented Jun 8, 2015

PVince81 commented Jun 8, 2015

PVince81 commented Jun 8, 2015

PVince81 commented Oct 14, 2016

butonic commented Oct 14, 2016

PVince81 commented Mar 5, 2018

Use closure tables to implement file cache without full path column #4209

Use closure tables to implement file cache without full path column #4209

Comments

butonic commented Jul 26, 2013 • edited by DeepDiver1975 Loading

icewind1991 commented Jul 27, 2013

jancborchardt commented Jul 16, 2014

butonic commented Jul 16, 2014

butonic commented Jul 16, 2014

butonic commented May 13, 2015

MTRichards commented May 13, 2015

DeepDiver1975 commented May 13, 2015

butonic commented May 13, 2015

PVince81 commented May 28, 2015

PVince81 commented May 28, 2015

PVince81 commented Jun 8, 2015

PVince81 commented Jun 8, 2015

PVince81 commented Jun 8, 2015

PVince81 commented Oct 14, 2016

butonic commented Oct 14, 2016

PVince81 commented Mar 5, 2018

butonic commented Jul 26, 2013 •

edited by DeepDiver1975

Loading