Chunk file disappears from cache during sync #24653

Closed
Marginal opened this Issue May 16, 2016 · 38 comments

Projects

None yet

5 participants

@Marginal
Marginal commented May 16, 2016 edited

Syncing large files with the 2.1.1 (build 3107) client on OSX against ownCloud 9.0.2 on a slowish 384KB/s link I'm getting an "expected filesize 5242880 got 0" exception on about 5% of the chunk uploads. I've added some additional logging to OC\Files\View::file_put_contents() (see owncloud.log below) and it appears that apache is receiving the file chunk contents OK, ownCloud is writing the contents to the filesystem cache OK, but the chunk files are occasionally disappearing from the filesystem before OC_FileChunking::store() returns.

Any pointers on how to debug this gratefully received. On the assumption that the underlying filesystem isn't randomly losing files, my working hypothesis is that the cache is occasionally clearing up chunk files too aggressively while they're still being written.

Also, I'm aware of #9832 but that issue seems a mess of various problems, and of #11778 but this seems to be a different problem. Could be related to owncloud/client#4421.

Steps to reproduce

  1. Upload 6 100MB files with the ownCloud sync client over a slowish link.

Expected behaviour

Files sync OK.

Actual behaviour

Client reports "Error download https://xxx/xxx-chunking-xxx-y-z - server replied: Bad Request (expected filesize 5242880 got 0)"

Server configuration

Operating system: Linux

Web server: Apache 2.4.16

Database: MariaDB 10.1.14

PHP version: 5.5.32

ownCloud version: 9.0.2.2

Updated from an older ownCloud or fresh install: Updated incrementally from 5.0.7

Where did you install ownCloud from: https://download.owncloud.org/community/owncloud-9.0.2.zip

Signing status (ownCloud 9.0 and above):

No errors have been found.

List of activated apps:

Files
Gallery
Contacts
Calendar

The content of config/config.php:

<?php
$CONFIG = array (
  'debug' => false,
  'instanceid' => '51b4613a9b05a',
  'passwordsalt' => 'Redacted',
  'datadirectory' => '/home/marginal/owncloud-data',
  'dbtype' => 'mysql',
  'dbdriveroptions' => 
  array (
    1002 => 'SET wait_timeout = 3600',
  ),
  'version' => '9.0.2.2',
  'dbname' => 'marginal_owncloud',
  'dbhost' => 'localhost',
  'dbtableprefix' => 'oc_',
  'dbuser' => 'Redacted',
  'dbpassword' => 'Redacted',
  'installed' => true,
  'forcessl' => true,
  'logfile' => '/home/marginal/owncloud.log',
  'loglevel' => 0,
  'log_rotate_size' => 67108864,
  'maintenance' => false,
  'theme' => '',
  'trusted_domains' => 
  array (
    0 => 'Redacted',
  ),
  'secret' => 'Redacted',
  'share_folder' => '/Shared',
  'forceSSLforSubdomains' => true,
  'trashbin_retention_obligation' => 'auto',
);

Are you using external storage, if yes which one: Local

Are you using encryption: No

Are you using an external user-backend, if yes which one: None

Client configuration

Browser: Firefox 46.0.1

Operating system: OSX 10.10.5

Logs

Web server access log

x.x.x.x - - [17/May/2016:00:30:19 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-2 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:08 +0200] "GET /ocs/v1.php/cloud/activity?page=0&pagesize=100&format=json HTTP/1.1" 200 242 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:30:19 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-0 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:30:19 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-1 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:20 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 374 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:29 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 416 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:08 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-3 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:52 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 374 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:09 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-4 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:02 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 416 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:39 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-6 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:27 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 374 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:04 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-7 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:34 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 416 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:09 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-5 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:06 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-8 HTTP/1.1" 400 225 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"

ownCloud log (data/owncloud.log)

{"reqId":"VzpJ@23JklEAADcNxJsAAAAG","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-2.YQIY9G9r9RWnP7YS.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:06+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-2","user":"xxx"}
{"reqId":"VzpJ@23JklEAADVGWzcAAAAM","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-0.5MiXMy0jYPhWsfQG.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:07+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-0","user":"xxx"}
{"reqId":"VzpJ@23JklEAADRc0bsAAAAD","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-1.JxhlUUjsdiqlImLV.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:07+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-1","user":"xxx"}
{"reqId":"VzpKLG3JklEAADpXFl4AAAAI","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-3.9caCvTHfJ9mRqU9Z.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:37+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-3","user":"xxx"}
{"reqId":"VzpKLW3JklEAADbZQCoAAAAC","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-4.g8dA06k5JAF1iUzp.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:00+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-4","user":"xxx"}
{"reqId":"VzpKS23JklEAADhbAwwAAAAJ","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-6.3YH71eRiLUiVjh9C.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:04+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-6","user":"xxx"}
{"reqId":"VzpKZG3JklEAADnhLgIAAAAB","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-7.bmRfooGP900enSbt.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:28+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-7","user":"xxx"}
{"reqId":"VzpKLW3JklEAADcNxJ4AAAAG","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-5.2a6iZC56JUVnROmJ.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:37+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-5","user":"xxx"}
{"reqId":"VzpKZm3JklEAACfcAz8AAAAF","remoteAddr":"x.x.x.x","app":"PHP","message":"filesize(): stat failed for \/home\/marginal\/owncloud-data\/xxx\/cache\/100M-2-chunking-1949736465-8.adjGuN5SCj567wO6.part at \/home\/marginal\/public_html\/lib\/private\/files\/storage\/local.php#142","level":3,"time":"2016-05-16T22:32:53+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-8","user":"xxx"}
{"reqId":"VzpKZm3JklEAACfcAz8AAAAF","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-8.adjGuN5SCj567wO6.part: exists: 0, readable: 0, size: 0","level":1,"time":"2016-05-16T22:32:53+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-8","user":"xxx"}
{"reqId":"VzpKZm3JklEAACfcAz8AAAAF","remoteAddr":"x.x.x.x","app":"webdav","message":"Exception: {\"Message\":\"HTTP\\\/1.1 400 expected filesize 5242880 got 0\",\"Exception\":\"Sabre\\\\DAV\\\\Exception\\\\BadRequest\",\"Code\":0,\"Trace\":\"#0 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php(103): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->createFileChunked(Resource id #320)\\n#1 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/directory.php(134): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->put(Resource id #320)\\n#2 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(1036): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\Directory->createFile('100M-2-chunking...', Resource id #320)\\n#3 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/CorePlugin.php(523): Sabre\\\\DAV\\\\Server->createFile('Test\\\/100M-2-chu...', Resource id #320, NULL)\\n#4 [internal function]: Sabre\\\\DAV\\\\CorePlugin->httpPut(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#5 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/event\\\/lib\\\/EventEmitterTrait.php(105): call_user_func_array(Array, Array)\\n#6 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(459): Sabre\\\\Event\\\\EventEmitter->emit('method:PUT', Array)\\n#7 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(248): Sabre\\\\DAV\\\\Server->invokeMethod(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#8 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/appinfo\\\/v1\\\/webdav.php(55): Sabre\\\\DAV\\\\Server->exec()\\n#9 \\\/home\\\/marginal\\\/public_html\\\/remote.php(138): require_once('\\\/home\\\/marginal\\\/...')\\n#10 {main}\",\"File\":\"\\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php\",\"Line\":406,\"User\":\"xxx\"}","level":4,"time":"2016-05-16T22:32:54+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-8","user":"xxx"}

Browser log

n/a

@PVince81
Collaborator

clearing up chunk files too aggressively while they're still being written

I'd also suspect that.
Are you able to do a ls -l of the chunk file before it disappears ?
The way how cleanup works is that it sets the mtime of the chunk 24 hours in the future.

Then cleanup job is triggered at login time (login hook) and would discard all chunks which mtime is < now. Maybe the timestamp / timezone of the server is causing the mtime to be wrong ?

@PVince81 PVince81 added the needs info label May 17, 2016
@Marginal
Marginal commented May 17, 2016 edited

Are you able to do a ls -l of the chunk file before it disappears ?

Yes - the offending chunk files are definitely created on the filesystem, and are visible to a sftp client (I'm running on shared hosting).

Then cleanup job is triggered at login time (login hook) and would discard all chunks which mtime is < now.

I'm looking into this now. While the chunk file is being written to the mtime ~= now. The mtime is only set 24hrs into the future after the file handle has been closed, in OC\Cache\File::set.

@PVince81
Collaborator

The mtime is only set 24hrs into the future after the file handle has been closed, in OC\Cache\File::set.

Hmmmm, indeed. But then there shouldn't be a huge time window during the time the file is written and the touch operation is done.
Unless file_put_contents is slow and already creates the file entry while it's writing it ?
Chunks are usually 20 mb big, and if the transfer is slow...

Also the code uses a .part file while it's being written. But the gc code doesn't care about extension. This is correct because we might want to clear stray part chunks too.

Were you able to observe the time between the moment where the part file is being written, does it exist ?

You might be onto something, keep posting your findings ๐Ÿ˜„
Will do some local testing with a debugger, this looks fishy.

@PVince81 PVince81 added the bug label May 17, 2016
@PVince81
Collaborator

I did some debugging and found out that this file_put_contents from chunking is actually calling streamCopy in a loop, because the chunk contents is a file handle, not data.
When checking the timestamp of the file on-disk I noticed that for every fwrite the mtime of the file was changed. So it's a game of cat and mouse until the GC comes to eat the file if its mtime happens to be older than now.

CC @RealRancor @guruz

@PVince81
Collaborator

So basically there's a loop that copies the bytes 8k by 8k, and for every fwrite the timestamp gets adjusted. If one of them takes longer and at the same time a login hook is triggered, it will kill the chunk too early.

@PVince81 PVince81 added this to the 9.0.3-current-maintenance milestone May 17, 2016
@PVince81
Collaborator

@Marginal great debugging, thank you!

@PVince81
Collaborator

A possible fix would be to not use touch but do actual time calculation in the gc method.
Basically, instead of setting the chunk's mtime to the future we leave it be and do the gc on files where mtime < now - 24hours

@Marginal
Marginal commented May 17, 2016 edited

Yep - I added some logging to OC\Cache\File::gc and I can see this happening:

{"reqId":"VzsdGG3JklEAAA0Z70YAAAAB","remoteAddr":"x.x.x.x","app":"OC\\Files\\View::file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-49666361-2.flUvxizYuEcTCGAx.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-17T13:31:42+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-2","user":"xxx"}
{"reqId":"VzsdWm3JklEAAA5505IAAAAN","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-2-chunking-49666361-4.ZzxwB0pcWur6mTEJ.part","level":1,"time":"2016-05-17T13:32:11+00:00","method":"PROPPATCH","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"VzsdWm3JklEAAA5505IAAAAN","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-3-chunking-1403625106-1.G5YnsFJG5LIfaI49.part","level":1,"time":"2016-05-17T13:32:11+00:00","method":"PROPPATCH","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"VzsdWm3JklEAAA5505IAAAAN","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-2-chunking-49666361-3.nZoppMlNyBs6FbOa.part","level":1,"time":"2016-05-17T13:32:11+00:00","method":"PROPPATCH","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x","app":"PHP","message":"filesize(): stat failed for \/home\/marginal\/owncloud-data\/xxx\/cache\/100M-2-chunking-49666361-3.nZoppMlNyBs6FbOa.part at \/home\/marginal\/public_html\/lib\/private\/files\/storage\/local.php#142","level":3,"time":"2016-05-17T13:32:13+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x,"app":"OC\\Files\\View::file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-49666361-3.nZoppMlNyBs6FbOa.part: exists: 0, readable: 0, size: 0","level":1,"time":"2016-05-17T13:32:13+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x","app":"OC\\Files\\Cache\\Cache::remove","message":"Removing: cache\/100M-2-chunking-49666361-3, id: 66491","level":1,"time":"2016-05-17T13:32:14+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x","app":"webdav","message":"Exception: {\"Message\":\"HTTP\\\/1.1 400 expected filesize 5242880 got 0\",\"Exception\":\"Sabre\\\\DAV\\\\Exception\\\\BadRequest\",\"Code\":0,\"Trace\":\"#0 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php(103): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->createFileChunked(Resource id #320)\\n#1 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/directory.php(134): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->put(Resource id #320)\\n#2 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(1036): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\Directory->createFile('100M-2-chunking...', Resource id #320)\\n#3 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/CorePlugin.php(523): Sabre\\\\DAV\\\\Server->createFile('Test\\\/100M-2-chu...', Resource id #320, NULL)\\n#4 [internal function]: Sabre\\\\DAV\\\\CorePlugin->httpPut(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#5 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/event\\\/lib\\\/EventEmitterTrait.php(105): call_user_func_array(Array, Array)\\n#6 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(459): Sabre\\\\Event\\\\EventEmitter->emit('method:PUT', Array)\\n#7 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(248): Sabre\\\\DAV\\\\Server->invokeMethod(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#8 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/appinfo\\\/v1\\\/webdav.php(55): Sabre\\\\DAV\\\\Server->exec()\\n#9 \\\/home\\\/marginal\\\/public_html\\\/remote.php(138): require_once('\\\/home\\\/marginal\\\/...')\\n#10 {main}\",\"File\":\"\\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php\",\"Line\":406,\"User\":\"xxx\"}","level":4,"time":"2016-05-17T13:32:14+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}

I guess the chunk file should be locked while being written?

@PVince81
Collaborator

I just realized that the set function is expecting a ttl here: https://github.com/owncloud/core/blob/v9.0.2/lib/private/cache/file.php#L100
This must be why it was implemented this way. So the approach of moving the ttl away won't work.
However I'm not aware of any code path that uses that $ttl argument.

@Marginal let's examine your idea about locking.
It looks like the code that calls file_put_contents already acquires locks.
But the gc one doesn't, so let's see if we can add lock acquiring there as well.

@PVince81
Collaborator

Hmmm, I went through with a debugger and found out that both file_put_contents from OC_FileChunking and also unlink from gc will acquire locks using the locking provider.

However from what I remember the locking provider was limited to work only on the "files/" entries, not the "cache/" entries. If we make it work on the latter as well, it should prevent such situations.

@Marginal

I can see that OC\Files\View::file_put_contents is creating locks, but I'm not seeing any "Could not cleanup locked chunk" debug messages from OC\Cache\File::gc so they're not being applied.

@PVince81
Collaborator

@Marginal this is why: #24672
You might want try with this patch, but I'm not 100% convinced because locking will cause some performance overhead, especially with the database locking provider.

@PVince81
Collaborator
PVince81 commented May 17, 2016 edited

I'm not so happy about using locking here because in some environments the DB locking provider is likely to be much slower when uploading many chunks. For every chunks it needs to allocate locks and change them during the course of the file operation.

Also the locking approach isn't backportable to OC < 8.1 where locking wasn't available yet.

Let's list the possible approaches:

  1. Get rid of the $ttl argument in OC\Cache\File::set() and have the ttl value in the gc() method directly.
    • the good: no locking overhead, single ttl value for every cached file
    • the bad: this is private api, so breaking it might be fine, but still... backporting an API change isn't nice
  2. Enable locking for the "cache" folder (POC here #24672)
    • the good: locking makes sure that there are no concurrency issues
    • the bad: more performance overhead especially with DB locking, not backportable to OC < 8.1
  3. Add an extra 24 hours to the ttl in the gc function's checks
    • the good: solves the issue with API change, it's backportable to OC < 8.1
    • the bad: the ttl contract isn't respected since a file might stay 24 hours longer than expected

@icewind1991 @schiesbn @nickvergessen @DeepDiver1975 any insights ?

@Marginal

Thanks, I'll try #24672.

IMHO your solution 1 seems cleanest - it's weird having mtimes in the future, and even weirder relying on this.

@PVince81
Collaborator

@Marginal yep, I agree on the weirdness. This is very very old code so there might have been an reason in the past. I'll make a POC for solution 1) too so you can try it too on your magic env ๐Ÿ˜„

@PVince81
Collaborator
PVince81 commented May 17, 2016 edited

Oh noes ๐Ÿ˜ฆ :

It looks like this is part of the public API, which means there could be apps using this and expecting the $ttl value to work.

Another idea I got was to add the expected mtime value into the file name, but then it makes it difficult to find that file name again when only the $key is known.

@PVince81
Collaborator

Okay, yet another idea: if we assume that the mtime of a finished chunk is always correct but the one of a part chunk might be wrong, then add an extra 24 hours tolerance for part chunks in the gc() method. This is especially relevant considering that part chunks cannot be found concurrently by `get()' as they aren't finished and have a unique id.

@Marginal

It looks like this is part of the public API, which means there could be apps using this and expecting the $ttl value to work.

I guess that explains the "mtimes in the future" implementation.

if we assume that the mtime of a finished chunk is always correct but the one of a part chunk might be wrong, then add an extra 24 hours tolerance for part chunks in the gc() method.

Kinda ugly to special-case .part files in gc, but if you don't want the overhead of locking the chunk files this might be the only practical solution.

@PVince81
Collaborator

Steps to reproduce the original issue

  1. Edit https://github.com/owncloud/core/blob/v9.0.2/lib/private/cache/file.php#L116 and add sleep(5); before the touch() operation
  2. Open cadaver with the current user on the console (every file operation triggers a login due to basic auth)
  3. Setup desktop client
  4. Upload a big file (~200 MB) with the desktop client
  5. While the upload is running, run ls several times in cadaver to trigger the login hook which will clean up the part file
  6. ๐Ÿ’ฃ desktop client says upload failed, log says "Expected filesize X go 0"
@PVince81
Collaborator

I'd like @icewind1991's wisdom regarding locking overhead to make a decision between solution 2 (enable locking for "cache" folder) and solution 4 (extra expire time in GC for part files). For the backport to 8.0 which doesn't have the locking code, I'd go for 4 anyway.

@Marginal
Marginal commented May 17, 2016 edited

Testing your solution 2 #24672:

{"reqId":"Vzsp-m3JklEAAAbfqkUAAAAB","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-2-chunking-453979753-18.DsYu8Yn3IYL09Jmp.part","level":1,"time":"2016-05-17T14:26:07+00:00","method":"PROPFIND","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"Vzsp-m3JklEAAAbfqkUAAAAB","remoteAddr":"x.x.x.x","app":"core","message":"Exception when running cache gc: $absolutePath must be relative to \"files\"","level":2,"time":"2016-05-17T14:26:07+00:00","method":"PROPFIND","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}

The sync is still proceeding happily 'though!

@PVince81
Collaborator

Exception when running cache gc: $absolutePath must be relative to "files"

Okay, looks like I might have missed something.
Thanks for testing!

@Marginal

Looks like there are assumptions that locked files only reside under /files in OC\Files\View::getPathRelativeToFiles and ::getUidAndFilename . Perhaps also elsewhere?

@PVince81
Collaborator

@icewind1991 popped up on IRC and suggested we just add - 1min in the gc code instead.

But this would assume that an fwrite of 8k would never take more than a minute, which is still a risky assumption.

@Marginal you're right, might need special handling for the cache folder ๐Ÿ˜ฆ

@PVince81
Collaborator

On the other hand, if 8k takes more than an hour to write then something is likely wrong with the storage and the upload will likely be aborted.

I think I'll add - 1 hour with the gc code. That quick fix should suit all use cases for now.
Diving deeper into the locking code is likely to cause other side effects and potential regressions.

@PVince81 PVince81 removed the needs info label May 17, 2016
@Marginal
Marginal commented May 17, 2016 edited

@icewind1991 popped up on IRC and suggested we just add - 1min in the gc code instead.

Over a busy connection with 6 streams simulateously I can see .part files take several minutes to upload. Should perhaps use PHP's max_input_time and/or max_execution_time value instead of an arbitrary 1min / 1hr / 1 day value?

@PVince81
Collaborator

The delay is only about a single fwrite call of 8192 bytes. After each fwrite, the mtime on-disk is adjusted to the current mtime. So as long as the fwrite loop continues to run, even if slowly, the mtime should be ahead of the GC's eager chasing.

@PVince81
Collaborator

Simplified fix with extra time tolerance is here: #24676

@PVince81
Collaborator

@Marginal if you say several minutes per chunk: one chunk is usually 20 MB I believe (or was it reduced to 10 ?). Several minutes, ok let's say 8 minutes which means 2.5 MB a minute, ~41 KB a second. So within that one second the fwrite method will be called more than 5 times, which means the mtime of the file will get updated 5 times within a second. Should be enough ๐Ÿ˜„

@Marginal
Marginal commented May 17, 2016 edited

The default chunk size is now 5MB for the 2.1 client.

You're correct that the mtime is updated every 8KB, so 1hr leeway should be fine.

@PVince81
Collaborator

๐Ÿธ

Let me know if the new PR works for you and we can move it forward to the fixing pipeline / backports, etc

@Marginal

#24676 works for me - I uploaded 6 x 100MB files with no "expected filesize 5242880 got 0" errors.

@PVince81
Collaborator

@Marginal thanks for testing!

master PR is here: #24691 (for 9.1)
I'll list the backport PRs in that one.

@guruz
Contributor
guruz commented May 18, 2016

Awersome to see this fixed!

@PVince81
Collaborator

Yep. The reason why we didn't find it earlier is because it's hidden behind a $storage->file_put_contents() call which, at first appearance, should be atomic. However, due to the many layers of OC, that call is transformed to a loop that copies blocks which isn't atomic.

@PVince81
Collaborator

I'm going to link all other related tickets to this one for the solution.

For those of you arriving here from other tickets, please try this PR #24691 or one of the related backports #24691 (comment). Alternatively you can just wait for the next maintenance releases ๐Ÿ˜„

@PVince81
Collaborator

@RealRancor if you see people on the forums with the error "Expected filesize X got 0", you might want to direct them to this solution. Thanks ๐Ÿ˜„

@enoch85
Member
enoch85 commented May 18, 2016

Finally! Thanks @Marginal and @PVince81 ๐ŸŽ† ๐Ÿ‘

@PVince81 PVince81 closed this in #24691 May 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment