Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ext storage performance with stat cache [$5] #7910

Closed
5 of 8 tasks
PVince81 opened this issue Mar 27, 2014 · 22 comments
Closed
5 of 8 tasks

Improve ext storage performance with stat cache [$5] #7910

PVince81 opened this issue Mar 27, 2014 · 22 comments

Comments

@PVince81
Copy link
Contributor

PVince81 commented Mar 27, 2014

Many ext storage implementations are calling remote APIs for every call of stat, is_dir, filetype, filemtime, file_exists etc.
Note that for many file operations it will intensively use these functions to make sure for example that a file doesn't exist before overwriting.

Some ext storages already use PHP's function that have their own internal stat cache (FTP, etc).

To make file operations faster (and browsing as well), we should add a local cache for all ext storage implementation that need them. See the Dropbox one as an example.

CC @karlitschek @icewind1991 (this is something I just noticed while fixing Openstack bugs)

@PVince81
Copy link
Contributor Author

PVince81 commented Jul 8, 2014

@karlitschek @craigpg can we schedule this ?

Some external storage backends are making needlessly lots of calls to the remote just for simple things like checking if the file exists. Some calls are repeated within a single PHP run, not good for performance.

@deklar
Copy link

deklar commented Jul 8, 2014

Why is FTP listed as check off/internal stat cache? My testing 20 minutes ago showed several hundred connections in just a few minutes to create one file and upload three more.

Tested SFTP/FTP/Webdav - each of these have the issue.

@PVince81
Copy link
Contributor Author

PVince81 commented Jul 8, 2014

@deklar from what I saw last time I checked (which was months ago) the FTP should be using PHP's stat cache already since all FTP calls are going through PHP's FS APIs which, AFAIK, are using a stat cache. Maybe your version of PHP or combination of plugins or config are disabling it somehow ?

@PVince81
Copy link
Contributor Author

PVince81 commented Jul 8, 2014

Here in the documentation of stat() it says that results are cached: http://www.php.net/manual/en/function.stat.php
(search for "statcache").

It also says stat() can be used with protocol wrappers, which leads to http://www.php.net/manual/en/wrappers.php

Last time I tried was with early OC 7 when testing FTP against a server that only allowed 5 connections per IP.

I haven't tried again, maybe something got broken since then.

@karlitschek karlitschek added this to the ownCloud 8 milestone Jul 8, 2014
@karlitschek
Copy link
Contributor

Definitely one of the key areas to work on for 8

@deklar
Copy link

deklar commented Jul 8, 2014

I don't see how my stat() would be any other than the built in PHP version that it came with.

At this point I think the best way to go forward with this one is to have someone verify that with OC 7 that it does in fact work for them (they will need access to the FTP server logs in order to verify this - I can supply you with such if needed).

If this cannot be easily duplicated by anyone on OC 7 (in other words, if it works properly), I'd be interested in knowing the PHP version, the compile time options, and having a copy of the php.ini file.

Also keep in mind that the number of connections per IP is not the same thing as the number of connections per minute. I did not see more than one or two connections at one time - however several hundred back to back (one terminates, another one starts). This may have been why you were successful in your testing.

@deklar
Copy link

deklar commented Jul 9, 2014

I'm not even sure if this is related or not, but I switched from the PHP built in opcache to xcache. Am I completely off base thinking that there are simply too many calls going to the FTP side (and others) that really don't need to happen? If SQL holds each file location, the last time it has been touched, the size, etc - then there would be no need to read a directory (remote or otherwise) unless you are going to modify that file, pull that file for reading, or delete that file (and of course upload a new file). For example, when I upload a file to a SFTP map, there should be one connection to drop that file in place, and perhaps a second one to verify the file size - but the rest should be done internally, no?

Here is an example of just FTP on what I have been tracking:
Just entering the mapped folder - will issue two commands that are invalid by FTP specs, twice per connection, and it will make three connections (this is just for ENTERING the path). Those commands are SIZE and MDTM - OC is trying to run these against the root path, which it should never do - because that is the root path. Furthermore, it shouldn't be running this against directories anyways, only filenames within that path.

Uploading a single 120KB file:

  1. Login three times, issue the SIZE and MDTM on the root each time.
  2. Login EIGHT times, attempt to CWD to the filename (why would it be doing this? Does not OC have a database of files that exist? - moreso why would it be trying to cwd into a FILENAME)? It then (in that same session) attempts to run a SIZE on that filename that it should know does not exist. Even if it has to do this for some reason I do not understand, EIGHT times?
  3. Actually UPLOADS the file with a single session.

(this is where it gets really really bad, as if it wasn't bad enough before)

  1. Logs back in after uploading, and DELETES the filename it just uploaded! Then re-uploads it, again.
  2. Repeat steps 1-2 again.
  3. Some other steps (the 3 second log file is 757 lines long, I don't want to bore you all).

@PVince81
Copy link
Contributor Author

PVince81 commented Jul 9, 2014

@deklar like I said, ownCloud is using the PHP functions so if those are doing more than one command to the FTP server than it it partially PHP's fault. But I agree that some buffering should be done on the ownCloud side as well.

If you're willing to dig deeper I suggest you to setup XDebug and a debugger, then step into where the calls are done in apps/files_external/lib/ftp.php. The class is based off the streamwrapper here apps/files_external/lib/streamwrapper.php that maps to the regular PHP functions.

Normally if filesize is called multiple times in the same PHP call it should only reach the server once, because the data would then be cached in PHP's statcache (at least until clearstatcache is called).

I do agree that uploading shouldn't require that many calls, not sure why it is happening.

The only place where I'd expect a flood of calls is when scanning for changes.
Normally the procedure is to check the mtime of the root folder. If that one changed, rescan that directory.

@deklar
Copy link

deklar commented Jul 9, 2014

I dug into this a bit last night, it was clear that PHP is handling the FTP calls - but I'm not going to be too quick to blame PHP here just yet as something on the OC side is calling for those calls to be made (for example, the deletion of the file after it was uploaded, only to re-upload it again, I suspect that is not the PHP side). That alone is enough reason not to run the FTP end, and I'm not even sure if this is what is taking place on local file systems for that matter.

I do like OC, it is pretty unique and offers a lot for an individual or a group of people -- but my goal was to offer this to customers at some point .. which is looking less likely.

@butonic
Copy link
Member

butonic commented Oct 30, 2014

@PVince81 @icewind1991 I thought a lot about #7897, #11712 and ideas from #11797. External filesystem backends will wildly vary in capabilities and we should

  1. make use of the filecache table as a stat cache (as in [WIP] new s3 scanner #11712)
  2. use a request/session in-memory cache as a stat cache (as in Added object cache for Swift ext storage #7897)
  3. if a directory is listed via web ui or webdav, the listing is fetched from the external source and used to update the filecache & in memory cache on the fly. This ensures users will always see an up to date filelisting. The trip to the external source can be omitted if ownCloud has exclusive access.
  4. hasUpdated() should by default return false (the majority of backends has no way of reliably and quickly answering the question)
    1. webdav to an oc instance can make a stat call.
    2. all other backends rely on a backround job to update the filecache
  5. A background job periodically triggers a full filescan of the remote source to update the filecache (it traverses the complete directory tree, regardless of mtimes / etags as not all backends can provide them)
    1. webdav can use etags to check if a directory needs to be updated
    2. local can use mtime (if not a FAT partition)
    3. FTP / SFTP
    4. amazon s3 / swift should to iterate over all objects instead of traversing the fake directory tree
  6. If the remote source provides a notification mechanism for file changes the periodic scan can be executed less freqently

@butonic
Copy link
Member

butonic commented Oct 30, 2014

Another Idea: We could check the mtime / etag behavior of the storage when configuring the mount point:

  1. create a dir and remember mtime / etag
  2. create file in dir and remember mtime / etag
  3. check if dir mtime / etag changed
  4. try three folders deep and check root
  5. depending on mtime / etag select scanning strategy
    1. if etag of folder updates as well, rely on etag changes of root folder
    2. if mtime of folder updates as well, rely on mtime changes of root folder
    3. fall back to full scan
  6. use special scanners
  7. amazon s3 -> iterate over all objects
  8. swift -> iterate over all objects
  9. google drive -> use updates API and remember last update
  10. dropbox -> delta?

the necessary scanning strategy might depend on the remote source:

  1. for the local backend FAT partitions don't update the mtime of parent folders
  2. not all (S)FTP servers return an mtime (and it relies on the underlying filesystem, see FAT above)
  3. not all webdav implementations return correct etags / mtimes

Future backends might should be easier to implement if we can automatically determine a scanning strategy.

The scanners could measure how long it takes to do a scan and adjust the rescan time accordingly, as in Minutes, Hours, Days or Weeks.

@PVince81
Copy link
Contributor Author

I noticed today there's also a problem with etag propagation: when you have a deep folder structure on S3 and create a file inside the last folder, the etag of every parent folder is updated. For some reason there's a long delay for every etag update. I guess it is maybe internally trying to check whether the folder exists or something.

Using the oc_filecache table as stat cache will also help a lot with this (#11712)

@DeepDiver1975
Copy link
Member

@PVince81 8.0 8.1 ?? THX

@PVince81
Copy link
Contributor Author

PVince81 commented Jan 9, 2015

This is more of an ongoing topic. I'd say backlog for now, but we need to keep it in sight when improving external storage.

@DeepDiver1975 DeepDiver1975 modified the milestones: 8.1-next, ownCloud 8 Jan 9, 2015
@DeepDiver1975
Copy link
Member

moving to 8.1-next - just no to forget 🙊

@PVince81
Copy link
Contributor Author

PVince81 commented Feb 4, 2015

DAV doesn't have a stat cache and it looks bad with server to server sharing, see #13882

@PVince81
Copy link
Contributor Author

PVince81 commented Feb 4, 2015

@icewind1991 would it make sense to implement a generic stat cache as a storage wrapper ?
Then ext storage instances could have an option when registering the backend to wrap or not to wrap, because some might not need it (like "FTP" because it already uses PHP's protocol wrapper stat cache)

@PVince81 PVince81 mentioned this issue Feb 4, 2015
22 tasks
@PVince81
Copy link
Contributor Author

PVince81 commented Feb 9, 2015

Stat cache storage wrapper ticket here: #13971

@PVince81
Copy link
Contributor Author

This is an ongoing effort, and we're past feature freeze. Moving to 9.0

@PVince81 PVince81 modified the milestones: 9.0-next, 8.2-current Sep 21, 2015
@PVince81
Copy link
Contributor Author

While grepping the code I just noticed that the GDrive library and also the AWS library both have the option to use Memcache. Not sure if it uses it for caching results or like a stat cache. Something to look into, maybe.

@PVince81
Copy link
Contributor Author

  • Important: use a "cap" in stat caches to make sure that occ files:scan on big storages doesn't run out of memory

See how it's done for SMB here #21648

@PVince81
Copy link
Contributor Author

Feature freeze => 9.1

@PVince81 PVince81 modified the milestones: 9.1-next, 9.0-current Feb 12, 2016
@PVince81 PVince81 modified the milestones: 9.1-current, 9.2-next Jun 14, 2016
@DeepDiver1975 DeepDiver1975 changed the title Improve ext storage performance with stat cache Improve ext storage performance with stat cache [$5] Feb 7, 2017
@PVince81 PVince81 modified the milestones: 10.0.1, 10.0 Apr 11, 2017
@PVince81 PVince81 modified the milestones: triage, 10.0.1 May 17, 2017
@PVince81 PVince81 modified the milestones: triage, maybe some day May 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants