Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving ext storage with streaming #5949

Closed
5 of 10 tasks
PVince81 opened this issue Nov 19, 2013 · 39 comments
Closed
5 of 10 tasks

Improving ext storage with streaming #5949

PVince81 opened this issue Nov 19, 2013 · 39 comments

Comments

@PVince81
Copy link
Contributor

PVince81 commented Nov 19, 2013

Currently the performance of external storage is bad because the file must first be downloaded into a temporary file, then fopen() is called on that file and the handle is returned. This means that downloading a 4 GB file from an ext storage will first create a 4 GB temporary file which is then passed to the client as download.

In some cases like downloading, mimetype scanning (if we keep it), antivirus app, etc we are only interested in getting either the first bytes of the stream (fread() then fclose()) or stream the whole file sequentially. No fseek() needed.

In such cases, it might be more efficient to do a fopen() on the stream directly, if possible. It seems that PHP allows fopen() on HTTP URLs. We could just stream the body of the response as it comes into the hooks and back to the client.

I had a quick look and it looks like most external storages could be modified to use streaming, as many of them use HTTP calls anyway.

The alternative to this would be to use a library like CURL that uses threads to pre-download the file into the temporary file. The control could be given back to the caller before the file is finished downloading, so that they can already start working on the start of the temporary file.

Please let me know what you think of this idea.
@icewind1991 @karlitschek @schiesbn @DeepDiver1975 @bantu

List of storages

@PVince81
Copy link
Contributor Author

Here is an example how to use streams with fopen():
http://php.net/manual/en/wrappers.http.php

@PVince81
Copy link
Contributor Author

@schiesbn the only fseek() call I know so far is the one to detect whether a file is encrypted. As we discussed before, this can be replaced with an approach that simply detects whether keys exist for that file.

@karlitschek
Copy link
Contributor

And additionally we could experiment with parallel http request to reduce roundtrip time. .
https://code.google.com/p/rolling-curl/
https://github.com/owncloud/administration/blob/master/performance-tests/oc-stress.php

@PVince81
Copy link
Contributor Author

Just saw that the webdav.php storage already uses curl, but it still does a full-download of the files.
It also uses "php://temp" as temporary file instead of \OC_Helper::tmpFile($ext).
From the doc it seems that "php://temp" is using memory for up to 2 MB.

And the FTP storage does a fopen() directly on the ftp:// URL.
Seems to work with http as well (haven't tried https yet)

@PVince81
Copy link
Contributor Author

  • Check whether thumbnail creation is using streaming (direct fopen) or temporary files (@georgehrke)
  • Find a way to stream with fopen() into avconv/ffmpeg instead of using a temp file (or raise an issue for that)

@PVince81 PVince81 modified the milestones: Maybe someday, ownCloud 6, ownCloud 7 Mar 6, 2014
@PVince81
Copy link
Contributor Author

PVince81 commented Mar 6, 2014

For OC7 we will try and introduce streaming to WebDAV mounts and SMB mounts.
If time is left, then try and fix the other ones as well.

@bantu
Copy link

bantu commented Mar 6, 2014

@PVince81 Doing WebDAV streaming should be not too hard to implement. How do you want to approach SMB/CIFS streaming?

@PVince81
Copy link
Contributor Author

PVince81 commented Mar 6, 2014

Piping with smbclient. I've already experimented with it and it works, see #4770 (comment)

But first we need to get @icewind1991 's new SMB impl in: #4770

@georgehrke
Copy link
Contributor

Find a way to stream with fopen() into avconv/ffmpeg instead of using a temp file (or raise an issue for that)

We don't create a temp file of the whole file, just for the first 5MB.
https://github.com/owncloud/core/blob/master/lib/private/preview/movies.php#L47

@PVince81
Copy link
Contributor Author

PVince81 commented Mar 6, 2014

@georgehrke yes I know, that's why streaming ext storage will improve that as well.
Currently most ext storage will FIRST download the full file to the temp storage, then call fopen on the temp file and give the stream to you (the preview code)

@bantu
Copy link

bantu commented Mar 6, 2014

@PVince81 Ah okay, I was expecting a PHP implementation of SMB/CIFS, hehe. :-D

@PVince81 PVince81 self-assigned this Mar 6, 2014
@PVince81
Copy link
Contributor Author

Just had a look at WebDAV streaming.
It is possible to stream download using fopen('https://...') to the WebDAV server.
BUT the stream wrapper doesn't support writing, which means we'll still need the temporary files.

I also looked at curl and it seems that it's not possible to pipe files to curl, at least not using the "curl_exec" functions.

@bantu
Copy link

bantu commented Mar 26, 2014

@PVince81 How much work would it be to implement a stream wrapper that can be written to?

@PVince81
Copy link
Contributor Author

Not sure, but it could be an idea. There are callback functions for read and write, so these could be used to buffer the next block in the wrapper.

I've also found this: http://guzzle.readthedocs.org/en/latest/ (which happens to be used by the Amazon API lib)

@MorrisJobke MorrisJobke removed this from the ownCloud 7 milestone Jun 18, 2014
@lokeller
Copy link

Is this still in the roadmap?

In my case I'm now using a workaround. I substitute the code in lib/private/files/storage/dav.php in function DAV::fopen for the case "r" and "rb" with the following code:

if (!$this->file_exists($path)) {
  return false;
}
return fopen($this->createBaseUri() . $this->encodePath($path), $mode);

and I can now stream files directly from the external remote. This may break stuff elsewhere (for instance any place that calls fseek on the returned stream) and it is not 100% functionally equivalent to the code I substituted (for instance maybe SSL validation) but now I can finally use external storage in my usecase (starting to consume > 1GB files as quickly as possible while reading from a very slow external storage).

I studied a bit a possible real solution to the problem starting from the pointers that PVince81 added above. One way to properly fix this seems to be to create a new stream wrapper that saves the file to a local storage (to allow seeking) and returns data as soon as it is available. Ingredients of this solutions seem to be the function stream_wrapper_register and the option CURLOPT_WRITEFUNCTION. I'm not good enough in multi-threaded PHP (if this even exists) to be able to write a patch implementing this but I hope that someone with the necessary skills gets interested in this problem.

@PVince81
Copy link
Contributor Author

I don't really understand how and why your code change would allow streaming.

From what I heard there is no such thing as multi-threaded PHP, unless maybe if the called function is implemented in C/C++ in the back 😦

There are already a few stream wrappers used in the external storage backends that work almost like you said, but the temporary file is first fully downloaded before access to it is given.
It might be possible to use curl (or multicurl if that makes sense ?) to downloaded the data in the background and make it available packet-wise.
Then if the caller tries to seek forward, it will be blocked until the relevant packet is available.

Another idea, for seeking, is that the caller can inform whether fseek is required or not.
If fseek is required, the old behavior (full download to temp) is used.
Else streaming is used (when available).

Needs further research.

@icewind1991
Copy link
Contributor

See #10620

@lokeller
Copy link

Thanks for the feedback. Turns out that as you stated there is not multithreading in general in PHP.

I studied a bit more the problem a mix of what you suggest (multicurl) and what I was proposing can be used to implement something that works. Below I put a small sample prototype of the approach I propose:

<?php

class SeekableHttpStream {

    // keeps track of the number of bytes written to the temporary file
    var $read_bytes = 0;

    // context used to download the file
    var $curl;
    var $multi_curl;

    // a file resource pointing to a temporary file where we store the download 
    var $temp_file;

    // different than null as long as we are downloading from the remote server
    var $active;

    // position at which our client (the person that opened the stream with our)
    // protocol wants the next read to happen 
    var $position = 0;

    function write_to_temp($res, $data) {

    // here we write the newly received data at the end of the temporary file
    fseek($this->temp_file, $this->read_bytes);
    $written = fwrite($this->temp_file, $data);
    $this->read_bytes += $written;
    return $written;

    }

    function stream_open($path, $mode, $options, &$opened_path) {

    // open the temporary file that will store the partial download
    $this->temp_file = fopen("php://temp", "rwb");

    // setup the download
    $this->curl = curl_init();

    $real_url = "http".substr($path, strlen("http+curl")) ; 

    curl_setopt($this->curl, CURLOPT_URL, $real_url);
    curl_setopt($this->curl, CURLOPT_HEADER, 0);

    // set a callback that will be called when new data comes in
    curl_setopt($this->curl, CURLOPT_WRITEFUNCTION, array($this, "write_to_temp"));

    $this->multi_curl = curl_multi_init();

    curl_multi_add_handle($this->multi_curl,$this->curl);

    $this->active = null;

    // start the download
    do {
        $mrc = curl_multi_exec($this->multi_curl, $this->active);
    } while ($mrc == CURLM_CALL_MULTI_PERFORM);

    if ( $mrc != CURLM_OK ) {
        $this->stream_close();
    }

    error_log($this->active);

        return true;
    }

    function stream_close() {
    // clean up when closing the download
    $active = null;
    curl_multi_remove_handle($this->multi_curl, $this->curl);
    curl_multi_close($this->multi_curl);
    }


    function stream_read($count) {

    if ( $this->stream_eof()) {
        return FALSE;
    }

    $available_count = min( $this->read_bytes - $this->position, $count); 

    fseek($this->temp_file, $this->position);

    $data = fread($this->temp_file, $available_count);

    $this->position += strlen($data);

    return $data;

    }

    function stream_tell() {
        return $this->position;
    }

    function wait_data() {
    // wait for incoming data
    if (curl_multi_select($this->multi_curl) != -1) {

        // read the data
        do {
            $mrc = curl_multi_exec($this->multi_curl, $this->active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);


        if ( $mrc != CURLM_OK) {
            $this->stream_close();
        }

    } else {
        $this->stream_close();
    }
    }

    function wait_for_end() {
    // wait until the download is over
    while ( $this->active) {
        $this->wait_data();
    }
    }

    function wait_for_position($position) {
    // wait until we read past the position or the download is over
    while ( $position >= $this->read_bytes && $this->active) {
        $this->wait_data();
    }
    }

    function stream_eof() {

    // here we could rely on the Content-Lenght field of the header
    $this->wait_for_position($this->position);

    // return FALSE if the request is past the end of the file
    return $this->position >= $this->read_bytes ;
    }

    function stream_seek($offset, $whence) {

        switch ($whence) {
            case SEEK_SET:

        $this->wait_for_position($offset);

        if ($this->read_bytes > $offset) {
            $this->position = $offset;
            return TRUE;
        } else {
            return FALSE;
        }

                break;

            case SEEK_CUR:

        return $this->stream_seek($this->position + $offset, SEEK_SET);

            case SEEK_END:

        // here we could rely on the Content-Lenght field of the header
        $this->wait_for_end();
        return $this->stream_seek($this->position + $offset, SEEK_SET);

            default:
                return false;
        }
    }

}

stream_wrapper_register("http+curl", "SeekableHttpStream")
    or die("Failed to register protocol");

$fp = fopen("http+curl://127.0.0.1/infinite-file.php", "r");

print "First read: ";
print fread($fp, 4);
print "<br>";
print "Second read: ";
print fread($fp, 4);
fseek($fp, 0);
print "<br>";
print "Third read: ";
print fread($fp, 4);

?>

The approach could be optimized significantly (for instance by initiating range requests when the file is seeked on a region that is not yet downloaded and by using the Content-Lenght headers to predict the file size.

@lokeller
Copy link

lokeller commented Sep 8, 2014

I prepared a prototype of the changes needed to fix this bug, it is only partially tested (i.e. I downloaded a file and it worked ) : https://github.com/lokeller/core/commit/fcb1edb3b590ba8bba7f6801ea7fdc79a2400008

I put the commit in public domain, if you feel it may be useful in OwnCloud feel free to include it ( I don't want to sign any contribution agreement so I won't create a proper pull request). If there is interest I can polish and test it a bit more.

@bantu
Copy link

bantu commented Sep 8, 2014

@lokeller Hey. Signing a contributors agreement is not a requirement for submittung pull requests to ownCloud. After submitting a pull request, you will get the chance to alternatively state that your contribution is MIT licensed. :-)

@lokeller
Copy link

lokeller commented Sep 9, 2014

That's great! I'll test a bit further my patch and then I will create the pull request so it will be easier to review it.

@lokeller
Copy link

See #11000

@PVince81
Copy link
Contributor Author

Looks like Guzzle could be used with a stream factory: https://github.com/owncloud/core/pull/19002/files#diff-2ab592dfd95e06115a358d1b5b20cdc5R318

@PVince81
Copy link
Contributor Author

Okay, looks like it might be even easier, as long as the remote supports HTTP: https://github.com/owncloud/core/pull/18653/files#diff-782207b41e0a420a99054e41c7e7946dR348

@PVince81
Copy link
Contributor Author

  • TODO: double check every storage and distinguish stream download from stream upload.
    Many already have stream download but not necessarily upload.

@PVince81
Copy link
Contributor Author

  • Beware: we discovered that when using Guzzle for stream download it will do a fopen directly on the target URL. This means that calling fread might not return the number of requested bytes. This situation can happen when using encryption. => the solution for fread/fwrite blocks, use Icewind/Streams' "RetryWrapper" on the relevant external storages that use streaming

We might need to fix the encryption code first to not assume that fread always returns the number of expected bytes. The alternative is to make sure that all external storages using fopen streams also have some caching / buffering in place (for example Guzzle's CachingStream).

@PVince81
Copy link
Contributor Author

GDrive stream download: #23517
Dropbox stream download: #23516

@PVince81
Copy link
Contributor Author

Streaming upload would help reduce the occurrences of PHP timeouts like #24454 (comment)

@PVince81
Copy link
Contributor Author

@mmattel FYI I added this to the planning discussion: #24684 (comment)

@stale
Copy link

stale bot commented Sep 20, 2021

This issue has been automatically closed.

@stale stale bot closed this as completed Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests