Skip to content
This repository has been archived by the owner on Jul 22, 2021. It is now read-only.

large file download fails with OverflowError #30

Open
rupertlevene opened this issue Feb 11, 2015 · 8 comments
Open

large file download fails with OverflowError #30

rupertlevene opened this issue Feb 11, 2015 · 8 comments

Comments

@rupertlevene
Copy link

On my 32-bit linux machine, files over 2GB fail to download. Memory usage while my test script runs gets very high, suggesting the entire download is being cached in memory; I think the download should be streamed to disk instead.

To use the script, upload a large file called bigvid.avi to google drive and put client_secrets.json in the working directory.

$ ./test.py 
bigvid.avi
Traceback (most recent call last):
  File "./test.py", line 17, in <module>
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 167, in GetContentFile
    self.FetchContent(mimetype)
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 36, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 198, in FetchContent
    self.content = io.BytesIO(self._DownloadFromUrl(download_url))
  File "/usr/local/lib/python2.7/dist-packages/pydrive/auth.py", line 54, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 313, in _DownloadFromUrl
    resp, content = self.auth.service._http.request(url)
  File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 135, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 547, in new_request
    redirections, connection_type)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1593, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1335, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1318, in _conn_request
    content = response.read()
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 624, in _read_chunked
    return ''.join(value)
OverflowError: join() result is too long for a Python string
$ cat test.py
#!/usr/bin/env python

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

gauth=GoogleAuth()
if not gauth.LoadCredentialsFile("auth.txt") :
    gauth.CommandLineAuth()
    gauth.SaveCredentialsFile("auth.txt")

drive=GoogleDrive(gauth)

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
    print f['title'];
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

$ ls -l /tmp/big*
ls: cannot access /tmp/big*: No such file or directory
@aliafshar
Copy link
Contributor

This uses the google-api-python-client under the hood. That is where the bug is. However, I am really sorry about this - that is appalling forethought to dump the entire thing into memory without streaming.

@Fjodor42
Copy link
Contributor

Fjodor42 commented Feb 3, 2016

You might want to have a look at #27

@Fjodor42
Copy link
Contributor

Fjodor42 commented Feb 17, 2016

I don't have a 32bit system handy for testing, but could you report whether replacing

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
    print f['title'];
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

with

local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb')
for f in file_list:
    print f['title']
    id = f.metadata.get('id')
    request = drive.auth.service.files().get_media(fileId=id)
    downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)

    done = False

    while done is False:
        status, done = downloader.next_chunk()
local_file.close()

works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?

Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.

If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.

@rupertlevene
Copy link
Author

Thanks, this works!

(I upped the chunk size by a factor of 10 to save time. Otherwise it was
rather slow.)

On 17 February 2016 at 02:38, Fjodor42 notifications@github.com wrote:

I don't have a 32bit system handy for testing, but could you report
whether replacing

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
print f['title'];
f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

with

`local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb')
for f in file_list:
print f['title']
id = f.metadata.get('id')
request = drive.auth.service.files().get_media(fileId=id)
downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)

done = False

while done is False:
status, done = downloader.next_chunk()

local_file.close()`

works (you'll probably need an from apiclient.http import
MediaIoBaseDownload somewhere)?

Inasmuch as it seems to download a 4Gb file om random data, without any
serious memory use, on my machine, I posit the dreaded "works on my
machine", but that is a 64bit one.

If it does work, I think I can cook up a way to let PyDrive take a
decision to do this for files over a certain size, but I then think that I
shall want to open a feature request to solicit responses as to what that
limit should be, as well as whether the limit should be the chunk size,
then.


Reply to this email directly or view it on GitHub
#30 (comment).

@RNabel
Copy link
Collaborator

RNabel commented Jun 8, 2016

@rupertlevene This should be resolved now. Post here if you are still encountering this issue.

@RNabel RNabel closed this as completed Jun 8, 2016
@RNabel RNabel reopened this Jun 9, 2016
@RNabel
Copy link
Collaborator

RNabel commented Jun 9, 2016

Reopening, as @Fjodor42 points out, and #62 references, there is no verification of this being resolved.

@RNabel RNabel added the bug label Jun 15, 2016
@RNabel RNabel added this to the Future Improvements milestone Oct 23, 2016
@smichaud
Copy link

smichaud commented Jun 9, 2020

I don't have a 32bit system handy for testing, but could you report whether replacing

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
    print f['title'];
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

with

local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb')
for f in file_list:
    print f['title']
    id = f.metadata.get('id')
    request = drive.auth.service.files().get_media(fileId=id)
    downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)

    done = False

    while done is False:
        status, done = downloader.next_chunk()
local_file.close()

works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?

Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.

If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.

Thank you for the solution.
Side note:

import io
from googleapiclient.http import MediaIoBaseDownload

@shcheklein
Copy link
Collaborator

@smichaud btw, GetContentFile has been rewritten (among other fixes and improvements) in the iterative/PyDrive2 - a maintained fork. It uses MediaIoBaseDownload internally and should work out of the box ... here is an example how it used in DVC -

https://github.com/iterative/dvc/blob/b57077af11ae287941b4d2939071fda2ad01f483/dvc/remote/gdrive.py#L376

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants