Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resumable uploads errors if (optional) name is omitted #147

Closed
JacobHayes opened this issue Mar 14, 2022 · 6 comments
Closed

Resumable uploads errors if (optional) name is omitted #147

JacobHayes opened this issue Mar 14, 2022 · 6 comments

Comments

@JacobHayes
Copy link
Contributor

Thanks for the great package!

Describe the bug
When initiating a resumable upload, the name query parameter is optional:

This is not required if you included a name in the object metadata file in Step 2.

However, gcp-storage-emulator expects to get some data on the first request (request.data.get("name")), causing:

  File ".../gcp_storage_emulator/handlers/objects.py", line 188, in _create_resumable_upload
    object_id = request.data.get("name")
AttributeError: 'NoneType' object has no attribute 'get'

To Reproduce
A minimal, complete, and reproducible example to reproduce the behavior.

import os

import gcp_storage_emulator.server
import gcsfs

server = gcp_storage_emulator.server.create_server("localhost", 0, in_memory=True)
server.start()
host, port = server._api._httpd.socket.getsockname()
os.environ["STORAGE_EMULATOR_HOST"] = f"http://{host}:{port}"

try:
    gcs_fs = gcsfs.GCSFileSystem()
    gcs_fs.mkdir("test_bucket")
    gcs_fs.touch("test_bucket/file") # Triggers a resumable upload w/o name on first request and errors
finally:
    server.stop()

Or, with curl to a running instance w/ existing bucket:

curl -i -X 'POST' -H 'content-type: application/json' 'http://127.0.0.1:9023/upload/storage/v1/b/test_bucket/o?uploadType=resumable'

Expected behavior
No error, but to redirect to a unique generated upload ID.

System (please complete the following information):

  • OS version: macos monterey
  • Python version: 3.9
  • gcp-storage-emulator version: 2022.2.17

Additional context
Add any other context about the problem here.

@oittaa
Copy link
Owner

oittaa commented Mar 14, 2022

Thanks for the great test case! Unfortunately files.pythonhosted.org is currently broken (at least for me) and I can't test properly with gcsfs.

$ pip install gcsfs
...
ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/6f/ad/46e12b207068000bfa76ab5e07154d975b0e1977cc0786299344e6c90925/gcsfs-2022.2.0-py2.py3-none-any.whl (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1129)')))
...

I made a quick fix. #148 Could you test with it and see if gcsfs works as expected with the change?

If you don't want, or can't, build from the source, the latest Docker image follows the main branch.

docker pull oittaa/gcp-storage-emulator
docker run -d --rm \
  -e PORT=9023 \
  -p 9023:9023 \
  --name gcp-storage-emulator \
  oittaa/gcp-storage-emulator

@oittaa
Copy link
Owner

oittaa commented Mar 14, 2022

Must be a CDN problem or something, because GitHub doesn't have any issues fetching the packages, but I get some self-signed certificate every time. pypa/pip#10963

@JacobHayes
Copy link
Contributor Author

JacobHayes commented Mar 14, 2022

Thanks for the quick look (and very odd on pypi 🙃)! Now I got a UnboundLocalError: local variable 'object_id' referenced before assignment (and apparently a non-json requests.data) with:

import gcsfs
import json

fs = gcsfs.GCSFileSystem()
fs.mkdir("test")

gcsfs.GCSFileSystem().touch("test/a") # I think this one just doesn't have the name in either the body nor query param
# UnboundLocalError: local variable 'object_id' referenced before assignment

with fs.open("test/b", "w") as file:
    json.dump(5, file) # this just writes the requests data with b"5"
# AttributeError: 'bytes' object has no attribute 'get'

For the optional file name, I'm wondering if GCS just generates a uuid or similar upload_id and gets the name on subsequent requests? It looks like we use the object name from the first request for the upload_id (which seems very reasonable), but maybe it should generate a uuid or something if no name... and I guess store a mapping of {uuid: upload session} sort of thing?

@oittaa
Copy link
Owner

oittaa commented Mar 14, 2022

Hmm... From the API documentation I understood that you have to pass either the query parameter or the metadata, but you shouldn't omit both. I need to take a closer look how gcsfs works when the CDN starts working again.

  1. Optionally, create a JSON file that contains the metadata you want to set on the object that you're uploading.

...

OBJECT_NAME is the name you want to give your object. For example, pets/dog.png. This is not required if you included a name in the object metadata file in Step 2.

@oittaa
Copy link
Owner

oittaa commented Mar 14, 2022

Ok, I've tried to figure out what's going on. I'm about 99% sure that gcsfs uses a wrong method. They should use PUT method to write the data, but they're using POST. https://cloud.google.com/storage/docs/performing-resumable-uploads#single-chunk-upload, but for some reason the official Google API accepts it anyway, if the request includes upload_id parameter. So I added a workaround for the undocumented Google API behavior. #149

This should work now.

import os

import gcp_storage_emulator.server
import gcsfs
from google.cloud import storage

server = gcp_storage_emulator.server.create_server("localhost", 9023, in_memory=True)
server.start()
host, port = server._api._httpd.socket.getsockname()
os.environ["STORAGE_EMULATOR_HOST"] = f"http://{host}:{port}"

try:
    gcs_fs = gcsfs.GCSFileSystem()
    gcs_fs.mkdir("test_bucket")
    gcs_fs.touch("test_bucket/file")
    with gcs_fs.open("test_bucket/file", "w") as file:
        file.write("testing")

    client = storage.Client()
    bucket = client.bucket("test_bucket")
    for blob in bucket.list_blobs():
        content = blob.download_as_bytes()
        print(f"Blob [{blob.name}]: {content}")
finally:
    server.stop()

@JacobHayes
Copy link
Contributor Author

Wow, things look good on my end too so I'll close out the issue! Thank you for the super quick fixes! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants