storage: expose resumable upload upload_id=xx #1224

snktagarwal · 2018-11-17T16:28:09Z

We are using Google Cloud Storage for uploading audio data in a streaming fashion. Sjnce the streaming can last for quite a long time (upto an hour?) its important that we have all mechanisms to save ourselves from any crashes that may happen within the services.

Essentially we want to make sure that any data that we have uploaded until a point and not yet Close()ed the file, can be closed at some point. I see that the JSON API provides a way to obtain the upload_id=xx to resume uploads in future. Is there a way to expose this information from the golang api? If there is not one, any intermediate solution that might be useful?

Our other approach will be to roll our own simple client but then we may have to re-invent the wheel w.r.t retry logic and other niceness of an API.

Guidance appreciated

@suki-fredrik

jeanbza · 2018-11-19T23:07:20Z

Hello! Thanks for filing an issue. Please note that the client library already retries resumable upload failures, such as when we get 5xx errors and so on. So, just want to clarify: is this feature request asking for an API addition that allows users to get an upload ID and then later specify the resumption of an upload with an upload ID plus some data? Or, if not, could you describe this request in more detail?

snktagarwal · 2018-11-20T03:42:07Z

A little bit more about the request. Our setup is something like this:

Frontend -> Backend -> GCS

We want to ensure that as long as data gets to the service in backend, we are able to save it in GCS. Consider the scenario that a Backend service goes down (crash, pre-emptions so on...) then in the current implementation of go gcs we essentially lose the data since since we don't have the upload_id to finalize the file after the service is back up.

And the implications are bad, because we could have been streaming for 1hr and suddenly the service goes down. Another solution (although not sure how possible it is in cloud storage) is to flush the file every few seconds. My understanding is that GCS does not provide that option either. I think go library should simply expose the upload_id that the client can save in a persisted storage to provide guarantee against loss of handles if it goes down.

jeanbza · 2018-11-20T22:39:50Z

How would the backend service know which byte to resume uploading at? Do you imagine the backend service keeping a cursor locally, and this request API takes the byte location and upload_id?

frankyn · 2018-11-27T21:09:30Z

@jadekler Resumable uploads allows you to check status of the upload session documented at:
https://cloud.google.com/storage/docs/xml-api/resumable-upload#step_4query_for_the_upload_status

Quoting the document:

PUT https://[BUCKET_NAME].storage.googleapis.com/[OBJECT_NAME]?upload_id=[UPLOAD_ID] HTTP/1.1
Date: Fri, 01 Oct 2010 22:25:53 GMT
Content-Range: bytes */[TOTAL_SIZE]
Content-Length: 0

Response:

If you received a 308 Resume Incomplete response, process the response's Range header, which specifies which bytes Cloud Storage has received so far.

jeanbza · 2018-11-28T16:10:54Z

Ah, right on, thanks Frank!

noseglid · 2018-12-05T20:20:13Z

I have another use case which relates to this.

We're also streaming audio into GCS. Once all audio is done, we need to go back to the beginning of the file to write the header of the audio file (such as duration etc, which is not known until the whole file has been processed).

I'm not sure if resumable uploads allows me to overwrite a portion of a file though?

frankyn · 2018-12-05T20:23:15Z

@noseglid Not at the moment. Resumable uploads are sequential writes only and can't adjust the write cursor or metadata at the end.

noseglid · 2018-12-05T20:25:29Z

@frankyn Thanks for the response!

So there's no support in GCS for streaming media? This is a very common thing to do also for video.

I realize this question is quite off-topic for this thread, and even this repo.

frankyn · 2018-12-05T20:28:22Z

Apologies, maybe I misunderstood. When you say go back and write the header, does that mean you want to modify the byte data or does that mean you want to modify GCS object headers?

noseglid · 2018-12-05T20:30:39Z

The byte data.

So maybe I need to modify byte offset 100 - 250 of the file, once I've written it all (which normally is 100+ MB).

frankyn · 2018-12-05T20:45:57Z

Gotcha, what you could do instead is:

write bytes (251 - EOF) in object1.
write header in object2.
Compose object2 & object1 in order expected. (Compose operations concat in the order they're listed)

noseglid · 2018-12-05T21:17:17Z

I don't really have that level of control. As per the example above, byte 0-99 is written in the first round, and all 0's is written for byte 100-250. These bytes needs to be overwritten after a seek.

I'm using libav (and I know it's similar with GStreamer and probably also other media libs), and I can essentially provide it with a Write function and a Seek function. It will then call those functions when it has data to write, or want to move the cursor.

Typically it will Seek to somewhere in the beginning of the file when all audio is processed, and then write a few more bytes there before it's all done.

What I'd really want (I'm using the Go libs) is a WriteSeeker, as of now, I can only get a Writer.

frankyn · 2018-12-05T21:44:59Z

Thanks for the additional information @noseglid! I think this is a separate discussion that deems its own issue. Could you restate the information and feature request in a new issue?

I don't have any background in this area, but would be interested to learn more.
FWIW, there could be two options here.

If Seek is only used at the end of the write to add the header in header (100-251) and the writes are sequential without updating prior data. The Write and Seek interface could provide what libav needs and on GCS create 3 separate files. Object 1 (0-99), Object 2 (100-250-- comes at the end), and Object 3 (251 -EOF). When a seek is performed and it's in the range of 100-250 then swap it out with a new object only if the data isn't all 0's. Then perform a compose operation on a "close" to join the pieces.
Write the stream to local disk and then upload the file after it's complete. No additional support needs to be written in this case.

noseglid · 2018-12-06T07:05:54Z

Thanks! I'll create a new issue for this!

mweibel · 2019-10-07T15:38:28Z

Does anyone know what the state of this issue is?

My setup:
browser JS client -> API gateway -> storage microservice -> GCS

The storage microservice is using this library and abstracts away the GCS implementation, as well as providing access to only a certain part of GCS.

The JS client should be able to upload large files to GCS.
If that should work well, I think the storage service would need to expose the resumable upload API. This way, the browser can chunk the file and upload (with retries) these chunks.

Besides being able to provide better progress reports, this also allows for less memory usage in the layers in between (only the chunk size memory needs to be allocated, not the full size).

Does that make sense or is there something I'm missing where it would work with the existing exposed API of this library?

jeanbza · 2019-10-08T16:53:29Z

cc @frankyn @jkwlui

JustinBeckwith added the triage me I really want to be triaged. label Nov 18, 2018

jeanbza added needs more info This issue needs more information from the customer to proceed. and removed triage me I really want to be triaged. labels Nov 19, 2018

jeanbza self-assigned this Nov 19, 2018

JustinBeckwith added the triage me I really want to be triaged. label Nov 19, 2018

jeanbza added help wanted We'd love to have community involvement on this issue. and removed needs more info This issue needs more information from the customer to proceed. labels Nov 28, 2018

noseglid mentioned this issue Dec 6, 2018

storage: Provide a WriteSeeker interface from Storage Client #1243

Open

jeanbza removed their assignment Dec 27, 2018

odeke-em changed the title ~~Expose resumable upload upload_id=xx~~ storage: expose resumable upload upload_id=xx Jul 14, 2019

jeanbza added the api: storage Issues related to the Cloud Storage API. label Oct 8, 2019

jeanbza removed the help wanted We'd love to have community involvement on this issue. label Oct 8, 2019

codyoss assigned tritone Dec 3, 2020

milosgajdos mentioned this issue Oct 24, 2023

refactor: gcs storage driver distribution/distribution#4120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: expose resumable upload upload_id=xx #1224

storage: expose resumable upload upload_id=xx #1224

snktagarwal commented Nov 17, 2018

jeanbza commented Nov 19, 2018

snktagarwal commented Nov 20, 2018

jeanbza commented Nov 20, 2018

frankyn commented Nov 27, 2018

jeanbza commented Nov 28, 2018

noseglid commented Dec 5, 2018

frankyn commented Dec 5, 2018

noseglid commented Dec 5, 2018

frankyn commented Dec 5, 2018

noseglid commented Dec 5, 2018

frankyn commented Dec 5, 2018

noseglid commented Dec 5, 2018 •

edited

frankyn commented Dec 5, 2018

noseglid commented Dec 6, 2018

mweibel commented Oct 7, 2019

jeanbza commented Oct 8, 2019

storage: expose resumable upload upload_id=xx #1224

storage: expose resumable upload upload_id=xx #1224

Comments

snktagarwal commented Nov 17, 2018

jeanbza commented Nov 19, 2018

snktagarwal commented Nov 20, 2018

jeanbza commented Nov 20, 2018

frankyn commented Nov 27, 2018

jeanbza commented Nov 28, 2018

noseglid commented Dec 5, 2018

frankyn commented Dec 5, 2018

noseglid commented Dec 5, 2018

frankyn commented Dec 5, 2018

noseglid commented Dec 5, 2018

frankyn commented Dec 5, 2018

noseglid commented Dec 5, 2018 • edited

frankyn commented Dec 5, 2018

noseglid commented Dec 6, 2018

mweibel commented Oct 7, 2019

jeanbza commented Oct 8, 2019

noseglid commented Dec 5, 2018 •

edited