Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: expose resumable upload upload_id=xx #1224

Open
snktagarwal opened this issue Nov 17, 2018 · 16 comments
Open

storage: expose resumable upload upload_id=xx #1224

snktagarwal opened this issue Nov 17, 2018 · 16 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@snktagarwal
Copy link

We are using Google Cloud Storage for uploading audio data in a streaming fashion. Sjnce the streaming can last for quite a long time (upto an hour?) its important that we have all mechanisms to save ourselves from any crashes that may happen within the services.

Essentially we want to make sure that any data that we have uploaded until a point and not yet Close()ed the file, can be closed at some point. I see that the JSON API provides a way to obtain the upload_id=xx to resume uploads in future. Is there a way to expose this information from the golang api? If there is not one, any intermediate solution that might be useful?

Our other approach will be to roll our own simple client but then we may have to re-invent the wheel w.r.t retry logic and other niceness of an API.

Guidance appreciated

@suki-fredrik

@JustinBeckwith JustinBeckwith added the triage me I really want to be triaged. label Nov 18, 2018
@jeanbza
Copy link
Member

jeanbza commented Nov 19, 2018

Hello! Thanks for filing an issue. Please note that the client library already retries resumable upload failures, such as when we get 5xx errors and so on. So, just want to clarify: is this feature request asking for an API addition that allows users to get an upload ID and then later specify the resumption of an upload with an upload ID plus some data? Or, if not, could you describe this request in more detail?

@jeanbza jeanbza added needs more info This issue needs more information from the customer to proceed. and removed triage me I really want to be triaged. labels Nov 19, 2018
@jeanbza jeanbza self-assigned this Nov 19, 2018
@JustinBeckwith JustinBeckwith added the triage me I really want to be triaged. label Nov 19, 2018
@snktagarwal
Copy link
Author

A little bit more about the request. Our setup is something like this:

Frontend -> Backend -> GCS

We want to ensure that as long as data gets to the service in backend, we are able to save it in GCS. Consider the scenario that a Backend service goes down (crash, pre-emptions so on...) then in the current implementation of go gcs we essentially lose the data since since we don't have the upload_id to finalize the file after the service is back up.

And the implications are bad, because we could have been streaming for 1hr and suddenly the service goes down. Another solution (although not sure how possible it is in cloud storage) is to flush the file every few seconds. My understanding is that GCS does not provide that option either. I think go library should simply expose the upload_id that the client can save in a persisted storage to provide guarantee against loss of handles if it goes down.

@jeanbza jeanbza added type: question Request for information or clarification. Not an issue. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. type: question Request for information or clarification. Not an issue. labels Nov 20, 2018
@jeanbza
Copy link
Member

jeanbza commented Nov 20, 2018

How would the backend service know which byte to resume uploading at? Do you imagine the backend service keeping a cursor locally, and this request API takes the byte location and upload_id?

@frankyn
Copy link
Member

frankyn commented Nov 27, 2018

@jadekler Resumable uploads allows you to check status of the upload session documented at:
https://cloud.google.com/storage/docs/xml-api/resumable-upload#step_4query_for_the_upload_status

Quoting the document:

PUT https://[BUCKET_NAME].storage.googleapis.com/[OBJECT_NAME]?upload_id=[UPLOAD_ID] HTTP/1.1
Date: Fri, 01 Oct 2010 22:25:53 GMT
Content-Range: bytes */[TOTAL_SIZE]
Content-Length: 0

Response:

If you received a 308 Resume Incomplete response, process the response's Range header, which specifies which bytes Cloud Storage has received so far.

@jeanbza
Copy link
Member

jeanbza commented Nov 28, 2018

Ah, right on, thanks Frank!

@jeanbza jeanbza added help wanted We'd love to have community involvement on this issue. and removed needs more info This issue needs more information from the customer to proceed. labels Nov 28, 2018
@noseglid
Copy link

noseglid commented Dec 5, 2018

I have another use case which relates to this.

We're also streaming audio into GCS. Once all audio is done, we need to go back to the beginning of the file to write the header of the audio file (such as duration etc, which is not known until the whole file has been processed).

I'm not sure if resumable uploads allows me to overwrite a portion of a file though?

@frankyn
Copy link
Member

frankyn commented Dec 5, 2018

@noseglid Not at the moment. Resumable uploads are sequential writes only and can't adjust the write cursor or metadata at the end.

@noseglid
Copy link

noseglid commented Dec 5, 2018

@frankyn Thanks for the response!

So there's no support in GCS for streaming media? This is a very common thing to do also for video.

I realize this question is quite off-topic for this thread, and even this repo.

@frankyn
Copy link
Member

frankyn commented Dec 5, 2018

Apologies, maybe I misunderstood. When you say go back and write the header, does that mean you want to modify the byte data or does that mean you want to modify GCS object headers?

@noseglid
Copy link

noseglid commented Dec 5, 2018

The byte data.

So maybe I need to modify byte offset 100 - 250 of the file, once I've written it all (which normally is 100+ MB).

@frankyn
Copy link
Member

frankyn commented Dec 5, 2018

Gotcha, what you could do instead is:

  1. write bytes (251 - EOF) in object1.
  2. write header in object2.
  3. Compose object2 & object1 in order expected. (Compose operations concat in the order they're listed)

@noseglid
Copy link

noseglid commented Dec 5, 2018

I don't really have that level of control. As per the example above, byte 0-99 is written in the first round, and all 0's is written for byte 100-250. These bytes needs to be overwritten after a seek.

I'm using libav (and I know it's similar with GStreamer and probably also other media libs), and I can essentially provide it with a Write function and a Seek function. It will then call those functions when it has data to write, or want to move the cursor.

Typically it will Seek to somewhere in the beginning of the file when all audio is processed, and then write a few more bytes there before it's all done.

What I'd really want (I'm using the Go libs) is a WriteSeeker, as of now, I can only get a Writer.

@frankyn
Copy link
Member

frankyn commented Dec 5, 2018

Thanks for the additional information @noseglid! I think this is a separate discussion that deems its own issue. Could you restate the information and feature request in a new issue?

I don't have any background in this area, but would be interested to learn more.
FWIW, there could be two options here.

  1. If Seek is only used at the end of the write to add the header in header (100-251) and the writes are sequential without updating prior data. The Write and Seek interface could provide what libav needs and on GCS create 3 separate files. Object 1 (0-99), Object 2 (100-250-- comes at the end), and Object 3 (251 -EOF). When a seek is performed and it's in the range of 100-250 then swap it out with a new object only if the data isn't all 0's. Then perform a compose operation on a "close" to join the pieces.
  2. Write the stream to local disk and then upload the file after it's complete. No additional support needs to be written in this case.

@noseglid
Copy link

noseglid commented Dec 6, 2018

Thanks! I'll create a new issue for this!

@jeanbza jeanbza removed their assignment Dec 27, 2018
@odeke-em odeke-em changed the title Expose resumable upload upload_id=xx storage: expose resumable upload upload_id=xx Jul 14, 2019
@mweibel
Copy link

mweibel commented Oct 7, 2019

Does anyone know what the state of this issue is?

My setup:
browser JS client -> API gateway -> storage microservice -> GCS

The storage microservice is using this library and abstracts away the GCS implementation, as well as providing access to only a certain part of GCS.

The JS client should be able to upload large files to GCS.
If that should work well, I think the storage service would need to expose the resumable upload API. This way, the browser can chunk the file and upload (with retries) these chunks.

Besides being able to provide better progress reports, this also allows for less memory usage in the layers in between (only the chunk size memory needs to be allocated, not the full size).

Does that make sense or is there something I'm missing where it would work with the existing exposed API of this library?

@jeanbza jeanbza added the api: storage Issues related to the Cloud Storage API. label Oct 8, 2019
@jeanbza jeanbza removed the help wanted We'd love to have community involvement on this issue. label Oct 8, 2019
@jeanbza
Copy link
Member

jeanbza commented Oct 8, 2019

cc @frankyn @jkwlui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

7 participants