Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: detecting when the streaming buffer is empty #507

Closed
mikebell-org opened this issue Feb 10, 2017 · 11 comments
Closed

bigquery: detecting when the streaming buffer is empty #507

mikebell-org opened this issue Feb 10, 2017 · 11 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.

Comments

@mikebell-org
Copy link

mikebell-org commented Feb 10, 2017

With BigQuery, how can I poll when the streaming buffer is empty and it's safe to copy a table without dropping down to the lower level library?
I presume it should return a Job you can Wait() on similar to copies?

@s-mang s-mang changed the title Detecting when the streaming buffer is empty bigquery: detecting when the streaming buffer is empty Mar 1, 2017
@jba jba added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Mar 15, 2017
@jba
Copy link
Contributor

jba commented Jun 7, 2017

We blew it on this one. We didn't mean to ignore you. Do you still have this question? If so, can you explain what you mean by "the streaming buffer"? If you're calling Uploader.Put, it should be synchronous.

@derekperkins
Copy link
Contributor

BigQuery doesn't stream directly into their long term storage, they first put it into a write optimized store and periodically flush that to the main storage. Queries are able to immediately use the streaming buffer, but since other BQ functions ignore the buffer, he's wanting to wait for the buffer to clear.

@derekperkins
Copy link
Contributor

For anyone interested in the details, this is a great article.
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert

@jba jba self-assigned this Jun 9, 2017
@jba
Copy link
Contributor

jba commented Jun 9, 2017

@mikebell-org We didn't write a poll method, but you can now find streaming buffer info in TableMetadata, so you can write your own polling. (Feel free to lift our own Retry function. Sorry it's not public.)

@belablotski
Copy link

Ok, thank you. Is it possible to flush it if it isn't empty? What if process needs bulk UPDATE after streaming?

@jba
Copy link
Contributor

jba commented Mar 2, 2018

There's no way to flush it.

What if process needs bulk UPDATE after streaming?

Can you explain more what you mean by this?

@belablotski
Copy link

Hi Jonathan,
Let's say you've streamed data, then you'd like to clean data up by deleting some duplicates / updating some records - i.e. you need to execute DML query (UPDATE and/or DELETE) when streaming buffer isn't empty. In such case BigQuery will throw "Error 400: UPDATE or DELETE statements are not supported on table XYZ with streaming buffer". And you have to wait ~90 minutes to do that. So it would be nice to have API to flush streaming buffer and make table available for DELETE/UPDATE DML.

@jba
Copy link
Contributor

jba commented Mar 2, 2018

OK, I see what you mean.

This is really a question for the BigQuery service itself, not the Go client. But I'll attempt an answer.

The reason there is unlikely to be a flush is that the streaming buffer is a performance optimization designed to keep queries fast, by batching changes to the underlying storage. So if there was a flush, everyone would use it, and the performance optimization would be gone.

Have you looked at using a load job as an alternative to a streaming insert?

@belablotski
Copy link

Thank you Jonathan. I see the reason, of course it should be expensive operation however "impossible" sounds like impossible. To be honest, my plan B is to put more functionality on the client side to minimize DML necessity.

@davehowell
Copy link

@jba I am doing some testing, using dataflow to stream to bigquery from pubsub, and I'd like to have a process to publish some message, test what appears in bigquery, and then reset it (i.e. delete/truncate the bigquery table).
The performance value of the streaming buffer makes total sense, and the tradeoffs are worth it, but wondering what to do in this minor use case. I've been using in-line SQL e.g. DELETE FROM {table_name} WHERE 1=1 , the fake where clause is because Bigquery requires a where clause on a delete statement. The query works if the streaming buffer is empty - 90 minutes later after I stop publishing messages to pubsub. If it's not empty, it throws the error.

Is there a way to delete records that are already materialized, ignoring the streaming buffer records? That's probably not very useful, just wondering.

Would deleting & recreating the table be an alternative? Is it going to let me drop & recreate it within that 90 minutes? I will have a try.

@evmin02
Copy link

evmin02 commented May 3, 2019

I would be also interested in an option to do iterative development on streaming inserts - i.e. being able to flush the buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

6 participants