Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting a timeout for BigQuery async_query doesn't work #4135

Closed
jasonqng opened this issue Oct 7, 2017 · 7 comments
Closed

Setting a timeout for BigQuery async_query doesn't work #4135

jasonqng opened this issue Oct 7, 2017 · 7 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jasonqng
Copy link

jasonqng commented Oct 7, 2017

The documentation and code appears to allow for setting a maximum duration before timing out for an async_query by passing timeout (an int for the number of milliseconds) to the result() function that is called on a query job (https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/job.py#L476 and https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/job.py#L1311). Similarly, there also appear to be some references to the fetch_data() function of a QueryResults object being able to accept a timeout parameter (https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/query.py#L390).

However, passing a timeout value of 1 or 0 in either fashion with a async_query does nothing to interrupt the query from being completed. Queries complete regardless and output results:

from google.cloud import bigquery
import uuid
client = bigquery.Client(project=project)
query_job = client.run_async_query(str(uuid.uuid4()),
                                   "select 'hello' as a, 32423432 as b")
query_job.begin()
query_results = query_job.result(timeout=1)
data = query_results.fetch_data(timeout_ms=1)
rows = list(data)
print(rows)

Output:

[(u'hello', 32423432)]

By contrast, setting a timeout value for a sync_query by setting the timeout_ms property of a QueryJob indeed works as expected as raises the appropriate JobComplete = False flag:

from google.cloud import bigquery
client = bigquery.Client(project=project)
query_job = client.run_sync_query("select 'hello' as a, 32423432 as b")
query_job.timeout_ms=1
query_job.run()
print(query_job._properties.get("jobComplete"))
rows = list(query_job.rows)
print(rows)

Output:

False
[]

It appears timeout for async_queries is still to be implemented and it is unclear how to access a JobIncomplete property via an async_query, forcing users to continue using sync_queries if they want the ability to time out their queries.

cc: @tswast who requested I raise the issue which I identified in this pandas-gbq PR: googleapis/python-bigquery-pandas#25 (comment)

OSX
Python 2.7.13
google-cloud-python 0.27.0 and google-cloud-bigquery 0.26.0

@tswast
Copy link
Contributor

tswast commented Oct 9, 2017

/cc @jonparrott

@tswast tswast added the api: bigquery Issues related to the BigQuery API. label Oct 9, 2017
@tswast
Copy link
Contributor

tswast commented Oct 9, 2017

Just to clarify, the timeout parameter on your last run_sync_query example does not set a timeout for the query. It sets a timeout on waiting for the results. If the request times out, the query is still running in the background. You can verify this by looking at the list of jobs created.

You have to run QueryJob.cancel() to actually stop the query, and even that is only best-effort because the query could theoretically complete before the cancel request propagates to the query workers.

One reason for the behavior you are experiencing is that the result timeout is in seconds and the provided query will almost certainly complete within 1 second. See: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future.result

@tswast tswast added the type: question Request for information or clarification. Not an issue. label Oct 9, 2017
@jasonqng
Copy link
Author

jasonqng commented Oct 9, 2017

I'm trying to provide back compatibility for this feature: googleapis/python-bigquery-pandas#76. This is for preventing stalled jobs from hanging an entire script, and to have an elegant way to move on in the script without having to write our own timer function to handle; if the actual query can be killed in background, all the better, but at the bare minimum, it should stop looking for results after a set amount of time and move on. Unfortunately I can't replicate with aync_query, but I can with sync_query.

As for seconds, I tried several more complex queries that definitely takes more than 1 second to complete and it still returned results for all of them. I'm ok with updating to seconds in my project since I can't imagine many use cases for sub-1 second timeouts being needed, but if it is indeed supposed to work this way, the documentation here should update to reflect this since there are no mentions of the timeout parameter being in seconds (and for legacy users and those perusing the docs, one might be mistaken that it's supposed to be milliseconds due to the timeoutMs and timeout_ms references).

@tswast
Copy link
Contributor

tswast commented Oct 16, 2017

I think I know what's going on. The call to result repeatedly calls the done method, but since the done method does an API that could last up to 10 seconds, any timeout less than 10 seconds won't work.

@jonparrot Do you think we could add an optional timeout parameter to done? Or a should we pass the timeout from result to done in some other way?

@theacodes
Copy link
Contributor

@tswast I don't think we should use connection/response timeouts to indicate incomplete. If we are going to poll indefinitely for the response then blocking on the request is fine, but if the poll has a deadline then the request to check if a job is done should return as quickly as possible.

@tswast tswast added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Oct 16, 2017
@tswast tswast self-assigned this Oct 16, 2017
@tswast
Copy link
Contributor

tswast commented Oct 16, 2017

Since the deadline is an integer in seconds, maybe it would make sense if done() always tries to complete within 1 second (as compared to the current default of 10 seconds)?

@tswast tswast removed the type: question Request for information or clarification. Not an issue. label Oct 17, 2017
@tswast
Copy link
Contributor

tswast commented Oct 18, 2017

This will be fixed in the next release after the bigquery-b2 feature branch is merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants