Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: QueryJob().done() method gets stuck #7831

Closed
igor47 opened this issue Apr 30, 2019 · 11 comments · Fixed by #9875
Closed

BigQuery: QueryJob().done() method gets stuck #7831

igor47 opened this issue Apr 30, 2019 · 11 comments · Fixed by #9875
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@igor47
Copy link

igor47 commented Apr 30, 2019

My environment:

 $ python --version
Python 3.7.1
 $ cat requirements.txt | grep bigquery
google-cloud-bigquery==1.8.0

here's a code sample:

    while self.backgrounded_jobs_remain:
      # find one ready job
      for q in self.query_jobs:
        self.logger.debug(f"checking to see if job {q['job']} is done...")
        if q['job'].done():
          self.logger.info(f"job {q['job']} is complete! returning it.")
          return (q['job'], q['memo'])

      # if we're still here, we need to wait some more
self.logger.info(f'waiting for one of {len(self.query_jobs)} pending jobs to complete...')

in the usual case, the output looks something like this:

DEBUG datapipes.bqreader : checking to see if job <google.cloud.bigquery.job.QueryJob object at 0x7f186aac9898> is done...
DEBUG urllib3.connectionpool : https://www.googleapis.com:443 "GET /bigquery/v2/projects/myproject/queries/c14e18c5-3ef2-472e-9eab-34ea85fbc2ee?maxResults=0&timeoutMs=0&location=US HTTP/1.1" 200 None
INFO datapipes.bqreader : waiting for one of 10 pending jobs to complete...
DEBUG datapipes.bqreader : checking to see if job <google.cloud.bigquery.job.QueryJob object at 0x7f186b18feb8> is done...
DEBUG urllib3.connectionpool : https://www.googleapis.com:443 "GET /bigquery/v2/projects/myproject/queries/58d2e451-c372-41ac-acb6-395f48eff6e1?maxResults=0&timeoutMs=0&location=US HTTP/1.1" 200 None

however, sometimes it looks like this:

DEBUG datapipes.bqreader : checking to see if job <google.cloud.bigquery.job.QueryJob object at 0x7f186aac9898> is done...
DEBUG urllib3.connectionpool : https://www.googleapis.com:443 "GET /bigquery/v2/projects/myproject/queries/c14e18c5-3ef2-472e-9eab-34ea85fbc2ee?maxResults=0&timeoutMs=0&location=US HTTP/1.1" 200 None
INFO datapipes.bqreader : waiting for one of 10 pending jobs to complete...
DEBUG datapipes.bqreader : checking to see if job <google.cloud.bigquery.job.QueryJob object at 0x7f186b18feb8> is done...

and then the output proceeds no further.
as best as i can tell, the .done() function is getting stuck somewhere, without ever even issuing the urllib3 query (since there is no urllib3 log line.
when this happens, my task just becomes stuck indefinitely.

i'm running this in containers, so i don't have access to good debug tools but might be willing to install them if necessary.
i'm going to try to read the code for .done() and see if i can spot what's happening, but wanted to open the issue to get additional eyes on it.

this happens pretty sporadically -- like, maybe one out of 50 jobs get stuck -- which makes debugging more difficult.

@igor47
Copy link
Author

igor47 commented Apr 30, 2019

oh i forgot to mention that i previously set:

 job._done_timeout = 0

to allow api calls to complete quickly instead of using the built-in timeout

@igor47
Copy link
Author

igor47 commented May 1, 2019

i realized that the DEBUG line from urllib3 is not printed until the API call returns (because that debug line contains the result/status code of the api call). so, if the API call is started but then takes forever, this would cause my code to hang forever. we could be failing to look up dns, or have problems with our connection pool waiting for available connections (i wouldn't expect that in my single-threaded code) or hang out establishing the TCP session for the API request, or any number of reasons.

internally, bigquery uses the google-auth-library for making api calls, which in turn uses requests, which in turn does support read and connect timeouts: https://2.python-requests.org/en/master/user/advanced/#timeouts

is there a way for me to set such parameters via the BigQuery api?

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label May 1, 2019
@tseaver tseaver added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels May 1, 2019
@tseaver
Copy link
Contributor

tseaver commented May 1, 2019

First, setting the job._done_timeout = 0 is odd: could you try making it a larger value (maybe 0.5 or 1)? The requests docs for timeouts actually suggest a value slightly larger than 3 for the connect timeout, so maybe use 3.05 or so.

QueryJob.done makes up to three API calls:

  • refresh credentials, if needed.
  • load the query results (which is the projects/myproject/queries/<query-uri> one you see in your log); this is the one which uses a timeout computed from job._done_timeout.
  • reload the query job resource if the results show completed: in this case, the path would be projects/myprojects/jobs/<job-id>.

You can tweak the _http object passed to the client's constructor to pass in an instance of a class derived from google.auth.transport.requests.Session, if you want to try forcing a timeout on any of the other requests.

@igor47
Copy link
Author

igor47 commented May 1, 2019

the timeout derived from job._done_timeout is not set on any of the relevant connection objects. it is passed as an argument in the request, and is used on the backend to determine how long the backend waits for the job before returning an http response.

my workaround for now is to run job.done() in a thread and move on if that request is not completed soon enough

@tseaver
Copy link
Contributor

tseaver commented May 1, 2019

Rather than looping on job.done, you could use QueryJob.result(timeout=<whatever>), which will do the blocking poll for you, and bail out if the timeout is exceeded.

@igor47
Copy link
Author

igor47 commented May 2, 2019

thanks @tseaver that seems like a pretty good solution. the main drawback is that i don't want a RowIterator at that moment. however, it looks like _blocking_poll does exactly what i want -- it calls .done() and raises a concurrent.futures.TimeoutError if it doesn't finish quickly enough.

possibly a feature request: can a similar timeout parameter be added to other functions, like done or reload ? that would save me from using a "private" method

@igor47
Copy link
Author

igor47 commented May 2, 2019

never mind -- the mechanism in _blocking_poll won't solve the problem. after reading the code and navigating the many layers of abstraction i see that the retry mechanism simply invokes the target function (

), expecting it to complete right away. if it does complete right away, it will be retried for the timeout duration, but if it gets stuck forever, so does the rest of the program. since the target function (done_or_raise) simply invokes done() (here: ) then using _blocking_poll is just a more complicated way of calling done() directly.

since result() also just invokes _blocking_poll, calling result with a timeout won't help.

there's definitely something broken here -- maybe just on my end, since nobody else seems to be having jobs that get stuck forever? haven't figure out what it is yet, but i'm going to stick with calling done() in a thread for now and maybe that'll help me learn the conditions under which the failure occurs

@tswast
Copy link
Contributor

tswast commented Nov 12, 2019

possibly a feature request: can a similar timeout parameter be added to other functions, like done or reload ? that would save me from using a "private" method

This is a good request. My suspicion is that the GET job resource request is hanging.

We should also investigate the timeout argument on Client._get_query_results to ensure that it uses the timeout options from the requests module in addition to setting a timeout in the request body.

@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. labels Nov 12, 2019
@igor47
Copy link
Author

igor47 commented Nov 14, 2019

investigate the timeout argument on Client._get_query_results to ensure that it uses the timeout options from the requests module in addition to setting a timeout in the request body.

yes -- client-side timeout support would make this library much more resilient.

@plamut
Copy link
Contributor

plamut commented Nov 18, 2019

FWIW, the reported behavior can be reproduced with a small trick - calling time.sleep(10) just before the if q['job'].done() check, and then, while briefly sleeping, disabling the internet connection on the machine. This simulates a hanging GET request(s) the .done() method makes.

If re-enabling the connection, a subsequent retry eventually kicks in at some point, and the .done() check completes.

@tommydangerous
Copy link

👋 @igor47! Nice seeing you here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants