Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add progress bar for to_arrow method #352

Merged
merged 10 commits into from Nov 16, 2020

Conversation

@HemangChothani
Copy link
Contributor

@HemangChothani HemangChothani commented Oct 29, 2020

Fixes #343

  • PR open for the feedback and suggestions.

  • Currently implemented for 'QueryJob.to_arrow' method

  • When the cacheHit for result is true at time query_plan is blank so can't implement progress bar.

@HemangChothani HemangChothani requested a review from tswast Oct 29, 2020
@google-cla google-cla bot added the cla: yes label Oct 29, 2020
@@ -20,6 +20,9 @@
import copy
import re
import threading
import tqdm
Copy link
Contributor

@tswast tswast Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be an optional import. See:

try:
import tqdm
except ImportError: # pragma: NO COVER
tqdm = None

Loading

@@ -3275,6 +3283,27 @@ def result(
rows._preserve_order = _contains_order_by(self.query)
return rows

def _get_progress_bar(self, progress_bar_type, description, total, unit):
Copy link
Contributor

@tswast tswast Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is very close to the logic in table.py

def _get_progress_bar(self, progress_bar_type):

I'd suggest creating a _tqdm_helpers.py module similar to our pandas helpers to hold this logic for both Table and Job logic.

Loading

@HemangChothani HemangChothani marked this pull request as ready for review Nov 2, 2020
@HemangChothani HemangChothani requested a review from as a code owner Nov 2, 2020
@@ -3337,7 +3339,49 @@ def to_arrow(
..versionadded:: 1.17.0
"""
return self.result().to_arrow(
if self.query_plan and progress_bar_type:
Copy link
Contributor

@tswast tswast Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic necessary if we are calling to_arrow? Can't we rely on to_arrow's progress bar support?

Loading

Copy link
Contributor Author

@HemangChothani HemangChothani Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't get this point, are you talking about table.to_arrow() progress bar support?

Loading

Copy link
Contributor

@tswast tswast Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was getting this confused with RowIterator, where to_dataframe relies on to_arrow's progress bar support. I do wonder if some of this "waiting for the query to finish" logic could be refactored into a _tqdm_helpers.py function, since it should be identical between to_dataframe and to_arrow

Loading

@@ -3406,7 +3450,46 @@ def to_dataframe(
Raises:
ValueError: If the `pandas` library cannot be imported.
"""
return self.result().to_dataframe(
query_plan = self.query_plan
if query_plan and progress_bar_type:
Copy link
Contributor

@tswast tswast Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_plan can get updated as the job progresses. I'd prefer if this always created a progress bar, but had a generic message when the job is still pending and there isn't a query plan.

Loading

)

try:
query_result = self.result(timeout=0.5)
Copy link
Contributor

@tswast tswast Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make constants for this (PROGRESS_BAR_UPDATE_INTERVAL, for example)

Loading

Copy link
Contributor

@tswast tswast left a comment

Looking good. A few nits regarding naming and indentation.

Loading

return None


def _query_job_result_helper(query_job, progress_bar_type=None):
Copy link
Contributor

@tswast tswast Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use active verbs in this helper name.

Suggested change
def _query_job_result_helper(query_job, progress_bar_type=None):
def wait_for_query(query_job, progress_bar_type=None):

Loading


def _query_job_result_helper(query_job, progress_bar_type=None):
"""Return query result and display a progress bar while the query running, if tqdm is installed."""
if progress_bar_type:
Copy link
Contributor

@tswast tswast Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reduce the level of indentation if we exit early. Also, I think we'll want to pass through some keyword arguments to result(), correct?

Suggested change
if progress_bar_type:
if progress_bar_type is None:
return query_job.result()

Aside (not relevant for this PR): we'll eventually want to pass additional arguments to result() whenever we implement #296

Loading

_PROGRESS_BAR_UPDATE_INTERVAL = 0.5


def _get_progress_bar(progress_bar_type, description, total, unit):
Copy link
Contributor

@tswast tswast Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use these helpers from other modules, let's remove the (redundant because of private module) leading _.

Suggested change
def _get_progress_bar(progress_bar_type, description, total, unit):
def get_progress_bar(progress_bar_type, description, total, unit):

Loading

progress_bar = _get_progress_bar(
progress_bar_type, "Query is running", 1, "query"
)
if query_job.query_plan:
Copy link
Contributor

@tswast tswast Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be one level deeper. I'd like to see the while True loop, even when query_plan is not initially populated.

Loading

Copy link
Contributor

@tswast tswast left a comment

I'd expect the while loop to contain the following steps:

  1. Call result() with a short-ish timeout.
  2. Call reload()
  3. Update the progress bar (different cases for query_plan present or not)

Loading

"""Return query result and display a progress bar while the query running, if tqdm is installed."""
if progress_bar_type is None:
query_result = query_job.result()
else:
Copy link
Contributor

@tswast tswast Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you could return query_job.result() here. The else statement is then unnecessary, saving us 1 level of indentation.

Loading

i += 1
continue
else:
query_result = query_job.result()
Copy link
Contributor

@tswast tswast Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a timeout here. It's not clear to me why the above try block is even in the if statement.

Loading

Copy link
Contributor

@tswast tswast Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the only difference is the presence of total, which could be set to a default value in the case where query_plan is not present.

Loading

while True:
if query_job.query_plan:
total = len(query_job.query_plan)
query_job.reload() # Refreshes the state via a GET request.
Copy link
Contributor

@tswast tswast Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we only want to call reload after result times out?

Also, why would we only call reload inside this if statement? The job might not have a query_plan until after reload is called in many cases.

Loading

@HemangChothani
Copy link
Contributor Author

@HemangChothani HemangChothani commented Nov 13, 2020

system test failed not related to changes.

Loading

tswast
tswast approved these changes Nov 16, 2020
Copy link
Contributor

@tswast tswast left a comment

Love it!

P.S. I have opened #388 to fix those failing tests.

Loading

@tswast tswast merged commit dc78edd into googleapis:master Nov 16, 2020
9 of 10 checks passed
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants