Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add a bunch of bigquery ops #1222

Closed
wants to merge 23 commits into from

Conversation

tsdlovell
Copy link

... and fixup some things in BigQueryClient.{_execute,raw_sql}

@tsdlovell
Copy link
Author

still need to add some tests

@tsdlovell tsdlovell force-pushed the bigquery-ops branch 6 times, most recently from 3eac40d to eabddd9 Compare November 14, 2017 22:04
@cpcloud cpcloud self-assigned this Nov 20, 2017
@cpcloud cpcloud self-requested a review November 20, 2017 04:09
@cpcloud cpcloud added this to the 0.13 milestone Nov 20, 2017
return BigQueryCursor(query)
if results:
return BigQueryCursor(query)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this, since it's a no-op.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the else and default to python's implicit return None or remove the if?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the else: and write return None or nothing, up to you.



def approx_count_distinct(arg):
return bq_ops.ApproxCountDistinct(arg).to_expr()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibis already has HLLCardinality for this. Not a great name since it contains the algorithm used to implement the functionality. We should rename it to ApproxCountDistinct. You don't have to do that in this PR though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API is column.approx_nunique(), just FYI.

Copy link
Author

@tsdlovell tsdlovell Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: BigQuery offers an explicitly separate HyperLogLog API



def format_date(fmt, arg):
return bq_ops.FormatDate(arg, fmt).to_expr()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibis calls this ops.Strftime. The API is value.strftime('some format string').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strftime should cover all of the cases that bigquery separates like TIME_FORMAT, DATE_FORMAT, and TIMESTAMP_FORMAT.

args = expr.op().args
(arg, percentile_rank) = map(translator.translate, args[:2])
value_type = args[2]
command = dict(cont='percentile_cont', disc='percentile_disc')[value_type]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 'percentile_{}'.format(value_type)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this out of this PR, and I'm going to refactor Quantile to be an AnalyticOp so that we don't have to have both Quantile and Percentile.


def _approx_quantile(translator, expr):
args = expr.op().args
args = [translator.translate(arg) for arg in args[:2]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the :2? Is there a third argument you're leaving out on purpose?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ignore_nulls which is the next function down and raises NotImplementedError if it evaluates to True

def _timestamp_diff(translator, expr):
(arg0, arg1, timestamp_part) = expr.op().args
(af0, af1) = [translator.translate(arg) for arg in (arg0, arg1)]
return 'TIMESTAMP_DIFF({}, {}, {})'.format(af0, af1, timestamp_part)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a generic method like delta that can choose the correct function to call (DATE_DIFF/TIME_DIFF/TIMESTAMP_DIFF) based on the input types. This would also need a promotion mechanism to determine which type to promote different arguments to.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: BigQuery also has a Datetime type and I don't think we have a way of differentiating in between Datetime and Timestamp in ibis.

@@ -1044,6 +1062,15 @@ class NthValue(AnalyticOp):
input_type = [rules.column, rules.integer]
output_type = rules.type_of_arg(0)


class Arbitrary(Reduction):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a test.

def format_date(fmt, arg):
return bq_ops.FormatDate(arg, fmt).to_expr()
klass_name = expr.op().args[0].type().__class__.__name__.upper()
if klass_name in ['TIMESTAMP', 'DATE', 'TIME']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an assertion since you shouldn't be able to reach this code path if the type system is working correctly.

(typ0, typ1) = (arg0.type(), arg1.type())
klass_name = typ0.__class__.__name__.upper()
assert typ0 == typ1, (
"Don't know how to diff different temporal types: {}, {}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users can hit these they should be TypeErrors. Same for the assert below.

bq_ops.TimestampDiff: _timestamp_diff,
bq_ops.DateDiff: _temporal_diff,
bq_ops.TimestampDiff: _temporal_diff,
bq_ops.TimeDiff: _temporal_diff,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes should go in ibis.expr.operations.

return '{}_DIFF({}, {}, {})'.format(klass_name, af0, af1, part)


def temporal_diff(arg0, arg1, *args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call these left and right

@cpcloud cpcloud modified the milestones: 0.13, 0.14 Mar 11, 2018
@cpcloud cpcloud modified the milestones: 0.14, 0.15 Jul 16, 2018
@xmnlab
Copy link
Contributor

xmnlab commented Dec 31, 2018

hey @tsdlovell what is the status of this PR?

there are some files with conflict, could you rebase your code?

@cpcloud cpcloud modified the milestones: 1.0.0, Future Mar 17, 2019
@datapythonista
Copy link
Contributor

Closing as stale. @tsdlovell, if you want to continue working on this, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants