Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rules refactor #1366

Closed
wants to merge 79 commits into from
Closed

Rules refactor #1366

wants to merge 79 commits into from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Feb 24, 2018

No description provided.

@kszucs kszucs changed the title Prainbow [WIP] Rules refactor Feb 24, 2018
@kszucs kszucs force-pushed the prainbow branch 2 times, most recently from 2eb37b7 to 8a19a88 Compare February 26, 2018 16:31
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good so far. We need to avoid any unnecessary breakages, so that downstream users that have extended the library outside of upstream don't experience unnecessary breakage.

def bq_param_integer(param, value):
return bq.ScalarQueryParameter(param._name, 'INT64', value)


@bigquery_param.register(ir.DoubleScalar, float)
@bigquery_param.register(ir.FloatingScalar, float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure there are tests for these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg passing in the different int and float subclasses

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed those subclasses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they are only required to be able to bind methods on them (otherwise evenScalarExpr(DataType) and ColumnExpr(DataType) would be enough).

The actual work is done by the datatype passed to ValueExpr as an argument. Also see DataType.scalar and DataType.column class properties.



def is_reduction(expr):
# Aggregations yield typed scalar expressions, since the result of an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this in the docstring

@@ -348,12 +360,16 @@ def __str__(self):
def _equal_part(self, other, cache=None):
return self.precision == other.precision and self.scale == other.scale

def largest(self):
return Decimal(self.precision, 38)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scale of 38 implies a precision of 38, because precision = scale + number of digits to the left of the decimal point so unless self.precision == 38 this type is impossible.

There's also the issue of what you mean by largest here. The largest decimal value that can be constructed can only be constructed with a decimal type of decimal(38, 0) and that's the integer with 9 repeated 38 times. Also, we shouldn't assume 38 digits of precision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ported from rules, used at the following places:

Decimal largest is clearly wrong, but what should I use instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we define largest for a given scale? In that case the definition of largest becomes Decimal(38, self.scale)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume 38 digits of precision for now. If we have users that desire more precision than that we can revisit in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -559,6 +584,9 @@ def _equal_part(self, other, cache=None):
any = Any()
null = Null()
boolean = Boolean()
integer = Integer()
floating = Floating()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for a user to specify this as the type of a column (exposing them this way implies that they are part of the user-facing API). If it does have a meaning, are there tests for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were required previously, I'll drop them.

@@ -559,6 +584,9 @@ def _equal_part(self, other, cache=None):
any = Any()
null = Null()
boolean = Boolean()
integer = Integer()
floating = Floating()
decimal = Decimal(12, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 12, 2 as the precision and scale, respectively?

class OperationMeta(type):

@classmethod
def __prepare__(metacls, name, bases, **kwds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't exist on Python 2.7, so we'll need to find another way of writing this class. We may have a module that we import it from depending on the version of python so that we can continue to use __prepare__ in Python >=3

return False

def flat_args(self):
for arg in self.args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we have a flatten function laying around somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't one, should it go to ibis.util?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this here for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One wrinkle here is that this only flattens one nesting level. So a general flatten function would probably break some things. Let's revisit flatten after this PR is merged.

if (self, other) in cache:
return cache[(self, other)]

if id(self) == id(other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed to self is other.

HasSchema.__init__(self, schema, name=name)
name = rlz.instanceof(six.string_types)
schema = rlz.schema
source = rlz.noop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These noops should be instances of ibis.client.Client, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a lot of noops currently which we should eventually configure.

TableNode.__init__(self, [schema, name])
HasSchema.__init__(self, schema, name=name)
schema = rlz.schema
name = rlz.optional(rlz.instanceof(six.string_types), default=genname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document that callables are allowed for defaults if you haven't already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.buckets = buckets
self.closed = _validate_closed(closed)
__slots__ = ('arg', 'buckets', 'closed', 'close_extreme', 'include_under',
'include_over')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud I've added slots for each operations to support argument order on py2 too. While this is a bit more boilerplate, I wouldn't consider bad.
I can use plain __init__ methods as well as the counter trick we've discussed, in case of the latter we need to enforce calling the rules.

Which approach do You prefer?
a. __slot__ or argnames or ...?
b. counter trick and rlz.any() or rlz.value(dt.any)
c. plain but idiomatic __init__ functions*?

  • which is not not that bad, but op.args would require another order definition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another one:

class ArraySlice(ValueOp):
    arg = Argument(rlz.value(dt.Array(dt.any)))
    start = Argument(rlz.integer)
    stop = Argument(rlz.integer, default=None)
    output_type = rlz.typeof('arg')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on gitter conversation we ended up with the following API:

class StringFind(ValueOp):
    arg = Arg(rlz.string)
    start = Arg(rlz.integer, default=None)
    end = Arg(rlz.integer, default=None)
    return_ = Return(like=arg, dtype=dt.int64)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will alse make possible to support operation definition with python typing.

MapValue, MapScalar, MapColumn,
StructValue, StructScalar, StructColumn,
CategoryValue, unnamed, as_value_expr, literal,
param, null, sequence)
Copy link
Member Author

@kszucs kszucs Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were not exposed in __all__, so I've removed them.

from ibis.compat import PY2, to_time, to_date
from ibis.expr.types import Expr, null, param, literal, sequence, as_value_expr
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might factor out null, param, literal, sequence, as_value_expr to here, but we can do that incrementally :)

def castable(self, target, **kwargs):
return castable(self, target, **kwargs)

def cast(self, target, **kwargs):
return cast(self, target, **kwargs)

def scalar_type(self):
import ibis.expr.types as ir
return getattr(ir, '{}Scalar'.format(self.name))
return partial(self.scalar, dtype=self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud These are actually "factories" not types. Are You OK with these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fine with me.

case_expr = as_value_expr(case_expr)
result_expr = as_value_expr(result_expr)
case_expr = ir.as_value_expr(case_expr)
result_expr = ir.as_value_expr(result_expr)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get rid of as_value_expr too, all the value coercions should be handled by rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree with that.

@@ -915,502 +567,239 @@ def group_by(self, by=None, **additional_grouping_expressions):
# with: an instance of each is well-typed and includes all valid methods
# defined for each type.

# TODO: __slots__?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define or not to define? Use a metaclass?

self.include_under = bool(include_under)

if not len(buckets):
arg = Arg(rlz.noop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, Arg(rlz.noop) is equivalent to (the old) rules.value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, noop is toolz.identity, just a placeholder where no rule was defined previously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. LGTM.

import ibis.expr.analysis as _L
import ibis.expr.datatypes as dt
import ibis.expr.analytics as _analytics
import ibis.expr.operations as ops
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to change the _ops and _com imports? The changes in this file are rather noisy because it looks like most of them just renaming _ops to ops and _com to com. The reason these are prefixed with an underscore is to hide them from a star import.

Copy link
Member Author

@kszucs kszucs Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Api.py has __all__ defined - which protects against star import, and I thought ops is cleaner.

@@ -239,11 +230,16 @@ def __repr__(self):


class SignedInteger(Integer):
pass

def largest(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to make these @propertys?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular. When should I prefer a property over a method?

@@ -1143,6 +1188,12 @@ def can_cast_floats(source, target, upcast=False, **kwargs):
return True


@castable.register(Decimal, Decimal)
def cas_cast_decimals(source, target, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be can_cast_decimals?

@six.add_metaclass(AnnotableMeta)
class Annotable(object):

__slots__ = tuple()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be __slots__ = ().

self.default == other.default
)

def optional(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a @property?

if self.default is None:
return None
elif util.is_function(self.default):
value = self.default()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that we support default argument values that are functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is for table genname:

class UnboundTable(PhysicalTable):
    schema = Arg(sch.Schema)
    name = Arg(six.string_types, default=genname)

I can remove it though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, let's leave it. It's fine for now.

elif name in kwargs:
value = argument.validate(kwargs[name], name=name)
else:
value = argument.validate(name=name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might be possible to bind arguments using inspect.signature here so we don't have to recreate Python's calling convention algorithm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Isn't inspect.signature py3 only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but there's the funcsigs package for that :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add funcsigs as a dependency for #1277, so don't worry about adding it here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to issue: #1396

@@ -182,22 +176,6 @@ def test_primitive(spec, expected):
assert dt.dtype(spec) == expected


def test_precedence_with_no_arguments():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you move these tests to another file?

schema = query.schema()

# clickhouse columns has been defined as non-nullable
# whereas other backends don't support non-nullable columns yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Impala supports this, and possiblty other backends.

ir.Node.__init__(self, [foreign_table, predicates])
class NotExistsSubquery(ops.Node):
foreign_table = Arg(rlz.noop)
predicates = Arg(rlz.noop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this finally getting rolled into the rules system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to revisit all Args with rlz.noop :)

r = find_base_table(arg)
if isinstance(r, TableExpr):
return r
unnamed = UnnamedMarker()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember deleting this. Am I mistaken? If not, why bring it back?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, it's present in the master

@@ -1288,7 +1289,7 @@ def g(x):
def test_pickle_table_expr():
schema = [('time', 'timestamp'), ('key', 'string'), ('value', 'double')]
t0 = ibis.table(schema, name='t0')
raw = pickle.dumps(t0)
raw = pickle.dumps(t0, protocol=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you passing a specific protocol version here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On py27 serializeing object with __slots__ requires protocol 2.

arg = Argument(lambda x: x)

class StringOp(Op):
arg = Argument(str) # inherit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Arg instead of Argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In signature.py it's defined as Argument and aliased as Arg during import. Do You mean renaming in signature.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. No, leaving it here as Argument is fine.


def _repr(self, memo=None):
if memo is None:
from ibis.expr.format import FormatMemo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably ok to just import in the function, this is a little strange.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are circular imports.

cache[(self, other)] = True
return True

def is_ancestor(self, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used anywhere? It seems like it's really not that different from just a straight-up call to equals.

else:
return rules.shape_like(arg, 'int64')
"""Absolute value"""
output_type = rlz.typeof('arg')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not pass arg directly here? It is in scope after all.

Copy link
Member Author

@kszucs kszucs Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg is actually an Argument instance without any knowledge of the underlying data, also doesn't know its name until the metaclass finishes signature construction.
A solution would be to set the argument's name in the metaclass or use descriptors and _sunder slots.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually You could give me a couple of ideas how should we abstract return_ definition.



class Capitalize(StringUnaryOp):
pass
"""Return a capitalized version of input string"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the documentation here.



class Union(TableNode, HasSchema):
left = Arg(rlz.noop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be TableExpr here, but leave it as is if making that change doesn't trivially work.

@cpcloud
Copy link
Member

cpcloud commented Apr 5, 2018

Bombs away!

@cpcloud cpcloud closed this in 9394436 Apr 5, 2018
@kszucs kszucs deleted the prainbow branch May 30, 2022 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants