New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added rules for validating that tables have certain columns and for c… #1298

Closed
wants to merge 16 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@DiegoAlbertoTorres
Contributor

DiegoAlbertoTorres commented Jan 24, 2018

…olumn type checking.

@DiegoAlbertoTorres DiegoAlbertoTorres force-pushed the DiegoAlbertoTorres:rules branch 4 times, most recently from 0a398a5 to a609965 Jan 25, 2018

@DiegoAlbertoTorres DiegoAlbertoTorres changed the title from WIP: Added rules for validating that tables have certain columns and for c… to Added rules for validating that tables have certain columns and for c… Jan 29, 2018

Parameters
----------
satisfying : iterable
An iterable of column rules.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Jan 29, 2018

Contributor

Ah crap, have to finish this docstring.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Jan 29, 2018

Contributor

Done. 👍

self.doc = doc
self.validator = validator
def _validate(self, args, i):

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Jan 31, 2018

Contributor

@cpcloud I just realized that with the current implementation a table can pass validation simply by having enough columns to pass the rules. We can of course control the amount of columns with lambda rules, but this might be counterintuitive. Maybe we should have an argument like allow_extra, which changes the behavior for allowing columns that do not match a rule.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Feb 1, 2018

Contributor

I implemented this.

An iterable of lambda expressions, which will be used to validate
arguments. Each lambda expression must take a table as its single
argument, validate, and then return ``True`` for tables that
pass validation and ``False`` otherwise.

This comment has been minimized.

@jreback

jreback Feb 2, 2018

Contributor

is it possible to have satisfying or schema (or both as you have them now)?

iow what if i don’t care about a specific schema
i could validate in satisfying (or do some general rule inside that)

@DiegoAlbertoTorres

This comment has been minimized.

Contributor

DiegoAlbertoTorres commented Feb 6, 2018

I cleaned up the example a bit. Once tests pass I think this will be ready.

validator : ???
???
doc : ???
???

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Feb 6, 2018

Contributor

@cpcloud I couldn't figure what to do with these (the validator and doc args). I don't use them but stuff breaks if I don't set them to None. Maybe I could set them to None manually (without turning them into args), but first I want to make sure that it makes sense. Do you know how these are used?

This comment has been minimized.

@cpcloud

cpcloud Feb 9, 2018

Member

doc lets you give a docstring to the argument:

In [1]: class Foo(ibis.expr.operations.ValueOp):
   ...:     input_type = [ibis.expr.rules.double(name='foo', doc='bar bar double')]
   ...: 

In [2]: x = Foo(1)

In [3]: x.a?
Object `x.a` not found.

In [4]: x.foo?
Type:        property
String form: <property object at 0x7fedee9b2278>
Docstring:   bar bar double

This comment has been minimized.

@cpcloud

cpcloud Feb 9, 2018

Member

Looks like validators let you pass in your own validation function, see line 239 in this file.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Feb 9, 2018

Contributor

I see. Should we leave it? Can't think of any use case for it, if users want a custom validator they should just write one, right?

This comment has been minimized.

@DiegoAlbertoTorres
arguments. Each lambda expression must take a table as its single
argument, validate, and then return ``True`` for tables that
pass validation and ``False`` otherwise.
schema : iterable

This comment has been minimized.

@jreback

jreback Feb 7, 2018

Contributor

mark as optional

This comment has been minimized.

@DiegoAlbertoTorres
The name of the table argument.
optional : bool
Whether this table argument is optional or not.
satisfying : iterable

This comment has been minimized.

@jreback

jreback Feb 7, 2018

Contributor

mark as optional

This comment has been minimized.

@DiegoAlbertoTorres
MyOp(123)
def test_table_invalid_satisfying():

This comment has been minimized.

@jreback

jreback Feb 7, 2018

Contributor

can you add a test that doesn't pass either satisfying or schema

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Feb 8, 2018

Contributor

There were already some tests, but I just added more.

@DiegoAlbertoTorres DiegoAlbertoTorres force-pushed the DiegoAlbertoTorres:rules branch from e2e9ff7 to 42f292c Feb 8, 2018

@cpcloud cpcloud added this to the 0.13 milestone Feb 9, 2018

@jreback jreback added the enhancement label Feb 11, 2018

# Check column schema
rules_matched = 0
for column_rule in self.schema:
if not isinstance(column_rule, Argument):

This comment has been minimized.

@jreback

jreback Feb 11, 2018

Contributor

can you add some comments on the various sections on what is going on / what you are checking

... lambda t: t.schema().types.count(dt.Int64()) >= 2],
... )]
... output_type = rules.type_of_arg(0)
"""

This comment has been minimized.

@jreback

jreback Feb 11, 2018

Contributor

add a Notes section about satisfying, schema, allow_extra can be mixed / matched to produce a validator and not all are needed

This comment has been minimized.

@DiegoAlbertoTorres
Op(table)
@pytest.mark.parametrize(

This comment has been minimized.

@jreback

jreback Feb 11, 2018

Contributor

I would separate the tests that raise from the ones that don't and don't use check_op_input at all. Much more readable.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Feb 13, 2018

Contributor

I tried that and it was a mess, because the same schemas would be repeated twice, once for tests passing validation and those not passing.

I think it is waaay clearer to say "given this schema, here are things that pass and things that don't pass".

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Why would they be repeated? There's a set of schemas for passing, and one for failing. This should definitely be separated into two tests: test_table_with_schema and test_table_with_invalid_schema. The True/False is hard to follow. If the name of the function indicates what's supposed to happen, then it's very clear what the test is testing.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Sorry, didn't explain this right. Indeed, the schemas for the tables being validated would not be repeated. Bu the schema we are comparing against will be repeated.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

This would end up looking as:

@pytest.mark.parametrize(
    "... many passing test cases here ...")
def test_table_with_schema():
    "<validation code with a schema>"

@pytest.mark.parametrize(
    "... failing test cases here ...")
def test_table_with_schema_failing():
    with pytest.raises(IbisTypeError):
        "<same validation code with the same schema>"

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Eh, did it like that. Its not so bad.

@DiegoAlbertoTorres

This comment has been minimized.

Contributor

DiegoAlbertoTorres commented Feb 16, 2018

@jreback bump

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 16, 2018

lgtm. needs a rebase. if @cpcloud is ok.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Feb 19, 2018

@DiegoAlbertoTorres @jreback

I don't understand why the complexity of this is needed. As a possible strawman, what's wrong with doing the validation of the table in the operation itself, like we currently do with joins?

Column rules are convenient because most operations are column operations and there are a ton of them, but there are very few operations that take tables as input and if they do, they are (and should be) unrestricted. Sure, we may want to write more operations that takes tables as input, but there aren't so many that having a generic table rule this complex is warranted.

How about this?

We define the following classmethod on the Table object:

def with_column_subset(...):
    """must have at least these columns"""

and that's it. I think we can get pretty far without things like allow_extra and satisfying. We should start simple and go from there.

I don't think there are any operations (yet) where something like satisfying is needed. If something like that comes up, we'll add the functionality then.

Sound good?

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 19, 2018

i agree this is more complex than needed atm
but the core operation we need to write is

a table that for example

  • has a column named foo of type time
  • has a column of type int32
  • has at one or more double columns
@cpcloud

This comment has been minimized.

Member

cpcloud commented Feb 19, 2018

I think the use case you have in mind for the third case--one or more double columns--can be solved by giving it a different API than one that would require such a rule.

The with_column_subset method would cover cases 1 and 2 (unnamed and named columns with a particular type).

@cpcloud

This comment has been minimized.

Member

cpcloud commented Feb 19, 2018

The API with this pared down version would be:

rules.table.with_column_subset([
    rules.column(name='foo', value_type=dt.timestamp),
    rules.column(value_type=int32)
])

If what you meant by an int32 column was "exactly one int32 column", that would be another method exactly_one_of(rules.column(value_type=dt.int32)), maybe we define an and_ method that allows you to compose them.

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 19, 2018

the key is i need a conjunction of all of these in a particular table ; separately they r straightforward

@DiegoAlbertoTorres

This comment has been minimized.

Contributor

DiegoAlbertoTorres commented Feb 28, 2018

Having with_column_subset and exactly_one_of and composing them sounds pretty much like having schema and allow_extra. Maybe we just want to keep these two and do away with satisfying?

@DiegoAlbertoTorres DiegoAlbertoTorres force-pushed the DiegoAlbertoTorres:rules branch from 9a21f1a to 5aa137f Mar 1, 2018

@jreback

lgtm. some minor comments

@@ -26,7 +26,7 @@ dependencies:
- requests
- six
- sqlalchemy>=1.0.0,<1.1.15
- thrift
- thrift>=0.10.0,<0.11.0

This comment has been minimized.

@jreback

jreback Mar 7, 2018

Contributor

is this orthogonal change?

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Actually @DiegoAlbertoTorres something is strange, since @jreback's code is passing with this commit: 60c5806. You shouldn't need to update this since he didn't have to.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 8, 2018

Contributor

Yeah, tests were refusing to pass (and still are).

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Got rid of the pin.

([('group', dt.int64),
('value', dt.double),
('value2', dt.double)], False)])
def test_table_with_schema(schema, raises):

This comment has been minimized.

@jreback

jreback Mar 7, 2018

Contributor

do you have an example of using with_coumn_subset with a rules that is NOT a column? (should raise)

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 8, 2018

Contributor

Just added one.

@DiegoAlbertoTorres

This comment has been minimized.

Contributor

DiegoAlbertoTorres commented Mar 8, 2018

CI passes!!

def _validate(self, args, i):
arg = args[i]
if isinstance(arg, ir.TableExpr):

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Let raise if it's not a table as the first thing, then we can remove a level of indentation.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

I do this now, I think you commented on an outdated diff?

if not self.allow_extra:
# Count rules
rules = [x for x in self.schema if isinstance(x, Argument)]
if ((len(rules) != 0) and

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

You can just say if rules here

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Outdated diff, this is gone!

column = arg[column_rule.name]
except IbisTypeError:
if column_rule.optional:
break

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

I'm not sure I follow the logic of using break here. This will exit the loop. Don't we want to check additional column rules that may not be optional?

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

This is gone too

rules = [x for x in self.schema if isinstance(x, Argument)]
if ((len(rules) != 0) and
(len(arg.columns) > rules_matched)):
raise IbisTypeError('Extra columns not allowed.')

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

How about "Additional columns beyond {} not allowed".format(list of names)?

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

And this

@@ -26,7 +26,7 @@ dependencies:
- requests
- six
- sqlalchemy>=1.0.0,<1.1.15
- thrift
- thrift>=0.10.0,<0.11.0

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Actually @DiegoAlbertoTorres something is strange, since @jreback's code is passing with this commit: 60c5806. You shouldn't need to update this since he didn't have to.

def __str__(self):
fields = {'name', 'doc', 'optional'}
return str({k: v for k, v in self.__dict__.items()
if (k in fields) and (v is not None)})

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Don't need parens here.

This comment has been minimized.

@DiegoAlbertoTorres
@@ -801,3 +807,120 @@ def _validate(self, args, i):
def comparable(left, right):
return ir.castable(left, right) or ir.castable(right, left)
@six.add_metaclass(abc.ABCMeta)

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Do this as:

class TableColumnValidator(six.with_metaclass(object, abc.ABCMeta)):
   ...

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Fixed this like we said in person.

Op(table)
@pytest.mark.parametrize(

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Why would they be repeated? There's a set of schemas for passing, and one for failing. This should definitely be separated into two tests: test_table_with_schema and test_table_with_invalid_schema. The True/False is hard to follow. If the name of the function indicates what's supposed to happen, then it's very clear what the test is testing.

if column_rule.optional:
break
else:
raise IbisTypeError(

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

This whole thing can be written much more clearly:

if not column_rule.optional and column_rule.name not in arg:
    raise IbisTypeError(...)

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Still have to skip if optional. Will do:

if column_rule.name not in arg:
    if column_rule.optional:
        continue
    else:
        raise IbisTypeError(
             'No column with name {}.'.format(column_rule.name))

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Did that ^

# Check that columns match the schema first
for column_rule in self.rules:
# Members of a schema are arguments with a name
if not isinstance(column_rule, Argument):

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

There's a thing that makes stuff into Argument instances somewhere so you shouldn't have to do this. It's very hard to get to the meat of the code here because of all the checking.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Do you mean _to_argument? It seems like most argument subclasses call _to_argument directly, so we wouldn't even get it for free if I made SubsetValidator a child of Argument. Also, we always want to take an Argument here (can't think of anything we may want to upcast), so I think it is a good idea to check it explicitly.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Although I agree it is hard to get to the meat of the code with all the validation. I will split the rule checking from the table validation.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Ok, I made a new function to validate rules. Also it is now called in the constructor, not during validation.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

(I think this is the only outstanding-ish comment).

... output_type = rules.type_of_arg(0)
"""
def __init__(self, name=None, optional=False, schema=None, doc=None,
validator=None, allow_extra=False, **arg_kwds):

This comment has been minimized.

@cpcloud

cpcloud Mar 9, 2018

Member

Do we still need allow_extra? I thought we had decided to get rid of that.

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

:( forgot to remove it here srry

This comment has been minimized.

@DiegoAlbertoTorres

DiegoAlbertoTorres Mar 12, 2018

Contributor

Removed it. 👍

DiegoAlbertoTorres added some commits Mar 12, 2018

@cpcloud cpcloud closed this Mar 13, 2018

@cpcloud cpcloud reopened this Mar 13, 2018

@cpcloud cpcloud closed this in a885e70 Mar 13, 2018

@cpcloud

This comment has been minimized.

Member

cpcloud commented Mar 13, 2018

@DiegoAlbertoTorres Merged! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment