Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added rules for validating that tables have certain columns and for c… #1298

Closed

Conversation

DiegoAlbertoTorres
Copy link
Contributor

…olumn type checking.

@DiegoAlbertoTorres DiegoAlbertoTorres changed the title WIP: Added rules for validating that tables have certain columns and for c… Added rules for validating that tables have certain columns and for c… Jan 29, 2018
Parameters
----------
satisfying : iterable
An iterable of column rules.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah crap, have to finish this docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 👍

self.doc = doc
self.validator = validator

def _validate(self, args, i):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud I just realized that with the current implementation a table can pass validation simply by having enough columns to pass the rules. We can of course control the amount of columns with lambda rules, but this might be counterintuitive. Maybe we should have an argument like allow_extra, which changes the behavior for allowing columns that do not match a rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this.

An iterable of lambda expressions, which will be used to validate
arguments. Each lambda expression must take a table as its single
argument, validate, and then return ``True`` for tables that
pass validation and ``False`` otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to have satisfying or schema (or both as you have them now)?

iow what if i don’t care about a specific schema
i could validate in satisfying (or do some general rule inside that)

@DiegoAlbertoTorres
Copy link
Contributor Author

I cleaned up the example a bit. Once tests pass I think this will be ready.

validator : ???
???
doc : ???
???
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud I couldn't figure what to do with these (the validator and doc args). I don't use them but stuff breaks if I don't set them to None. Maybe I could set them to None manually (without turning them into args), but first I want to make sure that it makes sense. Do you know how these are used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc lets you give a docstring to the argument:

In [1]: class Foo(ibis.expr.operations.ValueOp):
   ...:     input_type = [ibis.expr.rules.double(name='foo', doc='bar bar double')]
   ...: 

In [2]: x = Foo(1)

In [3]: x.a?
Object `x.a` not found.

In [4]: x.foo?
Type:        property
String form: <property object at 0x7fedee9b2278>
Docstring:   bar bar double

Copy link
Member

@cpcloud cpcloud Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like validators let you pass in your own validation function, see line 239 in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Should we leave it? Can't think of any use case for it, if users want a custom validator they should just write one, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump @cpcloud

arguments. Each lambda expression must take a table as its single
argument, validate, and then return ``True`` for tables that
pass validation and ``False`` otherwise.
schema : iterable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark as optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The name of the table argument.
optional : bool
Whether this table argument is optional or not.
satisfying : iterable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark as optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

MyOp(123)


def test_table_invalid_satisfying():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test that doesn't pass either satisfying or schema

Copy link
Contributor Author

@DiegoAlbertoTorres DiegoAlbertoTorres Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were already some tests, but I just added more.

@cpcloud cpcloud added this to the 0.13 milestone Feb 9, 2018
@jreback jreback added the feature Features or general enhancements label Feb 11, 2018
# Check column schema
rules_matched = 0
for column_rule in self.schema:
if not isinstance(column_rule, Argument):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments on the various sections on what is going on / what you are checking

... lambda t: t.schema().types.count(dt.Int64()) >= 2],
... )]
... output_type = rules.type_of_arg(0)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a Notes section about satisfying, schema, allow_extra can be mixed / matched to produce a validator and not all are needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Op(table)


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would separate the tests that raise from the ones that don't and don't use check_op_input at all. Much more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that and it was a mess, because the same schemas would be repeated twice, once for tests passing validation and those not passing.

I think it is waaay clearer to say "given this schema, here are things that pass and things that don't pass".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would they be repeated? There's a set of schemas for passing, and one for failing. This should definitely be separated into two tests: test_table_with_schema and test_table_with_invalid_schema. The True/False is hard to follow. If the name of the function indicates what's supposed to happen, then it's very clear what the test is testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't explain this right. Indeed, the schemas for the tables being validated would not be repeated. Bu the schema we are comparing against will be repeated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would end up looking as:

@pytest.mark.parametrize(
    "... many passing test cases here ...")
def test_table_with_schema():
    "<validation code with a schema>"

@pytest.mark.parametrize(
    "... failing test cases here ...")
def test_table_with_schema_failing():
    with pytest.raises(IbisTypeError):
        "<same validation code with the same schema>"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, did it like that. Its not so bad.

@DiegoAlbertoTorres
Copy link
Contributor Author

@jreback bump

@jreback
Copy link
Contributor

jreback commented Feb 16, 2018

lgtm. needs a rebase. if @cpcloud is ok.

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2018

@DiegoAlbertoTorres @jreback

I don't understand why the complexity of this is needed. As a possible strawman, what's wrong with doing the validation of the table in the operation itself, like we currently do with joins?

Column rules are convenient because most operations are column operations and there are a ton of them, but there are very few operations that take tables as input and if they do, they are (and should be) unrestricted. Sure, we may want to write more operations that takes tables as input, but there aren't so many that having a generic table rule this complex is warranted.

How about this?

We define the following classmethod on the Table object:

def with_column_subset(...):
    """must have at least these columns"""

and that's it. I think we can get pretty far without things like allow_extra and satisfying. We should start simple and go from there.

I don't think there are any operations (yet) where something like satisfying is needed. If something like that comes up, we'll add the functionality then.

Sound good?

@jreback
Copy link
Contributor

jreback commented Feb 19, 2018

i agree this is more complex than needed atm
but the core operation we need to write is

a table that for example

  • has a column named foo of type time
  • has a column of type int32
  • has at one or more double columns

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2018

I think the use case you have in mind for the third case--one or more double columns--can be solved by giving it a different API than one that would require such a rule.

The with_column_subset method would cover cases 1 and 2 (unnamed and named columns with a particular type).

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2018

The API with this pared down version would be:

rules.table.with_column_subset([
    rules.column(name='foo', value_type=dt.timestamp),
    rules.column(value_type=int32)
])

If what you meant by an int32 column was "exactly one int32 column", that would be another method exactly_one_of(rules.column(value_type=dt.int32)), maybe we define an and_ method that allows you to compose them.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2018

the key is i need a conjunction of all of these in a particular table ; separately they r straightforward

@DiegoAlbertoTorres
Copy link
Contributor Author

Having with_column_subset and exactly_one_of and composing them sounds pretty much like having schema and allow_extra. Maybe we just want to keep these two and do away with satisfying?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some minor comments

@@ -26,7 +26,7 @@ dependencies:
- requests
- six
- sqlalchemy>=1.0.0,<1.1.15
- thrift
- thrift>=0.10.0,<0.11.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this orthogonal change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @DiegoAlbertoTorres something is strange, since @jreback's code is passing with this commit: 60c5806. You shouldn't need to update this since he didn't have to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, tests were refusing to pass (and still are).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of the pin.

([('group', dt.int64),
('value', dt.double),
('value2', dt.double)], False)])
def test_table_with_schema(schema, raises):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an example of using with_coumn_subset with a rules that is NOT a column? (should raise)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added one.

@DiegoAlbertoTorres
Copy link
Contributor Author

CI passes!!


def _validate(self, args, i):
arg = args[i]
if isinstance(arg, ir.TableExpr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let raise if it's not a table as the first thing, then we can remove a level of indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do this now, I think you commented on an outdated diff?

if not self.allow_extra:
# Count rules
rules = [x for x in self.schema if isinstance(x, Argument)]
if ((len(rules) != 0) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just say if rules here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated diff, this is gone!

column = arg[column_rule.name]
except IbisTypeError:
if column_rule.optional:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the logic of using break here. This will exit the loop. Don't we want to check additional column rules that may not be optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gone too

rules = [x for x in self.schema if isinstance(x, Argument)]
if ((len(rules) != 0) and
(len(arg.columns) > rules_matched)):
raise IbisTypeError('Extra columns not allowed.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Additional columns beyond {} not allowed".format(list of names)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this

@@ -26,7 +26,7 @@ dependencies:
- requests
- six
- sqlalchemy>=1.0.0,<1.1.15
- thrift
- thrift>=0.10.0,<0.11.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @DiegoAlbertoTorres something is strange, since @jreback's code is passing with this commit: 60c5806. You shouldn't need to update this since he didn't have to.

def __str__(self):
fields = {'name', 'doc', 'optional'}
return str({k: v for k, v in self.__dict__.items()
if (k in fields) and (v is not None)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need parens here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -801,3 +807,120 @@ def _validate(self, args, i):

def comparable(left, right):
return ir.castable(left, right) or ir.castable(right, left)


@six.add_metaclass(abc.ABCMeta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this as:

class TableColumnValidator(six.with_metaclass(object, abc.ABCMeta)):
   ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this like we said in person.

Op(table)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would they be repeated? There's a set of schemas for passing, and one for failing. This should definitely be separated into two tests: test_table_with_schema and test_table_with_invalid_schema. The True/False is hard to follow. If the name of the function indicates what's supposed to happen, then it's very clear what the test is testing.

if column_rule.optional:
break
else:
raise IbisTypeError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing can be written much more clearly:

if not column_rule.optional and column_rule.name not in arg:
    raise IbisTypeError(...)

Copy link
Contributor Author

@DiegoAlbertoTorres DiegoAlbertoTorres Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have to skip if optional. Will do:

if column_rule.name not in arg:
    if column_rule.optional:
        continue
    else:
        raise IbisTypeError(
             'No column with name {}.'.format(column_rule.name))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that ^

# Check that columns match the schema first
for column_rule in self.rules:
# Members of a schema are arguments with a name
if not isinstance(column_rule, Argument):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a thing that makes stuff into Argument instances somewhere so you shouldn't have to do this. It's very hard to get to the meat of the code here because of all the checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean _to_argument? It seems like most argument subclasses call _to_argument directly, so we wouldn't even get it for free if I made SubsetValidator a child of Argument. Also, we always want to take an Argument here (can't think of anything we may want to upcast), so I think it is a good idea to check it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree it is hard to get to the meat of the code with all the validation. I will split the rule checking from the table validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made a new function to validate rules. Also it is now called in the constructor, not during validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think this is the only outstanding-ish comment).

... output_type = rules.type_of_arg(0)
"""
def __init__(self, name=None, optional=False, schema=None, doc=None,
validator=None, allow_extra=False, **arg_kwds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need allow_extra? I thought we had decided to get rid of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( forgot to remove it here srry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it. 👍

@cpcloud cpcloud closed this Mar 13, 2018
@cpcloud cpcloud reopened this Mar 13, 2018
@cpcloud cpcloud closed this in a885e70 Mar 13, 2018
@cpcloud
Copy link
Member

cpcloud commented Mar 13, 2018

@DiegoAlbertoTorres Merged! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants