Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: many Operations can't handle NULL optional arguments #8833

Open
1 task done
Tracked by #8996
NickCrews opened this issue Mar 30, 2024 · 7 comments
Open
1 task done
Tracked by #8996

bug: many Operations can't handle NULL optional arguments #8833

NickCrews opened this issue Mar 30, 2024 · 7 comments
Labels
breaking change Changes that introduce an API break at any level bug Incorrect behavior inside of ibis

Comments

@NickCrews
Copy link
Contributor

What happened?

consider the implementation of

@public
class Substring(Value):
    arg: Value[dt.String]
    start: Value[dt.Integer]
    length: Optional[Value[dt.Integer]] = None

This allows you to leave out the length argument, meaning "until the end of the string". But this is only represented on the python side. If the length is evaluated at runtime to be NULL, then this errors on some backends, like postgres, or gives the wrong result on duckdb.

For example, ibis.literal("abcde").substr(2, ibis.literal(1).nullif(1)) results in NULL in duckdb, but I would expect it to be "cde".

During compilation, we are naive and do the NULL checking only on the python side:

def visit_Substring(self, op, *, arg, start, length):
        start += 1
        arg_length = self.f.length(arg)

        if length is None:
            return self.if_(
                start >= 1,
                self.f.substring(arg, start),
                self.f.substring(arg, start + arg_length),
            )
        return self.if_(
            start >= 1,
            self.f.substring(arg, start, length),
            self.f.substring(arg, start + arg_length, length),
        )

I discovered this when adding more tests to #8832 .

I think what we should do is make Substring more like

@public
class Substring(Value):
    arg: Value[dt.String]
    start: Value[dt.Integer]
    length:Value[dt.Integer] = ibis.null()

and then use sql CASE statements if we can't determine the nullness statically.

but I wasn't able to get this to work. Does this seem like the right direction? Any tips on what to do here?

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Mar 30, 2024
@cpcloud
Copy link
Member

cpcloud commented Mar 30, 2024

None indicates that the argument wasn't provided, NULL is something else. We should avoid conflating these two uses.

@cpcloud
Copy link
Member

cpcloud commented Mar 30, 2024

The failures notwithstanding, the duckdb behavior looks correct to me in this case: null inputs produce null outputs, NULL doesn't mean "no argument".

The representation of "argument not provided" is unrelated to whether an input is NULL.

@NickCrews
Copy link
Contributor Author

hmmm, I think you are right. We are victims of the SQL standard here. I don't really like this footgun:

  • ibis.literal("abcde").substr(2, None) gives "cde", but
  • ibis.literal("abcde").substr(2, ibis.null()) gives NULL
    The only way to get around this would be if we did a length = length.fillnull(arg.length()) during compilation, but then this would be unexpected for more experienced SQL users, and less performant. Maybe just better documentation? This same thing exists for many other ops (eg StringFind with the optional start and end), but probably these are less used.

@cpcloud
Copy link
Member

cpcloud commented Apr 1, 2024

I think there are other approaches to the problem. I'll state the problem so that we have it written somewhere.

Ibis uses None for two things: 1) the default value for optional arguments and 2) a value that users can use for SQL's NULL.

Users explicitly passing None expect it to behave like they explicitly passed ibis.null() or an equivalent such as ibis.NA.

Since None is being co-opted for use as the "no-argument" sentinel value, Ibis interprets it that way instead of as NULL, leading to a divergence in behavior between the user-facing meaning of None versus how we use it internally.

@cpcloud
Copy link
Member

cpcloud commented Apr 1, 2024

One approach that will break user code, but is probably the right approach IMO is to use some other sentinel value to mean "argument wasn't provided".

Bit of a hack, but fairly common is to create a dummy object like NO_ARGUMENT = object(), set that to the default value wherever None is used, and then use arg is NO_ARGUMENT to check for whether an argument was priovided.

@NickCrews
Copy link
Contributor Author

I find the NO_ARGUMENT solution fairly good, I think it might be worth going through a deprecation cycle to get this right.

@NickCrews
Copy link
Contributor Author

wait, would this then mean that for ibis.literal("abcde").substr(2, None), the None gets cast to ibis.null() and the result is therefore NULL? I think this actually might be an even worse footgun than what we have now.

@ncclementi ncclementi added the breaking change Changes that introduce an API break at any level label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that introduce an API break at any level bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

No branches or pull requests

3 participants