Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-20953: Add support for range selection to gen3 expression parser #183

Merged
merged 4 commits into from Aug 21, 2019

Conversation

andy-slac
Copy link
Contributor

This adds range literal support inside IN() expressions and adds
translation of that syntax to sqlalchemy. With narrow range it just adds
all its elements to IN() list, if list becomes long (longer than 1000
items) then additional BETWEEN ... AND MOD() are generated.

Copy link
Contributor

@natelust natelust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things that I think need clarified before this gets merged, but I am confident that they can be discussed and hammered out.

Start value of a range.
stop : `int`
End value of a range, inclusive, same or higher than ``start``.
stride : `int` or `None`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree None is a useful value for stride, especially as the constructor already has a default value. From the perspective of someone reading the code and or documentation, I am not sure how to interpret what None would mean in a stride. I'm not opposed to it if you had something specific in mind, but if so can you document what it would mean to have a None stride?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it mean if stride was zero? Should we test and disallow it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I need to be more consistent about stride=1 vs None. I want to keep None for case when stride was not specified just in case some consumers of the parsed expression want to know exact syntax of the original expression (for whatever reason). Substitution of None to 1 should happen on the consumer side.I'll update code for consistency and also add stride=0 check, that can be done at parsing stage, there is no useful case for stride=0.

def __str__(self):
res = f"{self.start}..{self.stop}"
if self.stride is not None:
res += f":{self.stride}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you keep None as an option maybe write
res = f"{self.start}..{self.stop}{f':{self.stride}' if self.stride else ''}
to avoid boiler plate and or string addition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is less readable, too much f-string nesting. I'll do something like:

res = f"{self.start}..{self.stop}" + f":{self.stride}" if self.stride else ""

@@ -38,6 +38,11 @@
# Local non-exported definitions --
# ----------------------------------

_RE_RANGE = re.compile(r"(-?\d+)\s*\.\.\s*(-?\d+)(\s*:\s*(\d+))?")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to support an arbitrary number of spaces between the elements? (In practice it probably will never be a problem, just trying to cover bases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I want to allow one space as I think some people could find it more readable. From a typical parser perspective one space and more than one space is the same thing, the only difference is between allowing no space or any number of spaces (TABS/newlines).

# range literal in format N..M[:S], spaces allowed
# docstring must match _RE_RANGE regexp
def t_RANGE_LITERAL(self, t):
r"""(-?\d+)\s*\.\.\s*(-?\d+)(\s*:\s*(\d+))?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this must match, would it be a good idea to stick it in with string formatting so that it for sure stays the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wanted to have this regex in the docstring so I can see it right here. I agree that there should not be a repetition, I'll do something about it.

match = _RE_RANGE.match(t.value)
start = int(match.group(1))
stop = int(match.group(2))
stride = match.group(4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is match group 4, I presume so you can have a group to make optional. Would it make sense to make that a non-capturing group so that these lines could just be 1,2,3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably switch to named groups, counting parens is indeed tedious.

def p_literal_range(self, p):
""" literal : RANGE_LITERAL
"""
p[0] = RangeLiteral(*p[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this may be outside the scope of this ticket as it relates to past work, but magic unpacking of tuples/lists? by indexing makes me nervous for maintainability. Could you at least add either in the doc string or a code comment explaining what the structure of p is expected to be, so if it changes for some reason in the future, someone has some incite into debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment here, but this is really thing that repeats for every method. p is a list of parser symbols, symbols correspond to the rule being processed (as it appears in a grammar in docstring). p[0] is the result (literal in this case) p[1] is the value of RANGE_LITERAL token. The value of RANGE_LITERAL is what lexer assigns to it, in this case I know that lexer assigns three-tuple of (start, stop, stride).
I know this is hard to read but I do not expect many people reading this code, and those who want to understand what's going on should also read PLY documentation first.

# just return a triple and let parent clauses handle it
if stride is None:
stride = 1
return (start, stop, stride)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you keep None as an options maybe just
return (start, stop, stride or 1)

if not_in:
return lhs.notin_(values)
clauses = [not_(expr) for expr in clauses]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems a shame to need to have a loop over the len(ranges_remain) twice in the case of not_in just to add the not expression. Consider putting the if not_ also in the above loop, something like:
clauses.append(not_(expr) if not_in else expr)
Or of course the converse Boolean logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced it with

    expr = or_(*clauses)
    if not_in:
        expr = not_(expr)

it make slightly different expression structure but it should be logically equivalent indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the change you made. The change I was thinking of was having two if not_in blocks, one that is the returns, and the other inside where you are creating the expr for the first time. I think that would produce exactly the same thing you had, but with one less loop over all the strides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, you do not like two loops there? I guess this is how I first thought of it and it ended up in implementation. I don't see big issue with iterating twice over ranges, should not be any overhead compared to constructing all expressions. Still I can merge them into this if you prefer:

    literals, ranges = [], []
    for item in values:
        if isinstance(item, tuple):
            ranges.append(item)
        else:
            literals.append(item)

    clauses = []
    for start, stop, stride in ranges:
        count = (stop - start + 1) // stride
        if len(literals) + count > max_in_items:
            # X BETWEEN START AND STOP AND MOD(X, STRIDE) = MOD(START, STRIDE)
            expr = lhs.between(start, stop)
            if stride != 1:
                expr = and_(expr, (lhs % stride) == (start % stride))
            clauses.append(expr)
        else:
            # add all values to literal list, stop is inclusive
            literals += [literal(value) for value in range(start, stop+1, stride)]
    
    if literals:
        # add IN() in front of BETWEENs
        clauses.insert(0, lhs.in_(literals))

    expr = or_(*clauses)
    if not_in:
        expr = not_(expr)

    return expr

I prefer to handle not_in in one place, makes it clearer what is going on (at least for myself)

@andy-slac
Copy link
Contributor Author

Committed all fixes and rebased to latest master.

@andy-slac
Copy link
Contributor Author

@natelust, I think I answered all your concern (hopefully), is it OK to merge it?

Copy link
Contributor

@natelust natelust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only remaining concern that I may have just missed you address is a check if someone puts a zero for a stride

@andy-slac
Copy link
Contributor Author

That is done now at parser/lexer level:

_RE_RANGE = r"(?P<start>-?\d+)\s*\.\.\s*(?P<stop>-?\d+)(\s*:\s*(?P<stride>[1-9]\d*))?"

Regex will only match non-zero stride and will give a syntax error for 0 or negative number. I tried to find a better way to produce reasonable message at parser level but it gets complicated. Alternatively I could allow 0 in lexer and move this check to queryBuilder but that is also non-ideal as it does not have enough context to point to exact location of a region in expression (and I prefer not to delay handling of errors until very late). Error reporting from parser may not be perfect now but we can look at it later and try to improve it if people complain about it.

@natelust
Copy link
Contributor

Yep I did miss that then, I was not expecting it handled that way but I really like it. Good to go

I decided to handle ranges at lexer level rather than parser level, that
simplifies handling of the possible ambiguities with '..' mixed with
numbers.
This adds range literal support inside IN() expressions and adds
translation of that syntax to sqlalchemy. With narrow range it just adds
all its elements to IN() list, if list becomes long (longer than 1000
items) then additional BETWEEN ... AND MOD() are generated.
@andy-slac andy-slac merged commit 6c4764b into master Aug 21, 2019
@timj timj deleted the tickets/DM-20953 branch April 22, 2020 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants