DM-20953: Add support for range selection to gen3 expression parser #183

andy-slac · 2019-08-19T21:14:27Z

This adds range literal support inside IN() expressions and adds
translation of that syntax to sqlalchemy. With narrow range it just adds
all its elements to IN() list, if list becomes long (longer than 1000
items) then additional BETWEEN ... AND MOD() are generated.

natelust

There are a few things that I think need clarified before this gets merged, but I am confident that they can be discussed and hammered out.

natelust · 2019-08-20T13:51:11Z

python/lsst/daf/butler/exprParser/exprTree.py

+        Start value of a range.
+    stop : `int`
+        End value of a range, inclusive, same or higher than ``start``.
+    stride : `int` or `None`


I'm not sure I agree None is a useful value for stride, especially as the constructor already has a default value. From the perspective of someone reading the code and or documentation, I am not sure how to interpret what None would mean in a stride. I'm not opposed to it if you had something specific in mind, but if so can you document what it would mean to have a None stride?

What would it mean if stride was zero? Should we test and disallow it?

Indeed I need to be more consistent about stride=1 vs None. I want to keep None for case when stride was not specified just in case some consumers of the parsed expression want to know exact syntax of the original expression (for whatever reason). Substitution of None to 1 should happen on the consumer side.I'll update code for consistency and also add stride=0 check, that can be done at parsing stage, there is no useful case for stride=0.

natelust · 2019-08-20T13:57:53Z

python/lsst/daf/butler/exprParser/exprTree.py

+    def __str__(self):
+        res = f"{self.start}..{self.stop}"
+        if self.stride is not None:
+            res += f":{self.stride}"


If you keep None as an option maybe write
res = f"{self.start}..{self.stop}{f':{self.stride}' if self.stride else ''}
to avoid boiler plate and or string addition

I think this is less readable, too much f-string nesting. I'll do something like:

res = f"{self.start}..{self.stop}" + f":{self.stride}" if self.stride else ""

natelust · 2019-08-20T14:03:29Z

python/lsst/daf/butler/exprParser/parserLex.py

@@ -38,6 +38,11 @@
 #  Local non-exported definitions --
 # ----------------------------------

+_RE_RANGE = re.compile(r"(-?\d+)\s*\.\.\s*(-?\d+)(\s*:\s*(\d+))?")


do we really want to support an arbitrary number of spaces between the elements? (In practice it probably will never be a problem, just trying to cover bases)

Well, I want to allow one space as I think some people could find it more readable. From a typical parser perspective one space and more than one space is the same thing, the only difference is between allowing no space or any number of spaces (TABS/newlines).

natelust · 2019-08-20T14:04:49Z

python/lsst/daf/butler/exprParser/parserLex.py

+    # range literal in format N..M[:S], spaces allowed
+    # docstring must match _RE_RANGE regexp
+    def t_RANGE_LITERAL(self, t):
+        r"""(-?\d+)\s*\.\.\s*(-?\d+)(\s*:\s*(\d+))?


If this must match, would it be a good idea to stick it in with string formatting so that it for sure stays the same?

I really wanted to have this regex in the docstring so I can see it right here. I agree that there should not be a repetition, I'll do something about it.

natelust · 2019-08-20T14:06:47Z

python/lsst/daf/butler/exprParser/parserLex.py

+        match = _RE_RANGE.match(t.value)
+        start = int(match.group(1))
+        stop = int(match.group(2))
+        stride = match.group(4)


I see that this is match group 4, I presume so you can have a group to make optional. Would it make sense to make that a non-capturing group so that these lines could just be 1,2,3?

I should probably switch to named groups, counting parens is indeed tedious.

natelust · 2019-08-20T14:10:14Z

python/lsst/daf/butler/exprParser/parserYacc.py

+    def p_literal_range(self, p):
+        """ literal : RANGE_LITERAL
+        """
+        p[0] = RangeLiteral(*p[1])


So this may be outside the scope of this ticket as it relates to past work, but magic unpacking of tuples/lists? by indexing makes me nervous for maintainability. Could you at least add either in the doc string or a code comment explaining what the structure of p is expected to be, so if it changes for some reason in the future, someone has some incite into debugging?

I'll add a comment here, but this is really thing that repeats for every method. p is a list of parser symbols, symbols correspond to the rule being processed (as it appears in a grammar in docstring). p[0] is the result (literal in this case) p[1] is the value of RANGE_LITERAL token. The value of RANGE_LITERAL is what lexer assigns to it, in this case I know that lexer assigns three-tuple of (start, stop, stride).
I know this is hard to read but I do not expect many people reading this code, and those who want to understand what's going on should also read PLY documentation first.

natelust · 2019-08-20T14:17:20Z

python/lsst/daf/butler/sql/queryBuilder.py

+        # just return a triple and let parent clauses handle it
+        if stride is None:
+            stride = 1
+        return (start, stop, stride)


If you keep None as an options maybe just
return (start, stop, stride or 1)

natelust · 2019-08-20T15:04:01Z

python/lsst/daf/butler/sql/queryBuilder.py

        if not_in:
-            return lhs.notin_(values)
+            clauses = [not_(expr) for expr in clauses]


it seems a shame to need to have a loop over the len(ranges_remain) twice in the case of not_in just to add the not expression. Consider putting the if not_ also in the above loop, something like:
clauses.append(not_(expr) if not_in else expr)
Or of course the converse Boolean logic

I replaced it with

expr = or_(*clauses) if not_in: expr = not_(expr)

it make slightly different expression structure but it should be logically equivalent indeed.

I'm not sure I understand the change you made. The change I was thinking of was having two if not_in blocks, one that is the returns, and the other inside where you are creating the expr for the first time. I think that would produce exactly the same thing you had, but with one less loop over all the strides.

Oh, I see, you do not like two loops there? I guess this is how I first thought of it and it ended up in implementation. I don't see big issue with iterating twice over ranges, should not be any overhead compared to constructing all expressions. Still I can merge them into this if you prefer:

literals, ranges = [], [] for item in values: if isinstance(item, tuple): ranges.append(item) else: literals.append(item) clauses = [] for start, stop, stride in ranges: count = (stop - start + 1) // stride if len(literals) + count > max_in_items: # X BETWEEN START AND STOP AND MOD(X, STRIDE) = MOD(START, STRIDE) expr = lhs.between(start, stop) if stride != 1: expr = and_(expr, (lhs % stride) == (start % stride)) clauses.append(expr) else: # add all values to literal list, stop is inclusive literals += [literal(value) for value in range(start, stop+1, stride)] if literals: # add IN() in front of BETWEENs clauses.insert(0, lhs.in_(literals)) expr = or_(*clauses) if not_in: expr = not_(expr) return expr

I prefer to handle not_in in one place, makes it clearer what is going on (at least for myself)

andy-slac · 2019-08-20T21:08:51Z

Committed all fixes and rebased to latest master.

andy-slac · 2019-08-20T23:47:48Z

@natelust, I think I answered all your concern (hopefully), is it OK to merge it?

natelust

My only remaining concern that I may have just missed you address is a check if someone puts a zero for a stride

andy-slac · 2019-08-21T00:44:37Z

That is done now at parser/lexer level:

daf_butler/python/lsst/daf/butler/exprParser/parserLex.py

Line 41 in 8644b45

    
           _RE_RANGE = r"(?P<start>-?\d+)\s*\.\.\s*(?P<stop>-?\d+)(\s*:\s*(?P<stride>[1-9]\d*))?"

Regex will only match non-zero stride and will give a syntax error for 0 or negative number. I tried to find a better way to produce reasonable message at parser level but it gets complicated. Alternatively I could allow 0 in lexer and move this check to queryBuilder but that is also non-ideal as it does not have enough context to point to exact location of a region in expression (and I prefer not to delay handling of errors until very late). Error reporting from parser may not be perfect now but we can look at it later and try to improve it if people complain about it.

natelust · 2019-08-21T01:05:41Z

Yep I did miss that then, I was not expecting it handled that way but I really like it. Good to go

I decided to handle ranges at lexer level rather than parser level, that simplifies handling of the possible ambiguities with '..' mixed with numbers.

This adds range literal support inside IN() expressions and adds translation of that syntax to sqlalchemy. With narrow range it just adds all its elements to IN() list, if list becomes long (longer than 1000 items) then additional BETWEEN ... AND MOD() are generated.

andy-slac force-pushed the tickets/DM-20953 branch from 9d1658c to 081fa2c Compare August 19, 2019 21:16

natelust approved these changes Aug 20, 2019

View reviewed changes

andy-slac force-pushed the tickets/DM-20953 branch from fa03a65 to 8644b45 Compare August 20, 2019 21:07

natelust reviewed Aug 21, 2019

View reviewed changes

andy-slac added 4 commits August 20, 2019 21:52

Add support for range literals to lexer.

25fa2cd

I decided to handle ranges at lexer level rather than parser level, that simplifies handling of the possible ambiguities with '..' mixed with numbers.

Add "from exception" to few raises in SqlRegistry

1c511e3

Address review comments

b35877c

andy-slac force-pushed the tickets/DM-20953 branch from 8644b45 to b35877c Compare August 21, 2019 02:52

andy-slac merged commit 6c4764b into master Aug 21, 2019

timj deleted the tickets/DM-20953 branch April 22, 2020 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-20953: Add support for range selection to gen3 expression parser #183

DM-20953: Add support for range selection to gen3 expression parser #183

andy-slac commented Aug 19, 2019

natelust left a comment

natelust Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

natelust Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

natelust Aug 20, 2019

andy-slac Aug 20, 2019

andy-slac commented Aug 20, 2019

andy-slac commented Aug 20, 2019

natelust left a comment

andy-slac commented Aug 21, 2019

natelust commented Aug 21, 2019

DM-20953: Add support for range selection to gen3 expression parser #183

DM-20953: Add support for range selection to gen3 expression parser #183

Conversation

andy-slac commented Aug 19, 2019

natelust left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac commented Aug 20, 2019

andy-slac commented Aug 20, 2019

natelust left a comment

Choose a reason for hiding this comment

andy-slac commented Aug 21, 2019

natelust commented Aug 21, 2019