Skip to content

SIMULATE...GIVEN syntax is unlike SELECT...WHERE and may cause unintended errors #605

@versar

Description

@versar

The problem is that using the word AND to express multiple constraints in a SIMULATE query is incorrect, however that syntax is correct in a SELECT query so a user may use it by accident. When this happens, no error is returned. The query runs normally. However, the wrong results are produced.

AND works in SELECT queries. In specific, the BQL or SQL expression
1: %bql SELECT var1, var2 FROM t WHERE "var1"=val1 AND "var2"=val2
is interpreted as:
2: %bql SELECT var1, var2 FROM t WHERE ("var1"=val1) AND ("var2"=val2)

However AND does not work normally in SIMULATE queries. In specific, the BQL expression
3: %bql SIMULATE var1, var2 FROM p WHERE "var1"=val1 AND "var2"=val2 LIMIT 10
is interpreted as:
4: %bql SIMULATE var1, var2 FROM p WHERE "var1" = (val1 AND "var2" = val2) LIMIT 10

Therefore, if a user invokes query 3 above, they will not get the answer they expect. Currently, the correct way to write multiple constraints in a SIMULATE expression is for them to be separated by commas, not by AND. Therefore the user will only get the right answer by using the syntax below:
5: %bql SIMULATE var1, var2 FROM p WHERE "var1"=val1, "var2"=val2 LIMIT 10

This is a problem because a user (like me!) who believes naturally that SELECT and SIMULATE have parallel syntax is susceptible to accidentally use AND with SIMULATE instead of commas. The bug is especially insidious because the result will silently be wrong, without an error. And, the results will look almost reasonable -- if the constraints are ignored, the variable will usually just be simulated at values near its overall mean. This is a trap for the unwary, especially when the incorrect results propagate through the rest of an analysis.

Possible resolutions:

  1. Return a syntax error when the compiler sees a SIMULATE query with multiple ANDs instead of commas.
  2. Compile an SIMULATE query with multiple ANDs to interpret the query as if it has commas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions