Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive effective predicate for Values #743

Merged
merged 3 commits into from
May 21, 2019

Conversation

martint
Copy link
Member

@martint martint commented May 10, 2019

This allows Join inference to push the derived predicate
to the other side when one side is based on a Values node.

For a query like:

 SELECT 1
 FROM orders JOIN (values 'O', 'F') t(x) ON orders.orderstatus = t.x;

It changes the plan from:

  - Output[_col0]
    - Project[]
      - InnerJoin[("orderstatus" = "field")]
        - TableScan[tpch:orders:sf0.01]
            orderstatus := tpch:orderstatus
                 :: [[F], [O], [P]]
        - Values
             ('O')
             ('F')

to:

  - Output[_col0]
    - Project[]
      - InnerJoin[("orderstatus" = "field")]
        - ScanFilter[table = tpch:orders:sf0.01, filterPredicate = ("orderstatus" IN ('F', 'O'))]
             orderstatus := tpch:orderstatus
                  :: [[F], [O]]
        - Filter[filterPredicate = ("field" IN ('F', 'O'))]
          - Values
               ('O')
               ('F')

@cla-bot cla-bot bot added the cla-signed label May 10, 2019
@martint martint force-pushed the effective-predicate-values branch 9 times, most recently from 54add8e to 53e0e2a Compare May 13, 2019 01:23
@@ -15,12 +15,15 @@

import com.google.common.collect.ImmutableBiMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something about excessive filter:

- Filter[filterPredicate = ("field" IN ('F', 'O'))]

I saw this problem before with table scans and partition columns.
When partition values from one side of join were pushed to the other side, then there is always filter above table scan with col IN partition_values.
I tried to tackle it in starburstdata/facebook-presto@7081f3e
but it didn't work unfortunately because inference cannot easily reduce

inheritedValue = 1 OR inheritedValue = 2
inheritedValue = tableScanValue
tableScanValue = 1 OR tableScanValue = 2

to simply true (after simplification) and TupleDomain doesn't have true contains method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there's not much we can do about it, and I haven't investigated it further. On the other hand, it's a filter over a ValuesNode, so it's inexpensive.

Copy link
Member

@sopel39 sopel39 May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be meaningful though if filter contains 100s of partition values, but it's out of scope of this PR

@martint martint force-pushed the effective-predicate-values branch from 53e0e2a to 650ff6f Compare May 13, 2019 16:19
return TRUE_LITERAL;
}

ExpressionInterpreter interpreter = ExpressionInterpreter.expressionOptimizer(value, metadata, session, typeAnalyzer.getTypes(session, types, value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe ValuesNode should only contain Literal as rows (I think it implicitly does)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventuallly, we should probably have a physical vs logical values node or something else to represent that. Currently, a values node inside a correlated subquery could have a reference to the outer scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, a values node inside a correlated subquery could have a reference to the outer scope.

That's a problem on it's own. See: #473

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not a “problem” per se. It’s legitimate for a query to reference variables from the enclosing scope. After all, filters, projections, subqueries, etc, are nothing but capturing lamba expressions (if you squint a little)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all, filters, projections, subqueries, etc, are nothing but capturing lamba expressions (if you squint a little)

Still, I don't think correlated symbols should appear out of nothing in any node that contains Expression, but should rather be injected into subquery via some kind of special ProjectCorrelations node. This way subquery support can be narrowed down in various places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a “bind” operation, just like lambda(x) { ... } binds x.

The next generation of the IR that I’ve been prototyping for a while will likely unify both concepts. Anyway, topic for another time :)

for (int row = 0; row < node.getRows().size(); row++) {
Expression value = node.getRows().get(row).get(column);

if (!DeterminismEvaluator.isDeterministic(value)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could still derive domains for other symbols

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the predicate needs to be an “upper bound”. If we had a single row, we could do that, but otherwise, there’s no way to formulate such a predicate.

@martint martint force-pushed the effective-predicate-values branch from 650ff6f to 1a11e03 Compare May 20, 2019 15:50
Use tables to store the points/polygons used in the tests. This is
needed for an upcoming change to derivery predicates from inline
tables, which causes the plans used in the tests to constant-fold
all the way.
This allows Join inference to push the derived predicate
to the other side when one side is based on a Values node.

For a query like:

 SELECT 1
 FROM orders JOIN (values 'O', 'F') t(x) ON orders.orderstatus = t.x;

It changes the plan from:

  - Output[_col0]
    - Project[]
      - InnerJoin[("orderstatus" = "field")]
        - TableScan[tpch:orders:sf0.01]
            orderstatus := tpch:orderstatus
                 :: [[F], [O], [P]]
        - Values
             ('O')
             ('F')

to:

  - Output[_col0]
    - Project[]
      - InnerJoin[("orderstatus" = "field")]
        - ScanFilter[table = tpch:orders:sf0.01, filterPredicate = ("orderstatus" IN ('F', 'O'))]
             orderstatus := tpch:orderstatus
                  :: [[F], [O]]
        - Filter[filterPredicate = ("field" IN ('F', 'O'))]
          - Values
               ('O')
               ('F')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants