Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make APPROX_COUNT_DISTINCT error configurable #29

Closed
asuhan opened this issue Jun 20, 2017 · 17 comments · Fixed by #112
Closed

Make APPROX_COUNT_DISTINCT error configurable #29

asuhan opened this issue Jun 20, 2017 · 17 comments · Fixed by #112

Comments

@asuhan
Copy link
Contributor

asuhan commented Jun 20, 2017

Add a second, optional parameter to APPROX_COUNT_DISTINCT to control the error rate of the underlying HyperLogLog structure.

@shahsyed-pcln
Copy link

Hi @asuhan, does a starter tag imply that anyone just taking a look at the code base could possibly look into fixing this?

@asuhan
Copy link
Contributor Author

asuhan commented Jun 21, 2017

Yes, it does. Ideally, people working on a task should announce their intention to avoid duplication of effort.

Here's what this task involves:

  • Changes to the ApproxCountDistinct operator (in the MapDSqlOperatorTable class) to make it accept an additional, optional parameter for the desired relative error. Check PgILike or RegexpLike for an example of how to achieve that (the escape character for these two is optional).
  • Add the desired relative error as an additional field of Analyzer::AggExpr and use it in GroupByAndAggregate::initCountDistinctDescriptors instead of HLL_MASK_WIDTH.

@Smyatkin-Maxim
Copy link
Contributor

May I take this one?

@asuhan
Copy link
Contributor Author

asuhan commented Oct 9, 2017

@Smyatkin-Maxim Yes, that’d be great. Let me know if you need help.

@billmaimone
Copy link
Contributor

yes, this is good.

@Smyatkin-Maxim
Copy link
Contributor

@asuhan
Tried to understand it today. Either I'm missing some easy way to do it, or it's a bit more work than just updating code in the three places, because there are no other aggregates working with multiple arguments.
The changes in calcite are straightforward. The changes in C++ not that much. If I understand it correctly, I should find a way to parse json generated by calcite in parse_aggregate_expr (RelAlgAbstractInterpreter.cpp) to get the optional parameter into RexAgg and then forward it to AggExpr? I also see there are other code paths creating AggExpr instances, which I also should handle.

Btw, should the optional argument be Integer (e.g., 2%) or Decimal (e.g., 0.02 error rate)?
Will try to implement it tomorrow.

@asuhan
Copy link
Contributor Author

asuhan commented Oct 11, 2017

Yes, you have to start from parse_aggregate_expr and relax the operands requirements a little bit to account for the fact that we have two parameters now. You should only allow for an additional literal parameter for APPROX_COUNT_DISTINCT and store it in RexAgg and AggExpr. RelAlgTranslator::translateAggregateRex is the only place which really instantiates AggExpr, the other occurrences either just copy it (ExpressionRewrite.cpp, CalciteAdapter.cpp) or synthesize a COUNT aggregate for a cardinality query (in RelAlgExecutor.cpp). The same additional information needs to be stored in the TargetInfo structure and stored there by target_info in ResultRows.h.

Regarding the format of the argument, let's make it percent (2%).

@Smyatkin-Maxim
Copy link
Contributor

Ok, thanks

@Smyatkin-Maxim
Copy link
Contributor

As far as I understand, by design I should not (and can not currently) access the literal at parse_aggregate_expr, because it's not part of an aggregate expression - it's part of input arguments. I should resolve it only in RelAlgTranslator::translateAggregateRex from scalar_sources. But for some reason optimizer decides that I don't need this literal and removes it in eliminate_dead_columns. Which means that I won't see it later in scalar_sources.
So, while I could parse it in parse_aggregate_expr into AggRex, it feels like a dirty hack, because it's not the place where operands should be resolved. So I probably have to come up with a fix to eliminate_dead_columns I suppose?

@Smyatkin-Maxim
Copy link
Contributor

Or maybe this little hack is better than updating optimizer for the corner case of aggregate with two arguments. @asuhan , which way would you advice?

@asuhan
Copy link
Contributor Author

asuhan commented Oct 12, 2017

@Smyatkin-Maxim I don't understand your comment about parse_aggregate_expr. It returns a full aggregate expression which carries all the information: the type of the aggregate and the arguments. For APPROX_COUNT_DISTINCT, the first argument is handled just like we do for all aggregates, you can leave it alone. The second argument becomes an additional integer / float (depends on your choice) field of RexAgg; unlike the first argument, it's not a scalar source. You don't need to change eliminate_dead_columns.

@Smyatkin-Maxim
Copy link
Contributor

Ok, perhaps optimizer works as it should - it's not something that can be learned in two days :)
I understand what parse_aggregate_expr does. I mean that Calcite doesn't make this literal part of aggregate expression, so parse_aggregate_expr can't see it.
For example, if we consider this query:

select approx_count_distinct(tree_dbh, 2) from nyc_trees_2015_683k;

it sees only this part of input JSON string:

    {
      "id": "2",
      "relOp": "LogicalAggregate",
      "fields": [
        "EXPR$0"
      ],
      "group": [],
      "aggs": [
        {
          "agg": "APPROX_COUNT_DISTINCT",
          "type": {
            "type": "BIGINT",
            "nullable": false
          },
          "distinct": false,
          "operands": [
            0,
            1
          ]
        }
      ]
    }

While I need to access this part:

    {
      "id": "1",
      "relOp": "LogicalProject",
      "fields": [
        "tree_dbh",
        "$f1"
      ],
      "exprs": [
        {
          "input": 4
        },
        {
          "literal": 2,
          "type": "DECIMAL",
          "target_type": "INTEGER",
          "scale": 0,
          "precision": 1,
          "type_scale": 0,
          "type_precision": 10
        }
      ]
    }

For the first argument this is being resolved later in RelAlgTranslator::translateAggregateRex, but the new argument is getting optimized out by the moment. So I'm asking if it's fine if I let parse_aggregate_expr access this part of query. I'll have to change it's signature to something like this:

std::unique_ptr<const RexAgg> parse_aggregate_expr(const rapidjson::Value& expr,
                                                   const std::vector<std::shared_ptr<const RelAlgNode>>& inputs) {

@asuhan
Copy link
Contributor Author

asuhan commented Oct 12, 2017

Ok, I now see the problem.The dead column elimination doesn't look at the second argument because it cannot -- RexAgg::getOperand() doesn't take a position argument and only returns the first operand. Fortunately, we only call it from two places in the optimizer, so you can break the signature and then fix the call sites in get_live_ins and renumber_rex_aggs. Both are straightforward: the first one is purely additive (just add the second argument to the live set), the second doesn't really matter (no need for renumbering a literal operand).

@Smyatkin-Maxim
Copy link
Contributor

@asuhan Ok, just to make sure that I got you right:

  • I don't change parse_aggregate_expr signature, just add the second operand support.
  • I do change signature of RexAgg::getOperand()
  • I do change optimizer a little bit in places you've just named to manage multiple arguments.

@asuhan
Copy link
Contributor Author

asuhan commented Oct 12, 2017

Yup, I think this is the best course of action -- should work smoothly, if it doesn't we'll have a closer look.

@Smyatkin-Maxim
Copy link
Contributor

I've created a pull request for review. I'd like you to have a closer look into these things:

  • Did I pick the correct formula for bitmap size calculation from relative error? I'm not 100% sure about it.
  • I see there is FunctionRef::analyze which also creates an AggExpr object. But it doesn't consider kAPPROXIMATE_COUNT_DISTINCT aggregates at all. So, I didn't update it for approx_count. As far as I understand, it's from some old parser before Calcite and isn't being used anymore? Or am I wrong here?

@asuhan
Copy link
Contributor Author

asuhan commented Oct 16, 2017

Thanks, I'll have a look tomorrow. Yes, FunctionRef::analyze is effectively dead code we should delete, don't worry about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants