Make `APPROX_COUNT_DISTINCT` error configurable #29

asuhan · 2017-06-20T22:34:14Z

Add a second, optional parameter to APPROX_COUNT_DISTINCT to control the error rate of the underlying HyperLogLog structure.

The text was updated successfully, but these errors were encountered:

shahsyed-pcln · 2017-06-21T10:13:23Z

Hi @asuhan, does a starter tag imply that anyone just taking a look at the code base could possibly look into fixing this?

asuhan · 2017-06-21T18:12:16Z

Yes, it does. Ideally, people working on a task should announce their intention to avoid duplication of effort.

Here's what this task involves:

Changes to the ApproxCountDistinct operator (in the MapDSqlOperatorTable class) to make it accept an additional, optional parameter for the desired relative error. Check PgILike or RegexpLike for an example of how to achieve that (the escape character for these two is optional).
Add the desired relative error as an additional field of Analyzer::AggExpr and use it in GroupByAndAggregate::initCountDistinctDescriptors instead of HLL_MASK_WIDTH.

Smyatkin-Maxim · 2017-10-09T14:50:32Z

May I take this one?

asuhan · 2017-10-09T18:15:06Z

@Smyatkin-Maxim Yes, that’d be great. Let me know if you need help.

billmaimone · 2017-10-09T23:51:23Z

yes, this is good.

Smyatkin-Maxim · 2017-10-11T17:39:56Z

@asuhan
Tried to understand it today. Either I'm missing some easy way to do it, or it's a bit more work than just updating code in the three places, because there are no other aggregates working with multiple arguments.
The changes in calcite are straightforward. The changes in C++ not that much. If I understand it correctly, I should find a way to parse json generated by calcite in parse_aggregate_expr (RelAlgAbstractInterpreter.cpp) to get the optional parameter into RexAgg and then forward it to AggExpr? I also see there are other code paths creating AggExpr instances, which I also should handle.

Btw, should the optional argument be Integer (e.g., 2%) or Decimal (e.g., 0.02 error rate)?
Will try to implement it tomorrow.

asuhan · 2017-10-11T18:28:46Z

Yes, you have to start from parse_aggregate_expr and relax the operands requirements a little bit to account for the fact that we have two parameters now. You should only allow for an additional literal parameter for APPROX_COUNT_DISTINCT and store it in RexAgg and AggExpr. RelAlgTranslator::translateAggregateRex is the only place which really instantiates AggExpr, the other occurrences either just copy it (ExpressionRewrite.cpp, CalciteAdapter.cpp) or synthesize a COUNT aggregate for a cardinality query (in RelAlgExecutor.cpp). The same additional information needs to be stored in the TargetInfo structure and stored there by target_info in ResultRows.h.

Regarding the format of the argument, let's make it percent (2%).

Smyatkin-Maxim · 2017-10-11T20:14:34Z

Ok, thanks

Smyatkin-Maxim · 2017-10-12T13:51:31Z

As far as I understand, by design I should not (and can not currently) access the literal at parse_aggregate_expr, because it's not part of an aggregate expression - it's part of input arguments. I should resolve it only in RelAlgTranslator::translateAggregateRex from scalar_sources. But for some reason optimizer decides that I don't need this literal and removes it in eliminate_dead_columns. Which means that I won't see it later in scalar_sources.
So, while I could parse it in parse_aggregate_expr into AggRex, it feels like a dirty hack, because it's not the place where operands should be resolved. So I probably have to come up with a fix to eliminate_dead_columns I suppose?

Smyatkin-Maxim · 2017-10-12T14:19:26Z

Or maybe this little hack is better than updating optimizer for the corner case of aggregate with two arguments. @asuhan , which way would you advice?

asuhan · 2017-10-12T18:24:36Z

@Smyatkin-Maxim I don't understand your comment about parse_aggregate_expr. It returns a full aggregate expression which carries all the information: the type of the aggregate and the arguments. For APPROX_COUNT_DISTINCT, the first argument is handled just like we do for all aggregates, you can leave it alone. The second argument becomes an additional integer / float (depends on your choice) field of RexAgg; unlike the first argument, it's not a scalar source. You don't need to change eliminate_dead_columns.

Smyatkin-Maxim · 2017-10-12T18:50:37Z

Ok, perhaps optimizer works as it should - it's not something that can be learned in two days :)
I understand what parse_aggregate_expr does. I mean that Calcite doesn't make this literal part of aggregate expression, so parse_aggregate_expr can't see it.
For example, if we consider this query:

select approx_count_distinct(tree_dbh, 2) from nyc_trees_2015_683k;

it sees only this part of input JSON string:

    {
      "id": "2",
      "relOp": "LogicalAggregate",
      "fields": [
        "EXPR$0"
      ],
      "group": [],
      "aggs": [
        {
          "agg": "APPROX_COUNT_DISTINCT",
          "type": {
            "type": "BIGINT",
            "nullable": false
          },
          "distinct": false,
          "operands": [
            0,
            1
          ]
        }
      ]
    }

While I need to access this part:

    {
      "id": "1",
      "relOp": "LogicalProject",
      "fields": [
        "tree_dbh",
        "$f1"
      ],
      "exprs": [
        {
          "input": 4
        },
        {
          "literal": 2,
          "type": "DECIMAL",
          "target_type": "INTEGER",
          "scale": 0,
          "precision": 1,
          "type_scale": 0,
          "type_precision": 10
        }
      ]
    }

For the first argument this is being resolved later in RelAlgTranslator::translateAggregateRex, but the new argument is getting optimized out by the moment. So I'm asking if it's fine if I let parse_aggregate_expr access this part of query. I'll have to change it's signature to something like this:

std::unique_ptr<const RexAgg> parse_aggregate_expr(const rapidjson::Value& expr,
                                                   const std::vector<std::shared_ptr<const RelAlgNode>>& inputs) {

asuhan · 2017-10-12T19:03:35Z

Ok, I now see the problem.The dead column elimination doesn't look at the second argument because it cannot -- RexAgg::getOperand() doesn't take a position argument and only returns the first operand. Fortunately, we only call it from two places in the optimizer, so you can break the signature and then fix the call sites in get_live_ins and renumber_rex_aggs. Both are straightforward: the first one is purely additive (just add the second argument to the live set), the second doesn't really matter (no need for renumbering a literal operand).

Smyatkin-Maxim · 2017-10-12T19:16:43Z

@asuhan Ok, just to make sure that I got you right:

I don't change parse_aggregate_expr signature, just add the second operand support.
I do change signature of RexAgg::getOperand()
I do change optimizer a little bit in places you've just named to manage multiple arguments.

asuhan · 2017-10-12T19:18:02Z

Yup, I think this is the best course of action -- should work smoothly, if it doesn't we'll have a closer look.

Smyatkin-Maxim · 2017-10-15T11:47:07Z

I've created a pull request for review. I'd like you to have a closer look into these things:

Did I pick the correct formula for bitmap size calculation from relative error? I'm not 100% sure about it.
I see there is FunctionRef::analyze which also creates an AggExpr object. But it doesn't consider kAPPROXIMATE_COUNT_DISTINCT aggregates at all. So, I didn't update it for approx_count. As far as I understand, it's from some old parser before Calcite and isn't being used anymore? Or am I wrong here?

asuhan · 2017-10-16T00:33:22Z

Thanks, I'll have a look tomorrow. Yes, FunctionRef::analyze is effectively dead code we should delete, don't worry about it.

asuhan added starter enhancement labels Jun 20, 2017

This was referenced Sep 27, 2017

mapd_server crashed when queries and inserts proceeded concurrently #82

Closed

LOAD queries starve SELECT queries (and mapdql) when number of total concurrent queries is GTE than number of cpu cores #95

Closed

This was referenced Oct 13, 2017

JDBC PreparedStatement won't work with parameter as last element #45

Closed

Implemented optional error percentage argument for APPROX_COUNT_DISTINCT #112

Merged

asuhan closed this as completed in #112 Oct 20, 2017

MarcusGDaniels mentioned this issue Mar 24, 2021

determinism and SELECT DISTINCT #636

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `APPROX_COUNT_DISTINCT` error configurable #29

Make `APPROX_COUNT_DISTINCT` error configurable #29

asuhan commented Jun 20, 2017

shahsyed-pcln commented Jun 21, 2017

asuhan commented Jun 21, 2017

Smyatkin-Maxim commented Oct 9, 2017

asuhan commented Oct 9, 2017 •

edited

Loading

billmaimone commented Oct 9, 2017

Smyatkin-Maxim commented Oct 11, 2017

asuhan commented Oct 11, 2017

Smyatkin-Maxim commented Oct 11, 2017

Smyatkin-Maxim commented Oct 12, 2017

Smyatkin-Maxim commented Oct 12, 2017

asuhan commented Oct 12, 2017

Smyatkin-Maxim commented Oct 12, 2017

asuhan commented Oct 12, 2017 •

edited

Loading

Smyatkin-Maxim commented Oct 12, 2017

asuhan commented Oct 12, 2017

Smyatkin-Maxim commented Oct 15, 2017

asuhan commented Oct 16, 2017

Make APPROX_COUNT_DISTINCT error configurable #29

Make APPROX_COUNT_DISTINCT error configurable #29

Comments

asuhan commented Jun 20, 2017

shahsyed-pcln commented Jun 21, 2017

asuhan commented Jun 21, 2017

Smyatkin-Maxim commented Oct 9, 2017

asuhan commented Oct 9, 2017 • edited Loading

billmaimone commented Oct 9, 2017

Smyatkin-Maxim commented Oct 11, 2017

asuhan commented Oct 11, 2017

Smyatkin-Maxim commented Oct 11, 2017

Smyatkin-Maxim commented Oct 12, 2017

Smyatkin-Maxim commented Oct 12, 2017

asuhan commented Oct 12, 2017

Smyatkin-Maxim commented Oct 12, 2017

asuhan commented Oct 12, 2017 • edited Loading

Smyatkin-Maxim commented Oct 12, 2017

asuhan commented Oct 12, 2017

Smyatkin-Maxim commented Oct 15, 2017

asuhan commented Oct 16, 2017

Make `APPROX_COUNT_DISTINCT` error configurable #29

Make `APPROX_COUNT_DISTINCT` error configurable #29

asuhan commented Oct 9, 2017 •

edited

Loading

asuhan commented Oct 12, 2017 •

edited

Loading