-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make APPROX_COUNT_DISTINCT
error configurable
#29
Comments
Hi @asuhan, does a starter tag imply that anyone just taking a look at the code base could possibly look into fixing this? |
Yes, it does. Ideally, people working on a task should announce their intention to avoid duplication of effort. Here's what this task involves:
|
May I take this one? |
@Smyatkin-Maxim Yes, that’d be great. Let me know if you need help. |
yes, this is good. |
@asuhan Btw, should the optional argument be Integer (e.g., 2%) or Decimal (e.g., 0.02 error rate)? |
Yes, you have to start from Regarding the format of the argument, let's make it percent (2%). |
Ok, thanks |
As far as I understand, by design I should not (and can not currently) access the literal at |
Or maybe this little hack is better than updating optimizer for the corner case of aggregate with two arguments. @asuhan , which way would you advice? |
@Smyatkin-Maxim I don't understand your comment about |
Ok, perhaps optimizer works as it should - it's not something that can be learned in two days :) select approx_count_distinct(tree_dbh, 2) from nyc_trees_2015_683k; it sees only this part of input JSON string: {
"id": "2",
"relOp": "LogicalAggregate",
"fields": [
"EXPR$0"
],
"group": [],
"aggs": [
{
"agg": "APPROX_COUNT_DISTINCT",
"type": {
"type": "BIGINT",
"nullable": false
},
"distinct": false,
"operands": [
0,
1
]
}
]
} While I need to access this part: {
"id": "1",
"relOp": "LogicalProject",
"fields": [
"tree_dbh",
"$f1"
],
"exprs": [
{
"input": 4
},
{
"literal": 2,
"type": "DECIMAL",
"target_type": "INTEGER",
"scale": 0,
"precision": 1,
"type_scale": 0,
"type_precision": 10
}
]
} For the first argument this is being resolved later in std::unique_ptr<const RexAgg> parse_aggregate_expr(const rapidjson::Value& expr,
const std::vector<std::shared_ptr<const RelAlgNode>>& inputs) { |
Ok, I now see the problem.The dead column elimination doesn't look at the second argument because it cannot -- |
@asuhan Ok, just to make sure that I got you right:
|
Yup, I think this is the best course of action -- should work smoothly, if it doesn't we'll have a closer look. |
I've created a pull request for review. I'd like you to have a closer look into these things:
|
Thanks, I'll have a look tomorrow. Yes, |
Add a second, optional parameter to
APPROX_COUNT_DISTINCT
to control the error rate of the underlying HyperLogLog structure.The text was updated successfully, but these errors were encountered: