-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behavior of count distinct #108
Comments
That has to do with how folds are composed, which is admittedly a little wonky. A little background... There are 4 phases to a fold operation: pre-processing, reducing, combining, and post-processing. Pre-processing and post-processing compose in a fairly regular manner, but the reducer/combiner parts go in pairs and don't compose nicely. Let's look at the combinef/reducef for both of count and distinct.
For your example, we would compute the distinct elements for each mapper, and add them to a set. In the combiner, we would then merge those intermediate sets. We can't count the items in the set in the mappers because then we lose what items are in the set and can't properly combine the outputs of the mappers. I still wanted to have fold operations sort of compose, so that you could do stuff like this: (->> (fold/map f) (fold/filter g) (fold/distinct) (fold/first)) Which ends up doing something like this:
But it's not clear which parts apply to which phases. I thought that I would throw an exception if you tried to do what you did, but clearly I do not. What you tried looks something like this because it's using the distributed count:
Where I'm just taking the last reducer/combiner combo that you supply. What you really want is something like this:
Which does the count as a post-processing operation. So now there are two versions of So, what can you do? The solution today is to define a custom fold-fn that specifies which part is which:
... or if you want to reuse it ...
It's not perfect and it could be better. If you have any ideas for making these compose better, I'm all ears. Take a look at the |
Thanks, this is very helpful. I was wondering whether fold-fn would be necessary. Looks like it's time for a little brain-stretching :) |
Hi Folks, we're getting some unexpected results here
This code returns
[5]
. Removing the count gives[#{:a :b :c}]
, no surprises. I'd expect the count to return the cardinality of the set returned by distinct, but unfortunately something else is happening.Any ideas?
The text was updated successfully, but these errors were encountered: