Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wire up Standard Deviation aggregate function. #1573

Merged
merged 4 commits into from
Feb 11, 2015
Merged

Wire up Standard Deviation aggregate function. #1573

merged 4 commits into from
Feb 11, 2015

Conversation

corylanou
Copy link
Contributor

Fixes #1475

@corylanou
Copy link
Contributor Author

@pauldix is it worth doing a partial mean in the Map even though we still need to send all the data to the reduce to get diffs for std dev? It will distribute a small part of the load before reduce. Thoughts?

@pauldix
Copy link
Member

pauldix commented Feb 11, 2015

@corylanou nah, not worth it. The time to pipe the data over the network is what will dominate here

func MapStddev(itr Iterator, e *Emitter, tmax int64) {
var values []float64

for k, v := itr.Next(); k != 0; k, v = itr.Next() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should have something here that emits the values after it gets to a certain size. Like say 1000. This may fall over later if you try to emit a values array that is super massive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I can effectively "batch" emits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're already putting all the values into a single emit. I would just say that you need to limit the size of that. So you can do multiple emits per map

@otoolep
Copy link
Contributor

otoolep commented Feb 11, 2015

LGTM.

corylanou added a commit that referenced this pull request Feb 11, 2015
Wire up Standard Deviation aggregate function.
@corylanou corylanou merged commit 13c437a into master Feb 11, 2015
@corylanou corylanou deleted the stddev branch February 11, 2015 20:56
@pauldix
Copy link
Member

pauldix commented Feb 11, 2015

Wait, we want population stddev, don't we?

Also, that string "undefined" for undefined stddevs is going to break everyone trying to work with this data. If JSON supported floats I would suggest that it be NaN, but it doesn't. Given that, I'd say we're best off making it 0.

I know that's mathematically incorrect, but it's going to be the most sensible thing for people actually working with this.

@toddboom
Copy link
Contributor

@pauldix Aren't we technically working with samples? Since the original measurements are theoretically continuous functions, I'm not sure we have a population.

@pauldix
Copy link
Member

pauldix commented Feb 11, 2015

I don't see why we'd throw out one of the measurements

@toddboom
Copy link
Contributor

http://en.wikipedia.org/wiki/Bessel%27s_correction

"While there are n independent samples, there are only n − 1 independent residuals, as they sum to 0."

@pauldix
Copy link
Member

pauldix commented Feb 11, 2015

@toddboom ah ok. I'm am enlightened ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wire up stddev aggregate function
4 participants