Add 'cumulative' histogram 'mode' for CDF #1189

etpinard · 2016-11-22T20:23:45Z

resolves #1180

etpinard · 2016-11-22T20:31:39Z

Here's a proof-of-concept PR made to attract more attention from our plotly attribute associates @chriddyp @cldougl but also @alexcjohnson @rreusser @monfera

In this PoC, a mode attribute is added to histogram traces (and to histogram2d and histogram2dcontour eventually) with two possible values: 'density' (which is wrong, maybe per-bin or raw would be better) and 'cumulative' which would allow for cumulative histograms (that's a valid term apparently) and cumulative distribution functions (CDFs).

etpinard · 2016-11-22T20:32:50Z

src/traces/histogram/calc.js

    // average and/or normalize the data, if needed
    if(doavg) total = doAvg(size, counts);
    if(normfunc) normfunc(size, total, inc);
+    if(trace.mode === 'cumulative') cdf(size);


Maybe 'cumulative' should have a different meaning for histnorm !== '' and histfunc !== 'count'?

etpinard · 2016-11-22T20:34:02Z

src/traces/histogram/attributes.js

        ].join(' ')
    },

+    mode: {


or maybe alternatively, a boolean cumulative attribute could do the trick. I don't see any other possible modes for histograms.

This is a cool addition--I think I'd prefer a cumulative boolean attribute. That's pretty explicit and +1 about not seeing any other possible modes.
Also, not sure if this is being too nitpicky across traces, but I'm not super convinced that this use case provides a good parallel to scatter mode

⏫ the winning argument so far

monfera · 2016-11-22T20:35:46Z

Yes, something like distribution or bins may be better than density.

etpinard · 2016-11-22T20:38:41Z

Yes, something like distribution

I like the sound of distribution. Thanks!

chriddyp · 2016-11-22T23:41:55Z

Love this! How do you imagine this relating to cumulative distributions graphed as a line chart? That's how I'm used to seeing CDFs and PDFs. Lots of nice examples on the plotly feed: https://plot.ly/feed/?q=CDF

Something similar that comes up pretty frequently is visualizing the aggregations that we provide in histogram.zfunc and histogram.z as lines or points instead of bars. The common case here is visualizing time series histograms - the number of events e.g. per week (like number of payments per week with z equal to the payment amount and zfunc equal to sum). Here's a nice example from Stephen Wolfram's Blog

And in that case, visualizing this data as lines instead of bars allows users view "rolling aggregations" in a way that bars can't. For example, viewing number of events in the last week, day-over-day. As a date histogram, the width of the bar spans the size of the bin (e.g. a week) and so you can't have a bar on day 14 and a bar on day 15 each spanning last week without them overlapping.

These aggregations come up a lot and CDFs and PDFs seem to also fall into this family. Originally I thought that we would provide this type of functionality through an aggregation transform + a scatter chart but maybe it could just be a different rendering option in the traces themselves, like adding mode to histograms as bar | line | scatter.

Would love @alexcjohnson 's feedback on ^^ too

alexcjohnson · 2016-11-23T18:53:33Z

@chriddyp Re: line mode for histograms - yes, I think it's a good idea, but it would be good to get real line/area trace stacking working first. Then it would be easy to plumb this into a rolling average transform if someone wants that. Agreed that it would be misleading to do this on bars.

@etpinard re: cumulative histograms - I haven't looked at the attributes you've proposed yet, but we should be careful about our presentation.

The way you've done it in the gif up top - each bar is at the same location as its PDF analog, with height equal to its own height plus the sum of all bars before it - is very common (it's used in the wikipedia article you linked), but it's also arguably wrong. Visually you've shifted the distribution half a bar to the left. Imagine thinking of the bars as a continuous function (ie constant over the domain of each bar) and integrating that function, which is really what the CDF is supposed to mean. You'd actually get a piecewise linear result where the value at the right edge of each bar is the sum of that bar and all bars before it.

I suppose if you really want to keep bars, you could imagine each cumulative bar being the sum of all the bars before it plus half of the current bar... that would be more "correct" but that seems like it would just confuse people. Or you can show both the previous total and the current bin:

That's both visually correct and (to me anyway) intuitive... but it's a bit complicated.

In short though, I really don't like bars for CDFs, however common they are. Take that with a grain of salt though, I haven't used them much for real data analysis myself, would love to hear the perspective of someone who has.

There's something similar to be said about the plot @chriddyp posted (which has now disappeared? But I think I remember what it looked like) although ironically with a partially opposite solution. It looks like in that plot you're showing the exact CDF by adding a data point for each individual sample? In that case linearly interpolating between points is incorrect, the CDF does not linearly increase from one sample to the next, it jumps up exactly at each sample - because it's really an integral of delta functions, one for each sample. So in plotly.js language, you should use line.shape='hv' if the vertical position of each point is equal to the number (or fraction) of samples to the left of and including that one:

Alternatively (and arguably more correctly in terms of the visual significance of the point markers, but confusing for the same reasons as above) you could set the vertical position of each point to the number of samples to the left of this one plus half a sample for the current one, and connect points using line.shape='vhv'. Nobody seems to do this though.

Some people also normalize to N+1 points and connect with straight lines, it's called an ogive plot - see http://www.physics.csbsju.edu/stats/display.distribution.html - this looks a bit weird though and I don't know what the theoretical justification for it is. I guess an attempt to project from the sample distribution to the population distribution?

This kind of situation is, incidentally, exactly what's hard about doing real stacked area charts correctly... you'd be trying to stack y values of functions that are defined at uneven x values... so for the second trace, do you make steps at the x values of the first trace, even though the second trace doesn't have a data point there?

chriddyp · 2016-11-23T19:04:13Z

So in plotly.js language, you should use line.shape='hv'

Yeah exactly, that's how I'm used to making and seeing CDFs.

avoid an empty bin at the start. Tested via histogram_test

alexcjohnson · 2017-01-14T06:19:47Z

src/traces/histogram/attributes.js

+            'increases from left to right. If *decreasing* we sum later bins',
+            'so the fresult decreases from left to right.'
+        ].join(' ')
+    },


direction='decreasing' is to invert the accumulation - ie if you want "how much of the distribution is past this point" instead of "how much is before this point" (although note that it's not exactly the sum minus the increasing CDF unless you choose currentbin='exclude' for one, or currentbin='half' for both (see below).

Thoughts on the name direction? I'm not super excited about it but it seems OK.

I'm ok with direction here. But we should keep in mind that direction is now present in several attribute containers. Currently:

In pie traces, direction: 'clockwise' || 'counterclockwise'

In updatemenus, direction: 'left' || 'right' || 'up' || 'down'

In the animtion config, direction: 'forward' || 'reverse'

In base layout for polar plots: direction: 'clockwise' || 'counterclockwise'

I suppose it would be nice to make enumerated attributes of the same name share the same posibile values when used in different containers. Or maybe that's too much to ask for?

alexcjohnson · 2017-01-14T06:31:50Z

src/traces/histogram/attributes.js

+            '*include* is the default for compatibility with various other',
+            'tools, however it introduces a half-bin bias to the results.',
+            '*exclude* makes the opposite half-bin bias, and *half* removes',
+            'it.'


OK, maybe nobody will use this option, but I put it in to satisfy my own frustration with the visual flaw in the common practice (#1189 (comment)). Is it clear enough what the options mean?

And not to beat a dead horse, but as well as fixing the position bias I think 'half' also does a better job of representing the width of the distribution. To take the extreme case, lets say you have a histogram with all the samples in a single bin. Although they could be all at exactly the same value, generally they aren't. But the standard way to display this would have the CDF going from zero to max instantaneously, as a step function, which implies no width at all to the distribution. 'half' on the other hand would rise in 2 steps - which visually implies a width of 1 bin.

Thanks for putting that option in!

The name currentbin bothers me a little bit because it doesn't sound associated with cumulative.

This has me thinking: maybe we should group all cumulative attributes into a nested object.

cumulative: { enabled: true || false, direction: 'increasing' || 'decreasing', currentbin: 'include' || 'exclude' || 'half' }

By @chriddyp's #1189 (comment) where we might extend cumulative: true to other trace types down the road, adding a cumulative nested object to scatter would be less intrusive I think.

This has me thinking: maybe we should group all cumulative attributes into a nested object.

My only concern about this is that 90% of users won't use anything but enabled, and cumulative: true is easier than cumulative: {enabled: true}. But it would disambiguate direction too... so maybe it's worthwhile.

What if

{ type: 'histogram', x: [/* */], cumulative: true }

expanded to

{ type: 'histogram', x: [/* */], cumulative: { enabled: true, direction: 'increasing', currentbin: 'include' } }

in _fullData?

I thought about letting cumulative: true expand to cumulative: {enabled: true} internally, but I don't think it's a good idea - it would make it very confusing for folks to switch to the full form if they start with the simple one. I think I'll just change it to the nested structure.

nested in 4d02af7

and fix its tests for the improved behavior

etpinard · 2017-01-16T16:47:51Z

test/jasmine/tests/histogram_test.js

+                {
+                    currentbin: 'exclude', histnorm: 'probability',
+                    p: [2, 3, 4, 5], s: [0.1, 0.3, 0.6, 1]
+                }


It might be nice to test cumulative: true with other histfunc and histnorm settings.

Another good call by @etpinard 🌮 (and see also #1189 (comment))

What should we do with cumulative enabled and histnorm='density' or 'probability density'? As the code stands, CDFs using 'density' would rise to N/binSize (# samples / width of each bin) and 'probability density' would rise to 1/binSize. That seems useless and confusing, so I'd propose to interpret "cumulative" to mean an integral in these cases, ie 'density' would rise to N and 'probability density' would rise to 1, which then means in CDF mode these are equivalent to histnorm='' and 'probability' respectively.

I don't think there's anything special to do based on histfunc - some of these would also give strange results, but then the user is clearly asking for something strange.

Thoughts on any of this?

I don't think there's anything special to do based on histfunc - some of these would also give strange results, but then the user is clearly asking for something strange.

I can see this being used in time series CDFs. Think payments over time: bin by date and then cumulatively sum by payment amount

I'd propose to interpret "cumulative" to mean an integral in these cases, ie 'density' would rise to N and 'probability density' would rise to 1

I agree 100% here.

@chriddyp

I can see this being used in time series CDFs. Think payments over time: bin by date and then cumulatively sum by payment amount

Absolutely - and that will work just fine without modification (tests to come). I was just saying I don't think there's anything that needs altering based on histfunc, like what I'm planning to do for histnorm.

ha! turns out we didn't have any tests with histfunc and histnorm together, and max/min were broken. Fixed in c12b7cf and tests for all of this (cumulative + histfunc + histnorm all in one go!) in 4d02af7

cldougl · 2017-01-16T16:55:19Z

src/traces/histogram/attributes.js

+            'Only applies if `cumulative=true.',
+            'If *increasing* (default) we sum all prior bins, so the result',
+            'increases from left to right. If *decreasing* we sum later bins',
+            'so the fresult decreases from left to right.'


result typo

fixed in 4d02af7

etpinard · 2017-01-16T16:59:52Z

src/plot_api/plot_api.js

        'line.showscale', 'line.cauto', 'line.autocolorscale', 'line.reversescale',
        'marker.line.showscale', 'marker.line.cauto', 'marker.line.autocolorscale', 'marker.line.reversescale',
-        'xcalendar', 'ycalendar'
+        'xcalendar', 'ycalendar', 'cumulative', 'currentbin'


probably need direction in here too.

... unless it's already part of that list.

yep, look 3 lines up. That is a bit of a concern... that now all the different direction uses are coupled to each other. Though I'm assuming this whole list is not long for this world, and within some reasonable timeframe we'll delegate all of this to the trace modules.

etpinard · 2017-01-16T17:01:11Z

src/traces/histogram/attributes.js

        ].join(' ')
    },

+    cumulative: {


It would be nice to add one image mock. Maybe one that combines a currentbin: 'include' and currentbin: 'exclude' traces like in:

test image in 53b61aa

One thing this showed is that we need a way to harmonize autobins across traces, and that it needs to know about cumulative. To make this example work I needed to manually extend the bin range for the smaller trace, otherwise its CDF ended too soon. Actually, CDFs never end, really... so perhaps the even better thing to do would be to look at the axis range and draw bins out to the edge. Anyway, fixing this will be a bigger project so I'll make an issue for it rather than try to address it here.

Anyway, fixing this will be a bigger project so I'll make an issue for it rather than try to address it here.

That's fine. Thanks for the info!

etpinard · 2017-01-18T21:10:05Z

test/jasmine/tests/histogram_test.js

+                },
+                {
+                    // behaves the same as without *density*
+                    direction: 'decreasing', currentbin: 'half', histnorm: 'density',


Looking good!

etpinard · 2017-01-18T21:16:14Z

💃 Thanks for taking this one home!

[pof] add 'cumulative' histogram 'mode' for CDF

62aeefb

etpinard added status: discussion needed feature something new labels Nov 22, 2016

etpinard commented Nov 22, 2016

View reviewed changes

alexcjohnson added 3 commits January 13, 2017 09:23

Merge branch 'master' into cdf

9c0dea0

edge case in autoShiftNumericBins

fd7526b

avoid an empty bin at the start. Tested via histogram_test

flesh out CDFs

17c04c5

alexcjohnson reviewed Jan 14, 2017

View reviewed changes

alexcjohnson added status: reviewable and removed status: discussion needed labels Jan 14, 2017

further tweak of autoShiftNumericBins

4bc28dd

and fix its tests for the improved behavior

etpinard added this to the v1.22.0 milestone Jan 16, 2017

etpinard commented Jan 16, 2017

View reviewed changes

cldougl reviewed Jan 16, 2017

View reviewed changes

etpinard commented Jan 16, 2017

View reviewed changes

alexcjohnson added 3 commits January 18, 2017 15:13

fix bug in histogram min/max aggregation with normalization

c12b7cf

nest histogram cumulative attributes and test histnorm/histfunc more

4d02af7

CDF test image

53b61aa

etpinard commented Jan 18, 2017

View reviewed changes

alexcjohnson merged commit 49106aa into master Jan 18, 2017

alexcjohnson deleted the cdf branch January 18, 2017 22:37

alexcjohnson mentioned this pull request Jan 19, 2017

auto bins and cumulative distribution histograms #1318

Closed

Uh oh!

Add 'cumulative' histogram 'mode' for CDF #1189

Add 'cumulative' histogram 'mode' for CDF #1189

Uh oh!

Conversation

etpinard commented Nov 22, 2016

Uh oh!

etpinard commented Nov 22, 2016

Uh oh!

etpinard Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cldougl Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etpinard Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

monfera commented Nov 22, 2016

Uh oh!

etpinard commented Nov 22, 2016

Uh oh!

chriddyp commented Nov 22, 2016

Uh oh!

alexcjohnson commented Nov 23, 2016

Uh oh!

chriddyp commented Nov 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etpinard Jan 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

etpinard Nov 22, 2016 •

edited

Loading

cldougl Nov 22, 2016 •

edited

Loading

etpinard Nov 22, 2016 •

edited

Loading

etpinard Jan 16, 2017 •

edited

Loading