Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'cumulative' histogram 'mode' for CDF #1189

Merged
merged 8 commits into from
Jan 18, 2017
2 changes: 1 addition & 1 deletion src/plot_api/plot_api.js
Original file line number Diff line number Diff line change
Expand Up @@ -1296,7 +1296,7 @@ function _restyle(gd, aobj, _traces) {
'tilt', 'tiltaxis', 'depth', 'direction', 'rotation', 'pull',
'line.showscale', 'line.cauto', 'line.autocolorscale', 'line.reversescale',
'marker.line.showscale', 'marker.line.cauto', 'marker.line.autocolorscale', 'marker.line.reversescale',
'xcalendar', 'ycalendar'
'xcalendar', 'ycalendar', 'cumulative', 'currentbin'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need direction in here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... unless it's already part of that list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, look 3 lines up. That is a bit of a concern... that now all the different direction uses are coupled to each other. Though I'm assuming this whole list is not long for this world, and within some reasonable timeframe we'll delegate all of this to the trace modules.

];

for(i = 0; i < traces.length; i++) {
Expand Down
40 changes: 35 additions & 5 deletions src/traces/histogram/attributes.js
Original file line number Diff line number Diff line change
Expand Up @@ -71,14 +71,44 @@ module.exports = {
].join(' ')
},

mode: {
cumulative: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add one image mock. Maybe one that combines a currentbin: 'include' and currentbin: 'exclude' traces like in:

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test image in 53b61aa

One thing this showed is that we need a way to harmonize autobins across traces, and that it needs to know about cumulative. To make this example work I needed to manually extend the bin range for the smaller trace, otherwise its CDF ended too soon. Actually, CDFs never end, really... so perhaps the even better thing to do would be to look at the axis range and draw bins out to the edge. Anyway, fixing this will be a bigger project so I'll make an issue for it rather than try to address it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, fixing this will be a bigger project so I'll make an issue for it rather than try to address it here.

That's fine. Thanks for the info!

valType: 'boolean',
dflt: false,
role: 'info',
description: [
'If true, display the cumulative distribution by summing the',
'binned values. Use the `direction` and `centralbin` attributes',
'to tune the accumulation method.'
].join(' ')
},

direction: {
valType: 'enumerated',
values: ['density', 'cumulative'],
dflt: 'density',
values: ['increasing', 'decreasing'],
dflt: 'increasing',
role: 'info',
description: [
''
].join('')
'Only applies if `cumulative=true.',
'If *increasing* (default) we sum all prior bins, so the result',
'increases from left to right. If *decreasing* we sum later bins',
'so the fresult decreases from left to right.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result typo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 4d02af7

].join(' ')
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

direction='decreasing' is to invert the accumulation - ie if you want "how much of the distribution is past this point" instead of "how much is before this point" (although note that it's not exactly the sum minus the increasing CDF unless you choose currentbin='exclude' for one, or currentbin='half' for both (see below).

Thoughts on the name direction? I'm not super excited about it but it seems OK.

Copy link
Contributor Author

@etpinard etpinard Jan 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with direction here. But we should keep in mind that direction is now present in several attribute containers. Currently:

  • In pie traces, direction: 'clockwise' || 'counterclockwise'
  • In updatemenus, direction: 'left' || 'right' || 'up' || 'down'
  • In the animtion config, direction: 'forward' || 'reverse'
  • In base layout for polar plots: direction: 'clockwise' || 'counterclockwise'

I suppose it would be nice to make enumerated attributes of the same name share the same posibile values when used in different containers. Or maybe that's too much to ask for?


currentbin: {
valType: 'enumerated',
values: ['include', 'exclude', 'half'],
dflt: 'include',
role: 'info',
description: [
'Only applies if `cumulative=true.',
'Sets whether the current bin is included, excluded, or has half',
'of its value included in the current cumulative value.',
'*include* is the default for compatibility with various other',
'tools, however it introduces a half-bin bias to the results.',
'*exclude* makes the opposite half-bin bias, and *half* removes',
'it.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe nobody will use this option, but I put it in to satisfy my own frustration with the visual flaw in the common practice (#1189 (comment)). Is it clear enough what the options mean?

And not to beat a dead horse, but as well as fixing the position bias I think 'half' also does a better job of representing the width of the distribution. To take the extreme case, lets say you have a histogram with all the samples in a single bin. Although they could be all at exactly the same value, generally they aren't. But the standard way to display this would have the CDF going from zero to max instantaneously, as a step function, which implies no width at all to the distribution. 'half' on the other hand would rise in 2 steps - which visually implies a width of 1 bin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting that option in!

The name currentbin bothers me a little bit because it doesn't sound associated with cumulative.

This has me thinking: maybe we should group all cumulative attributes into a nested object.

cumulative: {
  enabled: true || false,
  direction: 'increasing' || 'decreasing',
  currentbin: 'include' || 'exclude' || 'half'
}

By @chriddyp's #1189 (comment) where we might extend cumulative: true to other trace types down the road, adding a cumulative nested object to scatter would be less intrusive I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has me thinking: maybe we should group all cumulative attributes into a nested object.

My only concern about this is that 90% of users won't use anything but enabled, and cumulative: true is easier than cumulative: {enabled: true}. But it would disambiguate direction too... so maybe it's worthwhile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if

{
  type: 'histogram',
  x: [/* */],
  cumulative: true
}

expanded to

{
   type: 'histogram',
   x: [/* */],
   cumulative: {
     enabled: true,
     direction: 'increasing',
     currentbin: 'include'
  }
}

in _fullData?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about letting cumulative: true expand to cumulative: {enabled: true} internally, but I don't think it's a good idea - it would make it very confusing for folks to switch to the full form if they start with the simple one. I think I'll just change it to the nested structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested in 4d02af7

].join(' ')
},

autobinx: {
Expand Down
84 changes: 74 additions & 10 deletions src/traces/histogram/calc.js
Original file line number Diff line number Diff line change
Expand Up @@ -41,15 +41,29 @@ module.exports = function calc(gd, trace) {
var pos0 = pa.makeCalcdata(trace, maindata);

// calculate the bins
if((trace['autobin' + maindata] !== false) || !(maindata + 'bins' in trace)) {
trace[maindata + 'bins'] = Axes.autoBin(pos0, pa, trace['nbins' + maindata], false, calendar);
var binAttr = maindata + 'bins',
binspec;
if((trace['autobin' + maindata] !== false) || !(binAttr in trace)) {
binspec = Axes.autoBin(pos0, pa, trace['nbins' + maindata], false, calendar);

// adjust for CDF edge cases
if(trace.cumulative && (trace.currentbin !== 'include')) {
if(trace.direction === 'decreasing') {
binspec.start = pa.c2r(pa.r2c(binspec.start) - binspec.size);
}
else {
binspec.end = pa.c2r(pa.r2c(binspec.end) + binspec.size);
}
}

// copy bin info back to the source data.
trace._input[maindata + 'bins'] = trace[maindata + 'bins'];
// copy bin info back to the source and full data.
trace._input[binAttr] = trace[binAttr] = binspec;
}
else {
binspec = trace[binAttr];
}

var binspec = trace[maindata + 'bins'],
nonuniformBins = typeof binspec.size === 'string',
var nonuniformBins = typeof binspec.size === 'string',
bins = nonuniformBins ? [] : binspec,
// make the empty bin array
i2,
Expand Down Expand Up @@ -115,7 +129,9 @@ module.exports = function calc(gd, trace) {
// average and/or normalize the data, if needed
if(doavg) total = doAvg(size, counts);
if(normfunc) normfunc(size, total, inc);
if(trace.mode === 'cumulative') cdf(size);

// after all normalization etc, now we can accumulate if desired
if(trace.cumulative) cdf(size, trace.direction, trace.currentbin);


var serieslen = Math.min(pos.length, size.length),
Expand Down Expand Up @@ -146,8 +162,56 @@ module.exports = function calc(gd, trace) {
return cd;
};

function cdf(size) {
for(var i = 1; i < size.length; i++) {
size[i] += size[i - 1];
function cdf(size, direction, currentbin) {
var i,
vi,
prevSum;

function firstHalfPoint(i) {
prevSum = size[i];
size[i] /= 2;
}

function nextHalfPoint(i) {
vi = size[i];
size[i] = prevSum + vi / 2;
prevSum += vi;
}

if(currentbin === 'half') {

if(direction === 'increasing') {
firstHalfPoint(0);
for(i = 1; i < size.length; i++) {
nextHalfPoint(i);
}
}
else {
firstHalfPoint(size.length - 1);
for(i = size.length - 2; i >= 0; i--) {
nextHalfPoint(i);
}
}
}
else if(direction === 'increasing') {
for(i = 1; i < size.length; i++) {
size[i] += size[i - 1];
}

// 'exclude' is identical to 'include' just shifted one bin over
if(currentbin === 'exclude') {
size.unshift(0);
size.pop();
}
}
else {
for(i = size.length - 2; i >= 0; i--) {
size[i] += size[i + 1];
}

if(currentbin === 'exclude') {
size.push(0);
size.shift();
}
}
}
7 changes: 6 additions & 1 deletion src/traces/histogram/defaults.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,12 @@ module.exports = function supplyDefaults(traceIn, traceOut, defaultColor, layout
var x = coerce('x'),
y = coerce('y');

coerce('mode');
var cumulative = coerce('cumulative');
if(cumulative) {
coerce('direction');
coerce('currentbin');
}

coerce('text');

var orientation = coerce('orientation', (y && !x) ? 'h' : 'v'),
Expand Down
71 changes: 71 additions & 0 deletions test/jasmine/tests/histogram_test.js
Original file line number Diff line number Diff line change
Expand Up @@ -246,5 +246,76 @@ describe('Test histogram', function() {
]);
});

describe('cumulative distribution functions', function() {
var base = {x: [1, 2, 3, 4, 2, 3, 4, 3, 4, 4]};

it('makes the right base histogram', function() {
var baseOut = _calc(base);
expect(baseOut).toEqual([
{b: 0, p: 1, s: 1},
{b: 0, p: 2, s: 2},
{b: 0, p: 3, s: 3},
{b: 0, p: 4, s: 4},
]);
});

var CDFs = [
{p: [1, 2, 3, 4], s: [1, 3, 6, 10]},
{
direction: 'decreasing',
p: [1, 2, 3, 4], s: [10, 9, 7, 4]
},
{
currentbin: 'exclude',
p: [2, 3, 4, 5], s: [1, 3, 6, 10]
},
{
direction: 'decreasing', currentbin: 'exclude',
p: [0, 1, 2, 3], s: [10, 9, 7, 4]
},
{
currentbin: 'half',
p: [1, 2, 3, 4, 5], s: [0.5, 2, 4.5, 8, 10]
},
{
direction: 'decreasing', currentbin: 'half',
p: [0, 1, 2, 3, 4], s: [10, 9.5, 8, 5.5, 2]
},
{
direction: 'decreasing', currentbin: 'half', histnorm: 'percent',
p: [0, 1, 2, 3, 4], s: [100, 95, 80, 55, 20]
},
{
currentbin: 'exclude', histnorm: 'probability',
p: [2, 3, 4, 5], s: [0.1, 0.3, 0.6, 1]
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to test cumulative: true with other histfunc and histnorm settings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good call by @etpinard 🌮 (and see also #1189 (comment))

What should we do with cumulative enabled and histnorm='density' or 'probability density'? As the code stands, CDFs using 'density' would rise to N/binSize (# samples / width of each bin) and 'probability density' would rise to 1/binSize. That seems useless and confusing, so I'd propose to interpret "cumulative" to mean an integral in these cases, ie 'density' would rise to N and 'probability density' would rise to 1, which then means in CDF mode these are equivalent to histnorm='' and 'probability' respectively.

I don't think there's anything special to do based on histfunc - some of these would also give strange results, but then the user is clearly asking for something strange.

Thoughts on any of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything special to do based on histfunc - some of these would also give strange results, but then the user is clearly asking for something strange.

I can see this being used in time series CDFs. Think payments over time: bin by date and then cumulatively sum by payment amount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to interpret "cumulative" to mean an integral in these cases, ie 'density' would rise to N and 'probability density' would rise to 1

I agree 100% here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriddyp

I can see this being used in time series CDFs. Think payments over time: bin by date and then cumulatively sum by payment amount

Absolutely - and that will work just fine without modification (tests to come). I was just saying I don't think there's anything that needs altering based on histfunc, like what I'm planning to do for histnorm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! turns out we didn't have any tests with histfunc and histnorm together, and max/min were broken. Fixed in c12b7cf and tests for all of this (cumulative + histfunc + histnorm all in one go!) in 4d02af7

];

CDFs.forEach(function(CDF) {
var direction = CDF.direction,
currentbin = CDF.currentbin,
histnorm = CDF.histnorm,
p = CDF.p,
s = CDF.s;

it('handles direction=' + direction + ', currentbin=' + currentbin + ', histnorm=' + histnorm, function() {
var traceIn = Lib.extendFlat({}, base, {
cumulative: true,
direction: direction,
currentbin: currentbin,
histnorm: histnorm
});
var out = _calc(traceIn);

expect(out.length).toBe(p.length);
out.forEach(function(outi, i) {
expect(outi.p).toBe(p[i]);
expect(outi.s).toBeCloseTo(s[i], 6);
expect(outi.b).toBe(0);
});
});
});
});

});
});