RFC: handling of nil/NA columns in $rollups. #181

michaelochurch · 2013-08-27T17:56:25Z

Right now, if you have nils in a dataset and call a $rollup function on it, you get NullPointerException, because the rollup functions treat the nils as they are, rather than as missing data (which is what I'd prefer).

Generally, when you do an aggregation (sum, min, max, avg) the missing values are excluded. So, for example, with this dataset and $rollup call:

user=> ds

| :color | :weight |
|--------+---------|
|   :red |      15 |
|   :red |      22 |
| :green |      17 |
| :green |         |  ;; that's a missing value / nil

user=> ($rollup :mean :weight :color ds)

NullPointerException   clojure.lang.Numbers.ops (Numbers.java:942)

you'd typically expect a mean weight for :color :green of 17 to be reported; the nil is just dropped.

Is this issue (missing data handling) a priority for 1.5.x or a 2.0.0 issue? There are a lot of different architectural directions we could take on this, ranging from the less intrusive (e.g. just change $rollup) to the much more massive architectural rewrites that would center on making missing data a major part of the scene.

I do think that, as ugly as it can be, missing data/NA is something we're going to have to deal with a lot in Dataset.

tutysara · 2013-08-28T04:04:14Z

Yes, It would be convenient to have the missing data dropped in the calculation.
I too faced this issue and for all my use cases dropping nil values was ok, and I just filtered out the rows having nil values in datasets for that col.
I guess patching $rollup is not that hard, what is the architectural rewrites that requires a big change, should we have to change many related functions?

michaelochurch · 2013-08-28T13:52:05Z

Adapting $rollup to handle nils isn't that hard.

The architectural question is how we want to handle NA/missing data in general. Do we treat it as nil? (That would be most idiomatic.) Or something else like the keyword :NA, or a dedicated object (def NA (Object.))? I would vote for using nil-- it's most idiomatic, and it will have edn support (which a dedicated object won't).

Moreover, when we see an empty string in a CSV file, do we automatically interpret that as an NA/nil instead of the empty string? (I would vote yes, with a :fill-empty option on loading that allows the user to interpret empty fields differently-- as empty strings or as zeros, as needed.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: handling of nil/NA columns in $rollups. #181

RFC: handling of nil/NA columns in $rollups. #181

michaelochurch commented Aug 27, 2013

tutysara commented Aug 28, 2013

michaelochurch commented Aug 28, 2013

RFC: handling of nil/NA columns in $rollups. #181

RFC: handling of nil/NA columns in $rollups. #181

Comments

michaelochurch commented Aug 27, 2013

tutysara commented Aug 28, 2013

michaelochurch commented Aug 28, 2013