Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: handling of nil/NA columns in $rollups. #181

Open
michaelochurch opened this issue Aug 27, 2013 · 2 comments
Open

RFC: handling of nil/NA columns in $rollups. #181

michaelochurch opened this issue Aug 27, 2013 · 2 comments

Comments

@michaelochurch
Copy link
Contributor

Right now, if you have nils in a dataset and call a $rollup function on it, you get NullPointerException, because the rollup functions treat the nils as they are, rather than as missing data (which is what I'd prefer).

Generally, when you do an aggregation (sum, min, max, avg) the missing values are excluded. So, for example, with this dataset and $rollup call:

user=> ds

| :color | :weight |
|--------+---------|
|   :red |      15 |
|   :red |      22 |
| :green |      17 |
| :green |         |  ;; that's a missing value / nil

user=> ($rollup :mean :weight :color ds)

NullPointerException   clojure.lang.Numbers.ops (Numbers.java:942)

you'd typically expect a mean weight for :color :green of 17 to be reported; the nil is just dropped.

Is this issue (missing data handling) a priority for 1.5.x or a 2.0.0 issue? There are a lot of different architectural directions we could take on this, ranging from the less intrusive (e.g. just change $rollup) to the much more massive architectural rewrites that would center on making missing data a major part of the scene.

I do think that, as ugly as it can be, missing data/NA is something we're going to have to deal with a lot in Dataset.

@tutysara
Copy link

Yes, It would be convenient to have the missing data dropped in the calculation.
I too faced this issue and for all my use cases dropping nil values was ok, and I just filtered out the rows having nil values in datasets for that col.
I guess patching $rollup is not that hard, what is the architectural rewrites that requires a big change, should we have to change many related functions?

@michaelochurch
Copy link
Contributor Author

Adapting $rollup to handle nils isn't that hard.

The architectural question is how we want to handle NA/missing data in general. Do we treat it as nil? (That would be most idiomatic.) Or something else like the keyword :NA, or a dedicated object (def NA (Object.))? I would vote for using nil-- it's most idiomatic, and it will have edn support (which a dedicated object won't).

Moreover, when we see an empty string in a CSV file, do we automatically interpret that as an NA/nil instead of the empty string? (I would vote yes, with a :fill-empty option on loading that allows the user to interpret empty fields differently-- as empty strings or as zeros, as needed.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants