Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Dense arrays with missing data #54

Closed
nfoti opened this issue Aug 25, 2012 · 14 comments
Closed

RFC: Dense arrays with missing data #54

nfoti opened this issue Aug 25, 2012 · 14 comments

Comments

@nfoti
Copy link
Contributor

nfoti commented Aug 25, 2012

One aspect of missing data that JuliaData does not support is dense arrays with missing data. Extending DataVec and PooledDataVec to a DataArray type that can handle missing data for an array of arbitrary dimension seems like a useful addition to the package. The semantics of a DataArray should be the same as normal Arrays with the addition that functions that operate on them should have the option of excluding missing data. Additionally, slicing a DataArray would return a DataArray with the proper number of dimensions. In principle a 1d DataArray would be a DataVec, however, there may be compelling reasons to keep the DataVec implementation separate. The proposed implementation of DataArray will be such that a 1d DataArray will behave exactly as a current DataVec. Having a special type DataMatrix for 2d data also would be useful. The nafilter/naFilter and nareplace/naReplace would return flattened versions of the objects. I'm sure there are behaviors I have not specified or have not been clear about, so any thoughts on the design and implementation of DataArray are appreciated.

One other special case that deserves attention is float arrays with missing data. In this case I think it is worth implementing something similar to the approach in issue #22 for arbitrary arrays of floats. That is using NaN to indicate missing data in arrays of floats. In this special case NaN has the correct semantics for missing data and does not require a separate mask. It is also straightforward to implement this behavior. Again, any thoughts are appreciated. This type could then be sub-typed to allow named rows and columns via the Index type already in JuliaData. However, just the added functionality for Float arrays would be very useful for machine learning and statistics algorithms that operate on float arrays.

I am planning on implementing these ideas as time permits, but help is welcome if anyone wants to run with the ideas.

@tshort
Copy link
Contributor

tshort commented Aug 26, 2012

I like this idea. Making DataVec more general will help make it more useful. I strongly suggest making DataArray inherit from AbstractArray and DataVec inherit from AbstractVector (issue #23). This will go a long way towards making DataVecs and DataArrays useable. I would not keep DataVec separate but make it a DataArray{T,1} as you suggest. Another option to consider is making the mask be a bitarray (issue #3).

Also, if you use bitstypes capable of supporting NA's (issue #45), you have arrays with NA support now.

As to float arrays with missing data, I like the idea. Harlan and Stefan are resistant as you can see from the commentary on issue #45. For floating point values, NaN's don't completely have correct semantics for missing data. The one area of difference is comparisons. Comparisons involving NaN always return false and not NaN (because booleans don't have a concept of NA or NaN). NaN's are useable as NA's given this. Pandas (python) uses NaN's as NA's despite this "feature". You just have to be aware and check for NA conditions in comparisons (which you normally have to do anyway). Having a bitstype for floats that supports NA's (as an NaN) gets around this by building in the check for NA conditions in comparisons..

@HarlanH
Copy link
Contributor

HarlanH commented Aug 26, 2012

In general, I'm definitely in support of a data type for floating-point matrices (and/or higher-dimensional Float arrays) with NA implemented by NaN payload, presumably by a bits type with appropriate conversions, as Tom suggests. I don't see any reason why that type can't inherit from AbstractArray/Vector. Thanks for your efforts here!

But DataFrames are semantically different from a DataMatrix/DataArray, and I strongly feel that there should be a single globally-useful implementation of NAs for the DataFrame type, and trying to push the round NaN peg into that square hole is not going to end up being easy for users (or package developers) to work with.

For now, let's keep the code in this issue separate from the existing DataVec/DataFrame types. Definitely re-use Indexes and other ideas as you can, but let's treat this as a separate "JuliaData" type for working with separate types of data.

I sure which I had more time to work on JuliaData! Maybe in a week or two...

@nfoti
Copy link
Contributor Author

nfoti commented Aug 26, 2012

Thanks for the comments.

Making DataArray inherit from AbstractArray does seem a natural thing to do. I plan on keeping DataVec separate from DataArray right now, I was just suggesting that if a 1d DataArray is equivalent (both in syntax and semantics) to a current DataVec when everything is implemented some duplicate code could be removed. I agree that a bitarray is the most compact type for a mask, I am still working out the best way to return a subscripted version (I'm assuming bitarrays cannot be multidimensional).

As for float arrays I don't think the NaN comparison issue is really an issue as long as it's documented that you should only compare non-NaN elements.

I have implemented a few functions for float arrays that allow skipping NaNs via a Bool argument. I have run into problems implementing var because there are so many versions with Bool flags already. Without keyword arguments to functions it is very difficult to add a new flag. I am now seeing why Matlab named their functions nanvar, etc. What are your opinions on the naming? I'm tempted to go with nanmean, nanvar, nansum, etc. rather than adding flags for consistency of the interface. If there are ever named function arguments then we could consider a "skipna" argument.

Thanks.

@tshort
Copy link
Contributor

tshort commented Aug 27, 2012

In a quick glance, it looks like you can have multidimensional BitArrays.

As far as NA's or NaN's and functions that work on them, i don't really like nanmean, nanvar, etc. I'd rather see something like:

mean(naFilter(x)) or mean(naReplace(x, -1))

In this case, naFilter doesn't actually filter; it just sets up a type indicating thatmean should skip NA's. Then, you just need to define the method to work with that type. For DataVec's, that's done by setting the filter flag. Then, mean(dv::DataVec) will skip over NA's. For arrays with NA's as NaN's, you can set up a type that basically just holds the data (for naReplace, it'd also need to hold the replace value). For examples of this, see issue #40 and:

https://github.com/tshort/JuliaData/blob/floatNA/src/alternate_NA.jl

Some of that is commented out, but at least one of the functions worked at one time.

@nfoti
Copy link
Contributor Author

nfoti commented Aug 27, 2012

Multidimensional BitArrays should make DataArray straight-forward to implement (famous last words).

I'm not a fan of the nanfun family of functions and the mean(naFilter(x)) syntax is not my favorite either. I'm partial to something like mean(x, dim, skipna), however, we run into problems with var(x, dim, skipna) as there is already a version of var that takes an AbstractArray, an Int and a Bool. However, I do think that the "functional" syntax mean(naFilter(x)) is useful and should be implemented if possible. One problem I see with it is computing a function (say the mean) of the rows skipping NaNs. I think the naFilter etc. should return a flattened iterator which makes computing functions along a dimension difficult. I guess computing the mean along the columns could be implemented as

means = similar(x)
for i = 1:size(x,1)
  means(i) = mean(naFilter(x[i,:]))
end

but this is ugly. Also, it seems like implementing naFilter/naReplace for Float arrays requires a new type as those functions set up a new DataVec that references the same data with different flags and then the DataVec iterator functionality is used. I would like to avoid introducing a new type and messing with the functionality of Arrays.

We could use the options module to pass in a skipna option to allow syntax like mean(x, 2, skipna) (and switch to a keyword argument if they are ever implemented) as opposed to nanfun type functions. I will try this out at some point soon. If there are any ideas for making naFilter(x) work for a Float array or the mean(naFilter(x), 2) syntax work I'd love to hear them.

I will push some code on my fork of JuliaData later today so you can see the current state of things.

@johnmyleswhite
Copy link
Contributor

I don't see why your concern about var() is a problem: doesn't Julia's multiple dispatch system always select a method based on the maximally specific form when varying levels of generality exist? If DataArray <: AbstractArray, then var(a::DataArray, i::Int, b::Bool) will take precedence over var(a::AbstractArray, i::Int, b::Bool). Am I missing something?

@tshort
Copy link
Contributor

tshort commented Aug 27, 2012

Here is the code for sum that uses regular arrays from the link I provided above. mean would be similar. The function works directly on the array (that's what the first line does; A.x is the array). The NAFilter type is really just an indicator and doesn't add overhead (no data copies involved). I would not have naFilter return a flattened iterator; you might want to do rowSums or something on a matrix, and that wouldn't work if it was flattened.

function sum(A::NAFilter)
    A = A.x
    v = 0.0
    for x in A
        if !isna(x)
            v += x
        end
    end
    v
end

For DataVecs, you could have a DataVec-specific method that could handle both the replace and the filter flags.

@nfoti
Copy link
Contributor Author

nfoti commented Aug 27, 2012

Regarding @johnmyleswhite's comment, you're right, this is not a problem for DataArray. It is a problem for the special case of handling NaNs in Array{Float} types (orthogonal to DataArrays). I think the solution right now is that there won't be short versions of var for skipping NaNs. The syntax will be something like `var(X, corrected, dim, skipna).

@tshort, I'm going to play around with your idea of sum(A::NAFilter).

Thanks.

@HarlanH
Copy link
Contributor

HarlanH commented Aug 27, 2012

I agree with Tom here. Although the naFilter/naReplace operations need
work, especially with DataFrames, they're very light-weight
performance-wise. And I do quite specifically like the functional syntax.
Also, if you haven't seen the Options module in extras/, take a look -- it
would be a reasonable way to deal with named arguments until such time as
the core language supports them.

On Mon, Aug 27, 2012 at 10:51 AM, Tom Short notifications@github.comwrote:

Here is the code for sum that uses regular arrays from the link I provided
above. mean would be similar. The function works directly on the array
(that's what the first line does; A.x is the array). The NAFilter type is
really just an indicator and doesn't add overhead (no data copies
involved). I would not have naFilter return a flattened iterator; you might
want to do rowSums or something on a matrix, and that wouldn't work if it
was flattened.

function sum(A::NAFilter)
A = A.x
v = 0.0
for x in A
if !isna(x)
v += x
end
end
vend

For DataVecs, you could have a DataVec-specific method that could handle
both the replace and the filter flags.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/54#issuecomment-8058052.

@nfoti
Copy link
Contributor Author

nfoti commented Aug 28, 2012

I have pushed my current code implementing some functions to handle NaNs as missing data for Float arrays to the float-nan branch of my fork of JuliaData. Specifically, sum, prod, max, min and mean are implemented. It's preliminary code, I'm sure there are obvious improvements. A few different interfaces are implemented included a boolean flag skipna, functions named nansum, etc. and the interface discussed above, e.g. nansum(naFilter(x), 2). The naFilter approach was actually quite easy to implement and I think is pretty lightweight. I'm also less opposed to it now that I see how general it is.

Feedback is appreciated.

Thanks.

@tshort
Copy link
Contributor

tshort commented Aug 28, 2012

Good stuff, nfoti,

I don't have time for much of a review, and I'll be out for the next week, but here are some quick comments:

  • Of all of the interfaces you tried, I think I still like mean(naFilter(x)) the best. If we get keyword arguments, then mean(x, skipna = true) is more attractive. nanmean(x), mean(x, true), and mean(x, @options skipna = true) are not as attractive to me.
  • In nanarray.jl, you define many methods based on StridedArrays. I think those can all be AbstractArrays. That's useful if you want a sparse array or some other array flavor.
  • In nanstats.jl, you define some of the NAFilter functions in terms of isnan. That will be slower than using a loop like your versions in nanarray.jl.

@nfoti
Copy link
Contributor Author

nfoti commented Aug 28, 2012

Thanks for taking a look, there's no need for a thorough review yet.

I agree that of the options that are available now the mean(naFilter(x)) syntax is the nicest. I'll clean the other interfaces out of the code and implement as many of the statistical functions that make sense for missing data. If keyword arguments are ever implemented someone should come back to this and implement the skipna version.

You're right, the functions in nanarray.jl can probably be implemented with AbstractArray rather than StridedArray. I think I just followed what array.jl does.

Good point with isnan, I totally wrote off the fact that isnan(A) has to make a new array. I've been doing a lot of Matlab lately and went on isnan autopilot. The implementations in there right now are just proof-of-concept, now that we've picked an interface I can go through and implement them all as loops.

Thanks again.

Nick

@nfoti
Copy link
Contributor Author

nfoti commented Aug 29, 2012

I've pushed some new code (float-nan branch) that only implements the naFilter interface. Operations like sum(naFilter(x)) are only slightly slower than sum(x) (with no missing data) and about 4x faster than sum(x[!isnan(x)]). However, the versions that work on a dimension (or a Dimspec), e.g. mean(naFilter(x), 2), are about 15x slower than sum(x, 2). I'm assuming this is because the nanplus, etc. functions that I use for the reductions implementing those operations have a lot of overhead.

@johnmyleswhite
Copy link
Contributor

Closed by b95ee3f

nalimilan pushed a commit that referenced this issue Jul 8, 2017
Stack should use similar_nullable, not NullableArray
quinnj pushed a commit that referenced this issue Sep 2, 2017
Stack should use similar_nullable, not NullableArray
nalimilan pushed a commit that referenced this issue May 26, 2022
Support RData/RDS format version 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants