Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA support for Vector{Float} #22

Closed
tshort opened this issue Jul 16, 2012 · 2 comments
Closed

NA support for Vector{Float} #22

tshort opened this issue Jul 16, 2012 · 2 comments

Comments

@tshort
Copy link
Contributor

tshort commented Jul 16, 2012

This request has been shot down before when we were working on Harlan's branch. I'm repeating it here "for the record".

I think we should add NA support to floating-point vectors using NaN's. It only takes ten lines of code as follows. Those ten lines of code will allow us to use regular vectors, and immediately most floating-point columns will "just work" (arithmetic, mean, quantile, ...). That will save us a lot of work trying to support every floating point function folks create for DataVecs. I don't think it's a problem to have multiple NA types as long as it appears pretty consistent from the user's point of view.

NA_Float64 = NaN 
NA_Float32 = NaN32
na(::Type{Float64}) = NA_Float64
na(::Type{Float32}) = NA_Float32
convert{T <: Float}(::Type{T}, x::NAtype) = na(T)
promote_rule{T <: Float}(::Type{T}, ::Type{NAtype} ) = T
isna{T <: Float}(x::T) = isnan(x)
isna{T <: Float}(x::AbstractVector{T}) = x .!= x
nafilter{T <: Float}(v::AbstractVector{T}) = v[!isna(v)]
nareplace{T <: Float}(v::AbstractVector{T}, r::T) = [isna(v)[i] ? r : v[i] for i = 1:length(v)]

What I'm proposing is to add NA support for Vectors{Float} by using NaNs as NAs for use as DataFrame columns. This is along the lines of what R and pandas do, and it's one of the options that Numpy is moving to (the other option is a masking approach like DataVec).

Arrays are Julia's fundamental type, so if we can support the use of Arrays{Float, 1}, our DataFrame columns will be more robust.

Advantages

Support

The #1 reason (by far) for using Vectors with NaNs as NAs is reduced support and development by the DataFrame team. With just these 10 lines of code, Vectors have pretty good NA support while at the same time supporting all Julia functions that operate on Arrays. Here are some examples:

julia> srand(1)

julia> v = randn(10)
10-element Float64 Array:
  0.00422471
  0.0636925 
  1.41376   
 -1.09879   
  0.503439  
  1.75336   
 -0.202676  
 -0.458741  
  0.526426  
  1.60172   

julia> dv = DataVec(v)
[0.004224711539662927,0.0636925153577793,1.413764097398493,-1.0987858271983026,0.5034390675674981,1.753360709001194,-0.20267565554863,-0.4587414680524865,0.5264259022023546,1.601723433618217]

julia> quartile(v)
3-element Float64 Array:
 -0.150951
  0.283566
  1.19193 

julia> quartile(dv)                # This is what I worry will be too common.
no method sort(DataVec{Float64},)
 in method_missing at base.jl:60
 in quantile at statistics.jl:356
 in quartile at statistics.jl:369

julia> mean(v)
0.41064274858857797

julia> mean(dv)
no method mean(DataVec{Float64},)
 in method_missing at base.jl:60

julia> 

julia> v[[1,5]] = NA
NA

julia> dv[[1,5]] = NA
NA

julia> mean(v)
NaN

julia> mean(dv)
no method mean(DataVec{Float64},)
 in method_missing at base.jl:60

julia> mean(nafilter(v))
0.44984546334732733

julia> mean(nafilter(dv))
0.44984546334732733

julia> v + dv
no method +(Array{Float64,1},DataVec{Float64})
 in method_missing at base.jl:60

julia> v * 2
10-element Float64 Array:
 NaN       
   0.127385
   2.82753 
  -2.19757 
 NaN       
   3.50672 
  -0.405351
  -0.917483
   1.05285 
   3.20345 

julia> dv * 2
no method *(DataVec{Float64},Int64)
 in method_missing at base.jl:60

With DataVecs, the most common FAQ will likely be: why doesn't function xyz() work on DataFrame columns? The answer will be: wrap the column in nafilter() or write a method for xyz() to support DataVecs.

Especially since DataVecs cannot inherit from AbstractArrays, there is a lot of work to be done to support even the functions in base. Once packages proliferate, it'll be even worse. Most of these will be fairly simple wrappers, but it'll still be work to do initially and to maintain going forward.

I also worry that forcing DataVecs will create more of a division between users coming from R and Matlab backgrounds. If DataFrames support Vectors, then it'll be easier for the Matlab folks to use and for R folks to use functions from the Matlab folks that work on Arrays.

Performance

In the groupby testing I did, the bare array columns were faster.

Automatic support of new Julia data types

By supporting AbstractVectors, new Julia data types that are AbstractVectors can automatically be used in DataFrames.

Here is an example of using mmap_arrays as columns of a DataFrame:

julia> s = open("bigdataframe.bin","w+")
IOStream(<file bigdataframe.bin>)

julia> N = 1000000000
1000000000

julia> v1 = mmap_array(Float64,(N,),s)

julia> v2 = mmap_array(Float64,(N,),s, numel(v1)*sizeof(eltype(v1)))

julia> d = DataFrame({v1, v2})
DataFrame  (1000000000,2)
                  x1  x2
[1,]             0.0 0.0
[2,]             0.0 0.0
[3,]             0.0 0.0
[4,]             0.0 0.0
[5,]             0.0 0.0
[6,]             0.0 0.0
[7,]             0.0 0.0
[8,]             0.0 0.0
[9,]             0.0 0.0
[10,]            0.0 0.0
[11,]            0.0 0.0
[12,]            0.0 0.0
[13,]            0.0 0.0
[14,]            0.0 0.0
[15,]            0.0 0.0
[16,]            0.0 0.0
[17,]            0.0 0.0
[18,]            0.0 0.0
[19,]            0.0 0.0
[20,]            0.0 0.0
  :
[999999981,]     0.0 0.0
[999999982,]     0.0 0.0
[999999983,]     0.0 0.0
[999999984,]     0.0 0.0
[999999985,]     0.0 0.0
[999999986,]     0.0 0.0
[999999987,]     0.0 0.0
[999999988,]     0.0 0.0
[999999989,]     0.0 0.0
[999999990,]     0.0 0.0
[999999991,]     0.0 0.0
[999999992,]     0.0 0.0
[999999993,]     0.0 0.0
[999999994,]     0.0 0.0
[999999995,]     0.0 0.0
[999999996,]     0.0 0.0
[999999997,]     0.0 0.0
[999999998,]     0.0 0.0
[999999999,]     0.0 0.0
[1000000000,]    0.0 0.0

julia> d["x1"][[2,5,N]] = NA
NA

julia> d["x2"][[1,3,6,9]] = pi
3.141592653589793

julia> head(d)
DataFrame  (6,2)
         x1      x2
[1,]    0.0 3.14159
[2,]    NaN     0.0
[3,]    0.0 3.14159
[4,]    0.0     0.0
[5,]    NaN     0.0
[6,]    0.0 3.14159


julia> sum(d[11:15, "x1"])
0.0

julia> sum(d[1:5, "x1"])
NaN

Disadvantages

No built-in filter/replace

The biggest downside to using vectors is that they won't have built in filtering/replacing. The use of filter and replace Bools in DataVecs is a cool idea. It's much better than R's options()$na.action -- I never use that because it makes code less portable. By attaching the na.action flag to the data, code is more portable while allowing the user to avoid wrapping things with nafilter.

I think the advantages of using Vectors outweigh losing built-in filter/replace for most applications of floating-point columns.

Confusing output

As it is now, NAs become NaNs, and they are displayed that way. It is probably possible to modify the show() functions to overcome this. It is probably possible to distinguish between NaNs and NAs by using a different bit pattern for each. With support of Julia core, output could be modified for all arrays.

No universal type for columns

New features like indexing cannot be added automatically to all column types. For many features, this could be handled by API's rather than by type inheritance. For features where this is not possible, DataVecs can be used.

No Integer/Bool/String support

This approach cannot be used for these types because they don't have an NaN. R's approach of picking a bit pattern for each as an NA would not work unless it was integrated into Julia's core (not likely and probably not a good idea). Anyway, DataVecs or PooledDataVecs are more appropriate here anyway. Also, the universe of functions that need to work on these types is smaller than the set of functions operating on Vector{Float}.

Next steps

I'm not proposing that we ditch DataVecs. I'm proposing that we have an alternative. One may work better for the user in some situations than the other.

More work is needed to sort all of this out. This includes deciding what's the default for column assignment or reading from CSV's. Here's what I would choose:

  • Bool -> PooledDataVec
  • String -> PooledDataVec
  • Integer -> DataVec (but Pooled may be used a lot here)
  • Float -> Vector

We probably need an AsIs type for overriding defaults.

Another area is promotion. What type should x::DataVec{Float64} + v::Vector{Float64} produce?

As far as DataVec{Float} support, I think we should continue to support it, but the list of functions we support at the start could be a lot smaller. If it gains wide use, folks will add methods to support it.

@tshort
Copy link
Contributor Author

tshort commented Aug 4, 2012

I've played with this some more. One area of concern has been display and user confusion. For the most part, that can be minimized. It's a three-line change to base to allow separate display of NaN's and NA's. So, even though there are different internal NA representations, the user can mainly use NA, and everything will be handled transparently.

Another issue I found is with floating-point comparisons. With NA's (or NaN's), a floating point comparison always evaluates to false. Here's an example:

julia> 1.0 .< [NA, 3., NA]
3-element Bool Array:
 false
  true
 false

julia> 1.0 .>= [NA, 3., NA]
3-element Bool Array:
 false
 false
 false

In practice, I don't think it's a big deal, but it is confusing. Pandas has this same issue, and I don't see where they even document it.

A separate floating-point bitstype could be created to solve this, but for comparison operations, it would have to return a type that could contain NA's. R users are used to handling the NA case in comparisons, but here, if one forgets, answers could be wrong.

@garborg
Copy link
Contributor

garborg commented Jan 13, 2015

Closing in favor of recent discussion about NullableArray design.

@garborg garborg closed this as completed Jan 13, 2015
nalimilan pushed a commit that referenced this issue May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants