Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CategoricalArray parsing #43

Closed
quinnj opened this issue Oct 12, 2016 · 6 comments
Closed

Support CategoricalArray parsing #43

quinnj opened this issue Oct 12, 2016 · 6 comments

Comments

@quinnj
Copy link
Member

quinnj commented Oct 12, 2016

No description provided.

@quinnj
Copy link
Member Author

quinnj commented Mar 19, 2017

@nalimilan, do you think it'd be wise to parse string columns as CategoricalArrays by default? Or should that be opt-in? Or should we try and be smart about it during type detection and do some kind of threshold for when we switch from CategoricalArray => String column?

@nalimilan
Copy link
Member

nalimilan commented Mar 19, 2017

That's a tough question. It's universally considered (e.g. this post) that the stringsAsFactors=TRUE default in R is a big mistake, so I'd rather not make the same choice again.

OTOH, R factors are worse than our categorical arrays since they have sometimes surprising behaviors: they sometimes behave like integers (and convert silently to them), and values not in levels are converted to missing on assignment (!). Categorical arrays should have a less surprising behavior, though the fact that they return CategoricalValue objects could be annoying when you really want a string (even though we could have CategoricalValue{String} <: AbstractString so that most string operations work).

Also, in R, creating factors by default doesn't have any advantage in terms of memory use now since all strings are automatically pooled. As long as this doesn't happen in Julia, using categorical arrays will be much more efficient for most tables. But we could also use a custom pooled string type instead, which would be completely transparent for users (as opposed to CategoricalValue).

So overall I'm not sure. I'd say we should return standard strings by default, with an option to easily use categorical arrays instead. That's the safest option for now.

@nalimilan
Copy link
Member

Most relevant excerpt from the link in my previous post:

Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in data.frame() was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is not possible between mutable versions of factor and character vector, so the substitution should not have been offered.

http://www.win-vector.com/blog/2014/09/factors-are-not-first-class-citizens-in-r/

@quinnj
Copy link
Member Author

quinnj commented Sep 12, 2017

Plan:

  • pick a threshold % when doing column type detection; if under %, parse as CategoricalArray, otherwise, it'll be String array
  • % threshold will be (# of unique string values) / rows_for_type_detect
  • if Categorical, column will have type of CategoricalValue
  • we might need a Data.categorical(::Type{Sink}) = true addition to the DataStreams interface to signal that a Sink supports streaming categorical columns (i.e. SQLite, CSV, & ODBC don't since they just store as strings anyway, but DataFrames & Feather do natively)
  • Sink's will need to allocate specially so they store as CategoricalArray and not a Vector of CategoricalValues
  • Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with
  • some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

@nalimilan
Copy link
Member

Makes sense. I'm torn about whether choosing the type automatically based on the % of unique values is a good idea or not. In principle it's a bit annoying to have the types depend on the contents of the file, but I guess it would work fine as long as the chosen threshold is high enough (so that you don't accidentally get a non-categorical array just because a given file contains more diverse data, e.g. because it's not sorted on that particular column). In practice the difference between categorical and non-categorical variables is likely quite clear: either only a handful of unique levels or mostly unique values, with very few cases in between.

Regarding the implementation, I only have a few comments:

Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with

Note that setindex!(::CategoricalArray, ...) will automatically add new levels to the pool when needed, and throw an OverflowError if there are too many of them. You could just catch it and resize the pool in that case. It would be interesting to start with a smaller reftype than the default UInt32, depending on how many unique values were detected in the first rows, since typemax(UInt32) is an unusually large number of levels for a categorical variable. In most cases I expect UInt8 to be enough, though UInt16 would be less risky.

some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

See the suggestions I made at JuliaData/DataFrames.jl#895 (comment). I don't think passing a custom threshold makes a lot of sense: if the default doesn't work for you, better specify what type you want explicitly, either for all string columns, or for a subset of columns.

@nalimilan
Copy link
Member

JuliaData/CategoricalArrays.jl#77 makes CategoricalValue <: AbstractString to make it mostly transparent to the user that a CategoricalArray{String} is used instead of an Array{String}.

quinnj added a commit that referenced this issue Sep 14, 2017
* Fix #43 by supporting CategoricalArray parsing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants