Support CategoricalArray parsing #43

quinnj · 2016-10-12T05:53:29Z

No description provided.

quinnj · 2017-03-19T04:09:33Z

@nalimilan, do you think it'd be wise to parse string columns as CategoricalArrays by default? Or should that be opt-in? Or should we try and be smart about it during type detection and do some kind of threshold for when we switch from CategoricalArray => String column?

nalimilan · 2017-03-19T15:28:10Z

That's a tough question. It's universally considered (e.g. this post) that the stringsAsFactors=TRUE default in R is a big mistake, so I'd rather not make the same choice again.

OTOH, R factors are worse than our categorical arrays since they have sometimes surprising behaviors: they sometimes behave like integers (and convert silently to them), and values not in levels are converted to missing on assignment (!). Categorical arrays should have a less surprising behavior, though the fact that they return CategoricalValue objects could be annoying when you really want a string (even though we could have CategoricalValue{String} <: AbstractString so that most string operations work).

Also, in R, creating factors by default doesn't have any advantage in terms of memory use now since all strings are automatically pooled. As long as this doesn't happen in Julia, using categorical arrays will be much more efficient for most tables. But we could also use a custom pooled string type instead, which would be completely transparent for users (as opposed to CategoricalValue).

So overall I'm not sure. I'd say we should return standard strings by default, with an option to easily use categorical arrays instead. That's the safest option for now.

nalimilan · 2017-03-19T18:31:32Z

Most relevant excerpt from the link in my previous post:

Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in data.frame() was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is not possible between mutable versions of factor and character vector, so the substitution should not have been offered.

http://www.win-vector.com/blog/2014/09/factors-are-not-first-class-citizens-in-r/

quinnj · 2017-09-12T21:46:57Z

Plan:

pick a threshold % when doing column type detection; if under %, parse as CategoricalArray, otherwise, it'll be String array
% threshold will be (# of unique string values) / rows_for_type_detect
if Categorical, column will have type of CategoricalValue
we might need a Data.categorical(::Type{Sink}) = true addition to the DataStreams interface to signal that a Sink supports streaming categorical columns (i.e. SQLite, CSV, & ODBC don't since they just store as strings anyway, but DataFrames & Feather do natively)
Sink's will need to allocate specially so they store as CategoricalArray and not a Vector of CategoricalValues
Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with
some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

nalimilan · 2017-09-13T10:37:19Z

Makes sense. I'm torn about whether choosing the type automatically based on the % of unique values is a good idea or not. In principle it's a bit annoying to have the types depend on the contents of the file, but I guess it would work fine as long as the chosen threshold is high enough (so that you don't accidentally get a non-categorical array just because a given file contains more diverse data, e.g. because it's not sorted on that particular column). In practice the difference between categorical and non-categorical variables is likely quite clear: either only a handful of unique levels or mostly unique values, with very few cases in between.

Regarding the implementation, I only have a few comments:

Will need a custom parsefield(io, ::Type{CategoricalValue}) method to parse a csv string cell as a CategoricalValue, potentially adding to the CategoricalPool if needed; we might also need special Data.streamto! methods for CategoricalArray if it receives a CategoricalValue w/ a larger pool than it was created with

Note that setindex!(::CategoricalArray, ...) will automatically add new levels to the pool when needed, and throw an OverflowError if there are too many of them. You could just catch it and resize the pool in that case. It would be interesting to start with a smaller reftype than the default UInt32, depending on how many unique values were detected in the first rows, since typemax(UInt32) is an unusually large number of levels for a categorical variable. In most cases I expect UInt8 to be enough, though UInt16 would be less risky.

some kind of keyword argument to CSV.read to allow columns to be categorical; not sure if it's just a categorical=true and we'll detect, or if we let them set the % threshold, or just simply allow people to pass in CategoricalValue as the column type and we'll just use that.

See the suggestions I made at JuliaData/DataFrames.jl#895 (comment). I don't think passing a custom threshold makes a lot of sense: if the default doesn't work for you, better specify what type you want explicitly, either for all string columns, or for a subset of columns.

nalimilan · 2017-09-14T20:11:31Z

JuliaData/CategoricalArrays.jl#77 makes CategoricalValue <: AbstractString to make it mostly transparent to the user that a CategoricalArray{String} is used instead of an Array{String}.

* Fix #43 by supporting CategoricalArray parsing.

This was referenced Sep 12, 2017

Support parsing columns into CategoricalArrays #98

Closed

add support to CSV.read for readtable's truestrings, falsestrings, makefactors keyword arguments #80

Closed

quinnj added a commit that referenced this issue Sep 14, 2017

Fix #43 by supporting CategoricalArray parsing.

4f0481f

quinnj mentioned this issue Sep 14, 2017

Fix #43 by supporting CategoricalArray parsing. #102

Merged

quinnj closed this as completed in #102 Sep 14, 2017

quinnj added a commit that referenced this issue Sep 14, 2017

Fix #43 by supporting CategoricalArray parsing. (#102)

0d6b383

* Fix #43 by supporting CategoricalArray parsing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CategoricalArray parsing #43

Support CategoricalArray parsing #43

quinnj commented Oct 12, 2016

quinnj commented Mar 19, 2017

nalimilan commented Mar 19, 2017 •

edited

Loading

nalimilan commented Mar 19, 2017

quinnj commented Sep 12, 2017

nalimilan commented Sep 13, 2017

nalimilan commented Sep 14, 2017

Support CategoricalArray parsing #43

Support CategoricalArray parsing #43

Comments

quinnj commented Oct 12, 2016

quinnj commented Mar 19, 2017

nalimilan commented Mar 19, 2017 • edited Loading

nalimilan commented Mar 19, 2017

quinnj commented Sep 12, 2017

nalimilan commented Sep 13, 2017

nalimilan commented Sep 14, 2017

nalimilan commented Mar 19, 2017 •

edited

Loading