-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CategoricalArray parsing #43
Comments
@nalimilan, do you think it'd be wise to parse string columns as CategoricalArrays by default? Or should that be opt-in? Or should we try and be smart about it during type detection and do some kind of threshold for when we switch from CategoricalArray => String column? |
That's a tough question. It's universally considered (e.g. this post) that the OTOH, R factors are worse than our categorical arrays since they have sometimes surprising behaviors: they sometimes behave like integers (and convert silently to them), and values not in levels are converted to missing on assignment (!). Categorical arrays should have a less surprising behavior, though the fact that they return Also, in R, creating factors by default doesn't have any advantage in terms of memory use now since all strings are automatically pooled. As long as this doesn't happen in Julia, using categorical arrays will be much more efficient for most tables. But we could also use a custom pooled string type instead, which would be completely transparent for users (as opposed to So overall I'm not sure. I'd say we should return standard strings by default, with an option to easily use categorical arrays instead. That's the safest option for now. |
Most relevant excerpt from the link in my previous post:
http://www.win-vector.com/blog/2014/09/factors-are-not-first-class-citizens-in-r/ |
Plan:
|
Makes sense. I'm torn about whether choosing the type automatically based on the % of unique values is a good idea or not. In principle it's a bit annoying to have the types depend on the contents of the file, but I guess it would work fine as long as the chosen threshold is high enough (so that you don't accidentally get a non-categorical array just because a given file contains more diverse data, e.g. because it's not sorted on that particular column). In practice the difference between categorical and non-categorical variables is likely quite clear: either only a handful of unique levels or mostly unique values, with very few cases in between. Regarding the implementation, I only have a few comments:
Note that
See the suggestions I made at JuliaData/DataFrames.jl#895 (comment). I don't think passing a custom threshold makes a lot of sense: if the default doesn't work for you, better specify what type you want explicitly, either for all string columns, or for a subset of columns. |
JuliaData/CategoricalArrays.jl#77 makes |
No description provided.
The text was updated successfully, but these errors were encountered: