-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to "unnullify" columns that do not contain nulls. #35
Comments
For now, you can specify the type of a specific column, referencing the column by number or name, so that CSV.read(file; nullable=true, types=Dict("non_null_column"=>Int64)) so all columns would be returned as One thing that I come back to with "auto unwrapping" NullableVectors is the analogy to dealing with type instability in Julia: a function's types shouldn't depend on specific input values. In a similar vein, I think it might be bad if we were to have the types of columns be dependent on the actual values parsed, as opposed to what they're setup to be. A lot of my own personal uses for CSV.jl involve reading batches and batches of CSV files, with various ranges of data. It would be annoying to not know if I would get a Definitely something to consider, but I think even just having the |
Isn't this unavoidable when parsing CSV? A column of numbers won't give the same array type as a column of strings. I'd say it's reasonable to have some type instability when reading the file (unless you specified column types of course), as long as operations after that are type-stable. |
As @nalimilan says, type instability is inevitable unless the column types are pre-specified. Also, I suggest this as an optional argument, for which the default could be |
I guess the real issue is the current architecture of the entire parsing process: currently, it's based on the fact that all CSV.jl needs to do is provide I guess my point is that right now the architecture requires us to decide upfront during Source construction whether a column will allow missing values or not. I.e. there's not a place in the current parsing process where you could check after the fact if there were missing values and unwrap if not. |
Maybe you could apply the same strategy as what is currently used to choose the column type? I.e. check whether the first |
@nalimilan It seems to me that you really need the whole column to be able to check if you have any nulls. If the first 100 elements of a column are I guess there are still aspects of this package and how it ties into |
I think the fact that CSV is streaming the data as it reads it is one of the best aspects of the package, and it integrates really nicely with e.g. Query.jl. For example, a query like this one: q = @from i in CSV.Source("data.csv") begin
@where i.Children > 2
@select i
@collect DataFrame
end will actually never materialize the whole, unfiltered data from the CSV file in memory in the form of a vector or something like that. Instead, the filter is added into the data stream, and only the values that actually pass the filter will eventually end up in the arrays that constitute the But as @quinnj pointed out, that essentially means that you need to make a call about the column type before you have read all the rows, otherwise this kind of streaming design can't work. |
@dmbates, to follow up on your last comment: we can avoid providing entire columns at once and still avoid using |
At least on Linux the shell command In another discussion a person said that the Julia |
^ this is a good point, and a big part of the reason that I added the functionality to do |
Closing as "nothing can currently be done without major re-architecting". I understand the case, but it's also easy enough to request non-null columns upfront or unwrap them afterwards if the user really wants. |
Right now the optional
nullable
argument toCSV.read
allows for the columns to all beNullableVector
types or for none to beNullableVector
s. Could an optional argument be added that whennullable = true
any columns that do not contain nulls are "unnullified" after being read? The code could be as simple asalthough I haven't looked inside the package to see exactly where this could be done.
The text was updated successfully, but these errors were encountered: