Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to "unnullify" columns that do not contain nulls. #35

Closed
dmbates opened this issue Sep 14, 2016 · 11 comments
Closed

Add an option to "unnullify" columns that do not contain nulls. #35

dmbates opened this issue Sep 14, 2016 · 11 comments

Comments

@dmbates
Copy link

dmbates commented Sep 14, 2016

Right now the optional nullable argument to CSV.read allows for the columns to all be NullableVector types or for none to be NullableVectors. Could an optional argument be added that when nullable = true any columns that do not contain nulls are "unnullified" after being read? The code could be as simple as

unnullify(x::NullableArray) = anynull(x) ? x : Array(x)

although I haven't looked inside the package to see exactly where this could be done.

@quinnj
Copy link
Member

quinnj commented Sep 14, 2016

For now, you can specify the type of a specific column, referencing the column by number or name, so that T will result in a Vector{T} and Nullable{T} will result in NullableVector{T}. So a call would look like:

CSV.read(file; nullable=true, types=Dict("non_null_column"=>Int64))

so all columns would be returned as NullableVector{T}, except non_null_column would be a regular Vector{T}.

One thing that I come back to with "auto unwrapping" NullableVectors is the analogy to dealing with type instability in Julia: a function's types shouldn't depend on specific input values. In a similar vein, I think it might be bad if we were to have the types of columns be dependent on the actual values parsed, as opposed to what they're setup to be. A lot of my own personal uses for CSV.jl involve reading batches and batches of CSV files, with various ranges of data. It would be annoying to not know if I would get a Vector{T} or NullableVector{T}, just depending on whether there were nulls in that range of the data or not.

Definitely something to consider, but I think even just having the nullable=true keyword and being able to specify individual column type nullability gets us most of the way there.

@nalimilan
Copy link
Member

One thing that I come back to with "auto unwrapping" NullableVectors is the analogy to dealing with type instability in Julia: a function's types shouldn't depend on specific input values. In a similar vein, I think it might be bad if we were to have the types of columns be dependent on the actual values parsed, as opposed to what they're setup to be.

Isn't this unavoidable when parsing CSV? A column of numbers won't give the same array type as a column of strings. I'd say it's reasonable to have some type instability when reading the file (unless you specified column types of course), as long as operations after that are type-stable.

@dmbates
Copy link
Author

dmbates commented Sep 15, 2016

As @nalimilan says, type instability is inevitable unless the column types are pre-specified. Also, I suggest this as an optional argument, for which the default could be false

@quinnj
Copy link
Member

quinnj commented Sep 15, 2016

I guess the real issue is the current architecture of the entire parsing process: currently, it's based on the fact that all CSV.jl needs to do is provide Data.getfield(source::CSV.Source, ::Type{T}, row, col) and Data.getfield(source, ::Type{Nullable{T}}, row, col). In the CSV.Source construction process, we take a peak at the first n rows to determine the types, but in terms of nullability, we assume, by default, that there may be missing values in any column. This is how the nullable=true keyword argument was introduced, so that by default, the columns are assumed to not have missing values.

I guess my point is that right now the architecture requires us to decide upfront during Source construction whether a column will allow missing values or not. I.e. there's not a place in the current parsing process where you could check after the fact if there were missing values and unwrap if not.

@nalimilan
Copy link
Member

Maybe you could apply the same strategy as what is currently used to choose the column type? I.e. check whether the first n rows contain a missing value. I don't see why it would be OK to do that to choose the type, but not for deciding whether null values should be allowed.

@dmbates
Copy link
Author

dmbates commented Sep 15, 2016

@nalimilan It seems to me that you really need the whole column to be able to check if you have any nulls. If the first 100 elements of a column are Ints I would be willing to guess that they all are and pay the price if I was wrong. But seeing the first 100 elements of a column don't contain any nulls would not make me confident that the whole column was free of nulls.

I guess there are still aspects of this package and how it ties into DataStreams that I don't understand. To me, if I was producing a DataFrame I need whole columns before returning the result, although I may be thinking too much like an R programmer here. Operations like push! and append! would be far too expensive to use in R but not in Julia. If it is the case that entire columns are not available until CSV.read returns then I can just do the unnullify operation afterwards.

@davidanthoff
Copy link

I think the fact that CSV is streaming the data as it reads it is one of the best aspects of the package, and it integrates really nicely with e.g. Query.jl. For example, a query like this one:

q = @from i in CSV.Source("data.csv") begin
    @where i.Children > 2
    @select i
    @collect DataFrame
end

will actually never materialize the whole, unfiltered data from the CSV file in memory in the form of a vector or something like that. Instead, the filter is added into the data stream, and only the values that actually pass the filter will eventually end up in the arrays that constitute the DataFrame. But at no point will this allocate an array that holds all the values from the file. At least in theory that should for example enable one to read a file that would not fit into memory unfiltered, but where the subset that is left after the @where filter is small enough to fit into memory in a DataFrame. I say in theory, because I haven't tried it ;)

But as @quinnj pointed out, that essentially means that you need to make a call about the column type before you have read all the rows, otherwise this kind of streaming design can't work.

@quinnj
Copy link
Member

quinnj commented Oct 5, 2016

@dmbates, to follow up on your last comment: we can avoid providing entire columns at once and still avoid using push! or append!. Currently CSV.jl will, by default, try to determine the total # of rows in a file and will pass that upfront to a DataFrame so that it can pre-allocate the correct # of rows. But the operation to count the # of rows in a file is much much cheaper than trying to do any kind of type inference or null detection on the entire file as well.

@dmbates
Copy link
Author

dmbates commented Oct 5, 2016

At least on Linux the shell command wc -l <filename> is an extremely fast way to get the number of rows, assuming that records correspond to lines. It may give an upper bound on the number of records but that it still okay for use with sizehint!.

In another discussion a person said that the Julia countlines function should be as fast as wc in the shell, but that hasn't been my experience.

@quinnj
Copy link
Member

quinnj commented Oct 5, 2016

^ this is a good point, and a big part of the reason that I added the functionality to do CSV.read(file; rows=num_rows) where the user can provide num_rows. The reason I have to use my own CSV.countlines is because of the possibility of embedded newlines in a CSV field, which neither wc -l nor Base.countlines account for. i.e. I have a cell in my CSV file this is "lot's of string text \n with a newline in the middle". But for cases where the # of rows are known or you can safely and quickly pre-compute them beforehand, it's certainly an optimization to pass them in as the rows=num_rows argument

@quinnj
Copy link
Member

quinnj commented Oct 10, 2016

Closing as "nothing can currently be done without major re-architecting". I understand the case, but it's also easy enough to request non-null columns upfront or unwrap them afterwards if the user really wants.

@quinnj quinnj closed this as completed Oct 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants