Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to exclude default null values when parsing CSV files #5984

Closed
dclong opened this issue Jan 2, 2023 · 9 comments · Fixed by #6203
Closed

Add an option to exclude default null values when parsing CSV files #5984

dclong opened this issue Jan 2, 2023 · 9 comments · Fixed by #6203
Labels
enhancement New feature or an improvement of an existing feature

Comments

@dclong
Copy link

dclong commented Jan 2, 2023

Problem description

pandas.read_csv has an option keep_default_na which allows users to exclude default null values (by keep_default_na=False). I do think this is an important option that should added into Polars. A related issue is #2769. Actually, sometimes, people do want to parse empty fields as empty strings instead of null values. However, there's no way to exclude null values in Polars currently.

@dclong dclong added the enhancement New feature or an improvement of an existing feature label Jan 2, 2023
@ritchie46
Copy link
Member

ritchie46 commented Jan 2, 2023

However, there's no way to exclude null values in Polars currently.

There is. I think we should not conflate data processing with data reading too much. The intent of the following query is much more clear to me than a keyword argument is.

pl.read_csv(..).with_columns(pl.col(pl.Utf8).fill_null(""))

@dclong
Copy link
Author

dclong commented Jan 2, 2023

Sorry, by "no way to exclude null values" I mean at the time of parsing CSV files.

What if there are other characters (e.g., NA) that are parsed as null when reading a CSV file? Then the above code will replace both empty fields and NA fields as null, right? This might not be as expected because users might intend parse empty filed as empty string but do keep NA as null.

@ritchie46
Copy link
Member

If we don't set NA as null values, we read all the values and then in post processing we set the proper values to null.

df = pl.DataFrame({
    "foo": ["", "foo", "NA", "Bar"],
    "bar": ["", "foo", "NA", "NA"]
})

df.with_columns(
    pl.when(pl.col(pl.Utf8).is_in(["NA", "NAN", "Null"]))
      .then(None)
      .otherwise(pl.col(pl.Utf8)).keep_name()
)
shape: (4, 2)
┌──────┬──────┐
│ foo  ┆ bar  │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│      ┆      │
│ foo  ┆ foo  │
│ null ┆ null │
│ Bar  ┆ null │
└──────┴──────┘

@ritchie46
Copy link
Member

I don't know how others feel about this, but looking at the keyword arguments in pandas where we have default null values and then another argument that excludes those default values seems like putting a query in keyword arguments.

I feel that this is what polars does for you. You are able to define which values you want as null, which you want to exclude etc. I think a keyword argument is worth it if we can save (a lot of) computation/memory by doing it at the scan, but when arguments become dependent of each other I become a bit hesitant.

@stinodego @alexander-beedie @ghuls @zundertj any thoughts?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 2, 2023

I think I see why pandas does it; it's because they conflate NaN/NULL and have a non-trivial set of values that they consider to be NaN (‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’).

If you want to supply some additional NaN/NULL values to pandas then it's quite painful to have to pass that list around and add your own each time, so the additional parameter makes ergonomic sense for them.

However, given that we can (and do) distinguish between NaN/NULL a dedicated keep_na_values param is less compelling as we're not conflating all of the above; you can either pass the explicit values you want treated as NULL into the CSV-reading function or you can do a simple post-process on the resulting frame, as described, for the edge case where empty fields are not NULL but other values are.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 2, 2023

Addendum: I can't think of many (any?) other use-cases where keep_na_values would be useful, except for being able to declare how empty fields are treated independenty of the list of NULL values. In this case an extra parameter would not become intermingled with null_values, and it could be could pushed it down to the scanner.

empty_field_is_null: bool = True  # default

Maybe this is a reasonable approach?
Would look like something like this:

pl.scan_csv( ..., null_values=['NA'], empty_field_is_null=False )

Clear what's happening, can be made efficient, parameters are independent, and covers the given use-case in a way that would otherwise require a post-processing step.

@ritchie46
Copy link
Member

I think we can add this. This only make sense for Utf8 columns though. An empty integer will map to null, and no we will not use NaN for this. ;)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 2, 2023

and no we will not use NaN for this. ;)

Lol... I had the confidence that we all have better taste than that :)
Maybe make it super-explicit / unambiguous by naming it empty_utf8_is_null?

@zundertj
Copy link
Collaborator

zundertj commented Jan 2, 2023

For my understanding, currently when we do:

pl.read_csv(...)

it is equivalent to

pl.read_csv(...., null_values=[""]))

?

Should we make it explicit in the api that we have empty string as a null value, i.e. set it a single empty string as the default value for the null_values parameter, and if you override with an empty list, an empty string is parsed as an empty string, rather than null? The long list that Pandas has is not nice to work with, but if all we do is an empty string right now, asking users to override is still ergonomic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
4 participants