Add an option to exclude default null values when parsing CSV files #5984

dclong · 2023-01-02T01:31:01Z

Problem description

pandas.read_csv has an option keep_default_na which allows users to exclude default null values (by keep_default_na=False). I do think this is an important option that should added into Polars. A related issue is #2769. Actually, sometimes, people do want to parse empty fields as empty strings instead of null values. However, there's no way to exclude null values in Polars currently.

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-01-02T09:40:04Z

However, there's no way to exclude null values in Polars currently.

There is. I think we should not conflate data processing with data reading too much. The intent of the following query is much more clear to me than a keyword argument is.

pl.read_csv(..).with_columns(pl.col(pl.Utf8).fill_null(""))

dclong · 2023-01-02T10:15:10Z

Sorry, by "no way to exclude null values" I mean at the time of parsing CSV files.

What if there are other characters (e.g., NA) that are parsed as null when reading a CSV file? Then the above code will replace both empty fields and NA fields as null, right? This might not be as expected because users might intend parse empty filed as empty string but do keep NA as null.

ritchie46 · 2023-01-02T10:42:25Z

If we don't set NA as null values, we read all the values and then in post processing we set the proper values to null.

df = pl.DataFrame({
    "foo": ["", "foo", "NA", "Bar"],
    "bar": ["", "foo", "NA", "NA"]
})

df.with_columns(
    pl.when(pl.col(pl.Utf8).is_in(["NA", "NAN", "Null"]))
      .then(None)
      .otherwise(pl.col(pl.Utf8)).keep_name()
)

shape: (4, 2)
┌──────┬──────┐
│ foo  ┆ bar  │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│      ┆      │
│ foo  ┆ foo  │
│ null ┆ null │
│ Bar  ┆ null │
└──────┴──────┘

ritchie46 · 2023-01-02T10:48:35Z

I don't know how others feel about this, but looking at the keyword arguments in pandas where we have default null values and then another argument that excludes those default values seems like putting a query in keyword arguments.

I feel that this is what polars does for you. You are able to define which values you want as null, which you want to exclude etc. I think a keyword argument is worth it if we can save (a lot of) computation/memory by doing it at the scan, but when arguments become dependent of each other I become a bit hesitant.

@stinodego @alexander-beedie @ghuls @zundertj any thoughts?

alexander-beedie · 2023-01-02T11:38:46Z

I think I see why pandas does it; it's because they conflate NaN/NULL and have a non-trivial set of values that they consider to be NaN (‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’).

If you want to supply some additional NaN/NULL values to pandas then it's quite painful to have to pass that list around and add your own each time, so the additional parameter makes ergonomic sense for them.

However, given that we can (and do) distinguish between NaN/NULL a dedicated keep_na_values param is less compelling as we're not conflating all of the above; you can either pass the explicit values you want treated as NULL into the CSV-reading function or you can do a simple post-process on the resulting frame, as described, for the edge case where empty fields are not NULL but other values are.

alexander-beedie · 2023-01-02T11:51:40Z

Addendum: I can't think of many (any?) other use-cases where keep_na_values would be useful, except for being able to declare how empty fields are treated independenty of the list of NULL values. In this case an extra parameter would not become intermingled with null_values, and it could be could pushed it down to the scanner.

empty_field_is_null: bool = True  # default

Maybe this is a reasonable approach?
Would look like something like this:

pl.scan_csv( ..., null_values=['NA'], empty_field_is_null=False )

Clear what's happening, can be made efficient, parameters are independent, and covers the given use-case in a way that would otherwise require a post-processing step.

ritchie46 · 2023-01-02T13:22:35Z

I think we can add this. This only make sense for Utf8 columns though. An empty integer will map to null, and no we will not use NaN for this. ;)

alexander-beedie · 2023-01-02T13:36:52Z

and no we will not use NaN for this. ;)

Lol... I had the confidence that we all have better taste than that :)
Maybe make it super-explicit / unambiguous by naming it empty_utf8_is_null?

zundertj · 2023-01-02T16:46:50Z

For my understanding, currently when we do:

pl.read_csv(...)

it is equivalent to

pl.read_csv(...., null_values=[""]))

?

Should we make it explicit in the api that we have empty string as a null value, i.e. set it a single empty string as the default value for the null_values parameter, and if you override with an empty list, an empty string is parsed as an empty string, rather than null? The long list that Pandas has is not nice to work with, but if all we do is an empty string right now, asking users to override is still ergonomic?

dclong added the enhancement New feature or an improvement of an existing feature label Jan 2, 2023

alexander-beedie mentioned this issue Jan 13, 2023

feat(rust,python): optionally treat missing UTF8 values as the empty string at CSV parse-time #6203

Merged

ritchie46 closed this as completed in #6203 Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to exclude default null values when parsing CSV files #5984

Add an option to exclude default null values when parsing CSV files #5984

dclong commented Jan 2, 2023

ritchie46 commented Jan 2, 2023 •

edited

dclong commented Jan 2, 2023 •

edited

ritchie46 commented Jan 2, 2023

ritchie46 commented Jan 2, 2023

alexander-beedie commented Jan 2, 2023 •

edited

alexander-beedie commented Jan 2, 2023 •

edited

ritchie46 commented Jan 2, 2023

alexander-beedie commented Jan 2, 2023 •

edited

zundertj commented Jan 2, 2023

Add an option to exclude default null values when parsing CSV files #5984

Add an option to exclude default null values when parsing CSV files #5984

Comments

dclong commented Jan 2, 2023

Problem description

ritchie46 commented Jan 2, 2023 • edited

dclong commented Jan 2, 2023 • edited

ritchie46 commented Jan 2, 2023

ritchie46 commented Jan 2, 2023

alexander-beedie commented Jan 2, 2023 • edited

alexander-beedie commented Jan 2, 2023 • edited

ritchie46 commented Jan 2, 2023

alexander-beedie commented Jan 2, 2023 • edited

zundertj commented Jan 2, 2023

ritchie46 commented Jan 2, 2023 •

edited

dclong commented Jan 2, 2023 •

edited

alexander-beedie commented Jan 2, 2023 •

edited

alexander-beedie commented Jan 2, 2023 •

edited

alexander-beedie commented Jan 2, 2023 •

edited