feat(rust,python): optionally treat missing UTF8 values as the empty string at CSV parse-time #6203

alexander-beedie · 2023-01-13T10:07:35Z

Closes #5984.

Explanation

When reading CSV data there is currently no easy/performant way to distinguish between null values that match an explicit list, and null values that come from missing values; once loaded, they are all just null. Consequently it makes sense to be able to distinguish between them at parse-time (as long as there is no associated overhead).

Pandas supports this through a fairly convoluted interaction between na_values and keep_na_values, which is definitely not something we'd want to copy.

Implementation

Aside from utf8 values, there don't really seem to be any other types that would have a reasonable non-null missing value, so I've exposed it on the CSV-reading interface as missing_utf8_is_empty_string, which makes clear what it does and what it applies to (lower down this param name becomes the slightly more generic missing_is_null, which is always True unless the new param flips it for utf8).

Essentially zero overhead; the boolean is passed-through the code until it can be used to flip the validity bitmap from False to True for the given missing value; no buffer-related shenanigans, etc. The only extra operation is a single field.is_empty() check that is triggered iif the new flag is enabled and we've already identified a potential null.

Added several new unit tests that try to stress the potential edge-cases...

Example

from textwrap import dedent
from io import StringIO
import polars as pl

csv = StringIO(dedent(
    r"""
    a,b,c,d,e,f,g
    na,,,,\N,,
    a,\N,c,,,,g
    ,,,,,,
    ,,,na,,,
    """
))

Usual behaviour: explicit null values and missing values become null:

pl.read_csv( 
    csv,
    null_values = ["na",r"\N"],
)
# ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    ┆ d    ┆ e    ┆ f    ┆ g    │
# ╞══════╪══════╪══════╪══════╪══════╪══════╪══════╡
# │ null ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
# │ a    ┆ null ┆ c    ┆ null ┆ null ┆ null ┆ g    │
# │ null ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
# │ null ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
# └──────┴──────┴──────┴──────┴──────┴──────┴──────┘

Optional new behaviour: missing values can be read as the empty string, with only the explicitly-indicated values becoming null:

pl.read_csv(
    csv,
    null_values = ["na",r"\N"],
    missing_utf8_is_empty_string = True,
)
# ┌──────┬──────┬─────┬──────┬──────┬──────┬─────┐
# │ a    ┆ b    ┆ c   ┆ d    ┆ e    ┆ f    ┆ g   │
# ╞══════╪══════╪═════╪══════╪══════╪══════╪═════╡
# │ null ┆      ┆     ┆      ┆ null ┆      ┆     │
# │ a    ┆ null ┆ c   ┆      ┆      ┆      ┆ g   │
# │      ┆      ┆     ┆      ┆      ┆      ┆     │
# │      ┆      ┆     ┆ null ┆      ┆      ┆     │
# └──────┴──────┴─────┴──────┴──────┴──────┴─────┘

…string at csv parse-time

ritchie46

Looks good @alexander-beedie. I indeed expect this to not have a performance penalty.. Well done! 👍

polars/polars-io/src/csv/buffer.rs

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 13, 2023

feat(rust,python): optionally treat missing utf8 values as the empty …

c1b2841

…string at csv parse-time

alexander-beedie force-pushed the empty-utf8-csv-values branch from 909cd44 to c1b2841 Compare January 13, 2023 10:20

ritchie46 reviewed Jan 13, 2023

View reviewed changes

polars/polars-io/src/csv/buffer.rs Show resolved Hide resolved

ritchie46 merged commit a324649 into pola-rs:master Jan 13, 2023

alexander-beedie deleted the empty-utf8-csv-values branch January 13, 2023 14:05

alexander-beedie mentioned this pull request Mar 27, 2023

[feature request] Handle several null values in null_values CSV parser argument #971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust,python): optionally treat missing UTF8 values as the empty string at CSV parse-time #6203

feat(rust,python): optionally treat missing UTF8 values as the empty string at CSV parse-time #6203

alexander-beedie commented Jan 13, 2023 •

edited

ritchie46 left a comment

feat(rust,python): optionally treat missing UTF8 values as the empty string at CSV parse-time #6203

feat(rust,python): optionally treat missing UTF8 values as the empty string at CSV parse-time #6203

Conversation

alexander-beedie commented Jan 13, 2023 • edited

Explanation

Implementation

Example

ritchie46 left a comment

Choose a reason for hiding this comment

alexander-beedie commented Jan 13, 2023 •

edited