feat(rust,python): optionally treat missing UTF8 values as the empty string at CSV parse-time #6203
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #5984.
Explanation
When reading CSV data there is currently no easy/performant way to distinguish between
null
values that match an explicit list, andnull
values that come from missing values; once loaded, they are all justnull
. Consequently it makes sense to be able to distinguish between them at parse-time (as long as there is no associated overhead).Pandas supports this through a fairly convoluted interaction between
na_values
andkeep_na_values
, which is definitely not something we'd want to copy.Implementation
Aside from utf8 values, there don't really seem to be any other types that would have a reasonable non-null missing value, so I've exposed it on the CSV-reading interface as
missing_utf8_is_empty_string
, which makes clear what it does and what it applies to (lower down this param name becomes the slightly more genericmissing_is_null
, which is always True unless the new param flips it for utf8).Essentially zero overhead; the boolean is passed-through the code until it can be used to flip the validity bitmap from False to True for the given missing value; no buffer-related shenanigans, etc. The only extra operation is a single
field.is_empty()
check that is triggered iif the new flag is enabled and we've already identified a potential null.Added several new unit tests that try to stress the potential edge-cases...
Example
Usual behaviour: explicit null values and missing values become
null
:Optional new behaviour: missing values can be read as the empty string, with only the explicitly-indicated values becoming
null
: