New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to exclude default null values when parsing CSV files #5984
Comments
There is. I think we should not conflate data processing with data reading too much. The intent of the following query is much more clear to me than a keyword argument is. pl.read_csv(..).with_columns(pl.col(pl.Utf8).fill_null("")) |
Sorry, by "no way to exclude null values" I mean at the time of parsing CSV files. What if there are other characters (e.g., NA) that are parsed as |
If we don't set df = pl.DataFrame({
"foo": ["", "foo", "NA", "Bar"],
"bar": ["", "foo", "NA", "NA"]
})
df.with_columns(
pl.when(pl.col(pl.Utf8).is_in(["NA", "NAN", "Null"]))
.then(None)
.otherwise(pl.col(pl.Utf8)).keep_name()
)
|
I don't know how others feel about this, but looking at the keyword arguments in pandas where we have default null values and then another argument that excludes those default values seems like putting a query in keyword arguments. I feel that this is what polars does for you. You are able to define which values you want as null, which you want to exclude etc. I think a keyword argument is worth it if we can save (a lot of) computation/memory by doing it at the scan, but when arguments become dependent of each other I become a bit hesitant. @stinodego @alexander-beedie @ghuls @zundertj any thoughts? |
I think I see why pandas does it; it's because they conflate NaN/NULL and have a non-trivial set of values that they consider to be NaN ( If you want to supply some additional NaN/NULL values to pandas then it's quite painful to have to pass that list around and add your own each time, so the additional parameter makes ergonomic sense for them. However, given that we can (and do) distinguish between NaN/NULL a dedicated |
Addendum: I can't think of many (any?) other use-cases where empty_field_is_null: bool = True # default Maybe this is a reasonable approach? pl.scan_csv( ..., null_values=['NA'], empty_field_is_null=False ) Clear what's happening, can be made efficient, parameters are independent, and covers the given use-case in a way that would otherwise require a post-processing step. |
I think we can add this. This only make sense for |
Lol... I had the confidence that we all have better taste than that :) |
For my understanding, currently when we do: pl.read_csv(...) it is equivalent to pl.read_csv(...., null_values=[""])) ? Should we make it explicit in the api that we have empty string as a null value, i.e. set it a single empty string as the default value for the |
Problem description
pandas.read_csv
has an optionkeep_default_na
which allows users to exclude default null values (bykeep_default_na=False
). I do think this is an important option that should added into Polars. A related issue is #2769. Actually, sometimes, people do want to parse empty fields as empty strings instead of null values. However, there's no way to exclude null values in Polars currently.The text was updated successfully, but these errors were encountered: