map_dict - mapping a string to None converts the entire column to f64 of null #7140

d-laub · 2023-02-24T02:02:36Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

Hi there, love polars so I wanted to report a potential bug I encountered. I tried converting a specific string to null using map_dict and this converted the column's dtype from string to f64 and made every value null. My intent was to map a set of strings to null and in retrospect this is readily doable with pl.when(...).then(...). Nevertheless, it was unexpected to see this behavior.

Reproducible example

import polars as pl

df = pl.DataFrame({
    'name': list('abcd')
})
df
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ b    │
│ c    │
│ d    │
└──────┘

df.with_columns(pl.col('name').map_dict({'b': None}))
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ f64  │
╞══════╡
│ null │
│ null │
│ null │
│ null │
└──────┘

Expected behavior

This works using pl.when(...).then(...)

df.with_columns(
    pl.when(pl.col('name') == 'b')
    .then(None)
    .otherwise(pl.col('name'))
    .alias('name')
)
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ null │
│ c    │
│ d    │
└──────┘

Installed versions

---Version info---
Polars: 0.16.8
Index type: UInt32
Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.28
Python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.0
numpy: 1.23.3
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.6.1

The text was updated successfully, but these errors were encountered:

d-laub · 2023-02-24T02:06:38Z

Realized that map_dict takes the default parameter which is why everything became null. If anything though, maybe the change in dtype is still unexpected?

cmdlineluser · 2023-02-24T13:33:17Z

I would think it is unexpected. You should probably re-open this so that it gets seen by the team.

coinflip112 · 2023-02-24T14:23:11Z

Maybe the change in dtype is still unexpected?

I'd say the dtype change is perhaps a bit confusing but not unexpected. Essentially the keys of the mapping dictionary govern the dtype. Your dictionary contains keys of only null which gets inferred as f64. Some details here. Not sure what the best solution is 🤷.

ghuls · 2023-02-24T14:36:50Z

It can be easily solved by assigning a typed series as a value in the mapping dict:

In [39]: df.with_columns(pl.col('name').map_dict({'b': pl.Series("", None, dtype=pl.Utf8)}))
Out[39]:
shape: (4, 1)
┌───────────┐
│ name      │
│ ---       │
│ list[str] │
╞═══════════╡
│ null      │
│ null      │
│ null      │
│ null      │
└───────────┘

coinflip112 · 2023-02-24T14:47:43Z

It can be easily solved by assigning a typed series as a value in the mapping dict:

I wouldn't say that. I'd say it actively hurts it. Now the dtype is list[str]. If you care about the dtype of the mapped expression and for some reason don't trust the inferred type based on the mapping dictionary keys, you should do an explicit cast as follow:

In [4]: df.with_columns(pl.col('name').map_dict({'b': None}).cast(pl.Utf8))
Out[4]: 
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ null │
│ null │
│ null │
│ null │
└──────┘

ghuls · 2023-02-24T14:58:52Z

I wouldn't say that. I'd say it actively hurts it. Now the dtype is list[str]. If you care about the dtype of the mapped expression and for some reason don't trust the inferred type based on the mapping dictionary keys, you should do an explicit cast as follow:

Oops, missed the extra list part.

ghuls · 2023-02-24T15:05:25Z

Reread the example properly this time.
As indicated by the examples in the docstring, you can use the original values for non-remapped values.

>>> df.with_columns(
...     pl.col("name").map_dict(
...         {"b": None},
...         default=pl.col("name")
...     )
... )
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ null │
│ c    │
│ d    │
└──────┘

d-laub · 2023-02-24T16:02:02Z

I want to note that at this point ~~it's clear that this isn't a bug~~ (perhaps arguable for the potential dtype conversion, bit of a sharp corner that should at least be documented IMO), but rather an enhancement/API adjustment. Thus I'm hesitant to re-open the issue pending a change in labels.

Following up with some more context/reflections on why this was unexpected behavior for me. First off, like many others I am a pandas -> polars convert, but I have been using polars actively and almost exclusively for several months now. Nevertheless, when I first wanted to "replace X strings with null" the solution wasn't immediately obvious to me, so I googled "polars replace" and my first hit was this stackoverflow post. The updated solution pointed to the new map_dict function which was awesome! So I updated my polars version and gave it a try, admittedly without spending time to fully read the solution or RTFD and note the important distinctions between pandas.DataFrame.replace and map_dict. To me, this distinction being that map_dict needs an exhaustive mapping, whereas pandas leaves values with unknown mappings as they are by default. As someone coming from pandas who didn't read that closely when trying map_dict for the first time, I was expecting the behavior to be the same between the two functions.

On another note, I want to explain why the pl.when solution didn't come reflexively to me despite using polars exclusively for a few months now. Again, this isn't a bug per se, but I'm sharing my experience to inform API design or "polars design patterns" if you will. I've learned to reach for top level functions like pl.when, pl.concat_<x>, and the like when I am faced with a operation/transformation that I know will require combining data from multiple columns. I think of it like: multiple columns in = don't use pl.col to start the query (there are of course exceptions). In this case, I knew I wanted to operate on/transform a single column exclusively, so I started with pl.col and soon remembered pl.col(..).replace doesn't exist, leading me to the stackoverflow post as described above. All of this to say, I think it's great that map_dict is appearing as the de facto counterpart to pandas.DataFrame/Series.replace, in which case I think having the behavior more closely match pandas would be nice as well. This amounts to changing the default for the default parameter to be default=pl.col(<input column>). Speaking for myself, I more often want to replace some subset of values in a column with a non-exhaustive mapping. In the case where several columns are being mapped e.g. pl.col(<dtype, list[str], regex>), this would also be much less verbose since I think the current implementation makes it necessary to process each column individually i.e. pass a list of pl.col(...).map_dict(..., default=pl.col(...)). After changing the default argument for the default parameter, it would still be very succinct to reproduce the old behavior even for a multi-column operation by simply passing None to default.

ghuls · 2023-02-24T17:45:03Z

default=pl.first() is a nice way to keep the default value if not remapped.

ghuls · 2023-03-07T14:25:57Z

The docstring for map_dict is inproved and mentions now pl.first() more prominently: #7393

Replace values in column according to remapping dictionary.

Needs a global string cache for lazily evaluated queries on columns of
type ``pl.Categorical``.

Parameters
----------
remapping
    Dictionary containing the before/after values to map.
default
    Value to use when the remapping dict does not contain the lookup value.
    Use ``pl.first()``, to keep the original value.

``

d-laub added bug Something isn't working python Related to Python Polars labels Feb 24, 2023

d-laub closed this as completed Feb 24, 2023

cmdlineluser mentioned this issue Feb 24, 2023

fix(python): allow map_dict on categorical dtype #7097

Merged

ghuls mentioned this issue Mar 6, 2023

Review experimental functionality #7349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map_dict - mapping a string to None converts the entire column to f64 of null #7140

map_dict - mapping a string to None converts the entire column to f64 of null #7140

d-laub commented Feb 24, 2023

d-laub commented Feb 24, 2023

cmdlineluser commented Feb 24, 2023

coinflip112 commented Feb 24, 2023

ghuls commented Feb 24, 2023

coinflip112 commented Feb 24, 2023

ghuls commented Feb 24, 2023

ghuls commented Feb 24, 2023 •

edited

d-laub commented Feb 24, 2023 •

edited

ghuls commented Feb 24, 2023

ghuls commented Mar 7, 2023

map_dict - mapping a string to None converts the entire column to f64 of null #7140

map_dict - mapping a string to None converts the entire column to f64 of null #7140

Comments

d-laub commented Feb 24, 2023

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

d-laub commented Feb 24, 2023

cmdlineluser commented Feb 24, 2023

coinflip112 commented Feb 24, 2023

ghuls commented Feb 24, 2023

coinflip112 commented Feb 24, 2023

ghuls commented Feb 24, 2023

ghuls commented Feb 24, 2023 • edited

d-laub commented Feb 24, 2023 • edited

ghuls commented Feb 24, 2023

ghuls commented Mar 7, 2023

ghuls commented Feb 24, 2023 •

edited

d-laub commented Feb 24, 2023 •

edited