Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_dict - mapping a string to None converts the entire column to f64 of null #7140

Closed
2 tasks done
d-laub opened this issue Feb 24, 2023 · 10 comments
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@d-laub
Copy link

d-laub commented Feb 24, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Hi there, love polars so I wanted to report a potential bug I encountered. I tried converting a specific string to null using map_dict and this converted the column's dtype from string to f64 and made every value null. My intent was to map a set of strings to null and in retrospect this is readily doable with pl.when(...).then(...). Nevertheless, it was unexpected to see this behavior.

Reproducible example

import polars as pl

df = pl.DataFrame({
    'name': list('abcd')
})
df
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ b    │
│ c    │
│ d    │
└──────┘

df.with_columns(pl.col('name').map_dict({'b': None}))
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ f64  │
╞══════╡
│ null │
│ null │
│ null │
│ null │
└──────┘

Expected behavior

This works using pl.when(...).then(...)

df.with_columns(
    pl.when(pl.col('name') == 'b')
    .then(None)
    .otherwise(pl.col('name'))
    .alias('name')
)
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ null │
│ c    │
│ d    │
└──────┘

Installed versions

---Version info---
Polars: 0.16.8
Index type: UInt32
Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.28
Python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.0
numpy: 1.23.3
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.6.1
@d-laub d-laub added bug Something isn't working python Related to Python Polars labels Feb 24, 2023
@d-laub d-laub closed this as completed Feb 24, 2023
@d-laub
Copy link
Author

d-laub commented Feb 24, 2023

Realized that map_dict takes the default parameter which is why everything became null. If anything though, maybe the change in dtype is still unexpected?

@cmdlineluser
Copy link
Contributor

I would think it is unexpected. You should probably re-open this so that it gets seen by the team.

@coinflip112
Copy link
Contributor

Maybe the change in dtype is still unexpected?

I'd say the dtype change is perhaps a bit confusing but not unexpected. Essentially the keys of the mapping dictionary govern the dtype. Your dictionary contains keys of only null which gets inferred as f64. Some details here. Not sure what the best solution is 🤷.

@ghuls
Copy link
Collaborator

ghuls commented Feb 24, 2023

It can be easily solved by assigning a typed series as a value in the mapping dict:

In [39]: df.with_columns(pl.col('name').map_dict({'b': pl.Series("", None, dtype=pl.Utf8)}))
Out[39]:
shape: (4, 1)
┌───────────┐
│ name      │
│ ---       │
│ list[str] │
╞═══════════╡
│ null      │
│ null      │
│ null      │
│ null      │
└───────────┘

@coinflip112
Copy link
Contributor

It can be easily solved by assigning a typed series as a value in the mapping dict:

I wouldn't say that. I'd say it actively hurts it. Now the dtype is list[str]. If you care about the dtype of the mapped expression and for some reason don't trust the inferred type based on the mapping dictionary keys, you should do an explicit cast as follow:

In [4]: df.with_columns(pl.col('name').map_dict({'b': None}).cast(pl.Utf8))
Out[4]: 
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ null │
│ null │
│ null │
│ null │
└──────┘

@ghuls
Copy link
Collaborator

ghuls commented Feb 24, 2023

I wouldn't say that. I'd say it actively hurts it. Now the dtype is list[str]. If you care about the dtype of the mapped expression and for some reason don't trust the inferred type based on the mapping dictionary keys, you should do an explicit cast as follow:

Oops, missed the extra list part.

@ghuls
Copy link
Collaborator

ghuls commented Feb 24, 2023

Reread the example properly this time.
As indicated by the examples in the docstring, you can use the original values for non-remapped values.

>>> df.with_columns(
...     pl.col("name").map_dict(
...         {"b": None},
...         default=pl.col("name")
...     )
... )
shape: (4, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ null │
│ c    │
│ d    │
└──────┘

@d-laub
Copy link
Author

d-laub commented Feb 24, 2023

I want to note that at this point it's clear that this isn't a bug (perhaps arguable for the potential dtype conversion, bit of a sharp corner that should at least be documented IMO), but rather an enhancement/API adjustment. Thus I'm hesitant to re-open the issue pending a change in labels.

Following up with some more context/reflections on why this was unexpected behavior for me. First off, like many others I am a pandas -> polars convert, but I have been using polars actively and almost exclusively for several months now. Nevertheless, when I first wanted to "replace X strings with null" the solution wasn't immediately obvious to me, so I googled "polars replace" and my first hit was this stackoverflow post. The updated solution pointed to the new map_dict function which was awesome! So I updated my polars version and gave it a try, admittedly without spending time to fully read the solution or RTFD and note the important distinctions between pandas.DataFrame.replace and map_dict. To me, this distinction being that map_dict needs an exhaustive mapping, whereas pandas leaves values with unknown mappings as they are by default. As someone coming from pandas who didn't read that closely when trying map_dict for the first time, I was expecting the behavior to be the same between the two functions.

On another note, I want to explain why the pl.when solution didn't come reflexively to me despite using polars exclusively for a few months now. Again, this isn't a bug per se, but I'm sharing my experience to inform API design or "polars design patterns" if you will. I've learned to reach for top level functions like pl.when, pl.concat_<x>, and the like when I am faced with a operation/transformation that I know will require combining data from multiple columns. I think of it like: multiple columns in = don't use pl.col to start the query (there are of course exceptions). In this case, I knew I wanted to operate on/transform a single column exclusively, so I started with pl.col and soon remembered pl.col(..).replace doesn't exist, leading me to the stackoverflow post as described above. All of this to say, I think it's great that map_dict is appearing as the de facto counterpart to pandas.DataFrame/Series.replace, in which case I think having the behavior more closely match pandas would be nice as well. This amounts to changing the default for the default parameter to be default=pl.col(<input column>). Speaking for myself, I more often want to replace some subset of values in a column with a non-exhaustive mapping. In the case where several columns are being mapped e.g. pl.col(<dtype, list[str], regex>), this would also be much less verbose since I think the current implementation makes it necessary to process each column individually i.e. pass a list of pl.col(...).map_dict(..., default=pl.col(...)). After changing the default argument for the default parameter, it would still be very succinct to reproduce the old behavior even for a multi-column operation by simply passing None to default.

@ghuls
Copy link
Collaborator

ghuls commented Feb 24, 2023

default=pl.first() is a nice way to keep the default value if not remapped.

@ghuls
Copy link
Collaborator

ghuls commented Mar 7, 2023

The docstring for map_dict is inproved and mentions now pl.first() more prominently: #7393

Replace values in column according to remapping dictionary.

Needs a global string cache for lazily evaluated queries on columns of
type ``pl.Categorical``.

Parameters
----------
remapping
    Dictionary containing the before/after values to map.
default
    Value to use when the remapping dict does not contain the lookup value.
    Use ``pl.first()``, to keep the original value.

``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants