ENH: allow lists/sets in fillna #21329

h-vetinari · 2018-06-05T19:32:42Z

The docs for the value-parameter of fillna say (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)

value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to
use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list. [my bold]

Frankly, I do not understand this limitation, especially because I see no way to interpret it ambiguously. There are several usecases (that I keep encountering) for filling in lists - especially empty lists.

One of the main ones in using .str.split(..., expand=False) or str.extract and wanting to keep processing the lists - e.g. turn them into sets:

s = pd.Series(['a,b,b,c', 'b,c,d,d,d', 1, None]).str.split(',')
s
# 0       [a, b, b, c]
# 1    [b, c, d, d, d]
# 2                NaN
# 3                NaN
# dtype: object

s.map(set)  # this errors on NaNs
# TypeError: 'float' object is not iterable

### would like to use:
s.fillna([]).map(set)
# TypeError: "value" parameter must be a scalar or dict, but you passed a "list"

### same for
s.fillna(set()).map(set)
# TypeError: "value" parameter must be a scalar or dict, but you passed a "set"

It's tedious to always do this by hand (esp. for DataFrame), like

serieswithalongname.loc[serieswithalongname.isnull()] = [] # can't be chained either

### even this doesn't work, because pd.notnull retains dimensionality -> ambiguous boolean
s.map(lambda x: set(x) if pd.notnull(x) else set())
# ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

### meaning the next step gets even more unwieldy:
s.map(lambda x: set() if isinstance(x, (float, type(None))) and pd.isnull(x) else set(x))
# 0    {c, a, b}
# 1    {c, d, b}
# 2           {}
# 3           {}
# dtype: object

### work-around even more painful for DataFrames,
### because single-value broadcast doesn't work as before
df = s.to_frame('A')
df.loc[df.A.isnull(), 'A'] = []
# ValueError: cannot copy sequence with size 0 to array axis with dimension 1

### need to know to use this:
df.loc[df.A.isnull(), 'A'] = [[]]
df.applymap(set)
#            A
# 0  {c, a, b}
# 1  {c, d, b}
# 2         {}
# 3         {}

This is also somewhat related to #19266.

The text was updated successfully, but these errors were encountered:

jreback · 2018-06-05T21:11:40Z

@h-vetinari first class list/set support would require an extension type from the community. as written these are non-idiomatic and non-performant, as well as a headache for indexing. you are much better off NOT using things like this.

what you are asking for is pandas to infer even more that it already does. this is not likely to happen. its already way too magical.

h-vetinari · 2018-06-05T21:17:24Z

@jreback How is .fillna([]) asking to infer anything? It just says: take the argument to this function - in this case [] -, and insert it into all the places that are pd.isnull(). Fair enough that dicts are exempt from this because they get interpreted, but why does this have to affect lists?

TomAugspurger · 2018-06-05T21:26:28Z

FWIW, when I refactored that code to share it w/ categorical, I couldn't really figure out why we were limiting things to dicts.

My best guess is it's because a potential ambiguity between whether the sequence should be "elementwise" (use value[i] to fill self[i]) or whether the sequence should be treated as a scalar (fill each NA of self with value).

h-vetinari · 2018-06-07T06:16:51Z

@TomAugspurger

Thanks for the input. I think this ambiguity is not so serious. First off self[i] for i in range(len(value)) does not work for a DF, but more importantly, one can immediately see the effect that the whole list gets filled into every NaN. And with a quick look at the docs, it's clear that a dictionary is needed to distinguish fill-values by column.

Furthermore, since lists throw errors currently, allowing this would not break any existing code. I think it would be very useful...

ODemidenko · 2018-11-01T08:20:08Z

Another issue for us is that it is unclear how to implement it efficiently with a custom function. So, supporting filling collections (lists, sets, dicts) actually also resolves performance issue here.

Otherwise we are using such a func, which is likely to be slower, than potential standard implementation:

def replace_nan(df, col, what):
    nans = df[col].isnull()
    df.loc[nans, col] = [what for isnan in nans.values if isnan]
    return df

We have so many cases when we needed to do df.fillna({'col_name': []}) in our codebase. that doing it more efficiently can be really noticeable for us.

lordgrenville · 2019-01-16T08:08:50Z

Slightly different from OP's use-case, but I wanted to fillna() with a list of values by iterating through the list and using them one by one. Got some ideas on how to do it in this SO thread.

goerlitz · 2020-06-11T21:20:30Z

I often have to deal with data where a table column contains a string with comma-separated values or NAs. So having a fix for this issue would be highly appreciated!

I painfully realized that

pd.Series(['a,b,b,c', None]).str.split(',').fillna([])

throws an error and

pd.Series(['a,b,b,c', None]).fillna('').str.split(',')

returns a list with an empty string but not an empty list. :(

So I ended up with

pd.Series(['a,b,b,c', None]).str.split(',').map(lambda x: x if isinstance(x, list) else [])

which is not pretty but does the job.

I don't know if this has a decent performance - and I don't really care because all I need is to declare several data transformation in a chained fashion where each transformation can be easily expressed on a single line (without defining several other helper functions or whatsoever).

nick-trustlab · 2023-11-29T11:06:29Z

Here is my ugly but performant hack to .filna([]):

col_name = 'follower'
df.loc[df[col_name].isna(), col_name] = np.empty((df[col_name].isna().sum(), 0)).tolist()

Or as a function:

def fill_nan_with_empty_list(df, col):
    df.loc[df[col].isna(),col] = np.empty((df[col].isna().sum(), 0)).tolist()

fill_nan_with_empty_list(users, 'follower')
fill_nan_with_empty_list(users, 'following')

though it doesn't always work...

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Enhancement labels Jun 6, 2018

jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow lists/sets in fillna #21329

ENH: allow lists/sets in fillna #21329

h-vetinari commented Jun 5, 2018

jreback commented Jun 5, 2018

h-vetinari commented Jun 5, 2018

TomAugspurger commented Jun 5, 2018

h-vetinari commented Jun 7, 2018

ODemidenko commented Nov 1, 2018

lordgrenville commented Jan 16, 2019

goerlitz commented Jun 11, 2020

nick-trustlab commented Nov 29, 2023 •

edited

ENH: allow lists/sets in fillna #21329

ENH: allow lists/sets in fillna #21329

Comments

h-vetinari commented Jun 5, 2018

jreback commented Jun 5, 2018

h-vetinari commented Jun 5, 2018

TomAugspurger commented Jun 5, 2018

h-vetinari commented Jun 7, 2018

ODemidenko commented Nov 1, 2018

lordgrenville commented Jan 16, 2019

goerlitz commented Jun 11, 2020

nick-trustlab commented Nov 29, 2023 • edited

nick-trustlab commented Nov 29, 2023 •

edited