Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow lists/sets in fillna #21329

Open
h-vetinari opened this issue Jun 5, 2018 · 8 comments
Open

ENH: allow lists/sets in fillna #21329

h-vetinari opened this issue Jun 5, 2018 · 8 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@h-vetinari
Copy link
Contributor

The docs for the value-parameter of fillna say (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)

value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to
use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list. [my bold]

Frankly, I do not understand this limitation, especially because I see no way to interpret it ambiguously. There are several usecases (that I keep encountering) for filling in lists - especially empty lists.

One of the main ones in using .str.split(..., expand=False) or str.extract and wanting to keep processing the lists - e.g. turn them into sets:

s = pd.Series(['a,b,b,c', 'b,c,d,d,d', 1, None]).str.split(',')
s
# 0       [a, b, b, c]
# 1    [b, c, d, d, d]
# 2                NaN
# 3                NaN
# dtype: object

s.map(set)  # this errors on NaNs
# TypeError: 'float' object is not iterable

### would like to use:
s.fillna([]).map(set)
# TypeError: "value" parameter must be a scalar or dict, but you passed a "list"

### same for
s.fillna(set()).map(set)
# TypeError: "value" parameter must be a scalar or dict, but you passed a "set"

It's tedious to always do this by hand (esp. for DataFrame), like

serieswithalongname.loc[serieswithalongname.isnull()] = [] # can't be chained either

### even this doesn't work, because pd.notnull retains dimensionality -> ambiguous boolean
s.map(lambda x: set(x) if pd.notnull(x) else set())
# ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

### meaning the next step gets even more unwieldy:
s.map(lambda x: set() if isinstance(x, (float, type(None))) and pd.isnull(x) else set(x))
# 0    {c, a, b}
# 1    {c, d, b}
# 2           {}
# 3           {}
# dtype: object

### work-around even more painful for DataFrames,
### because single-value broadcast doesn't work as before
df = s.to_frame('A')
df.loc[df.A.isnull(), 'A'] = []
# ValueError: cannot copy sequence with size 0 to array axis with dimension 1

### need to know to use this:
df.loc[df.A.isnull(), 'A'] = [[]]
df.applymap(set)
#            A
# 0  {c, a, b}
# 1  {c, d, b}
# 2         {}
# 3         {}

This is also somewhat related to #19266.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2018

@h-vetinari first class list/set support would require an extension type from the community. as written these are non-idiomatic and non-performant, as well as a headache for indexing. you are much better off NOT using things like this.

what you are asking for is pandas to infer even more that it already does. this is not likely to happen. its already way too magical.

@h-vetinari
Copy link
Contributor Author

@jreback How is .fillna([]) asking to infer anything? It just says: take the argument to this function - in this case [] -, and insert it into all the places that are pd.isnull(). Fair enough that dicts are exempt from this because they get interpreted, but why does this have to affect lists?

@TomAugspurger
Copy link
Contributor

FWIW, when I refactored that code to share it w/ categorical, I couldn't really figure out why we were limiting things to dicts.

My best guess is it's because a potential ambiguity between whether the sequence should be "elementwise" (use value[i] to fill self[i]) or whether the sequence should be treated as a scalar (fill each NA of self with value).

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Enhancement labels Jun 6, 2018
@h-vetinari
Copy link
Contributor Author

@TomAugspurger

Thanks for the input. I think this ambiguity is not so serious. First off self[i] for i in range(len(value)) does not work for a DF, but more importantly, one can immediately see the effect that the whole list gets filled into every NaN. And with a quick look at the docs, it's clear that a dictionary is needed to distinguish fill-values by column.

Furthermore, since lists throw errors currently, allowing this would not break any existing code. I think it would be very useful...

@ODemidenko
Copy link

Another issue for us is that it is unclear how to implement it efficiently with a custom function. So, supporting filling collections (lists, sets, dicts) actually also resolves performance issue here.

Otherwise we are using such a func, which is likely to be slower, than potential standard implementation:

def replace_nan(df, col, what):
    nans = df[col].isnull()
    df.loc[nans, col] = [what for isnan in nans.values if isnan]
    return df

We have so many cases when we needed to do df.fillna({'col_name': []}) in our codebase. that doing it more efficiently can be really noticeable for us.

@lordgrenville
Copy link
Contributor

Slightly different from OP's use-case, but I wanted to fillna() with a list of values by iterating through the list and using them one by one. Got some ideas on how to do it in this SO thread.

@goerlitz
Copy link

I often have to deal with data where a table column contains a string with comma-separated values or NAs. So having a fix for this issue would be highly appreciated!

I painfully realized that

pd.Series(['a,b,b,c', None]).str.split(',').fillna([])

throws an error and

pd.Series(['a,b,b,c', None]).fillna('').str.split(',')

returns a list with an empty string but not an empty list. :(

So I ended up with

pd.Series(['a,b,b,c', None]).str.split(',').map(lambda x: x if isinstance(x, list) else [])

which is not pretty but does the job.

I don't know if this has a decent performance - and I don't really care because all I need is to declare several data transformation in a chained fashion where each transformation can be easily expressed on a single line (without defining several other helper functions or whatsoever).

@jbrockmendel jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020
@nick-trustlab
Copy link

nick-trustlab commented Nov 29, 2023

Here is my ugly but performant hack to .filna([]):

col_name = 'follower'
df.loc[df[col_name].isna(), col_name] = np.empty((df[col_name].isna().sum(), 0)).tolist()

Or as a function:

def fill_nan_with_empty_list(df, col):
    df.loc[df[col].isna(),col] = np.empty((df[col].isna().sum(), 0)).tolist()

fill_nan_with_empty_list(users, 'follower')
fill_nan_with_empty_list(users, 'following')

though it doesn't always work...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

9 participants