Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.24.1] New nullable integer fillna with non-int doesn't coerce to object #25288

Open
jelther opened this issue Feb 12, 2019 · 22 comments
Open
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jelther
Copy link

jelther commented Feb 12, 2019

Code Sample

import pandas as pd

sample_data = []

sample_data.append({"integer_column":None})
sample_data.append({"integer_column":1})
sample_data.append({"integer_column":2})

df = pd.DataFrame(sample_data)

# Previous type is object
# df.dtypes

df.loc[:,'integer_column'] = df.loc[:,'integer_column'].astype('Int64')

# Check new type is Int64, nullable
# df.dtypes

df.fillna('null_string')

Problem description

Using the new nullable type Int64, it is not possible to fill "NaN" values with other value.

Error raised

TypeError: <U11 cannot be converted to an IntegerDtype

Expected Output

The new dataframe should have replaced it's NaN values with the desired input of .fillna() method.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.24.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml.etree: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.18
pymysql: None
psycopg2: None
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Feb 13, 2019

This works fine if you use an actual integer value to fill, so there's not really much of a point to using Int64 in this case since you're still asking for an object.

In any case I suppose it should still coerce to object for you like using float here would. Investigation and PRs are always welcome

@WillAyd WillAyd added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 13, 2019
@WillAyd WillAyd changed the title [0.24.1] New nullable integer can't fill NaN [0.24.1] New nullable integer fillna with non-int doesn't coerce to object Feb 13, 2019
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 13, 2019 via email

@jelther
Copy link
Author

jelther commented Feb 14, 2019

In fact, since I'm using pandas for an ETL tool, this doesn't look nice to me.
Having to change the type to "object" inevitable adds ".0" after the integer number and breaks my code.

The alternative I used is to remove the ".0" part after "astype(object)" and fill it with NaNs values.

@WillAyd
Copy link
Member

WillAyd commented Feb 14, 2019

@jelther getting slightly off topic but if you are getting zeros appended to your integers its because they are getting cast to float at some point. Explicitly constructing your data frame with dtype=object as an argument would allow you to mix / match the None value with integers without implicit cast to float

@jelther
Copy link
Author

jelther commented Feb 14, 2019

@jelther getting slightly off topic but if you are getting zeros appended to your integers its because they are getting cast to float at some point. Explicitly constructing your data frame with dtype=object as an argument would allow you to mix / match the None value with integers without implicit cast to float

@WillAyd , I think this is the best alternative but I'm not able to specify since I'm extracting the data from a SQL Server database with "pandas.read_sql". I don't see on the documentation how I would be able to specify the "dtypes" when selecting the data.

@jorisvandenbossche
Copy link
Member

but the type stability ensured by ExtensionArray[T].fillna -> ExtensionArray[T] is nice.

+1. For eg DatetimeArray, the fill value also needs to be a datetime-like.

@gosuto-inzasheru
Copy link

gosuto-inzasheru commented Feb 27, 2020

I would assume that .fillna() would coerce the series into being of type object when I am trying to fill it with an object. Just like for example when adding a float to it coerces it into being of type float:

>>> pd.Series([1, 2, None], dtype='Int64') + 0.5
0    1.5
1    2.5
2    NaN
dtype: float64

However;

>>> pd.Series([1, 2, None], dtype='Int64').fillna('')
TypeError: <U1 cannot be converted to an IntegerDtype

@gosuto-inzasheru
Copy link

This behaviour is also displayed when using .fillna() on a series of floats using a string:

>>> pd.Series([1, 2, None], dtype='float64').fillna('')
0    1
1    2
2     
dtype: object

@TomAugspurger
Copy link
Contributor

would coerce the series into being of type object when I am trying to fill it with an object

Why do you prefer coercing the series to the dtype of the fill value, rather than the other way around? It's not clear to me that one is preferable to the other.

@gosuto-inzasheru
Copy link

Because I need to fill the <NA> values with something else. Just like when adding a float to an integer it becomes a float. And like I said, this is already .fillna()'s current behaviour on floats.

But I also understand there is something to say for not doing so. Maybe a boolean argument such as coerce=True would be a solution?

@jbrockmendel
Copy link
Member

Definitely on the IntegerArray the fillna should raise instead of cast, but in ExtensionBlock.fillna we would expect it to fall back by casting to object

@mroeschke mroeschke added the Bug label Jun 26, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 9, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 9, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 10, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 10, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 11, 2021
@alexreg
Copy link
Contributor

alexreg commented Oct 11, 2021

@jbrockmendel The docs for ExtensionArray (I presume you meant ExtensionArray rather than the internal ExtensionBlock) now explicitly state: "pandas will recognize instances of this class as proper arrays with a custom type and will not attempt to coerce them to objects". Does that negate your above comment? And if so, should this issue therefore be closed?

@jbrockmendel
Copy link
Member

@alexreg the methods on ExtensionArray subclasses are in general stricter about what the allow than the Series/DataFrame methods. This less-strict behavior is implemented on the Block subclasses.

@alexreg
Copy link
Contributor

alexreg commented Oct 11, 2021

@jbrockmendel Oh, I see. That makes more sense now. This seems like a slightly tricky matter to get exactly right, and I'd rather not take it on in my existing PR (or another even). Do you have any inclination to have a go at it yourself?

@jbrockmendel
Copy link
Member

I'll get to it eventually if no one else does, but it isn't a priority for me ATM.

@alexreg
Copy link
Contributor

alexreg commented Oct 11, 2021

@jbrockmendel If you don't quickly pointing me to what you think are the right functions/methods to look at, I may have a go (fairly) soon.

@jorisvandenbossche
Copy link
Member

BTW, I don't think we should change this for the nullable dtypes. I personally think the no-casting behaviour is better (although inconsistent with other dtypes), and unless we have a discussion about which long-term behaviour we want and decide we want casting, IMO we should keep the strict behaviour of fillna for nullable dtypes for now (also on the block level).
This discussion is probably #39584

@gosuto-inzasheru
Copy link

are you then also saying there is no room for a flag (eg coerce=True) to make it more consistent with other dtypes if needed?

@jbrockmendel
Copy link
Member

@alexreg notwithstanding Joris's (reasonable) objection, the place where you would change the behavior is in ExtensionBlock.fillna. either a try/except around values = self.values.fillna or a better implementation of _can_hold_element

@jorisvandenbossche
Copy link
Member

are you then also saying there is no room for a flag (eg coerce=True) to make it more consistent with other dtypes if needed?

Not necessarily, as that has not yet been brought up in the discussion up to now, as far as I know.
Personally, I still don't really see the advantage of that compared to explicitly doing a dtype conversion if you want it (eg if you want to fill the missing integers with a float fill value, you can first cast to float and then fill, eg s.astype("Float64").fillna(some_float))

@gosuto-inzasheru
Copy link

i brought it up here: #25288 (comment)

i still think it makes more sense to be able to just use .fillna() without explicitly casting first. as i stated earlier in this thread, this is default behaviour for .fillna() on other dtypes, and it also is a logical step when for example adding a float to an Int64.

what you are suggesting from a user perspective, is that now sometimes i can .fillna() directly, and sometimes i will have to cast + fill. as a user i would feel more for consistent behaviour of .fillna().

@alexreg
Copy link
Contributor

alexreg commented Oct 13, 2021

@jbrockmendel Thanks

For what it's worth, while I certainly see @jorisvandenbossche's point, and feel that sort of behaviour would be appropriate under other circumstances, I agree with @jorijnsmit here. At the end of the day, Pandas has set a precedent for virtually ubiquitous implicit coercion, and to not do so here would indeed seem inconsistent. If we don't want to have implicit coercion here, then we probably shouldn't in lots of other places too, but that design decision has already been made, long ago — and it was probably the right one, given Panda's emphasis on ergonomics over things like strong typing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

8 participants