Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: assignment with enlargement gives object dtype with ExtensionArrays #32346

Open
jorisvandenbossche opened this issue Feb 29, 2020 · 14 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves setitem-with-expansion

Comments

@jorisvandenbossche
Copy link
Member

From #32271, when setting with enlargment, the dtype gets converted into object dtype (while for normal integer or boolean dtype this is not the case):

In [10]: s = pd.Series([1, 2, 3], dtype="Int64")  

In [11]: s[3] = 4 

In [12]: s   
Out[12]: 
0    1
1    2
2    3
3    4
dtype: object

In [13]: s = pd.Series([1, 2, 3], dtype="int64") 

In [14]: s[3] = 4   

In [15]: s  
Out[15]: 
0    1
1    2
2    3
3    4
dtype: int64

It also happens with eg DecimalArray from our tests, so suppose it is a general issue with ExtensionArrays.

@jorisvandenbossche jorisvandenbossche added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 29, 2020
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Feb 29, 2020
@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 29, 2020
@phofl
Copy link
Member

phofl commented Nov 23, 2020

This works now. Returns:

0    1
1    2
2    3
3    4
dtype: Int64

@phofl phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Nov 23, 2020
@kasim95
Copy link
Contributor

kasim95 commented Dec 27, 2020

take

@kasim95
Copy link
Contributor

kasim95 commented Dec 27, 2020

@phofl I found the same issue still occurring for other dtypes apart from Int64.
I used the following code:

In [1]: dtype = "Int32"
        result = Series([1, 2, 3], dtype=dtype)
        previous_dtype = result.dtype
        result.loc[3] = 4
        print(f"{previous_dtype} -> {result.dtype}")

Out [1]:Int32 -> Int64

My observations for some of the inconsistent dtype conversions are as follows:

Before Assignment After Assignment
Int32 Int64
Int16 Int64
Int8 Int64
UInt64 Float64
UInt32 Int64
UInt16 Int64
UInt8 Int64
Float64 object
Float32 object
Float16 (float16) float64
string object

Also, these inconsistent conversion only occur for assignments to indexes that do not exist in the original array.
For example, in the above code, index 3 does not exist in the result array.
Is this behavior deliberate or is it a bug?

@phofl phofl added Bug and removed Needs Tests Unit test(s) needed to prevent regressions good first issue labels Dec 29, 2020
@phofl
Copy link
Member

phofl commented Dec 29, 2020

This seems odd, but not sure if this is deliberate. This happens in

new_values = Series([value])._values

Maybe we should try to preserve the dtype? @jbrockmendel thoughts?

@jbrockmendel
Copy link
Member

Maybe we should try to preserve the dtype?

Agreed.

@kasim95 kasim95 removed their assignment Jan 4, 2021
@phofl
Copy link
Member

phofl commented Jan 24, 2021

I looked into this. This seems deliberate for now.

# for now only handle other floating types
if not all(isinstance(t, FloatingDtype) for t in dtypes):
    return None

This results in object for Float. The int case is the same as for non nullable dtypes. int32 is also changed to int64

@jbrockmendel
Copy link
Member

id try to avoid object wherever possible. maybe instead of requiring all-FloatingDtype could require all losslessly-castable, so would include numpy floating dtypes and some numpy integer dtypes

@simonjayhawkins
Copy link
Member

This works now. Returns:

0    1
1    2
2    3
3    4
dtype: Int64

but not for the pd.NA case

# gets upcast to object
df = pd.DataFrame({"a": [1, 2, 3]}, dtype="Int64")
df.loc[4] = pd.NA

xref #47214 (comment)

@jorisvandenbossche
Copy link
Member Author

That last one is actually a regression (I ran into this a few days ago and was planning to open a new issue).

This was working in 1.3 (and already in 1.1 as well):

In [1]: df = pd.DataFrame({"a": [1, 2, 3]}, dtype="Int64")
   ...: df.loc[4] = pd.NA

In [2]: df
Out[2]: 
      a
0     1
1     2
2     3
4  <NA>

In [3]: df.dtypes
Out[3]: 
a    Int64
dtype: object

In [4]: pd.__version__
Out[4]: '1.3.5'

while on 1.4 / main, this results in object dtype

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jun 8, 2022
@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 1.4.3 Jun 8, 2022
@simonjayhawkins
Copy link
Member

Thanks @jorisvandenbossche will open a dedicated issue for the regression.

@simonjayhawkins
Copy link
Member

will open a dedicated issue for the regression.

opened #47284

@simonjayhawkins simonjayhawkins removed the Regression Functionality that used to work in a prior pandas version label Jun 8, 2022
@simonjayhawkins simonjayhawkins modified the milestones: 1.4.3, Contributions Welcome Jun 8, 2022
@jreback jreback modified the milestones: Contributions Welcome, 1.5 Jul 1, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@phofl phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug good first issue labels Apr 18, 2023
@phofl
Copy link
Member

phofl commented Apr 18, 2023

Works now

@jorisvandenbossche
Copy link
Member Author

There was a specific PR for this that added tests (#47342), so just closing.

@jorisvandenbossche
Copy link
Member Author

Actually, we still have the issue for EAs in general (for external EAs, how they can preserve the dtype / recognize scalars), as I mentioned in the top post this also happens for our test DecimalArray. And this still doesn't work:

In [10]: from pandas.tests.extension.decimal.array import DecimalArray, make_data

In [11]: s = pd.Series(DecimalArray(make_data()))[:3]

In [12]: s
Out[12]: 
0    Decimal: 0.47855406981399450927483485429547727...
1    Decimal: 0.35287592203064421791935956207453273...
2    Decimal: 0.09288530130514716098844019143143668...
dtype: decimal

In [13]: s[3] = s[0]

In [14]: s
Out[14]: 
0    0.47855406981399450927483485429547727108001708...
1    0.35287592203064421791935956207453273236751556...
2    0.09288530130514716098844019143143668770790100...
3    0.47855406981399450927483485429547727108001708...
dtype: object

So we can keep this open for ExtensionArrays in general.

@jorisvandenbossche jorisvandenbossche added Bug and removed Needs Tests Unit test(s) needed to prevent regressions labels Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves setitem-with-expansion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants