BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

ssche · 2022-09-12T03:48:07Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

    arr = []
    df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
    other = pd.DataFrame({'a': [1, 2]})

    df2 = df.append(other)
    # same issue for pd.concat(...)
    # df2 = pd.concat([df, other])
    assert df2['a'].dtype == df['a'].dtype

>       assert df2['a'].dtype == df['a'].dtype
E       AssertionError: assert dtype('O') == Int64Dtype()
E        +  where dtype('O') = 0    1\n1    2\nName: a, dtype: object.dtype
E        +  and   Int64Dtype() = Series([], Name: a, dtype: Int64).dtype

Issue Description

When appending a dataframe (df_other) to another dataframe (df) which has an empty column of type ExtensionDtype (in this case Int64Dtype, but the specific EA type doesn't matter), then the resulting dataframe's column (df2['a']) loses the dtype information and turns into an object dtype.

You can run the example with arr = [1] instead of the empty list (arr = []) and observe that - as expected - the type is not changed and remains Int64Dtype.

I traced the issue to _concatenate_join_units and _get_empty_dtype which ignores type information when the column is empty (if not unit.is_na). This in turn then fails to enter the elif any(is_1d_only_ea_obj(t) for t in to_concat) EA handling branch in _concatenate_join_units.

def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
    ...
    dtypes = [unit.dtype for unit in join_units if not unit.is_na]
    if not len(dtypes):
        dtypes = [unit.dtype for unit in join_units if unit.block.dtype.kind != "V"]

    dtype = find_common_type(dtypes)
    ...

def _concatenate_join_units(
    join_units: list[JoinUnit], concat_axis: int, copy: bool
) -> ArrayLike:
    """
    Concatenate values from several join units along selected axis.
    """
    if concat_axis == 0 and len(join_units) > 1:
        # Concatenating join units along ax0 is handled in _merge_blocks.
        raise AssertionError("Concatenating join units along axis0")

    empty_dtype = _get_empty_dtype(join_units)

    has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
    upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)

    to_concat = [
        ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
        for ju in join_units
    ]

    if len(to_concat) == 1:
        # Only one block, nothing to concatenate.
        concat_values = to_concat[0]
        if copy:
            if isinstance(concat_values, np.ndarray):
                # non-reindexed (=not yet copied) arrays are made into a view
                # in JoinUnit.get_reindexed_values
                if concat_values.base is not None:
                    concat_values = concat_values.copy()
            else:
                concat_values = concat_values.copy()

    elif any(is_1d_only_ea_obj(t) for t in to_concat):  # <-- this branch isn't entered
        # TODO(EA2D): special case not needed if all EAs used HybridBlocks
        # NB: we are still assuming here that Hybrid blocks have shape (1, N)
        # concatting with at least one EA means we are concatting a single column
        # the non-EA values are 2D arrays with shape (1, n)

        # error: No overload variant of "__getitem__" of "ExtensionArray" matches
        # argument type "Tuple[int, slice]"
        to_concat = [
            t if is_1d_only_ea_obj(t) else t[0, :]  # type: ignore[call-overload]
            for t in to_concat
        ]
        concat_values = concat_compat(to_concat, axis=0, ea_compat_axis=True)
        concat_values = ensure_block_shape(concat_values, 2)

    else:
        concat_values = concat_compat(to_concat, axis=concat_axis)

    return concat_values

Expected Behavior

Type information remains as both types are compatible (the fact that one Series is empty shouldn't matter).

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.8-200.fc36.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.4.4
numpy : 1.23.2
pytz : 2020.4
dateutil : 2.8.1
setuptools : 59.6.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 1.1.1
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 1.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
snappy : None
sqlalchemy : 1.3.23
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-09-12T17:12:57Z

DataFrame.append is deprecated so this will likely be not relevant as of the next major release. Can this be hit through another method?

ssche · 2022-09-13T01:45:39Z

df2 = pd.concat([df, other]) has the same issue (I edited my initial report accordingly).

mroeschke · 2022-09-13T16:44:31Z

I can confirm this on the 1.5.0rc, but cannot on main.

In [2]:     arr = []
   ...:     df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
   ...:     other = pd.DataFrame({'a': [1, 2]})

In [3]: df2 = pd.concat([df, other])

In [4]: assert df2['a'].dtype == df['a'].dtype

In [5]: pd.__version__
Out[5]: '1.6.0.dev0+113.g94044c8532

May have been fixed recently but definitely could use a unit test.

ssche added Bug Needs Triage Issue that has not been reviewed by a pandas team member ExtensionArray Extending pandas with custom dtypes or arrays. Dtype Conversions Unexpected or buggy dtype conversions labels Sep 12, 2022

mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2022

ssche changed the title ~~BUG: Appending to empty ExtensionArray removes type information~~ BUG: Appending or concatenating to empty ExtensionArray removes type information Sep 13, 2022

mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Needs Info Clarification about behavior needed to assess issue Bug labels Sep 13, 2022

ssche mentioned this issue Sep 14, 2022

Added test case to lock in behaviour #48541

Merged

5 tasks

mroeschke closed this as completed in #48541 Sep 15, 2022

ssche added this to the 1.6 milestone Sep 16, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

ssche commented Sep 12, 2022 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Sep 12, 2022

ssche commented Sep 13, 2022 •

edited

Loading

mroeschke commented Sep 13, 2022

BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

Comments

ssche commented Sep 12, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

mroeschke commented Sep 12, 2022

ssche commented Sep 13, 2022 • edited Loading

mroeschke commented Sep 13, 2022

ssche commented Sep 12, 2022 •

edited

Loading

ssche commented Sep 13, 2022 •

edited

Loading