Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

Closed
2 of 3 tasks
ssche opened this issue Sep 12, 2022 · 3 comments · Fixed by #48541
Closed
2 of 3 tasks

BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

ssche opened this issue Sep 12, 2022 · 3 comments · Fixed by #48541
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@ssche
Copy link
Contributor

ssche commented Sep 12, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

    arr = []
    df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
    other = pd.DataFrame({'a': [1, 2]})

    df2 = df.append(other)
    # same issue for pd.concat(...)
    # df2 = pd.concat([df, other])
    assert df2['a'].dtype == df['a'].dtype
>       assert df2['a'].dtype == df['a'].dtype
E       AssertionError: assert dtype('O') == Int64Dtype()
E        +  where dtype('O') = 0    1\n1    2\nName: a, dtype: object.dtype
E        +  and   Int64Dtype() = Series([], Name: a, dtype: Int64).dtype

Issue Description

When appending a dataframe (df_other) to another dataframe (df) which has an empty column of type ExtensionDtype (in this case Int64Dtype, but the specific EA type doesn't matter), then the resulting dataframe's column (df2['a']) loses the dtype information and turns into an object dtype.

You can run the example with arr = [1] instead of the empty list (arr = []) and observe that - as expected - the type is not changed and remains Int64Dtype.

I traced the issue to _concatenate_join_units and _get_empty_dtype which ignores type information when the column is empty (if not unit.is_na). This in turn then fails to enter the elif any(is_1d_only_ea_obj(t) for t in to_concat) EA handling branch in _concatenate_join_units.

def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
    ...
    dtypes = [unit.dtype for unit in join_units if not unit.is_na]
    if not len(dtypes):
        dtypes = [unit.dtype for unit in join_units if unit.block.dtype.kind != "V"]

    dtype = find_common_type(dtypes)
    ...
def _concatenate_join_units(
    join_units: list[JoinUnit], concat_axis: int, copy: bool
) -> ArrayLike:
    """
    Concatenate values from several join units along selected axis.
    """
    if concat_axis == 0 and len(join_units) > 1:
        # Concatenating join units along ax0 is handled in _merge_blocks.
        raise AssertionError("Concatenating join units along axis0")

    empty_dtype = _get_empty_dtype(join_units)

    has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
    upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)

    to_concat = [
        ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
        for ju in join_units
    ]

    if len(to_concat) == 1:
        # Only one block, nothing to concatenate.
        concat_values = to_concat[0]
        if copy:
            if isinstance(concat_values, np.ndarray):
                # non-reindexed (=not yet copied) arrays are made into a view
                # in JoinUnit.get_reindexed_values
                if concat_values.base is not None:
                    concat_values = concat_values.copy()
            else:
                concat_values = concat_values.copy()

    elif any(is_1d_only_ea_obj(t) for t in to_concat):  # <-- this branch isn't entered
        # TODO(EA2D): special case not needed if all EAs used HybridBlocks
        # NB: we are still assuming here that Hybrid blocks have shape (1, N)
        # concatting with at least one EA means we are concatting a single column
        # the non-EA values are 2D arrays with shape (1, n)

        # error: No overload variant of "__getitem__" of "ExtensionArray" matches
        # argument type "Tuple[int, slice]"
        to_concat = [
            t if is_1d_only_ea_obj(t) else t[0, :]  # type: ignore[call-overload]
            for t in to_concat
        ]
        concat_values = concat_compat(to_concat, axis=0, ea_compat_axis=True)
        concat_values = ensure_block_shape(concat_values, 2)

    else:
        concat_values = concat_compat(to_concat, axis=concat_axis)

    return concat_values

Expected Behavior

Type information remains as both types are compatible (the fact that one Series is empty shouldn't matter).

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.8-200.fc36.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.4.4
numpy : 1.23.2
pytz : 2020.4
dateutil : 2.8.1
setuptools : 59.6.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 1.1.1
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 1.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
snappy : None
sqlalchemy : 1.3.23
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

@ssche ssche added Bug Needs Triage Issue that has not been reviewed by a pandas team member ExtensionArray Extending pandas with custom dtypes or arrays. Dtype Conversions Unexpected or buggy dtype conversions labels Sep 12, 2022
@mroeschke
Copy link
Member

DataFrame.append is deprecated so this will likely be not relevant as of the next major release. Can this be hit through another method?

@mroeschke mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2022
@ssche
Copy link
Contributor Author

ssche commented Sep 13, 2022

df2 = pd.concat([df, other]) has the same issue (I edited my initial report accordingly).

@ssche ssche changed the title BUG: Appending to empty ExtensionArray removes type information BUG: Appending or concatenating to empty ExtensionArray removes type information Sep 13, 2022
@mroeschke
Copy link
Member

I can confirm this on the 1.5.0rc, but cannot on main.

In [2]:     arr = []
   ...:     df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
   ...:     other = pd.DataFrame({'a': [1, 2]})

In [3]: df2 = pd.concat([df, other])

In [4]: assert df2['a'].dtype == df['a'].dtype

In [5]: pd.__version__
Out[5]: '1.6.0.dev0+113.g94044c8532

May have been fixed recently but definitely could use a unit test.

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Needs Info Clarification about behavior needed to assess issue Bug labels Sep 13, 2022
@ssche ssche added this to the 1.6 milestone Sep 16, 2022
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants