API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

jorisvandenbossche · 2021-01-12T08:36:18Z

Follow-up on #38843 and #39035.

Currently, we generally (some exceptions can be considered bugs, I think) do not drop empty objects when concatting DataFrames, but we do explicitly drop empties when concatting Series (in dtypes/concat.py::concat_compat, for axis==0).

We should make this consistent throughout pandas, and generally I would argue for not skipping empties: when not skipping empty objects, the resulting dtype of a concat-operation only depends on the input dtypes, and not on the exact content (the exact values, how many values (shape)). In general we want to get rid of value-dependent behaviour. In the past we discussed this in the context of the certain values (eg presence of NaNs or not), but I think also the shape should not matter (eg when slicing dataframes before concatting, you can get empties or not depending on values).

If people agree on going the way of not skipping empties in concat (and append, and friends), some different areas of work:

DataFrames: in general we already do not skip empty objects when determining the resulting dtype (except for DataFrames with EA columns, see below). However:
- Add a bunch of explicit tests for this behaviour.
- There are still some bugs, see eg Incorrect dtype for concat with empty float and datetime columns #32934 about M8[ns] + float64 giving M8[ns] for when empty and object otherwise
DataFrame/Series with new (nullable) extension dtypes: since those are still experimental, I think we can still just make a breaking change here?
- Update concat_compat to not skip empty nullable EAs
Series (and DataFrame with EA columns): those now skip empties. Can we "just" change this (it's long standing behaviour, and changing it can certainly break code (eg if you at once have no longer a datetime column but object dtype column)? Can we deprecate it? (doesn't sound easy) Or leave it as breaking change for pandas 2.0? (which means keeping the inconsistency for a while longer ..)

So IMO it's mainly the last bullet point (Series/DataFrame with longer-existing EAs) that requires some more discussion on how we want to change it.

Some illustrative examples:

# dataframe with int64 (default consolidated dtype) and period (EA dtype) column
>>> df1 = pd.DataFrame({'a': [1, 2, 3], 'b': pd.period_range("2012", freq="D", periods=3)})
# dataframe with object dtype columns
>>> df2 = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['a', 'b', 'c']})

For Series with basic dtype (int64), int64 + object dtype results in object dtype, but not when the object dtype Series is empty:

>>> pd.concat([df1['a'], df2['a']]).dtype
dtype('O')
>>> pd.concat([df1['a'], df2['a'][:0]]).dtype
dtype('int64')

For DataFrame, you can see that the int64 + object always gives object (even when one is empty), but for period dtype, the empty object dtype gets ignored:

>>> pd.concat([df1, df2]).dtypes
a    object
b    object
dtype: object
>>> pd.concat([df1, df2[:0]]).dtypes
a       object
b    period[D]
dtype: object

cc @pandas-dev/pandas-core

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-01-27T21:03:12Z

@jorisvandenbossche IIUC the change you're suggesting would change this test to expect object dtype; can you confirm?

    def test_append_empty_tz_frame_with_datetime64ns(self):
        # https://github.com/pandas-dev/pandas/issues/35460
        df = DataFrame(columns=["a"]).astype("datetime64[ns, UTC]")
    
        # pd.NaT gets inferred as tz-naive, so append result is tz-naive
        result = df.append({"a": pd.NaT}, ignore_index=True)
        expected = DataFrame({"a": [pd.NaT]}).astype("datetime64[ns]")
       tm.assert_frame_equal(result, expected)

jorisvandenbossche · 2021-01-28T22:38:12Z

Yes, that was also discussed on the PR in question: #36115 (comment)

jbrockmendel · 2021-02-02T01:06:36Z

In general we want to get rid of value-dependent behaviour

do we have a running list of where we currently have these behaviors?

jorisvandenbossche · 2021-02-03T14:31:31Z

For value-depending behaviour in the concat itself in general, there are several "encoded" in the _cast_to_common_type helper function:

pandas/pandas/core/dtypes/concat.py

Lines 24 to 61 in 9a52a81

    
           def _cast_to_common_type(arr: ArrayLike, dtype: DtypeObj) -> ArrayLike: 
        
               """ 
        
               Helper function for `arr.astype(common_dtype)` but handling all special 
        
               cases. 
        
               """ 
        
               if ( 
        
                   is_categorical_dtype(arr.dtype) 
        
                   and isinstance(dtype, np.dtype) 
        
                   and np.issubdtype(dtype, np.integer) 
        
               ): 
        
                   # problem case: categorical of int -> gives int as result dtype, 
        
                   # but categorical can contain NAs -> fall back to object dtype 
        
                   try: 
        
                       return arr.astype(dtype, copy=False) 
        
                   except ValueError: 
        
                       return arr.astype(object, copy=False) 
        
               if is_sparse(arr) and not is_sparse(dtype): 
        
                   # problem case: SparseArray.astype(dtype) doesn't follow the specified 
        
                   # dtype exactly, but converts this to Sparse[dtype] -> first manually 
        
                   # convert to dense array 
        
                   arr = cast(SparseArray, arr) 
        
                   return arr.to_dense().astype(dtype, copy=False) 
        
               if ( 
        
                   isinstance(arr, np.ndarray) 
        
                   and arr.dtype.kind in ["m", "M"] 
        
                   and dtype is np.dtype("object") 
        
               ): 
        
                   # wrap datetime-likes in EA to ensure astype(object) gives Timestamp/Timedelta 
        
                   # this can happen when concat_compat is called directly on arrays (when arrays 
        
                   # are not coming from Index/Series._values), eg in BlockManager.quantile 
        
                   arr = ensure_wrapped_if_datetimelike(arr) 
        
               if is_extension_array_dtype(dtype) and isinstance(arr, np.ndarray): 
        
                   # numpy's astype cannot handle ExtensionDtypes 
        
                   return array(arr, dtype=dtype, copy=False) 
        
               return arr.astype(dtype, copy=False)

But specifically where this happens for empty / non-empty inconsistencies, I don't think we have a running list of those, except for the issue linked in the first bullet point about DataFrames in the top post here

jbrockmendel · 2022-12-22T23:23:09Z

Some notes on what it would take to change this. AFAICT it boils down to disabling a check in concat_compat https://github.com/pandas-dev/pandas/blob/main/pandas/core/dtypes/concat.py#L64-L67

    non_empties = [x for x in to_concat if is_nonempty(x)]
    if non_empties and axis == 0 and not ea_compat_axis:
        # ea_compat_axis see GH#39574
        to_concat = non_empties

and flipping a condition in JoinUnit.is_na https://github.com/pandas-dev/pandas/blob/main/pandas/core/internals/concat.py#L424

        if values.size == 0:
            return True

Doing this breaks 10 tests for me locally:

FAILED pandas/tests/dtypes/test_concat.py::test_concat_mismatched_categoricals_with_empty - AssertionError: Categorical Expected type <class 'pandas.core.arrays.categorical.Categorical'>, found <class 'numpy.ndarray'> instead
FAILED pandas/tests/reshape/concat/test_append_common.py::TestConcatAppendCommon::test_concat_categorical_empty - AssertionError: Attributes of Series are different
FAILED pandas/tests/reshape/concat/test_concat.py::test_concat_ignore_emtpy_object_float[None-float64] - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="foo") are different
FAILED pandas/tests/reshape/concat/test_concat.py::test_concat_ignore_emtpy_object_float[None-datetime64[ns]] - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="foo") are different
FAILED pandas/tests/reshape/concat/test_concat.py::test_concat_ignore_emtpy_object_float[float64-datetime64[ns]] - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="foo") are different
FAILED pandas/tests/reshape/concat/test_concat.py::test_concat_ignore_emtpy_object_float[object-float64] - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="foo") are different
FAILED pandas/tests/reshape/concat/test_concat.py::test_concat_ignore_emtpy_object_float[object-datetime64[ns]] - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="foo") are different
FAILED pandas/tests/reshape/concat/test_empty.py::TestEmptyConcat::test_concat_empty_series - AssertionError: Attributes of Series are different
FAILED pandas/tests/reshape/concat/test_series.py::TestSeriesConcat::test_concat_empty_and_non_empty_series_regression - AssertionError: Attributes of Series are different
FAILED pandas/tests/reshape/merge/test_merge.py::TestMerge::test_join_append_timedeltas - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="d") are different

Of these, 5 were introduced by #47372 which was specifically about restoring the current behavior. i.e. changing this would be a breaking change, but doesn't look like it would be particularly tough to implement.

iftg · 2023-10-25T14:54:01Z

Concat of a non-empty dataframe with an empty one resulting in the non-empty is a natural behaviour. Hope you won't change it. Example:
df1 = pd.DataFrame()
df2 = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['a', 'b', 'c']})
df3 = pd.concat([df1, df2]) should result in df2. Otherwise this will have to be replaced by an ugly
if df1.empty:
df3 = df2
else:
df3 = pd.concat([df1, df2])
Am I missing something?

jorisvandenbossche added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Jan 12, 2021

This was referenced Jan 12, 2021

API: concatting DataFrames does not skip empty objects #39035

Merged

Incorrect dtype for concat with empty float and datetime columns #32934

Closed

jbrockmendel mentioned this issue Feb 1, 2021

BUG: incorrect casting in DataFrame.append #39454

Merged

4 tasks

jbrockmendel mentioned this issue Feb 3, 2021

BUG: DataFrame.append with timedelta64 #39574

Merged

4 tasks

jorisvandenbossche mentioned this issue Apr 12, 2021

API: value-dependent behaviour in concat with all-NA data #40893

Closed

jbrockmendel mentioned this issue Mar 9, 2022

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

Closed

3 tasks

jorisvandenbossche mentioned this issue Jun 17, 2022

REGR: revert behaviour change for concat with empty/all-NaN data #47372

Merged

jbrockmendel mentioned this issue Oct 30, 2022

API: Breaking Changes in 3.0 (without deprecations) #44823

Open

8 tasks

jbrockmendel mentioned this issue Apr 7, 2023

DEPR: concat ignoring empty objects #52532

Merged

6 tasks

MarcoGorelli closed this as completed in #52532 Jul 10, 2023

jonashoechst mentioned this issue Sep 19, 2023

Pandas outputs warning when calling dataframe.append in flux_csv_parser._prepare_data_frame influxdata/influxdb-client-python#613

Open

Aloqeely mentioned this issue Apr 18, 2024

BUG: combine_first raise FutureWarning surprisingly #58304

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

jorisvandenbossche commented Jan 12, 2021

jbrockmendel commented Jan 27, 2021

jorisvandenbossche commented Jan 28, 2021

jbrockmendel commented Feb 2, 2021

jorisvandenbossche commented Feb 3, 2021

jbrockmendel commented Dec 22, 2022

iftg commented Oct 25, 2023

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

Comments

jorisvandenbossche commented Jan 12, 2021

jbrockmendel commented Jan 27, 2021

jorisvandenbossche commented Jan 28, 2021

jbrockmendel commented Feb 2, 2021

jorisvandenbossche commented Feb 3, 2021

jbrockmendel commented Dec 22, 2022

iftg commented Oct 25, 2023