ENH: DataFrame.append preserves columns dtype if possible #19021

topper-123 · 2017-12-31T23:14:24Z

closes df.append should retain columns type if same type #18359
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR makes DataFrame.append preserve columns dtype, e.g.:

>>> idx = pd.CategoricalIndex('a b'.split())
>>> df = pd.DataFrame([[1, 2]], columns=idx)
>>> ser = pd.Series([3, 4], index=idx, name=1)
>>> df.append(ser).columns
CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

Previously, the above returned Index(['a', 'b'], dtype='object'), i.e. the index type information was lost when using append.

topper-123 · 2018-01-01T10:21:57Z

Can't get Travis-ci to complete because it uses too much time, somehow.

I assume this has nothing to do with this PR, but is about travis settings?

jreback · 2018-01-01T14:19:17Z

rebase and push again, fixed some hanging by Travis CI

codecov · 2018-01-01T16:03:44Z

Codecov Report

Merging #19021 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #19021      +/-   ##
==========================================
- Coverage   91.84%   91.82%   -0.03%     
==========================================
  Files         153      153              
  Lines       49295    49299       +4     
==========================================
- Hits        45276    45268       -8     
- Misses       4019     4031      +12

Flag	Coverage Δ
#multiple	`90.21% <ø> (-0.03%)`	⬇️
#single	`41.89% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.16% <ø> (ø)`	⬆️
pandas/plotting/_converter.py	`65.07% <0%> (-1.74%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb095a6...13537ae. Read the comment docs.

topper-123 · 2018-01-01T16:21:34Z

Ok, rebased.

jreback · 2018-01-01T16:21:33Z

pandas/core/frame.py

+            idx_diff = other.index.difference(self.columns)
+            try:
+                combined_columns = self.columns.append(idx_diff)
+            except TypeError:


so this should not be done here, rather if a TypeError happens, then Index.append handles.

I don't understand you here, could you expand.

In [1]: pd.Index([1, 2, 3]).append(pd.Index(list('abc'))) Out[1]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object') In [2]: pd.Index([1, 2, 3]).append(pd.CategoricalIndex(list('abc'))) Out[2]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object') In [3]: pd.Index([1, 2, 3]).append(pd.MultiIndex.from_product([[1], ['a']])) Out[3]: Index([1, 2, 3, (1, 'a')], dtype='object') In [4]: pd.Index([1, 2, 3]).append(pd.IntervalIndex.from_breaks([1, 2, 3])) Out[4]: Index([1, 2, 3, (1, 2], (2, 3]], dtype='object')

.append`` should never raise

so you are catching something that does not happen.

The below fails, though:

>>> pd.CategoricalIndex('A B C'.split()).append(pd.Index([1]) TypeError: cannot append a non-category item to a CategoricalIndex

This is what is tested in the if not isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)): part of the tests. In addition, the reason why pd.IntervalIndex and pd.MultiIndex are excluded for testing there is because of reindexing issues:

>>> ii = pd.IntervalIndex.from_breaks([0,1,2]) >>> df = pd.DataFrame([[1, 2], [4, 5]], columns=ii) >>> ser = pd.Series([7], index=['a']) >>> combined = pd.Index(ii.tolist() + ser.index.tolist()) >>> df.reindex(columns=combined) TypeError: unorderable types: Interval() > int()

hmm, the MI case we did on purpose. However you cannot convert to lists and expect things to work. for the TypeError, do .astype(object) on each left and right then .append again. we never should coerce to lists.

jreback · 2018-01-01T16:22:31Z

pandas/tests/reshape/test_concat.py

+    def test_append_same_columns_type(self, df_columns):
+        # GH18359
+
+        # ser.index is a normal pd.Index, result from df.append(ser) should be


make this as a separate test (the if part), though I suspect you are actually hiding a bug (see above).

jreback · 2018-01-01T16:22:46Z

pandas/tests/reshape/test_concat.py

    def test_append_dtype_coerce(self):

        # GH 4993
        # appending with datetime will incorrectly convert datetime64
-        import datetime as dt
        from pandas import NaT


can move other imports to top as well

topper-123 · 2018-01-02T14:32:25Z

Changed and green (except for travis errors that seem unrelated).

A different subject, but shouldn't .reindex be able to align Interval and int objects?

jreback · 2018-01-04T00:34:43Z

pandas/tests/reshape/test_concat.py

+
+        # MultiIndex and IntervalIndex will raise if appended or appended to
+        # a different type
+        if (isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)) or


don't do this in a test. I want a separate test for MI/II (or 2 is ok).

pep8speaks · 2018-01-05T23:08:17Z

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 17, 2018 at 17:28 Hours UTC

topper-123 · 2018-01-12T23:38:39Z

All is green again, now that #19112 has been pulled in.

jreback · 2018-01-14T22:03:59Z

@topper-123 added some ids to the parametrization. also I added a test case in the combo test or an ordered and unordered category which is failing. (its commented out). can you have a look?

jreback · 2018-01-16T01:13:10Z

also pls rebase

topper-123 · 2018-01-18T00:16:40Z

@jreback , wrt. the ordered vs unordered combo in test_append_different_columns_types: pd.CategoricalIndex('A B C'.split()) and pd.CategoricalIndex('A B C'.split(), ordered=True) combine into pd.CategoricalIndex('A B C'.split()). To pass this test, the combined columns have to have length 6, but this has length 3.

I think the current implementation is reasonable behaviour, as append tries to find the lowest common denominator, which in this case is pd.CategoricalIndex('A B C'.split()). Possibly you could give append a strict parameter where you'd raise if the two dtypes aren't the same, but that's another issue than this one, IMO.

jreback · 2018-01-19T11:39:47Z

pandas/tests/reshape/test_concat.py

+
+        df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=mi)
+        ser = pd.Series([7, 8, 9], index=other_type, name=2)
+        if isinstance(other_type, pd.IntervalIndex):


so this is wrong, append should raise a TypeError (like all the others). can you fix this in II?

It looks like the ValueError is coming from here:

pandas/pandas/core/indexes/interval.py

Lines 1155 to 1161 in 1245f06

def _as_like_interval_index(self, other, error_msg):

self._assert_can_do_setop(other)

other = _ensure_index(other)

if (not isinstance(other, IntervalIndex) or

self.closed != other.closed):

raise ValueError(error_msg)

return other

Seems reasonable to split the if condition, and raise a different exception in each case, along the lines of:

if not isinstance(other, IntervalIndex): raise TypeError(error_msg) elif self.closed != other.closed: raise ValueError(error_msg)

I think this change will only break the test below, at least within the II tests, but all that needs to be changed is ValueError --> TypeError:

pandas/pandas/tests/indexes/interval/test_interval.py

Lines 936 to 940 in 1245f06

# non-IntervalIndex

msg = ('can only do set operations between two IntervalIndex objects '

'that are closed on the same side')

with tm.assert_raises_regex(ValueError, msg):

set_op(Index([1, 2, 3]))

jreback · 2018-01-19T11:40:44Z

pandas/tests/reshape/test_concat.py

+                          dt.datetime(2013, 1, 3, 7, 12)]),
+        pd.Index([4, 5, 6]),
+    ], ids=lambda x: str(x.dtype))
+    def test_append_interval_index_raises(self, other_type):


see above.

cc @jschendel

jreback · 2018-01-19T11:41:36Z

cc @jschendel

jschendel · 2018-01-20T05:48:55Z

pandas/tests/reshape/test_concat.py

+
+        df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=other_type)
+        ser = pd.Series([7, 8, 9], index=ii, name=2)
+        with pytest.raises(ValueError):


looks to be the same as above

jreback · 2018-01-21T21:10:12Z

doc/source/whatsnew/v0.23.0.txt

 - ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`)
+- :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`)


duplicated the whatsnew, don't remove the whitespace

Fixed, thanks.

jreback · 2018-01-21T21:11:53Z

pandas/tests/reshape/test_concat.py

+        # different index type
+
+        ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3])
+


this tests now looks superfluous, yes? and you just need to add it to the above

Appending IntervalIndex with other index types still fails, only with a TypeError rather than a ValueError, so not sure I understand.

Oh, you mean combine test_append_multi_index_raises and test_append_interval_index_raises? I think you're right. I'll get on it.

yes. None of these should fail, you are catching the TypeError already (and then coercing to object).

Well, combining IntervalIndex and other index types will stilll fail, because Interval and other dtypes are not orderable:

>>> ii = pd.IntervalIndex.from_breaks([0,1,2,3]) >>> i = pd.RangeIndex(3) >>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=ii) >>> ser = pd.Series([7, 8, 9], index=i, name=2) >>> df.append(ser) TypeError: unorderable types: Interval() > int()

test_append_multi_index_raises and test_append_interval_index_raises are very similar and I've made a changes to combine them into a single test, see test_append_different_columns_types_raises.

EDIT: I'm wondering if trying to order here is a bug, as you can append to the index itself:

>>> ii.append(i) Index([(0, 1], (1, 2], (2, 3], 0, 1, 2], dtype='object')

If the above is possible, surely you can use DataFrame.append as well. I'll take a look tonight.

topper-123 · 2018-02-18T23:03:01Z

I've addressed the comments, except one where I've commented specifically.

If #19353 (exact matches for CategoricalIndex) or a version thereof could be accepted, it would be possible to append frames/series with CategoricalIndex as well, i.e. move IntervalIndex checks from test_append_different_columns_types_raises to test_append_different_columns_types.

jreback · 2018-02-19T16:05:51Z

pandas/tests/reshape/test_concat.py

+                                columns=combined_columns)
+        assert_frame_equal(result, expected)
+
+    @pytest.mark.parametrize("this_type", [


indexes = [pd.RangeIndex(3), pd.Index([4, 5, 6]), pd.Index(list("abc")), pd.CategoricalIndex('A B C'.split()), pd.CategoricalIndex('A B C'.split(), ordered=True), pd.IntervalIndex.from_breaks([0, 1, 2, 3]), pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]), pd.DatetimeIndex([dt.datetime(2013, 1, 3, 0, 0), dt.datetime(2013, 1, 3, 6, 10), dt.datetime(2013, 1, 3, 7, 12)])] @pytest.fixture(params=indexes) def this_type(request): return request.param @pytest.fixture(params=indexes) def that_type(params=indexes): return request.param ..... def append_different_coumns_types_raises(self, this_type, that_type): ....

topper-123 · 2018-02-21T19:31:45Z

I've uploaded a version with fixture, but honestly I consider this less readable than the previous version...

jreback · 2018-02-22T02:13:30Z

pandas/tests/reshape/test_concat.py

+        pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]),
+    ]
+
+    @pytest.fixture(params=indexes_II)


ok, rather than use the fixtures, let's leave the lists where they are (though you don't need to have indexes_I at all, and pls name them something else more descriptive. and you can use parametrize directrly.

jreback · 2018-02-22T02:14:08Z

pandas/tests/reshape/test_concat.py

+                                columns=ser_index)
+        assert_frame_equal(result, expected)
+
+    @pytest.mark.parametrize("df_columns, series_index",


you can just use 2 parameterize here, one for df_columns and one for series_index; you don't need combinations, pytest already does this.

I don't get this approach to work, AFAIKS, parametrize doesn't do this for you, and I've googled it without results. Can you give an example?

jreback · 2018-03-16T22:15:18Z

can you update

jreback · 2018-04-01T14:25:30Z

can you rebase

jreback · 2018-04-14T13:47:32Z

@topper-123 can you rebase

topper-123 · 2018-04-14T21:55:57Z

Rebased.

I've also simplified the tests a bit. I dropped the fixture-based approach, as I don't think it makes things clearer in this case.

topper-123 · 2018-04-17T08:40:56Z

do you disagree wit the approach, @jreback?

jreback

lgtm. can you just make the whatsnew a bit more

jreback · 2018-04-17T10:24:50Z

doc/source/whatsnew/v0.23.0.txt

@@ -380,6 +380,7 @@ Other Enhancements
 - :class:`IntervalIndex` now supports time zone aware ``Interval`` objects (:issue:`18537`, :issue:`18538`)
 - :func:`Series` / :func:`DataFrame` tab completion also returns identifiers in the first level of a :func:`MultiIndex`. (:issue:`16326`)
 - :func:`read_excel()` has gained the ``nrows`` parameter (:issue:`16645`)
+- :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`)


can you add the case this is actually addressing (e.g. categorical)

topper-123 · 2018-04-19T20:04:44Z

@jreback, could you pull this in?

jreback · 2018-04-20T00:35:19Z

thanks @topper-123

jonblunt · 2018-04-26T16:58:42Z

A related question.

I was expecting making a column an index , or level in a MultiIndex, to not change the dtype.

df  = pd.DataFrame(dict(x=np.array([1,2],dtype = 'int32'), 
                        y =['a', 'b']))

df.set_index('x').reset_index().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
x    2 non-null int64
y    2 non-null object
dtypes: int64(1), object(1)
memory usage: 104.0+ bytes

Is this exepected ,and will this enhancement change this? Thanks

topper-123 · 2018-04-26T17:12:48Z

@jonblunt

The index will in this case be a Int64Index:

>>> df.set_index('x').index
Int64Index([1, 2], dtype='int64', name='x')

Pandas doesn't have a Int32Index, so this behaviour is as expected and you shouldn't expect an int32 when resetting (or you'll have to reset manually).

jbrockmendel · 2021-05-22T23:20:06Z

pandas/tests/reshape/test_concat.py

+            self, index_can_append, index_cannot_append_with_other):
+        # GH18359
+        # Dataframe.append will raise if IntervalIndex/MultiIndex appends
+        # or is appended to a different index type


@topper-123 can you help me understand why raising is the correct thing to do here? i have a branch that fixes a Series construction bug:

def test_constructor_dict_multiindex_reindex_flat(self): # construction involves reindexing with a MultiIndex corner case data = {('i', 'i'): 0, ('i', 'j'): 1, ('j', 'i'): 2, "j": np.nan} expected = Series(data) result = Series(expected[:-1].to_dict(), index=expected.index) tm.assert_series_equal(result, expected)

but the fix for that (in MultiIndex.reindex) causes this test to fail

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 72214ca to 6bb6860 Compare January 1, 2018 03:22

jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 1, 2018

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 6bb6860 to 9cfa3d3 Compare January 1, 2018 15:26

jreback requested changes Jan 1, 2018

View reviewed changes

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 9cfa3d3 to 0d55da6 Compare January 2, 2018 13:13

jreback requested changes Jan 4, 2018

View reviewed changes

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 8f8b140 to bd400df Compare January 5, 2018 23:08

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch 2 times, most recently from 730ac3a to 26e7419 Compare January 5, 2018 23:32

topper-123 mentioned this pull request Jan 6, 2018

BUG: pd.IntervalIndex.difference raises when empty difference #19101

Closed

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch 2 times, most recently from 464fdc2 to 48f406b Compare January 12, 2018 17:06

jreback added this to the 0.23.0 milestone Jan 14, 2018

jreback requested changes Jan 19, 2018

View reviewed changes

jschendel reviewed Jan 20, 2018

View reviewed changes

topper-123 mentioned this pull request Jan 21, 2018

Change IntervalIndex set-ops error code type #19329

Merged

4 tasks

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from a5deefb to 860a5d2 Compare January 21, 2018 20:32

jreback requested changes Jan 21, 2018

View reviewed changes

jreback requested changes Feb 19, 2018

View reviewed changes

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 944cf96 to f2fbdf4 Compare February 21, 2018 19:30

jreback requested changes Feb 22, 2018

View reviewed changes

jreback removed this from the 0.23.0 milestone Apr 14, 2018

tp and others added 5 commits April 14, 2018 21:09

DataFrame.append preserves columns dtype if possible

b3e8b93

DataFrame.append: add parametrized tests, different dtypes

093cd2c

add ids to parametrization

4cb8a5b

add ordered categories

ed895a4

DataFrame.append preserves columns dtype if possible

fd1e7a5

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from f2fbdf4 to 7e57ff6 Compare April 14, 2018 21:54

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 7e57ff6 to fe4bc5a Compare April 14, 2018 22:39

jreback requested changes Apr 17, 2018

View reviewed changes

DataFrame.append simplify tests

13537ae

topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from fe4bc5a to 13537ae Compare April 17, 2018 17:28

jreback added this to the 0.23.0 milestone Apr 20, 2018

jreback approved these changes Apr 20, 2018

View reviewed changes

jreback merged commit 3e691a4 into pandas-dev:master Apr 20, 2018

topper-123 deleted the DataFrame.append_preserve_columns_dtype branch May 18, 2018 18:09

jbrockmendel reviewed May 22, 2021

View reviewed changes

	def _as_like_interval_index(self, other, error_msg):
	self._assert_can_do_setop(other)
	other = _ensure_index(other)
	if (not isinstance(other, IntervalIndex) or
	self.closed != other.closed):
	raise ValueError(error_msg)
	return other

	# non-IntervalIndex
	msg = ('can only do set operations between two IntervalIndex objects '
	'that are closed on the same side')
	with tm.assert_raises_regex(ValueError, msg):
	set_op(Index([1, 2, 3]))

		- ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`)
		- :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`)

		# different index type

		ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3])

ENH: DataFrame.append preserves columns dtype if possible #19021

ENH: DataFrame.append preserves columns dtype if possible #19021

Conversation

topper-123 commented Dec 31, 2017 • edited

topper-123 commented Jan 1, 2018

jreback commented Jan 1, 2018

codecov bot commented Jan 1, 2018 • edited

Codecov Report

topper-123 commented Jan 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jan 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jan 2, 2018 • edited

Choose a reason for hiding this comment

pep8speaks commented Jan 5, 2018 • edited

Comment last updated on April 17, 2018 at 17:28 Hours UTC

topper-123 commented Jan 12, 2018

jreback commented Jan 14, 2018

jreback commented Jan 16, 2018

topper-123 commented Jan 18, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jan 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jan 21, 2018 • edited

Choose a reason for hiding this comment

topper-123 commented Feb 18, 2018 • edited

Choose a reason for hiding this comment

topper-123 commented Feb 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 16, 2018

jreback commented Apr 1, 2018

jreback commented Apr 14, 2018

topper-123 commented Apr 14, 2018 • edited

topper-123 commented Apr 17, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Apr 19, 2018

jreback commented Apr 20, 2018

jonblunt commented Apr 26, 2018

topper-123 commented Apr 26, 2018 • edited

Choose a reason for hiding this comment

topper-123 commented Dec 31, 2017 •

edited

codecov bot commented Jan 1, 2018 •

edited

topper-123 Jan 2, 2018 •

edited

topper-123 commented Jan 2, 2018 •

edited

pep8speaks commented Jan 5, 2018 •

edited

topper-123 commented Jan 18, 2018 •

edited

topper-123 Jan 21, 2018 •

edited

topper-123 Jan 21, 2018 •

edited

topper-123 commented Feb 18, 2018 •

edited

topper-123 commented Apr 14, 2018 •

edited

topper-123 commented Apr 26, 2018 •

edited