Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DataFrame.append preserves columns dtype if possible #19021

Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Dec 31, 2017

This PR makes DataFrame.append preserve columns dtype, e.g.:

>>> idx = pd.CategoricalIndex('a b'.split())
>>> df = pd.DataFrame([[1, 2]], columns=idx)
>>> ser = pd.Series([3, 4], index=idx, name=1)
>>> df.append(ser).columns
CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

Previously, the above returned Index(['a', 'b'], dtype='object'), i.e. the index type information was lost when using append.

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 72214ca to 6bb6860 Compare January 1, 2018 03:22
@topper-123
Copy link
Contributor Author

Can't get Travis-ci to complete because it uses too much time, somehow.

I assume this has nothing to do with this PR, but is about travis settings?

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

rebase and push again, fixed some hanging by Travis CI

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 1, 2018
@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 6bb6860 to 9cfa3d3 Compare January 1, 2018 15:26
@codecov
Copy link

codecov bot commented Jan 1, 2018

Codecov Report

Merging #19021 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19021      +/-   ##
==========================================
- Coverage   91.84%   91.82%   -0.03%     
==========================================
  Files         153      153              
  Lines       49295    49299       +4     
==========================================
- Hits        45276    45268       -8     
- Misses       4019     4031      +12
Flag Coverage Δ
#multiple 90.21% <ø> (-0.03%) ⬇️
#single 41.89% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.16% <ø> (ø) ⬆️
pandas/plotting/_converter.py 65.07% <0%> (-1.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb095a6...13537ae. Read the comment docs.

@topper-123
Copy link
Contributor Author

Ok, rebased.

idx_diff = other.index.difference(self.columns)
try:
combined_columns = self.columns.append(idx_diff)
except TypeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this should not be done here, rather if a TypeError happens, then Index.append handles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand you here, could you expand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: pd.Index([1, 2, 3]).append(pd.Index(list('abc')))
Out[1]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')

In [2]: pd.Index([1, 2, 3]).append(pd.CategoricalIndex(list('abc')))
Out[2]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')

In [3]: pd.Index([1, 2, 3]).append(pd.MultiIndex.from_product([[1], ['a']]))
Out[3]: Index([1, 2, 3, (1, 'a')], dtype='object')

In [4]: pd.Index([1, 2, 3]).append(pd.IntervalIndex.from_breaks([1, 2, 3]))
Out[4]: Index([1, 2, 3, (1, 2], (2, 3]], dtype='object')

.append`` should never raise

so you are catching something that does not happen.

Copy link
Contributor Author

@topper-123 topper-123 Jan 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The below fails, though:

>>> pd.CategoricalIndex('A B C'.split()).append(pd.Index([1])
TypeError: cannot append a non-category item to a CategoricalIndex

This is what is tested in the if not isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)): part of the tests. In addition, the reason why pd.IntervalIndex and pd.MultiIndex are excluded for testing there is because of reindexing issues:

>>> ii = pd.IntervalIndex.from_breaks([0,1,2])
>>> df = pd.DataFrame([[1, 2], [4, 5]], columns=ii)
>>> ser = pd.Series([7], index=['a'])
>>> combined = pd.Index(ii.tolist() + ser.index.tolist())
>>> df.reindex(columns=combined)
TypeError: unorderable types: Interval() > int()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the MI case we did on purpose. However you cannot convert to lists and expect things to work. for the TypeError, do .astype(object) on each left and right then .append again. we never should coerce to lists.

def test_append_same_columns_type(self, df_columns):
# GH18359

# ser.index is a normal pd.Index, result from df.append(ser) should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this as a separate test (the if part), though I suspect you are actually hiding a bug (see above).

def test_append_dtype_coerce(self):

# GH 4993
# appending with datetime will incorrectly convert datetime64
import datetime as dt
from pandas import NaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move other imports to top as well

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 9cfa3d3 to 0d55da6 Compare January 2, 2018 13:13
@topper-123
Copy link
Contributor Author

topper-123 commented Jan 2, 2018

Changed and green (except for travis errors that seem unrelated).

A different subject, but shouldn't .reindex be able to align Interval and int objects?


# MultiIndex and IntervalIndex will raise if appended or appended to
# a different type
if (isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do this in a test. I want a separate test for MI/II (or 2 is ok).

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 8f8b140 to bd400df Compare January 5, 2018 23:08
@pep8speaks
Copy link

pep8speaks commented Jan 5, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 17, 2018 at 17:28 Hours UTC

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch 2 times, most recently from 730ac3a to 26e7419 Compare January 5, 2018 23:32
@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch 2 times, most recently from 464fdc2 to 48f406b Compare January 12, 2018 17:06
@topper-123
Copy link
Contributor Author

All is green again, now that #19112 has been pulled in.

@jreback jreback added this to the 0.23.0 milestone Jan 14, 2018
@jreback
Copy link
Contributor

jreback commented Jan 14, 2018

@topper-123 added some ids to the parametrization. also I added a test case in the combo test or an ordered and unordered category which is failing. (its commented out). can you have a look?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2018

also pls rebase

@topper-123
Copy link
Contributor Author

topper-123 commented Jan 18, 2018

@jreback , wrt. the ordered vs unordered combo in test_append_different_columns_types: pd.CategoricalIndex('A B C'.split()) and pd.CategoricalIndex('A B C'.split(), ordered=True) combine into pd.CategoricalIndex('A B C'.split()). To pass this test, the combined columns have to have length 6, but this has length 3.

I think the current implementation is reasonable behaviour, as append tries to find the lowest common denominator, which in this case is pd.CategoricalIndex('A B C'.split()). Possibly you could give append a strict parameter where you'd raise if the two dtypes aren't the same, but that's another issue than this one, IMO.


df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=mi)
ser = pd.Series([7, 8, 9], index=other_type, name=2)
if isinstance(other_type, pd.IntervalIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is wrong, append should raise a TypeError (like all the others). can you fix this in II?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the ValueError is coming from here:

def _as_like_interval_index(self, other, error_msg):
self._assert_can_do_setop(other)
other = _ensure_index(other)
if (not isinstance(other, IntervalIndex) or
self.closed != other.closed):
raise ValueError(error_msg)
return other

Seems reasonable to split the if condition, and raise a different exception in each case, along the lines of:

if not isinstance(other, IntervalIndex):
    raise TypeError(error_msg)
elif self.closed != other.closed:
    raise ValueError(error_msg)

I think this change will only break the test below, at least within the II tests, but all that needs to be changed is ValueError --> TypeError:

# non-IntervalIndex
msg = ('can only do set operations between two IntervalIndex objects '
'that are closed on the same side')
with tm.assert_raises_regex(ValueError, msg):
set_op(Index([1, 2, 3]))

dt.datetime(2013, 1, 3, 7, 12)]),
pd.Index([4, 5, 6]),
], ids=lambda x: str(x.dtype))
def test_append_interval_index_raises(self, other_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above.

cc @jschendel

@jreback
Copy link
Contributor

jreback commented Jan 19, 2018

cc @jschendel


df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=other_type)
ser = pd.Series([7, 8, 9], index=ii, name=2)
with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks to be the same as above

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from a5deefb to 860a5d2 Compare January 21, 2018 20:32
- ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`)
- :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated the whatsnew, don't remove the whitespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

# different index type

ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests now looks superfluous, yes? and you just need to add it to the above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appending IntervalIndex with other index types still fails, only with a TypeError rather than a ValueError, so not sure I understand.

Oh, you mean combine test_append_multi_index_raises and test_append_interval_index_raises? I think you're right. I'll get on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. None of these should fail, you are catching the TypeError already (and then coercing to object).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, combining IntervalIndex and other index types will stilll fail, because Interval and other dtypes are not orderable:

>>> ii = pd.IntervalIndex.from_breaks([0,1,2,3])
>>> i = pd.RangeIndex(3)
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=ii)
>>> ser = pd.Series([7, 8, 9], index=i, name=2)
>>> df.append(ser)
TypeError: unorderable types: Interval() > int()

test_append_multi_index_raises and test_append_interval_index_raises are very similar and I've made a changes to combine them into a single test, see test_append_different_columns_types_raises.

EDIT: I'm wondering if trying to order here is a bug, as you can append to the index itself:

>>> ii.append(i)
Index([(0, 1], (1, 2], (2, 3], 0, 1, 2], dtype='object')

If the above is possible, surely you can use DataFrame.append as well. I'll take a look tonight.

@topper-123
Copy link
Contributor Author

topper-123 commented Feb 18, 2018

I've addressed the comments, except one where I've commented specifically.

If #19353 (exact matches for CategoricalIndex) or a version thereof could be accepted, it would be possible to append frames/series with CategoricalIndex as well, i.e. move IntervalIndex checks from test_append_different_columns_types_raises to test_append_different_columns_types.

columns=combined_columns)
assert_frame_equal(result, expected)

@pytest.mark.parametrize("this_type", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexes = [pd.RangeIndex(3),
        pd.Index([4, 5, 6]),
        pd.Index(list("abc")),
        pd.CategoricalIndex('A B C'.split()),
        pd.CategoricalIndex('A B C'.split(), ordered=True),
        pd.IntervalIndex.from_breaks([0, 1, 2, 3]),
        pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]),
        pd.DatetimeIndex([dt.datetime(2013, 1, 3, 0, 0),
                          dt.datetime(2013, 1, 3, 6, 10),
                          dt.datetime(2013, 1, 3, 7, 12)])]
@pytest.fixture(params=indexes)
def this_type(request):
    return request.param

@pytest.fixture(params=indexes)
def that_type(params=indexes):
    return request.param
.....

def append_different_coumns_types_raises(self, this_type, that_type):
   ....

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 944cf96 to f2fbdf4 Compare February 21, 2018 19:30
@topper-123
Copy link
Contributor Author

I've uploaded a version with fixture, but honestly I consider this less readable than the previous version...

pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]),
]

@pytest.fixture(params=indexes_II)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, rather than use the fixtures, let's leave the lists where they are (though you don't need to have indexes_I at all, and pls name them something else more descriptive. and you can use parametrize directrly.

columns=ser_index)
assert_frame_equal(result, expected)

@pytest.mark.parametrize("df_columns, series_index",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use 2 parameterize here, one for df_columns and one for series_index; you don't need combinations, pytest already does this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this approach to work, AFAIKS, parametrize doesn't do this for you, and I've googled it without results. Can you give an example?

@jreback
Copy link
Contributor

jreback commented Mar 16, 2018

can you update

@jreback
Copy link
Contributor

jreback commented Apr 1, 2018

can you rebase

@jreback
Copy link
Contributor

jreback commented Apr 14, 2018

@topper-123 can you rebase

@jreback jreback removed this from the 0.23.0 milestone Apr 14, 2018
@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from f2fbdf4 to 7e57ff6 Compare April 14, 2018 21:54
@topper-123
Copy link
Contributor Author

topper-123 commented Apr 14, 2018

Rebased.

I've also simplified the tests a bit. I dropped the fixture-based approach, as I don't think it makes things clearer in this case.

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from 7e57ff6 to fe4bc5a Compare April 14, 2018 22:39
@topper-123
Copy link
Contributor Author

do you disagree wit the approach, @jreback?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. can you just make the whatsnew a bit more

@@ -380,6 +380,7 @@ Other Enhancements
- :class:`IntervalIndex` now supports time zone aware ``Interval`` objects (:issue:`18537`, :issue:`18538`)
- :func:`Series` / :func:`DataFrame` tab completion also returns identifiers in the first level of a :func:`MultiIndex`. (:issue:`16326`)
- :func:`read_excel()` has gained the ``nrows`` parameter (:issue:`16645`)
- :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the case this is actually addressing (e.g. categorical)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@topper-123 topper-123 force-pushed the DataFrame.append_preserve_columns_dtype branch from fe4bc5a to 13537ae Compare April 17, 2018 17:28
@topper-123
Copy link
Contributor Author

@jreback, could you pull this in?

@jreback jreback added this to the 0.23.0 milestone Apr 20, 2018
@jreback jreback merged commit 3e691a4 into pandas-dev:master Apr 20, 2018
@jreback
Copy link
Contributor

jreback commented Apr 20, 2018

thanks @topper-123

@jonblunt
Copy link

A related question.

I was expecting making a column an index , or level in a MultiIndex, to not change the dtype.

df  = pd.DataFrame(dict(x=np.array([1,2],dtype = 'int32'), 
                        y =['a', 'b']))

df.set_index('x').reset_index().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
x    2 non-null int64
y    2 non-null object
dtypes: int64(1), object(1)
memory usage: 104.0+ bytes

Is this exepected ,and will this enhancement change this? Thanks

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 26, 2018

@jonblunt

The index will in this case be a Int64Index:

>>> df.set_index('x').index
Int64Index([1, 2], dtype='int64', name='x')

Pandas doesn't have a Int32Index, so this behaviour is as expected and you shouldn't expect an int32 when resetting (or you'll have to reset manually).

@topper-123 topper-123 deleted the DataFrame.append_preserve_columns_dtype branch May 18, 2018 18:09
self, index_can_append, index_cannot_append_with_other):
# GH18359
# Dataframe.append will raise if IntervalIndex/MultiIndex appends
# or is appended to a different index type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@topper-123 can you help me understand why raising is the correct thing to do here? i have a branch that fixes a Series construction bug:

    def test_constructor_dict_multiindex_reindex_flat(self):
        # construction involves reindexing with a MultiIndex corner case
        data = {('i', 'i'): 0, ('i', 'j'): 1, ('j', 'i'): 2, "j": np.nan}
        expected = Series(data)

        result = Series(expected[:-1].to_dict(), index=expected.index)
        tm.assert_series_equal(result, expected)

but the fix for that (in MultiIndex.reindex) causes this test to fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

df.append should retain columns type if same type
6 participants