New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.append should retain columns type if same type #18359

Closed
topper-123 opened this Issue Nov 18, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@topper-123
Contributor

topper-123 commented Nov 18, 2017

Currently df.append loses columns index type, if the columns is a CategoricalIndex:

>>> idx = pd.CategoricalIndex('a b'.split())
>>> df = pd.DataFrame([[1, 2]], columns=idx)
>>> ser = pd.Series([3, 4], index=idx, name=1)
>>> df.append(ser).columns
Index(['a', 'b'], dtype='object')

df.append(ser).columns should return a CategoricalIndex equal to idx.

pandas 0.21 has the new CategoricalDtype, so it's now easy to compare CategoricalIndex instances for strict type equality. Hence this issue should be much easier to solve than previously.

Solution proposal

In frame.py::DataFrame.append there is this line:

combined_columns = self.columns.tolist() + self.columns.union(
                    other.index).difference(self.columns).tolist()

This line converts CategoricalIndex columns to normal indexes. So by making some checks for types and dtypes it should be easy return the correct index. So if the above would be something like this instead:

same_types = type(self.columns) == type(other.index)
same_dtypes = self.columns.dtype == other.index.dtype
if same_types and same_dtypes:
    combined_columns = self.columns.union(other.index)
else:
    combined_columns = self.columns.tolist() + self.columns.union(
        other.index).difference(self.columns).tolist()

and I think this issue can be solved (haven't checked yet all details, maybe some adjustments have to be made). I'd appreciate comments if this approach is ok.

@topper-123 topper-123 changed the title from df.append should retain columns type to df.append should retain columns type if same type Nov 18, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 19, 2017

yeah this is kind of messy. this should all use Index.append then none of this is an issue. we shouldn't be using .tolist() at all.

@jreback jreback added this to the Next Major Release milestone Nov 19, 2017

@topper-123

This comment has been minimized.

Contributor

topper-123 commented Nov 28, 2017

Hi, I've started to look into this.

ATM it seems like .union actually is very robust, and I'm leaning towards that simply combined_columns = self.columns.union(other.index) is possible, but I wonder why you pointed to Index.append. Did you mean Index.Union?

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 28, 2017

no i meant append; you need to append the union of differences (i think this is the symmetric_didferenev)

@jreback jreback closed this Nov 28, 2017

@jreback jreback reopened this Nov 28, 2017

@topper-123

This comment has been minimized.

Contributor

topper-123 commented Nov 29, 2017

symmetric_difference doesn't work:

>>> d = pd.api.types.CategoricalDtype('A B C'.split())
>>> c1 = pd.CategoricalIndex('A B'.split(), dtype=d)
>>> c2 = pd.CategoricalIndex('B C'.split(), dtype=d)
>>> c1.symmetric_difference(c2)
Index(['A', 'C'], dtype='object')  # notice index type and also values are not good to be appended

Just difference is good:

>>> c2.append(c1.difference(c2))
CategoricalIndex(['B', 'C', 'A'], categories=['A', 'B', 'C'], ordered=False, dtype='category')

Which gives the same result (in this case, maybe generally) as union:

>>> c2.union(c1)
CategoricalIndex(['B', 'C', 'A'], categories=['A', 'B', 'C'], ordered=False, dtype='category')

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 14, 2018

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment