New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: to_records() fails for MultiIndex DF (#21064) #21082
Conversation
@@ -328,3 +328,26 @@ def test_to_dict_index_dtypes(self, into, expected): | |||
result = DataFrame.from_dict(result, orient='index')[cols] | |||
expected = DataFrame.from_dict(expected, orient='index')[cols] | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_to_records_with_multiindex(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could remove this test test_to_records_with_multiindex
since test_to_records_index_name
is already there on line 130 and deals with multi index?
Codecov Report
@@ Coverage Diff @@
## master #21082 +/- ##
==========================================
+ Coverage 91.84% 91.84% +<.01%
==========================================
Files 153 153
Lines 49505 49508 +3
==========================================
+ Hits 45466 45469 +3
Misses 4039 4039
Continue to review full report at Codecov.
|
Seems reasonable to me, @toobaz, want to have a look? |
df = DataFrame(np.random.randn(size, size), index=index) | ||
|
||
records = df.to_records(index=True) | ||
assert len(records) == size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
construct a recarray and compare using assert_numpy_array_equal
multi = MultiIndex([['a'], ['b']], labels=[[], []]) | ||
df = DataFrame(columns=['A'], index=multi) | ||
|
||
records = df.to_records(index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -1392,7 +1392,13 @@ def to_records(self, index=True, convert_datetime64=None): | |||
else: | |||
if isinstance(self.index, MultiIndex): | |||
# array of tuples to numpy cols. copy copy copy | |||
ix_vals = lmap(np.array, zip(*self.index.values)) | |||
tuples = self.index.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so rather than doing this here, call .tolist()
on this and fix the underlying code, like this:
In [30]: pd.MultiIndex.from_product([list('abc'), range(2)]).tolist()
Out[30]: [('a', 0), ('a', 1), ('b', 0), ('b', 1), ('c', 0), ('c', 1)]
In [31]: pd.MultiIndex.from_product([list('abc'), range(2)])[0:0].tolist()
Out[31]: []
so [31] should be a nested list arrays of the correct length/dtype
this might not work, but let's try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I gave this a go, but it seems a bit of an uphill fight:
When you do tolist()
on the index, it goes to
Line 893 in cc8d33e
def tolist(self): |
which seems to be used for many other classes. Changing this for the MultiIndex breaks the others, I guess you can check the type or define it in the MultiIndex
class, but might break other usages of MultiIndex. Were you thinking of adding a new tolist()
to the MultiIndex
class?
Also, just so you know, I modified the tests as requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you can override tolist()
(its defined in pandas/core/base) in MultiIndex
@@ -1392,7 +1392,13 @@ def to_records(self, index=True, convert_datetime64=None): | |||
else: | |||
if isinstance(self.index, MultiIndex): | |||
# array of tuples to numpy cols. copy copy copy | |||
ix_vals = lmap(np.array, zip(*self.index.values)) | |||
tuples = self.index.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you can override tolist()
(its defined in pandas/core/base) in MultiIndex
can you rebase |
1 similar comment
can you rebase |
closing as stale. if you'd like to continue, pls ping. |
git diff upstream/master -u -- "*.py" | flake8 --diff
This fixes the bug that prevents using
to_records
on an empty dataframe that has a MultiIndexself.index.values
returns an empty array (no tuples) for an empty MultiIndex DF and that would be used forix_values
. Instead, `ix_values should have an array of empty arrays, each one with the correct dtype according to the MultiIndex level/column.