Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

Merged
merged 1 commit into from May 16, 2019
Merged

Conversation

mahepe
Copy link
Contributor

@mahepe mahepe commented May 12, 2019

This fix for #26303 would avoid an indexing issue caused by using MultiIndexes which results to a segfault in the example provided in #26303.

@@ -874,6 +874,13 @@ def test_to_csv_index_no_leading_comma(self):
expected = tm.convert_rows_list_to_csv_str(expected_rows)
assert buf.getvalue() == expected

def test_to_csv_multi_index(self):
# see gh-26303
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don’t actually write a file as that leaves an artifact
either use a context manager (see other routines) or write to StringIO

in either case compare the result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment @jreback, resolved in c5d36b9

@codecov
Copy link

codecov bot commented May 12, 2019

Codecov Report

Merging #26355 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26355      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         175      175              
  Lines       52292    52292              
==========================================
- Hits        48132    48128       -4     
- Misses       4160     4164       +4
Flag Coverage Δ
#multiple 90.59% <100%> (ø) ⬆️
#single 40.71% <0%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.62% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 681a22c...17d04a5. Read the comment docs.

@codecov
Copy link

codecov bot commented May 12, 2019

Codecov Report

Merging #26355 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26355      +/-   ##
==========================================
- Coverage   91.68%   91.67%   -0.01%     
==========================================
  Files         174      174              
  Lines       50703    50703              
==========================================
- Hits        46488    46484       -4     
- Misses       4215     4219       +4
Flag Coverage Δ
#multiple 90.18% <100%> (ø) ⬆️
#single 41.17% <0%> (-0.18%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.62% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6e43a4...d1d0565. Read the comment docs.

@@ -254,6 +254,7 @@ Performance Improvements
- Improved performance of :meth:`Series.map` for dictionary mappers on categorical series by mapping the categories instead of mapping all values (:issue:`23785`)

.. _whatsnew_0250.bug_fixes:
- Fixed MultiIndex DataFrame to_csv segfault (:issue:`26303`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to Indexing, use :class:`MultiIdex` and :meth:DataFrame.to_csv ``; specify that this is a single-level multi-index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d47089e

@@ -946,7 +946,7 @@ def _format_native_types(self, na_rep='nan', **kwargs):
new_codes.append(level_codes)

if len(new_levels) == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here that this is a single-level MI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do new_levels and new_codes look like here (from your test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of new_levels is [array(['1', '2', '3'], dtype='<U21')].
The value of new_codes is [FrozenNDArray([0, 2], dtype='int8')]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in d47089e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok use levels[0].take(codes[0])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 159a828

@@ -874,6 +874,18 @@ def test_to_csv_index_no_leading_comma(self):
expected = tm.convert_rows_list_to_csv_str(expected_rows)
assert buf.getvalue() == expected

def test_to_csv_multi_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have a multiindex test, can you move this near; change the test to be test_to_csv_single_level_multiindex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d47089e

index = pd.Index([(1,), (2,), (3,)])
df = pd.DataFrame([[1, 2, 3]], columns=index)
with ensure_clean() as path:
df = df.reindex(columns=[(1,), (3,)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this reindex needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reindex is what seems to cause the segfault in DataFrame.to_csv as demonstrated in #26303. It seems that it has something to do with a multi-index having a larger number of levels than there are columns in a data frame.

@jreback jreback added IO CSV read_csv, to_csv MultiIndex labels May 12, 2019
@mahepe mahepe force-pushed the dev branch 3 times, most recently from d47089e to d403442 Compare May 12, 2019 15:31
@pep8speaks
Copy link

pep8speaks commented May 12, 2019

Hello @mahepe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-13 18:15:51 UTC

@mahepe mahepe force-pushed the dev branch 2 times, most recently from 159a828 to a3f85e8 Compare May 12, 2019 16:11
@@ -946,7 +946,9 @@ def _format_native_types(self, na_rep='nan', **kwargs):
new_codes.append(level_codes)

if len(new_levels) == 1:
return Index(new_levels[0])._format_native_types()
# a single-level multi-index
return Index(new_levels[0].take(new_codes[0])).\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need the explicit line break here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformatted

df.to_csv(path, line_terminator='\n')
expected = b",1,3\n0,1,3\n"

with open(path, mode='rb') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to physically write this to a file to reproduce the segfault or can you just inspect the result returned as a string?

Copy link
Contributor Author

@mahepe mahepe May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that no: Tried the example provided in the issue. The segfault occured even if the csv was never written to a file.

Updated the test so that it doesn't write to a file.

@jreback jreback merged commit 9c5165e into pandas-dev:master May 16, 2019
@jreback
Copy link
Contributor

jreback commented May 16, 2019

thanks @mahepe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiIndex DataFrame to_csv() terminates python process
4 participants