BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

mahepe · 2019-05-12T13:15:24Z

closes MultiIndex DataFrame to_csv() terminates python process #26303
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This fix for #26303 would avoid an indexing issue caused by using MultiIndexes which results to a segfault in the example provided in #26303.

jreback · 2019-05-12T13:46:19Z

pandas/tests/frame/test_to_csv.py

@@ -874,6 +874,13 @@ def test_to_csv_index_no_leading_comma(self):
        expected = tm.convert_rows_list_to_csv_str(expected_rows)
        assert buf.getvalue() == expected

+    def test_to_csv_multi_index(self):
+        # see gh-26303


don’t actually write a file as that leaves an artifact
either use a context manager (see other routines) or write to StringIO

in either case compare the result

Thanks for this comment @jreback, resolved in c5d36b9

codecov · 2019-05-12T13:54:10Z

Codecov Report

Merging #26355 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26355      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         175      175              
  Lines       52292    52292              
==========================================
- Hits        48132    48128       -4     
- Misses       4160     4164       +4

Flag	Coverage Δ
#multiple	`90.59% <100%> (ø)`	⬆️
#single	`40.71% <0%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`95.62% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 681a22c...17d04a5. Read the comment docs.

codecov · 2019-05-12T13:54:10Z

Codecov Report

Merging #26355 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26355      +/-   ##
==========================================
- Coverage   91.68%   91.67%   -0.01%     
==========================================
  Files         174      174              
  Lines       50703    50703              
==========================================
- Hits        46488    46484       -4     
- Misses       4215     4219       +4

Flag	Coverage Δ
#multiple	`90.18% <100%> (ø)`	⬆️
#single	`41.17% <0%> (-0.18%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`95.62% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6e43a4...d1d0565. Read the comment docs.

jreback · 2019-05-12T14:41:36Z

doc/source/whatsnew/v0.25.0.rst

@@ -254,6 +254,7 @@ Performance Improvements
 - Improved performance of :meth:`Series.map` for dictionary mappers on categorical series by mapping the categories instead of mapping all values (:issue:`23785`)

 .. _whatsnew_0250.bug_fixes:
+- Fixed MultiIndex DataFrame to_csv segfault (:issue:`26303`)


move to Indexing, use :class:`MultiIdex` and :meth:DataFrame.to_csv ``; specify that this is a single-level multi-index

Done in d47089e

jreback · 2019-05-12T14:42:44Z

pandas/core/indexes/multi.py

@@ -946,7 +946,7 @@ def _format_native_types(self, na_rep='nan', **kwargs):
            new_codes.append(level_codes)

        if len(new_levels) == 1:


add a comment here that this is a single-level MI

what do new_levels and new_codes look like here (from your test)

The value of new_levels is [array(['1', '2', '3'], dtype='<U21')].
The value of new_codes is [FrozenNDArray([0, 2], dtype='int8')]

Added a comment in d47089e

ok use levels[0].take(codes[0])

Done in 159a828

jreback · 2019-05-12T14:44:36Z

pandas/tests/frame/test_to_csv.py

@@ -874,6 +874,18 @@ def test_to_csv_index_no_leading_comma(self):
        expected = tm.convert_rows_list_to_csv_str(expected_rows)
        assert buf.getvalue() == expected

+    def test_to_csv_multi_index(self):


we already have a multiindex test, can you move this near; change the test to be test_to_csv_single_level_multiindex

Done in d47089e

jreback · 2019-05-12T14:45:02Z

pandas/tests/frame/test_to_csv.py

+        index = pd.Index([(1,), (2,), (3,)])
+        df = pd.DataFrame([[1, 2, 3]], columns=index)
+        with ensure_clean() as path:
+            df = df.reindex(columns=[(1,), (3,)])


why is this reindex needed?

The reindex is what seems to cause the segfault in DataFrame.to_csv as demonstrated in #26303. It seems that it has something to do with a multi-index having a larger number of levels than there are columns in a data frame.

pep8speaks · 2019-05-12T15:31:32Z

Hello @mahepe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-13 18:15:51 UTC

WillAyd · 2019-05-13T14:41:11Z

pandas/core/indexes/multi.py

@@ -946,7 +946,9 @@ def _format_native_types(self, na_rep='nan', **kwargs):
            new_codes.append(level_codes)

        if len(new_levels) == 1:
-            return Index(new_levels[0])._format_native_types()
+            # a single-level multi-index
+            return Index(new_levels[0].take(new_codes[0])).\


Shouldn't need the explicit line break here

Reformatted

WillAyd · 2019-05-13T14:41:54Z

pandas/tests/frame/test_to_csv.py

+            df.to_csv(path, line_terminator='\n')
+            expected = b",1,3\n0,1,3\n"
+
+            with open(path, mode='rb') as f:


Do you need to physically write this to a file to reproduce the segfault or can you just inspect the result returned as a string?

Turns out that no: Tried the example provided in the issue. The segfault occured even if the csv was never written to a file.

Updated the test so that it doesn't write to a file.

jreback · 2019-05-16T01:47:44Z

thanks @mahepe

jreback requested changes May 12, 2019

View reviewed changes

mahepe force-pushed the dev branch from 17d04a5 to c5d36b9 Compare May 12, 2019 14:37

jreback requested changes May 12, 2019

View reviewed changes

jreback added IO CSV read_csv, to_csv MultiIndex labels May 12, 2019

mahepe force-pushed the dev branch 3 times, most recently from d47089e to d403442 Compare May 12, 2019 15:31

mahepe force-pushed the dev branch 2 times, most recently from 159a828 to a3f85e8 Compare May 12, 2019 16:11

WillAyd reviewed May 13, 2019

View reviewed changes

BUG: Fix MultiIndex DataFrame to_csv() segfault (pandas-dev#26303)

d1d0565

mahepe force-pushed the dev branch from a3f85e8 to d1d0565 Compare May 13, 2019 18:15

jreback added this to the 0.25.0 milestone May 16, 2019

jreback approved these changes May 16, 2019

View reviewed changes

jreback merged commit 9c5165e into pandas-dev:master May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

mahepe commented May 12, 2019 •

edited

jreback May 12, 2019

mahepe May 12, 2019

codecov bot commented May 12, 2019

codecov bot commented May 12, 2019 •

edited

jreback May 12, 2019

mahepe May 12, 2019

jreback May 12, 2019

jreback May 12, 2019

mahepe May 12, 2019

mahepe May 12, 2019

jreback May 12, 2019

mahepe May 12, 2019

jreback May 12, 2019

mahepe May 12, 2019

jreback May 12, 2019

mahepe May 12, 2019

pep8speaks commented May 12, 2019 •

edited

WillAyd May 13, 2019

mahepe May 13, 2019

WillAyd May 13, 2019

mahepe May 13, 2019 •

edited

jreback commented May 16, 2019

		@@ -946,7 +946,7 @@ def _format_native_types(self, na_rep='nan', **kwargs):
		new_codes.append(level_codes)

		if len(new_levels) == 1:

BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

Conversation

mahepe commented May 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 12, 2019

Codecov Report

codecov bot commented May 12, 2019 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented May 12, 2019 • edited

Comment last updated at 2019-05-13 18:15:51 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahepe May 13, 2019 • edited

Choose a reason for hiding this comment

jreback commented May 16, 2019

mahepe commented May 12, 2019 •

edited

codecov bot commented May 12, 2019 •

edited

pep8speaks commented May 12, 2019 •

edited

mahepe May 13, 2019 •

edited