Skip to content

Commit

Permalink
Backport PR #54954 on branch 2.1.x (REGR: read_csv splitting on comma…
Browse files Browse the repository at this point in the history
… with delim_whitespace) (#54967)

Backport PR #54954: REGR: read_csv splitting on comma with delim_whitespace

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
  • Loading branch information
meeseeksmachine and phofl committed Sep 2, 2023
1 parent ed1f044 commit ca89e78
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Fixed regressions
~~~~~~~~~~~~~~~~~
- Fixed regression in :func:`merge` when merging over a PyArrow string index (:issue:`54894`)
- Fixed regression in :func:`read_csv` when ``usecols`` is given and ``dtypes`` is a dict for ``engine="python"`` (:issue:`54868`)
- Fixed regression in :func:`read_csv` when ``delim_whitespace`` is True (:issue:`54918`, :issue:`54931`)
- Fixed regression in :meth:`.GroupBy.get_group` raising for ``axis=1`` (:issue:`54858`)
- Fixed regression in :meth:`DataFrame.__setitem__` raising ``AssertionError`` when setting a :class:`Series` with a partial :class:`MultiIndex` (:issue:`54875`)
- Fixed regression in :meth:`Series.drop_duplicates` for PyArrow strings (:issue:`54904`)
Expand Down
3 changes: 2 additions & 1 deletion pandas/_libs/src/parser/tokenizer.c
Original file line number Diff line number Diff line change
Expand Up @@ -664,7 +664,8 @@ static int parser_buffer_bytes(parser_t *self, size_t nbytes,
((!self->delim_whitespace && c == ' ' && self->skipinitialspace))

// applied when in a field
#define IS_DELIMITER(c) ((c == delimiter) || (delim_whitespace && isblank(c)))
#define IS_DELIMITER(c) \
((!delim_whitespace && c == delimiter) || (delim_whitespace && isblank(c)))

#define _TOKEN_CLEANUP() \
self->stream_len = slen; \
Expand Down
26 changes: 26 additions & 0 deletions pandas/tests/io/parser/test_header.py
Original file line number Diff line number Diff line change
Expand Up @@ -658,3 +658,29 @@ def test_header_missing_rows(all_parsers):
msg = r"Passed header=\[0,1,2\], len of 3, but only 2 lines in file"
with pytest.raises(ValueError, match=msg):
parser.read_csv(StringIO(data), header=[0, 1, 2])


@skip_pyarrow
def test_header_multiple_whitespaces(all_parsers):
# GH#54931
parser = all_parsers
data = """aa bb(1,1) cc(1,1)
0 2 3.5"""

result = parser.read_csv(StringIO(data), sep=r"\s+")
expected = DataFrame({"aa": [0], "bb(1,1)": 2, "cc(1,1)": 3.5})
tm.assert_frame_equal(result, expected)


@skip_pyarrow
def test_header_delim_whitespace(all_parsers):
# GH#54918
parser = all_parsers
data = """a,b
1,2
3,4
"""

result = parser.read_csv(StringIO(data), delim_whitespace=True)
expected = DataFrame({"a,b": ["1,2", "3,4"]})
tm.assert_frame_equal(result, expected)

0 comments on commit ca89e78

Please sign in to comment.