Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add arrow engine to read_csv #31817

Closed
wants to merge 51 commits into from
Closed

ENH: add arrow engine to read_csv #31817

wants to merge 51 commits into from

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Feb 9, 2020

ASV's, 100,000 rows, 5 columns, for reading from BytesIO & StringIO buffers (running on 2 core machine).

[ 75.00%] ··· io.csv.ReadCSVEngine.time_read_bytescsv                                                                                                                                                                               ok
[ 75.00%] ··· ========= ============
                engine
              --------- ------------
                  c       51.7±6ms
                python    598±40ms
               pyarrow   27.8±0.8ms
              ========= ============

[100.00%] ··· io.csv.ReadCSVEngine.time_read_stringcsv                                                                                                                                                                              ok
[100.00%] ··· ========= ===========
                engine
              --------- -----------
                  c      53.8±20ms
                python    584±30ms
               pyarrow    29.9±1ms
              ========= ===========

@lithomas1
Copy link
Member Author

it appears that the arrow engine does better on larger datasets. for example, on scikit-learn's diabetes dataset(24 kb vs the 3kb iris), performance was ...
4.59 ms for arrow
6.83 ms for c
8.72 ms for python

@lithomas1 lithomas1 marked this pull request as ready for review February 9, 2020 04:54
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
@gfyoung gfyoung added Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance labels Feb 9, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to hook into the test framework via the fixture all_parsers, which exhaustively tests things. This parsers likely has a very narrow case where it actually passes.

Secondaly on the performance comparsions, pls show the asv's for this. again this might be slightly more performant that the c parser, but its important to narrow down when and in what cases.

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

There are a few things where pyarrow has different defaults. For example, in which values to recognize as NA values (but this is configurable). Do we want to pass our default values as option to pyarrow by default (so the different engines in pandas are consistent on that), or are we fine with following pyarrow's default?

pandas/io/parsers.py Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
pandas/io/parsers.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Jul 17, 2020

put up what you want and will have a look
it's very hard to grok the way you are using the fixtures

@simonjayhawkins
Copy link
Member

@lithomas1 can you move release note to 1.2

@dsaxton dsaxton added the Stale label Sep 17, 2020
@dsaxton
Copy link
Member

dsaxton commented Oct 8, 2020

@lithomas1 Can you merge master to fix the conflict?

@jorisvandenbossche
Copy link
Member

I would like to see this move forward, and can also put some effort in updating it (in case @lithomas1 has no time on the short term).

@jreback can you clarify what exactly you don't like about the current pyarrow_xfail approach?

@jreback
Copy link
Contributor

jreback commented Oct 21, 2020

this needs a merge in master and a green build

will have a look

@jorisvandenbossche
Copy link
Member

Yes it needs to be updated with latest master, but in the mean time it would still be useful to understand your concern about the fixture (as I asked in #31817 (comment) already). So can you clarify your "it's very hard to grok the way you are using the fixtures" (see also @lithomas1's answers to your comments at #31817 (comment))

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the testing method i think is easily fixed see my comments.

this needs to be integrated in a very different option validation structure that exists in master.

Also the actual processing of the arrow engine is pretty awkward.

engine : {``'c'``, ``'pyarrow'``, ``'python'``}
Parser engine to use. In terms of performance, the pyarrow engine,
which requires ``pyarrow`` >= 0.15.0, is faster than the C engine, which
is faster than the python engine. However, the pyarrow and C engines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionchanged tag here 1.2

possible pandas uses the C parser (specified as ``engine='c'``), but may fall
back to Python if C-unsupported options are specified. Currently, C-unsupported
options include:
Currently, pandas supports using three engines, the C engine, the python engine,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionchanged 1.2 tag here

the pyarrow engine is much less robust than the C engine, which in turn lacks a
couple of features present in the Python parser.

Where possible pandas uses the C parser (specified as ``engine='c'``), but may fall
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to refactor this entire section to provide a more table like comparision of all of the parsers, if you'd create an issue for this

@@ -271,6 +272,14 @@ change, as ``fsspec`` will still bring in the same packages as before.

.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/


read_csv() now accepts pyarrow as an engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 1.2

@@ -93,10 +100,16 @@ def import_optional_dependency(
raise ImportError(msg) from None
else:
return None

minimum_version = VERSIONS.get(name)
# Handle submodules: if we have submodule, grab parent module from sys.modules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is all this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been answered before: #31817 (comment) (and the above comment has been added based on your comment)

It's to 1) import a submodule (pyarrow.csv in this case) and 2) to support passing a different version as in our global minimum versions dictionary.

Now I suppose that the submodule importing is not necessarily needed. Right now this PR does:

csv = import_optional_dependency("pyarrow.csv", min_version="0.15")

but I suppose this could also be:

import_optional_dependency("pyarrow", min_version="0.15")
from pyarrow import csv

And then this additional code to directly import a submodule with import_optional_dependency is not needed (although where it is used, I think it is a bit cleaner to be able to directly import the submodule)

Copy link
Member Author

@lithomas1 lithomas1 Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche importing as a submodule is required, you can't access the csv module by doing pyarrow.csv as far as I remember, and if you do import pyarrow.csv, the it won't validate the version and will not error for pyarrow<0.15

return content.encode(self.encoding)


class ArrowParserWrapper(ParserBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to refactor this as the current code is very different from this.

Also I really don't like doing all of this validation in a single function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to refactor this as the current code is very different from this.

Can you clarify a bit more what you mean? Or point to recent changes related to this?

For example also on master, the C parser is using a very similar mechanism with the CParserWrapper class.
Or do you only mean that it needs to split some validation into separate methods (as you indicate in a comment below as well, fully agreed with that).

read_options["skip_rows"] = skiprows
read_options["autogenerate_column_names"] = True
read_options = pyarrow.ReadOptions(**read_options)
table = pyarrow.read_csv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add line breaks between section and comments.


parse_options = {k: v for k, v in kwdscopy.items() if k in parseoptions}
convert_options = {k: v for k, v in kwdscopy.items() if k in convertoptions}
headerexists = True if self.header is not None else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole structure of this class is really odd. The class should do consruction, option validation, reading, and post-processing by calling methods.

@@ -3400,7 +3570,7 @@ def _isindex(colspec):
colspec = orig_names[colspec]
if _isindex(colspec):
continue
data_dict[colspec] = converter(data_dict[colspec])
data_dict[colspec] = converter(np.array(data_dict[colspec]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.asarray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this?

Copy link
Member Author

@lithomas1 lithomas1 Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converter doesn't work on pandas series, only on np arrays, if i can remember correctly. The alternative would be to convert dataframe to dict of np arrays or something like that, which hurts perf a lot. This has much less impact and also doesn't hurt perf with non pyarrow case.

parser._engine = MyCParserWrapper(StringIO(data), **parser.options)

result = parser.read()
tm.assert_frame_equal(result, expected)


def test_empty_decimal_marker(all_parsers):
def test_empty_decimal_marker(all_parsers, pyarrow_xfail):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, why don't you just define a new fixture (all_parsers_xpyarrow) or something and then just change the inputs to all of these functions), otherwise this is awkward and unmaintanable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this all_parsers_xpyarrow fixture would be "all parsers but without pyarrow"?

If so, how is that necessarily more maintainable? You still need to update each test function that doesn't support pyarrow to change all_parsers to all_parsers_xpyarrow, while with the current approach you need to update this to all_parsers, pyarrow_xfail. That doesn't seem a big difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have a way to make this clearer.

at the moment the fixture is xfailing tests explicitly and not using an xfail marker.

I have changed this but there are some tests that seem to lock up, so also need a skip fixture.

Am working through the failing/timing out tests. will do a wip commit for discusssion. (just revert if not happy)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, frequent lock-ups on a full test run of tests/io/parser/ so several xfails changed to skips.

hopefully, this won't time out and with the decorators makes it easier to review for functionality that should be working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS py37_macos didn't time out this time around, but looks like Windows and linux still have a problem.

Copy link
Member Author

@lithomas1 lithomas1 Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins yeah i had problems with it getting stuck on my computer too. I think the tests are actually failing but pytest just gets stuck or something. I think that some of the CI machines don't have pyarrow on it, and skip the pyarrow tests, which is probably why they don't fail. Does it work with the xfail when you run that single test?

@jorisvandenbossche
Copy link
Member

@jreback thanks a lot for the review! I added some comments/questions for further clarifications

pandas/tests/io/parser/test_common.py Show resolved Hide resolved
pandas/tests/io/parser/test_common.py Outdated Show resolved Hide resolved
pandas/tests/io/parser/test_common.py Show resolved Hide resolved
pandas/tests/io/parser/test_common.py Outdated Show resolved Hide resolved
pandas/tests/io/parser/test_common.py Show resolved Hide resolved
pandas/tests/io/parser/test_common.py Outdated Show resolved Hide resolved
currently more feature-complete.
engine : {{'c', 'python', 'pyarrow'}}, optional
Parser engine to use. The C and pyarrow engines are faster, while the python engine
is currently more feature-complete. The pyarrow engine requires ``pyarrow`` >= 0.15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since our current project wide min is 0.15, this can probably be removed.

@lithomas1
Copy link
Member Author

lithomas1 commented Oct 28, 2020

Hi all,
Sorry that I abandoned this PR for a while, and thanks to @simonjayhawkins and @jorisvandenbossche for picking this up, there just seems to be way too many test cases that are failing than I can handle, last time I checked it was a couple hundred. Right now, there is a couple logic errors in the code with usecols I think, but code should be good otherwise.

I still don't think I have the time/energy to finish this one though, but I can commit the WIP code that I have and provide some guidance on the code if necessary.

@arw2019 arw2019 mentioned this pull request Dec 8, 2020
5 tasks
@arw2019
Copy link
Member

arw2019 commented Dec 8, 2020

Continued in #38370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: allow engine='pyarrow' in read_csv