Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add arrow engine to read_csv #31817

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
f22ff46
add arrow engine to read_csv
lithomas1 Feb 9, 2020
8ae43e4
fix failing test
lithomas1 Feb 9, 2020
09074df
formatting and revert unnecessary change
lithomas1 Feb 9, 2020
6be276d
remove bloat and more formatting changes
lithomas1 Feb 9, 2020
df4fa7e
Whatsnew
lithomas1 Feb 9, 2020
9cd9a6f
Merge remote-tracking branch 'upstream/master' into add-arrow-engine
lithomas1 Feb 9, 2020
ecaf3fd
Get tests up and running
lithomas1 Feb 10, 2020
b3c3287
Some fixes
lithomas1 Feb 10, 2020
474baf4
Add asvs and xfail some tests
lithomas1 Feb 11, 2020
2cd9937
address comments
lithomas1 Feb 20, 2020
48ff255
Merge branch 'master' into add-arrow-engine
lithomas1 Feb 20, 2020
3d15a56
fix typo
lithomas1 Feb 20, 2020
c969373
Merge branch 'add-arrow-engine' of github-other.com:lithomas1/pandas …
lithomas1 Feb 20, 2020
98aa134
some fixes
lithomas1 Feb 29, 2020
b9c6d2c
Fix bug
lithomas1 Apr 5, 2020
67c5db6
Fix merge conflicts
lithomas1 Apr 5, 2020
7f891a6
New benchmark and fix more tests
lithomas1 Apr 10, 2020
11fc737
Merge branch 'master' into add-arrow-engine
lithomas1 Apr 10, 2020
23425f7
More cleanups
lithomas1 Apr 10, 2020
d9b7a1f
Merge master
lithomas1 Apr 10, 2020
b8adf3c
Merge branch 'add-arrow-engine' of github-other.com:lithomas1/pandas …
lithomas1 Apr 11, 2020
01c0394
Formatting fixes and typo correction
lithomas1 Apr 11, 2020
ba5620f
skip pyarrow tests if not installed
lithomas1 Apr 12, 2020
2570c82
Address comments
lithomas1 Apr 12, 2020
b3a1f66
Get some more tests to pass
lithomas1 Apr 14, 2020
d46ceed
Fix some bugs and cleanups
lithomas1 Apr 17, 2020
d67925c
Merge branch 'master' into add-arrow-engine
lithomas1 Apr 17, 2020
6378459
Perform version checks for submodule imports too
lithomas1 May 20, 2020
9d64882
Refresh with newer pyarrow
lithomas1 May 20, 2020
852ecf9
Merge branch 'master' into add-arrow-engine
lithomas1 May 20, 2020
93382b4
Start xfailing tests
lithomas1 May 21, 2020
f1bb4e2
Get all tests to run & some fixes
lithomas1 May 27, 2020
14c13ab
Merge branch 'master' into add-arrow-engine
lithomas1 May 27, 2020
7876b4e
Lint and CI
lithomas1 May 29, 2020
4426642
Merge branch 'master' into add-arrow-engine
lithomas1 May 29, 2020
008acab
parse_dates support and fixups of some tests
lithomas1 Jun 3, 2020
2dddae7
Date parsing fixes and address comments
lithomas1 Jun 13, 2020
261ef6a
Merge branch 'master' into add-arrow-engine
lithomas1 Jun 13, 2020
88e200a
Clean/Address comments/Update docs
lithomas1 Jun 29, 2020
bf063ab
Merge branch 'master' into add-arrow-engine
lithomas1 Jun 29, 2020
ede2799
Fix typo
lithomas1 Jun 29, 2020
e8eff08
Fix doc failures
lithomas1 Jul 8, 2020
87cfcf5
Merge remote-tracking branch 'upstream/master' into add-arrow-engine
simonjayhawkins Oct 22, 2020
55139ee
wip
simonjayhawkins Oct 22, 2020
c1aeecf
more xfails and skips
simonjayhawkins Oct 22, 2020
62fc9d6
Merge branch 'master' into add-arrow-engine
lithomas1 Oct 28, 2020
b53a620
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 28, 2020
f13113d
Fix typos
lithomas1 Oct 28, 2020
f9ce2e4
Doc fixes and more typo fixes
lithomas1 Oct 28, 2020
4158d6a
Green?
lithomas1 Nov 2, 2020
d34e75f
Merge branch 'master' into add-arrow-engine
lithomas1 Nov 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
64 changes: 46 additions & 18 deletions asv_bench/benchmarks/io/csv.py
@@ -1,4 +1,4 @@
from io import StringIO
from io import BytesIO, StringIO
import random
import string

Expand Down Expand Up @@ -146,10 +146,10 @@ def time_read_csv(self, bad_date_value):
class ReadCSVSkipRows(BaseIO):

fname = "__test__.csv"
params = [None, 10000]
param_names = ["skiprows"]
params = ([None, 10000], ["c", "pyarrow"])
param_names = ["skiprows", "engine"]

def setup(self, skiprows):
def setup(self, skiprows, engine):
N = 20000
index = tm.makeStringIndex(N)
df = DataFrame(
Expand All @@ -164,8 +164,8 @@ def setup(self, skiprows):
)
df.to_csv(self.fname)

def time_skipprows(self, skiprows):
read_csv(self.fname, skiprows=skiprows)
def time_skipprows(self, skiprows, engine):
read_csv(self.fname, skiprows=skiprows, engine=engine)


class ReadUint64Integers(StringIORewind):
Expand Down Expand Up @@ -254,9 +254,30 @@ def time_read_csv_python_engine(self, sep, decimal, float_precision):
names=list("abc"),
)

def time_read_csv_arrow(self, sep, decimal, float_precision):
read_csv(
self.data(self.StringIO_input), sep=sep, header=None, names=list("abc"),
)

class ReadCSVCategorical(BaseIO):

class ReadCSVEngine(StringIORewind):
params = ["c", "python", "pyarrow"]
param_names = ["engine"]

def setup(self, engine):
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
data = ["A,B,C,D,E"] + (["1,2,3,4,5"] * 100000)
self.StringIO_input = StringIO("\n".join(data))
# simulate reading from file
self.BytesIO_input = BytesIO(self.StringIO_input.read().encode("utf-8"))

def time_read_stringcsv(self, engine):
read_csv(self.data(self.StringIO_input), engine=engine)

def time_read_bytescsv(self, engine):
read_csv(self.data(self.BytesIO_input), engine=engine)


class ReadCSVCategorical(BaseIO):
fname = "__test__.csv"

def setup(self):
Expand All @@ -273,7 +294,10 @@ def time_convert_direct(self):


class ReadCSVParseDates(StringIORewind):
def setup(self):
params = ["c", "python"]
param_names = ["engine"]

def setup(self, engine):
data = """{},19:00:00,18:56:00,0.8100,2.8100,7.2000,0.0000,280.0000\n
{},20:00:00,19:56:00,0.0100,2.2100,7.2000,0.0000,260.0000\n
{},21:00:00,20:56:00,-0.5900,2.2100,5.7000,0.0000,280.0000\n
Expand All @@ -284,18 +308,20 @@ def setup(self):
data = data.format(*two_cols)
self.StringIO_input = StringIO(data)

def time_multiple_date(self):
def time_multiple_date(self, engine):
read_csv(
self.data(self.StringIO_input),
engine=engine,
sep=",",
header=None,
names=list(string.digits[:9]),
parse_dates=[[1, 2], [1, 3]],
)

def time_baseline(self):
def time_baseline(self, engine):
read_csv(
self.data(self.StringIO_input),
engine=engine,
sep=",",
header=None,
parse_dates=[1],
Expand All @@ -304,17 +330,18 @@ def time_baseline(self):


class ReadCSVCachedParseDates(StringIORewind):
params = ([True, False],)
param_names = ["do_cache"]
params = ([True, False], ["c", "pyarrow", "python"])
param_names = ["do_cache", "engine"]

def setup(self, do_cache):
def setup(self, do_cache, engine):
data = ("\n".join(f"10/{year}" for year in range(2000, 2100)) + "\n") * 10
self.StringIO_input = StringIO(data)

def time_read_csv_cached(self, do_cache):
def time_read_csv_cached(self, do_cache, engine):
try:
read_csv(
self.data(self.StringIO_input),
engine=engine,
header=None,
parse_dates=[0],
cache_dates=do_cache,
Expand Down Expand Up @@ -344,22 +371,23 @@ def mem_parser_chunks(self):


class ReadCSVParseSpecialDate(StringIORewind):
params = (["mY", "mdY", "hm"],)
param_names = ["value"]
params = (["mY", "mdY", "hm"], ["c", "pyarrow", "python"])
param_names = ["value", "engine"]
objects = {
"mY": "01-2019\n10-2019\n02/2000\n",
"mdY": "12/02/2010\n",
"hm": "21:34\n",
}

def setup(self, value):
def setup(self, value, engine):
count_elem = 10000
data = self.objects[value] * count_elem
self.StringIO_input = StringIO(data)

def time_read_special_date(self, value):
def time_read_special_date(self, value, engine):
read_csv(
self.data(self.StringIO_input),
engine=engine,
sep=",",
header=None,
names=["Date"],
Expand Down
25 changes: 17 additions & 8 deletions doc/source/user_guide/io.rst
Expand Up @@ -160,9 +160,11 @@ dtype : Type name or dict of column -> type, default ``None``
(unsupported with ``engine='python'``). Use `str` or `object` together
with suitable ``na_values`` settings to preserve and
not interpret dtype.
engine : {``'c'``, ``'python'``}
Parser engine to use. The C engine is faster while the Python engine is
currently more feature-complete.
engine : {``'c'``, ``'pyarrow'``, ``'python'``}
Parser engine to use. In terms of performance, the pyarrow engine,
which requires ``pyarrow`` >= 0.15.0, is faster than the C engine, which
is faster than the python engine. However, the pyarrow and C engines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionchanged tag here 1.2

are currently less feature complete than their Python counterpart.
converters : dict, default ``None``
Dict of functions for converting values in certain columns. Keys can either be
integers or column labels.
Expand Down Expand Up @@ -1619,11 +1621,18 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
Specifying the parser engine
''''''''''''''''''''''''''''

Under the hood pandas uses a fast and efficient parser implemented in C as well
as a Python implementation which is currently more feature-complete. Where
possible pandas uses the C parser (specified as ``engine='c'``), but may fall
back to Python if C-unsupported options are specified. Currently, C-unsupported
options include:
Currently, pandas supports using three engines, the C engine, the python engine,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionchanged 1.2 tag here

and an optional pyarrow engine(requires ``pyarrow`` >= 0.15). In terms of performance
the pyarrow engine is fastest, followed by the C and Python engines. However,
the pyarrow engine is much less robust than the C engine, which in turn lacks a
couple of features present in the Python parser.

Where possible pandas uses the C parser (specified as ``engine='c'``), but may fall
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to refactor this entire section to provide a more table like comparision of all of the parsers, if you'd create an issue for this

back to Python if C-unsupported options are specified. If pyarrow unsupported options are
specified while using ``engine='pyarrow'``, the parser will error out
(a full list of unsupported options is available at ``pandas.io.parsers._pyarrow_unsupported``).

Currently, C-unsupported options include:

* ``sep`` other than a single character (e.g. regex separators)
* ``skipfooter``
Expand Down
9 changes: 9 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Expand Up @@ -254,6 +254,7 @@ If needed you can adjust the bins with the argument ``offset`` (a Timedelta) tha

For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.


fsspec now used for filesystem handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -271,6 +272,14 @@ change, as ``fsspec`` will still bring in the same packages as before.

.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/


read_csv() now accepts pyarrow as an engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 1.2

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`pandas.read_csv` now accepts engine="pyarrow" as an argument, allowing for faster csv parsing on multicore machines
with pyarrow>=0.15 installed. See the :doc:`I/O docs </user_guide/io>` for more info. (:issue:`23697`)


.. _whatsnew_110.enhancements.other:

Other enhancements
Expand Down
25 changes: 19 additions & 6 deletions pandas/compat/_optional.py
@@ -1,6 +1,8 @@
import distutils.version
import importlib
import sys
import types
from typing import Optional
import warnings

# Update install.rst when updating versions!
Expand Down Expand Up @@ -46,7 +48,11 @@ def _get_version(module: types.ModuleType) -> str:


def import_optional_dependency(
name: str, extra: str = "", raise_on_missing: bool = True, on_version: str = "raise"
name: str,
extra: str = "",
raise_on_missing: bool = True,
on_version: str = "raise",
min_version: Optional[str] = None,
):
"""
Import an optional dependency.
Expand All @@ -58,8 +64,7 @@ def import_optional_dependency(
Parameters
----------
name : str
The module name. This should be top-level only, so that the
version may be checked.
The module name.
extra : str
Additional text to include in the ImportError message.
raise_on_missing : bool, default True
Expand All @@ -73,6 +78,8 @@ def import_optional_dependency(
* ignore: Return the module, even if the version is too old.
It's expected that users validate the version locally when
using ``on_version="ignore"`` (see. ``io/html.py``)
min_version: Optional[str]
Specify the minimum version

Returns
-------
Expand All @@ -93,10 +100,16 @@ def import_optional_dependency(
raise ImportError(msg) from None
else:
return None

minimum_version = VERSIONS.get(name)
# Handle submodules: if we have submodule, grab parent module from sys.modules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is all this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been answered before: #31817 (comment) (and the above comment has been added based on your comment)

It's to 1) import a submodule (pyarrow.csv in this case) and 2) to support passing a different version as in our global minimum versions dictionary.

Now I suppose that the submodule importing is not necessarily needed. Right now this PR does:

csv = import_optional_dependency("pyarrow.csv", min_version="0.15")

but I suppose this could also be:

import_optional_dependency("pyarrow", min_version="0.15")
from pyarrow import csv

And then this additional code to directly import a submodule with import_optional_dependency is not needed (although where it is used, I think it is a bit cleaner to be able to directly import the submodule)

Copy link
Member Author

@lithomas1 lithomas1 Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche importing as a submodule is required, you can't access the csv module by doing pyarrow.csv as far as I remember, and if you do import pyarrow.csv, the it won't validate the version and will not error for pyarrow<0.15

parent = name.split(".")[0]
if parent != name:
name = parent
module_to_get = sys.modules[name]
else:
module_to_get = module
minimum_version = min_version if min_version is not None else VERSIONS.get(name)
if minimum_version:
version = _get_version(module)
version = _get_version(module_to_get)
if distutils.version.LooseVersion(version) < minimum_version:
assert on_version in {"warn", "raise", "ignore"}
msg = (
Expand Down