Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add arrow engine to read_csv #31817

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
f22ff46
add arrow engine to read_csv
lithomas1 Feb 9, 2020
8ae43e4
fix failing test
lithomas1 Feb 9, 2020
09074df
formatting and revert unnecessary change
lithomas1 Feb 9, 2020
6be276d
remove bloat and more formatting changes
lithomas1 Feb 9, 2020
df4fa7e
Whatsnew
lithomas1 Feb 9, 2020
9cd9a6f
Merge remote-tracking branch 'upstream/master' into add-arrow-engine
lithomas1 Feb 9, 2020
ecaf3fd
Get tests up and running
lithomas1 Feb 10, 2020
b3c3287
Some fixes
lithomas1 Feb 10, 2020
474baf4
Add asvs and xfail some tests
lithomas1 Feb 11, 2020
2cd9937
address comments
lithomas1 Feb 20, 2020
48ff255
Merge branch 'master' into add-arrow-engine
lithomas1 Feb 20, 2020
3d15a56
fix typo
lithomas1 Feb 20, 2020
c969373
Merge branch 'add-arrow-engine' of github-other.com:lithomas1/pandas …
lithomas1 Feb 20, 2020
98aa134
some fixes
lithomas1 Feb 29, 2020
b9c6d2c
Fix bug
lithomas1 Apr 5, 2020
67c5db6
Fix merge conflicts
lithomas1 Apr 5, 2020
7f891a6
New benchmark and fix more tests
lithomas1 Apr 10, 2020
11fc737
Merge branch 'master' into add-arrow-engine
lithomas1 Apr 10, 2020
23425f7
More cleanups
lithomas1 Apr 10, 2020
d9b7a1f
Merge master
lithomas1 Apr 10, 2020
b8adf3c
Merge branch 'add-arrow-engine' of github-other.com:lithomas1/pandas …
lithomas1 Apr 11, 2020
01c0394
Formatting fixes and typo correction
lithomas1 Apr 11, 2020
ba5620f
skip pyarrow tests if not installed
lithomas1 Apr 12, 2020
2570c82
Address comments
lithomas1 Apr 12, 2020
b3a1f66
Get some more tests to pass
lithomas1 Apr 14, 2020
d46ceed
Fix some bugs and cleanups
lithomas1 Apr 17, 2020
d67925c
Merge branch 'master' into add-arrow-engine
lithomas1 Apr 17, 2020
6378459
Perform version checks for submodule imports too
lithomas1 May 20, 2020
9d64882
Refresh with newer pyarrow
lithomas1 May 20, 2020
852ecf9
Merge branch 'master' into add-arrow-engine
lithomas1 May 20, 2020
93382b4
Start xfailing tests
lithomas1 May 21, 2020
f1bb4e2
Get all tests to run & some fixes
lithomas1 May 27, 2020
14c13ab
Merge branch 'master' into add-arrow-engine
lithomas1 May 27, 2020
7876b4e
Lint and CI
lithomas1 May 29, 2020
4426642
Merge branch 'master' into add-arrow-engine
lithomas1 May 29, 2020
008acab
parse_dates support and fixups of some tests
lithomas1 Jun 3, 2020
2dddae7
Date parsing fixes and address comments
lithomas1 Jun 13, 2020
261ef6a
Merge branch 'master' into add-arrow-engine
lithomas1 Jun 13, 2020
88e200a
Clean/Address comments/Update docs
lithomas1 Jun 29, 2020
bf063ab
Merge branch 'master' into add-arrow-engine
lithomas1 Jun 29, 2020
ede2799
Fix typo
lithomas1 Jun 29, 2020
e8eff08
Fix doc failures
lithomas1 Jul 8, 2020
87cfcf5
Merge remote-tracking branch 'upstream/master' into add-arrow-engine
simonjayhawkins Oct 22, 2020
55139ee
wip
simonjayhawkins Oct 22, 2020
c1aeecf
more xfails and skips
simonjayhawkins Oct 22, 2020
62fc9d6
Merge branch 'master' into add-arrow-engine
lithomas1 Oct 28, 2020
b53a620
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 28, 2020
f13113d
Fix typos
lithomas1 Oct 28, 2020
f9ce2e4
Doc fixes and more typo fixes
lithomas1 Oct 28, 2020
4158d6a
Green?
lithomas1 Nov 2, 2020
d34e75f
Merge branch 'master' into add-arrow-engine
lithomas1 Nov 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/source/whatsnew/v1.1.0.rst
Expand Up @@ -42,7 +42,9 @@ Other enhancements
^^^^^^^^^^^^^^^^^^

- :class:`Styler` may now render CSS more efficiently where multiple cells have the same styling (:issue:`30876`)
-
- :func:`pandas.read_csv` now accepts engine="arrow" as an argument, allowing for faster csv parsing
if pyarrow>0.11 is installed. However, the pyarrow engine is less feature-complete than its "c" or
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
"python" counterparts.
-

.. ---------------------------------------------------------------------------
Expand Down
123 changes: 92 additions & 31 deletions pandas/io/parsers.py
Expand Up @@ -20,6 +20,7 @@
from pandas._libs.parsers import STR_NA_VALUES
from pandas._libs.tslibs import parsing
from pandas._typing import FilePathOrBuffer
from pandas.compat._optional import import_optional_dependency
from pandas.errors import (
AbstractMethodError,
EmptyDataError,
Expand Down Expand Up @@ -165,9 +166,10 @@
to preserve and not interpret dtype.
If converters are specified, they will be applied INSTEAD
of dtype conversion.
engine : {{'c', 'python'}}, optional
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
engine : {{'c', 'python', 'arrow'}}, optional
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
Parser engine to use. The C and arrow engines are faster, while the python engine is
currently more feature-complete. The arrow engine requires ``pyarrow``
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
as a dependency however.
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels.
Expand Down Expand Up @@ -520,6 +522,7 @@ def _read(filepath_or_buffer: FilePathOrBuffer, kwds):
_fwf_defaults = {"colspecs": "infer", "infer_nrows": 100, "widths": None}

_c_unsupported = {"skipfooter"}
_arrow_unsupported = {"skipfooter", "low_memory", "float_precision"}
_python_unsupported = {"low_memory", "float_precision"}

_deprecated_defaults: Dict[str, Any] = {}
Expand Down Expand Up @@ -945,16 +948,16 @@ def _clean_options(self, options, engine):
delim_whitespace = options["delim_whitespace"]

# C engine not supported yet
if engine == "c":
if engine == "c" or engine == "arrow":
if options["skipfooter"] > 0:
fallback_reason = "the 'c' engine does not support skipfooter"
fallback_reason = f"the {engine} engine does not support skipfooter"
engine = "python"

encoding = sys.getfilesystemencoding() or "utf-8"
if sep is None and not delim_whitespace:
if engine == "c":
if engine == "c" or engine == "arrow":
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
fallback_reason = (
"the 'c' engine does not support "
f"the {engine} engine does not support "
"sep=None with delim_whitespace=False"
)
engine = "python"
Expand Down Expand Up @@ -1081,14 +1084,20 @@ def _clean_options(self, options, engine):
na_values, na_fvalues = _clean_na_values(na_values, keep_default_na)

# handle skiprows; this is internally handled by the
# c-engine, so only need for python parsers
# c-engine, so only need for python parser
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
if engine != "c":
if is_integer(skiprows):
skiprows = list(range(skiprows))
if skiprows is None:
skiprows = set()
elif not callable(skiprows):
skiprows = set(skiprows)
if engine == "arrow":
if not is_integer(skiprows) and skiprows is not None:
raise ValueError(
"skiprows argument must be an integer when using engine='arrow'"
)
else:
if is_integer(skiprows):
skiprows = list(range(skiprows))
if skiprows is None:
skiprows = set()
elif not callable(skiprows):
skiprows = set(skiprows)

# put stuff back
result["names"] = names
Expand All @@ -1109,6 +1118,8 @@ def __next__(self):
def _make_engine(self, engine="c"):
if engine == "c":
self._engine = CParserWrapper(self.f, **self.options)
elif engine == "arrow":
self._engine = ArrowParserWrapper(self.f, **self.options)
else:
if engine == "python":
klass = PythonParser
Expand All @@ -1125,29 +1136,32 @@ def _failover_to_python(self):
raise AbstractMethodError(self)

def read(self, nrows=None):
nrows = _validate_integer("nrows", nrows)
ret = self._engine.read(nrows)
if isinstance(self._engine, ArrowParserWrapper):
return self._engine.read(nrows)
else:
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
nrows = _validate_integer("nrows", nrows)
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do this at the top of the function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that nrows is not supported by pyarrow, so in principle we should only need to validate it for other engines.
Now, I suppose that before you end up here, we already would have raised an error if nrows was specified in case of the pyarrow engine, so it probably doesn't hurt to needlessly validate the default nrows argument in case of the pyarrow engine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess I could do that, but I think my way is cleaner, since all the pyarrow code would be in the if, and the other parser code would be in the else.

ret = self._engine.read(nrows)

# May alter columns / col_dict
index, columns, col_dict = self._create_index(ret)
# May alter columns / col_dict
jreback marked this conversation as resolved.
Show resolved Hide resolved
index, columns, col_dict = self._create_index(ret)

if index is None:
if col_dict:
# Any column is actually fine:
new_rows = len(next(iter(col_dict.values())))
index = RangeIndex(self._currow, self._currow + new_rows)
if index is None:
if col_dict:
# Any column is actually fine:
new_rows = len(next(iter(col_dict.values())))
index = RangeIndex(self._currow, self._currow + new_rows)
else:
new_rows = 0
else:
new_rows = 0
else:
new_rows = len(index)
new_rows = len(index)

df = DataFrame(col_dict, columns=columns, index=index)
df = DataFrame(col_dict, columns=columns, index=index)

self._currow += new_rows
self._currow += new_rows

if self.squeeze and len(df.columns) == 1:
return df[df.columns[0]].copy()
return df
if self.squeeze and len(df.columns) == 1:
return df[df.columns[0]].copy()
return df

def _create_index(self, ret):
index, columns, col_dict = ret
Expand Down Expand Up @@ -2139,6 +2153,53 @@ def _maybe_parse_dates(self, values, index, try_parse_dates=True):
return values


class ArrowParserWrapper(ParserBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to refactor this as the current code is very different from this.

Also I really don't like doing all of this validation in a single function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to refactor this as the current code is very different from this.

Can you clarify a bit more what you mean? Or point to recent changes related to this?

For example also on master, the C parser is using a very similar mechanism with the CParserWrapper class.
Or do you only mean that it needs to split some validation into separate methods (as you indicate in a comment below as well, fully agreed with that).

"""

"""
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved

def __init__(self, src, **kwds):
self.kwds = kwds
self.src = src
kwds = kwds.copy()

ParserBase.__init__(self, kwds)

# #2442
kwds["allow_leading_cols"] = self.index_col is not False
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved

# GH20529, validate usecol arg before TextReader
self.usecols, self.usecols_dtype = _validate_usecols_arg(kwds["usecols"])

def read(self, nrows=None):
pyarrow = import_optional_dependency(
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
"pyarrow.csv", extra="pyarrow is required to use arrow engine"
)
nrows = _validate_integer("nrows", nrows)
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
table = pyarrow.read_csv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add line breaks between section and comments.

self.src,
read_options=pyarrow.ReadOptions(
skip_rows=self.kwds.get("skiprows"), column_names=self.names
),
parse_options=pyarrow.ParseOptions(
delimiter=self.kwds.get("delimiter"),
quote_char=self.kwds.get("quotechar"),
),
convert_options=pyarrow.ConvertOptions(
include_columns=self.usecols, column_types=self.kwds.get("dtype")
),
)
if nrows:
table = table[:nrows]
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
table_width = len(table.column_names)
if self.names is None:
if self.prefix:
self.names = [f"{self.prefix}{i}" for i in range(table_width)]
if self.names:
table = table.rename_columns(self.names)
return table.to_pandas()


def TextParser(*args, **kwds):
"""
Converts lists of lists/tuples into DataFrames with proper type inference
Expand Down