New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read excel nrows #16672

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@alysivji
Contributor

alysivji commented Jun 11, 2017

  • closes #16645
  • tests added / passed
  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
  • whatsnew entry
@@ -38,6 +38,7 @@ Other Enhancements
- :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`)
- :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`)
- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`)

This comment has been minimized.

@gfyoung

gfyoung Jun 11, 2017

Member

How about:

pd.read_excel() has gained the nrows parameter (...)

This comment has been minimized.

@jreback

jreback Jun 14, 2017

Contributor

you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv

use :func:`read_excel`

true_values=None, false_values=None, engine=None,
squeeze=False, **kwds):
def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None,
skip_footer=0, index_col=None, names=None, parse_cols=None,

This comment has been minimized.

@jreback

jreback Jun 11, 2017

Contributor

we normally don't like to shuffle parameters around in kwargs. Please align the ordering of these params as much as possible with how read_csv does it (obviously only include the current parameters).

This comment has been minimized.

@alysivji

alysivji Jun 11, 2017

Contributor

Sure, not a problem. I was using read_csv as a guide and saw that nrows was directly after skiprows, hence the change.

I will go ahead and line up the read_excel kwargs as close as possible to the read_csv kwargs. I see 3or4 out of order with just a quick glance.

This comment has been minimized.

@jreback

jreback Jun 11, 2017

Contributor

ok great. prob need a slightly expanded whatsnew note to tell about this

This comment has been minimized.

@alysivji

alysivji Jun 13, 2017

Contributor

I rearranged the kwargs and also updated docstrings to match parameter order (one of the things that always bugs me). Should I also update all internal function kwargs in excel.py to match the external API?

I'm adding the following note to the Other Enhancements section (just wanted to make sure it was the right spot...):
- Rearranged the order of keyword arguments in :func:`read_excel()` to align with :func:`read_csv()`

Thanks!

This comment has been minimized.

@jreback

jreback Jun 14, 2017

Contributor

yes I would have the internal API match the external

expected = expected[:num_rows_to_pull]
tm.assert_frame_equal(actual, expected)
with pytest.raises(ValueError):

This comment has been minimized.

@jreback

jreback Jun 11, 2017

Contributor

@gfyoung don't we have an issue about nrows validation in the parser?

This comment has been minimized.

@gfyoung

gfyoung Jun 11, 2017

Member

We do, but I think you can circumvent the check via non read_csv and read_table calls (the check occurs in _read, which does not get hit on read_excel), which is annoying.

This comment has been minimized.

@gfyoung

gfyoung Jun 11, 2017

Member

@alysivji : Put this as a separate test, and use tm.assert_raises_regex to check the specific error message raised in the ValueError (we want it to be that validation failed).

expected = pd.read_excel(os.path.join(self.dirpath,
'test1' + self.ext))
expected = expected[:num_rows_to_pull]
tm.assert_frame_equal(actual, expected)

This comment has been minimized.

@jreback

jreback Jun 11, 2017

Contributor

try to pull more rows than exist in the file as well.

@@ -999,6 +999,8 @@ def _failover_to_python(self):
def read(self, nrows=None):
if nrows is not None:
nrows = _validate_integer('nrows', nrows)

This comment has been minimized.

@gfyoung

gfyoung Jun 11, 2017

Member

Good catch (see my comment here. However, instead of littering our code with duplicate checks, here's what I think is best:

In your modified parsers.py, there should now be two places where nrows = _validate_integer(...) is called. Delete both of those.

Locate the read method for the TextFileReader in that same file, and add this validation check right before the if nrows is not None. That should work.

This comment has been minimized.

@alysivji

alysivji Jun 11, 2017

Contributor

I originally didn't have this line there, but then I was getting the following error:
TypeError: '>=' not supported between instances of 'int' and 'str' vs
ValueError: 'nrows' must be an integer >=0

This comment has been minimized.

@jreback

jreback Jun 14, 2017

Contributor

yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.

This comment has been minimized.

@alysivji

alysivji Jun 14, 2017

Contributor

Where should I put this test? There are a bunch in `tests/io/parser', but nothing for read_csv directly.

@@ -82,6 +82,8 @@
Rows to skip at the beginning (0-indexed)
skip_footer : int, default 0
Rows at the end to skip (0-indexed)
nrows : int, default None
Number of rows to parse

This comment has been minimized.

@jreback

jreback Jun 14, 2017

Contributor

versionadded 0.21.0 tag

true_values=None, false_values=None, engine=None,
squeeze=False, **kwds):
def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None,
skip_footer=0, index_col=None, names=None, parse_cols=None,

This comment has been minimized.

@jreback

jreback Jun 14, 2017

Contributor

yes I would have the internal API match the external

@@ -38,6 +38,7 @@ Other Enhancements
- :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`)
- :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`)
- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`)

This comment has been minimized.

@jreback

jreback Jun 14, 2017

Contributor

you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv

use :func:`read_excel`

@@ -999,6 +999,8 @@ def _failover_to_python(self):
def read(self, nrows=None):
if nrows is not None:
nrows = _validate_integer('nrows', nrows)

This comment has been minimized.

@jreback

jreback Jun 14, 2017

Contributor

yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Jun 30, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 19, 2017

can you rebase/update

@pep8speaks

This comment has been minimized.

pep8speaks commented Jul 19, 2017

Hello @alysivji! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 19, 2017 at 12:56 Hours UTC
@codecov

This comment has been minimized.

codecov bot commented Jul 19, 2017

Codecov Report

Merging #16672 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16672      +/-   ##
==========================================
+ Coverage   90.93%   90.93%   +<.01%     
==========================================
  Files         161      161              
  Lines       49269    49270       +1     
==========================================
+ Hits        44802    44803       +1     
  Misses       4467     4467
Flag Coverage Δ
#multiple 88.69% <100%> (ø) ⬆️
#single 40.22% <60%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/excel.py 80.55% <100%> (ø) ⬆️
pandas/io/parsers.py 95.43% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...f1a6740. Read the comment docs.

@codecov

This comment has been minimized.

codecov bot commented Jul 19, 2017

Codecov Report

Merging #16672 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16672      +/-   ##
==========================================
+ Coverage   90.93%   90.93%   +<.01%     
==========================================
  Files         161      161              
  Lines       49269    49270       +1     
==========================================
+ Hits        44802    44803       +1     
  Misses       4467     4467
Flag Coverage Δ
#multiple 88.69% <100%> (ø) ⬆️
#single 40.22% <60%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.43% <100%> (ø) ⬆️
pandas/io/excel.py 80.55% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...ef52114. Read the comment docs.

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 10, 2017

can you rebase / update

@alysivji

This comment has been minimized.

Contributor

alysivji commented Sep 11, 2017

Sure, I got some time this week. Will take a look at it.

@jreback jreback removed this from the 0.21.0 milestone Sep 23, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 23, 2017

pls rebase

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 10, 2017

closing as stale. if you'd like to continue, pls ping.

@jreback jreback closed this Nov 10, 2017

@alysivji alysivji referenced this pull request Nov 26, 2017

Merged

Add nrows parameter to pandas.read_excel() #18507

4 of 4 tasks complete
@alysivji

This comment has been minimized.

Contributor

alysivji commented Nov 26, 2017

I got some time. Picked this back up. Sent another PR (18507)

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment