Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: inconsistent naming convention for read_excel column selection (#4988) #16488

Closed
wants to merge 67 commits into from
Closed

Conversation

abarber4gh
Copy link
Contributor

@abarber4gh abarber4gh commented May 24, 2017

  • update API to use ‘usecols’ instead of ‘parse_cols’. still functionally the same as ‘parse_col’,

  • added test cases for ‘usecols’, added assert_produces_warning(FutureWarning) to other test
    cases that use ‘parse_cols’.

  • refactor column use column parsing to only occur once per sheet.

  • updated whats new with deprecated parse_col argument.

  • closes ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

  • tests added / passed

  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff

  • whatsnew entry

  • packers.* BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@chris-b1
Copy link
Contributor

FYI, just discussing a bug related to usecols #15316, might be easiest to work that change into your refactoring, or can be separate, cc @stanleyguan

@@ -35,6 +35,7 @@ Other Enhancements
- ``RangeIndex.append`` now returns a ``RangeIndex`` object when possible (:issue:`16212`)
- :func:`to_pickle` has gained a protocol parameter (:issue:`16252`). By default, this parameter is set to `HIGHEST_PROTOCOL <https://docs.python.org/3/library/pickle.html#data-stream-format>`__
- :func:`api.types.infer_dtype` now infers decimals. (:issue: `15690`)
- :func:`read_excel` now allows a column character list (E.G. ['A', 'C', 'D']) with the ``usecols`` parameter (:issue:`4988`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are listed this in deprecations so this is unecessary

@@ -60,6 +61,7 @@ Other API Changes
Deprecations
~~~~~~~~~~~~
- :func:`read_excel()` has deprecated ``sheetname`` in favor of ``sheet_name`` for consistency with to_excel() (:issue:`10559`).
- :func:`read_excel()` has deprecated ``parse_cols`` in favor of ``usecols`` for consistency with other read_ functions (:issue:`4988`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.read_* functions

* If string then indicates comma separated list of column names and
column ranges (e.g. "A:E" or "A,C,E:F")
parse_cols : int or list, default None
.. deprecated:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, didn't know sphinx had this deprecated tag. @TomAugspurger @jorisvandenbossche should maybe change other deprecations to use this if it looks nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appears as:

parse_cols : int or list, default None
Deprecated since version 0.21.0: Use usecols instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger if we like this (and lgtm), let's make an issue to updated all DEPRECATED in codebase with this directive

dfref = self.get_csv_refdf('test1')
dfref = dfref.reindex(columns=['A', 'B', 'C'])
df1 = self.get_exceldf('test1', 'Sheet1', index_col=0, usecols=3)
df2 = self.get_exceldf('test1', 'Sheet2', skiprows=[1], index_col=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you need to change ALL tests to use the new one (usecols), except for a single test to actually hit the deprecation.


tm.assert_frame_equal(df1, dfref, check_names=False)
tm.assert_frame_equal(df2, dfref, check_names=False) # backward compat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these new tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, new tests for the new functionality.

@jreback jreback added Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel labels May 24, 2017
@codecov
Copy link

codecov bot commented May 24, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@b0a51df). Click here to learn what that means.
The diff coverage is 38.88%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #16488   +/-   ##
=========================================
  Coverage          ?    90.4%           
=========================================
  Files             ?      161           
  Lines             ?    51038           
  Branches          ?        0           
=========================================
  Hits              ?    46139           
  Misses            ?     4899           
  Partials          ?        0
Flag Coverage Δ
#multiple 88.24% <38.88%> (?)
#single 40.16% <11.11%> (?)
Impacted Files Coverage Δ
pandas/io/excel.py 62.29% <38.88%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0a51df...5b3693d. Read the comment docs.

@codecov
Copy link

codecov bot commented May 24, 2017

Codecov Report

Merging #16488 into master will decrease coverage by 0.36%.
The diff coverage is 45.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16488      +/-   ##
==========================================
- Coverage   90.79%   90.43%   -0.37%     
==========================================
  Files         161      161              
  Lines       51063    51046      -17     
==========================================
- Hits        46363    46162     -201     
- Misses       4700     4884     +184
Flag Coverage Δ
#multiple 88.27% <45.45%> (-0.37%) ⬇️
#single 40.16% <18.18%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/excel.py 62.37% <45.45%> (-18.27%) ⬇️
pandas/io/formats/excel.py 74.24% <0%> (-22.41%) ⬇️
pandas/conftest.py 95.83% <0%> (-0.6%) ⬇️
pandas/io/parsers.py 95.33% <0%> (-0.33%) ⬇️
pandas/util/testing.py 80.79% <0%> (-0.2%) ⬇️
pandas/core/series.py 94.71% <0%> (-0.19%) ⬇️
pandas/core/generic.py 92.16% <0%> (-0.1%) ⬇️
pandas/core/resample.py 96.08% <0%> (-0.02%) ⬇️
pandas/core/reshape/pivot.py 95.08% <0%> (ø) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0d9ee0...03593a7. Read the comment docs.

* If string then indicates comma separated list of column names and
column ranges (e.g. "A:E" or "A,C,E:F")
parse_cols : int or list, default None
.. deprecated:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger if we like this (and lgtm), let's make an issue to updated all DEPRECATED in codebase with this directive

@@ -312,11 +327,6 @@ def _range2cols(areas):
>>> _range2cols('A,C,Z:AB')
[0, 2, 25, 26, 27]
"""
def _excel2num(x):
"Convert Excel column name like 'AB' to 0-based column index"
return reduce(lambda s, a: s * 26 + ord(a) - ord('A') + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are changing this code? does this involve the deprecation somehow? if you are cleaning/fixing, pls do in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changed to allow passing list of strings, similar to other read_* functions, and keep the read_excel "superpower" to pass a string that is parsed as mentioned in the original issue (#4988, point 3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that's fine, but needs to be in a separate PR from the deprecation change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, opened #16510 for this functionality. will re-submit with only the argument change (parse_cols -> usecols).


def test_parse_cols_str(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave the original tests structure (sure you can change the name to conform), but don't change the tests (in THIS PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are back to original but with changed function & kwarg names.

@TomAugspurger
Copy link
Contributor

@abarber4gh can you rebase, now that the excel tests are running?

- removed usecols mention in Other Enhancments section,
remains in Deprecations.
- removed test_parse_* test methods in favor of test_usecols_* methods.
- changed parse_cols to usecols in test_read_one_empty_col_* instead
of catching warning.
…n selection (#4988)

update API to use ‘usecols’ instead of ‘parse_cols’.  still functionally the same as ‘parse_col’,
added test cases for ‘usecols’, added assert_produces_warning(FutureWarning) to other test
cases that use ‘parse_cols’.

refactor column use column parsing to only occur once per sheet.

updated whats new with deprecated parse_col argument and other enhancements
to usecols functionality.
update documentation to show new usecols functionality in read_excel().
add `check_stacklevel=False` to `test_excel_oldindex_format()`
@codecov
Copy link

codecov bot commented May 26, 2017

Codecov Report

Merging #16488 into master will increase coverage by 0.14%.
The diff coverage is 98.4%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16488      +/-   ##
==========================================
+ Coverage   90.79%   90.93%   +0.14%     
==========================================
  Files         161      161              
  Lines       51063    49267    -1796     
==========================================
- Hits        46363    44802    -1561     
+ Misses       4700     4465     -235
Flag Coverage Δ
#multiple 88.69% <96%> (+0.06%) ⬆️
#single 40.22% <27.2%> (+0.07%) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/pivot.py 95.08% <ø> (ø) ⬆️
pandas/core/reshape/reshape.py 99.28% <ø> (ø) ⬆️
pandas/core/reshape/concat.py 97.62% <ø> (ø) ⬆️
pandas/core/internals.py 93.43% <ø> (-0.01%) ⬇️
pandas/core/indexes/period.py 92.74% <ø> (ø) ⬆️
pandas/tseries/offsets.py 97.12% <ø> (-0.01%) ⬇️
pandas/io/common.py 69.91% <ø> (ø) ⬆️
pandas/util/testing.py 100% <ø> (+19.01%) ⬆️
pandas/core/dtypes/cast.py 86.89% <0%> (ø) ⬆️
pandas/compat/pickle_compat.py 69.51% <100%> (ø) ⬆️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0d9ee0...a525222. Read the comment docs.

TomAugspurger and others added 26 commits June 4, 2017 05:39
* PERF: vectorize _interp_limit

* CLN: remove old implementation

* fixup! CLN: remove old implementation
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str
…ches (#16460)

* gh-14671 Check if usecols with type string contains a subset of names, if not throws an error

* tests added for gh-14671, expected behavior of simultaneous use of usecols and names unclear so these tests are commented out

* Review comments
Splits extra information about the license and copyright holders to
AUTHORS.md.
* COMPAT: numpy 1.13 test compat

* CI: fix doc build to 1.12
- removed usecols mention in Other Enhancments section,
remains in Deprecations.
- removed test_parse_* test methods in favor of test_usecols_* methods.
- changed parse_cols to usecols in test_read_one_empty_col_* instead
of catching warning.
…n selection (#4988)

update API to use ‘usecols’ instead of ‘parse_cols’.  still functionally the same as ‘parse_col’,
added test cases for ‘usecols’, added assert_produces_warning(FutureWarning) to other test
cases that use ‘parse_cols’.

refactor column use column parsing to only occur once per sheet.

updated whats new with deprecated parse_col argument and other enhancements
to usecols functionality.
update documentation to show new usecols functionality in read_excel().
add `check_stacklevel=False` to `test_excel_oldindex_format()`
…o issue#4988

* 'issue#4988' of https://github.com/abarber4gh/pandas:
  add `deprecate_kwarg` from `_decorators` add `check_stacklevel=False` to `test_excel_oldindex_format()`
  removed excess blank line.
  change parse_cols to usecols
  change tests keyword from parse_cols to usecol.
  no message
  ENH: inconsistent naming convention for read_csv and read_excel column selection (#4988)
  implement changes request in PR#16488 - removed usecols mention in Other Enhancments section, remains in Deprecations. -      removed test_parse_* test methods in favor of test_usecols_* methods. - changed parse_cols to usecols in test_read_one_empty_col_* instead of catching warning.
rebased
@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

can you rebase / update according to comments

@jreback
Copy link
Contributor

jreback commented Sep 10, 2017

closing as stale. pls ping to reopen if you want to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: inconsistent naming convention for read_csv and read_excel column selection