Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Delegate more of Excel parsing to CSV #23544

Merged
merged 1 commit into from
Nov 11, 2018

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Nov 7, 2018

The idea is that we read the Excel file, get the data, and let the TextParser handle reading and parsing, or at least most of it. We shouldn't be doing a lot of work that is already defined in parsers.py

In doing so, identified several bugs:

  • index_col=None was not being respected

  • usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs

  • usecols was not being validated as proper Excel column names when passed as a string.

Closes #18273.
Closes #20480.

With regards to the latter issue, I believe this PR is a cleaner implementation that integrates the new usecols functionality without having to introduce a new parameter.

@pep8speaks
Copy link

Hello @gfyoung! Thanks for submitting the PR.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2018

wow nice cleanup!

linting

check import format using isort
ERROR: /home/travis/build/pandas-dev/pandas/pandas/io/excel.py Imports are incorrectly sorted.

@codecov
Copy link

codecov bot commented Nov 7, 2018

Codecov Report

Merging #23544 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23544      +/-   ##
==========================================
- Coverage   92.24%   92.23%   -0.01%     
==========================================
  Files         161      161              
  Lines       51278    51278              
==========================================
- Hits        47299    47298       -1     
- Misses       3979     3980       +1
Flag Coverage Δ
#multiple 90.62% <ø> (-0.01%) ⬇️
#single 42.28% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.55% <0%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cea659...a36aa76. Read the comment docs.

@gfyoung
Copy link
Member Author

gfyoung commented Nov 7, 2018

@jreback : Fixed the import error, so ready for review.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

pandas/io/excel.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comments.

doc/source/io.rst Show resolved Hide resolved
pandas/io/excel.py Show resolved Hide resolved
pandas/io/excel.py Outdated Show resolved Hide resolved
pandas/io/excel.py Outdated Show resolved Hide resolved
pandas/tests/io/test_excel.py Show resolved Hide resolved
pandas/tests/io/test_excel.py Show resolved Hide resolved
@jreback jreback added this to the 0.24.0 milestone Nov 8, 2018
@gfyoung gfyoung force-pushed the excel-csv-standardize branch 2 times, most recently from ddbc258 to 487a336 Compare November 8, 2018 18:44
@gfyoung
Copy link
Member Author

gfyoung commented Nov 8, 2018

@jreback : Addressed comments, and all is still green. PTAL.

doc/source/io.rst Show resolved Hide resolved
pandas/io/excel.py Show resolved Hide resolved
@gfyoung gfyoung force-pushed the excel-csv-standardize branch 2 times, most recently from 113bc37 to 0e12260 Compare November 10, 2018 23:40
@gfyoung
Copy link
Member Author

gfyoung commented Nov 10, 2018

@jreback : Need to rebase this onto master because there are several places where tm.assert_raises_regex is used instead of pyest.raises.

@gfyoung
Copy link
Member Author

gfyoung commented Nov 11, 2018

@jreback : Comments addressed, PR rebased, and all is green. PTAL.

The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
@jreback jreback merged commit da23030 into pandas-dev:master Nov 11, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 11, 2018
* upstream/master:
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)
  DOC: Fixes to docstring to add validation to CI (pandas-dev#23560)
  DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600)
  MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592)
  DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197)
@gfyoung gfyoung deleted the excel-csv-standardize branch November 11, 2018 22:36
thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
…fixed

* upstream/master:
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 12, 2018
gfyoung added a commit to forking-repos/pandas that referenced this pull request Nov 12, 2018
jreback pushed a commit that referenced this pull request Nov 12, 2018
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
The idea is that we read the Excel file,
get the data, and then let the TextParser
handle the reading and parsing.

We shouldn't be doing a lot of work that
is already defined in parsers.py

In doing so, we identified several bugs:

* index_col=None was not being respected

* usecols behavior was inconsistent with
that of read_csv for list of strings and
callable inputs

* usecols was not being validated as proper
Excel column names when passed as a string.

Closes pandas-devgh-18273.
Closes pandas-devgh-20480.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants