Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: XLSB support #29836

Merged
merged 31 commits into from
Jan 20, 2020
Merged

ENH: XLSB support #29836

merged 31 commits into from
Jan 20, 2020

Conversation

Rik-de-Kort
Copy link
Contributor

@Rik-de-Kort Rik-de-Kort commented Nov 25, 2019

Hey all, a moderately commonly requested feature is xlsb support. I thought I'd go ahead and make a PR for it, based on Pyxlsb. The library isn't very full-featured: datetimes are loaded in as floats without any indication they're datetimes. Would that be grounds for rejection?

Alternative would be to implement xlsb support in Openpyxl which looks like it will take a long time for someone not familiar with the file formats (as I am).

@jbrockmendel
Copy link
Member

to get the CI to pass, can you run isort pandas/io/excel/_pyxlsb.py

@Rik-de-Kort
Copy link
Contributor Author

@jbrockmendel thanks for the tip! Did it alphabetically but apparently there's some kind of convention.

The documentation not compiling looks like a fun bug. Will try and see if I can reproduce and squash it tomorrow.

@jbrockmendel
Copy link
Member

Did it alphabetically but apparently there's some kind of convention.

Yah, isort has its own conventions and then we have some configuration in setup.cfg, not sure what the problem was here, but looks resolved now, thanks.

The documentation not compiling looks like a fun bug. Will try and see if I can reproduce and squash it tomorrow.

That'd be great, its affecting a lot of PRs right now

@Rik-de-Kort
Copy link
Contributor Author

Haven't been able to reproduce the bug on my computer, but I did do the necessary modifications to the test suite. Currently it's failing a lot of tests because ExcelFile appears to get called with no engine and that raises an XLRDError. Any help here? I'm not sure ExcelFile should get called at all with no engine and an .xlsb-file, since I added the appropriate checks near the beginning of the fixture.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR. I think the limitation you mentioned is OK to start - it is something that would need to be fixed upstream first right?

Can you add a whatsnew note for 1.0.0 and update the user guide?

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#excel-files

pandas/io/excel/_pyxlsb.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added the IO Excel read_excel, to_excel label Nov 26, 2019
@Rik-de-Kort Rik-de-Kort changed the title XLSB support ENH: XLSB support Nov 28, 2019
@Rik-de-Kort
Copy link
Contributor Author

A bunch of tests are not passing because (seems like) pd.ExcelFile doesn't get patched to use the pyxlsb engine. Can anyone see what's going on with that, as I really don't know a whole lot about the way pytest works. I would anticipate the cd_and_set_engine-fixture to be used for every test in the class...

pandas/tests/io/excel/test_readers.py Outdated Show resolved Hide resolved
@Rik-de-Kort
Copy link
Contributor Author

Rik-de-Kort commented Nov 30, 2019

Found it, there was an addiional fixture I had missed read_ext == ".xlsb" and engine != "pyxlsb" line on. Thanks for the help as well, @WillAyd!

All of the failing tests look normal, having to do with not recognizing datetimes. There's two which have to do with a column header not being converted to "Unnamed: 0", but rather to "None". Probably something with types

@Rik-de-Kort Rik-de-Kort marked this pull request as ready for review December 2, 2019 23:27
@Rik-de-Kort
Copy link
Contributor Author

Alright, seems like most things are in order now. Commits are a bit messy but I'm not great at git yet.

The one build that's failing is failing because jinja2 is doing weird stuff, not sure about that.

@jbrockmendel
Copy link
Member

Commits are a bit messy but I'm not great at git yet.

Dont worry about it, we squash everything on merge so it wont matter before long. git is one of those learning curves that never seems to end.

The one build that's failing is failing because jinja2 is doing weird stuff, not sure about that.

Unrelated

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Can you update the read_excel docs as well? Otherwise lgtm

"pyxlsb",
marks=[
td.skip_if_no("pyxlsb"),
pytest.mark.filterwarnings("ignore:.*(tree\\.iter|html argument)"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is filterwarnings actually required here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was seeing some defusedxml warnings but they're present either way, so I've removed filterwarnings.

@TomAugspurger TomAugspurger added this to the 1.1 milestone Jan 9, 2020
@TomAugspurger
Copy link
Contributor

Pushing to 1.1

@Rik-de-Kort
Copy link
Contributor Author

Hey, sorry for going AWOL, holidays and stuff. Will rebase tonight.

@Rik-de-Kort
Copy link
Contributor Author

@jreback pinging on green (two weeks later lol)

@WillAyd
Copy link
Member

WillAyd commented Jan 15, 2020

@Rik-de-Kort need to pytest.xfail the tests with date times that aren't supported

@Rik-de-Kort
Copy link
Contributor Author

Done. There's still a failing test (read from url), due to pulling the test file from the master branch of the repo, where the files won't be present till this pull request is complete. :)

@WillAyd
Copy link
Member

WillAyd commented Jan 17, 2020

Yea I've seen that network one before. Can you xfail here and then fix in a follow up PR after this gets merged?

@Rik-de-Kort
Copy link
Contributor Author

Yea I've seen that network one before. Can you xfail here and then fix in a follow up PR after this gets merged?

Yep, done in the latest commit. I added two xfails since the test is going to fail regardless even if test1.xlsb is present in the master branch.

@Rik-de-Kort
Copy link
Contributor Author

@wwwiiilll just realized that because of the PEP 396 addition I might need the right version of Pyxlsb in the environment files. Currently it's 1.0.5, which doesn't have it. Will putting 1.0.6 suffice?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation lgtm just need to fix up a few of the version things at this point

@@ -264,6 +264,7 @@ pyarrow 0.12.0 Parquet, ORC (requires 0.13.0), and
pymysql 0.7.11 MySQL engine for sqlalchemy
pyreadstat SPSS files (.sav) reading
pytables 3.4.2 HDF5 reading / writing
pyxlsb 1.0.5 Reading for xlsb files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pyxlsb 1.0.5 Reading for xlsb files
pyxlsb 1.0.6 Reading for xlsb files

@@ -19,6 +19,7 @@
"pyarrow": "0.13.0",
"pytables": "3.4.2",
"pytest": "5.0.1",
"pyxlsb": "1.0.5",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"pyxlsb": "1.0.5",
"pyxlsb": "1.0.6",

pandas/tests/io/excel/test_readers.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Jan 20, 2020

this is prob ok for 1.0.0, if you are ok with it @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Jan 20, 2020

Oh yea sure

@WillAyd WillAyd modified the milestones: 1.1, 1.0.0 Jan 20, 2020
@WillAyd WillAyd merged commit cdffa43 into pandas-dev:master Jan 20, 2020
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 20, 2020
@WillAyd
Copy link
Member

WillAyd commented Jan 20, 2020

@Rik-de-Kort very nice PR. If you can fix up comments on version and unskip the test that fails because it requires the file on master in a follow up would be much appreciated

simonjayhawkins pushed a commit that referenced this pull request Jan 21, 2020
Co-authored-by: Rik-de-Kort <32839123+Rik-de-Kort@users.noreply.github.com>
@Rik-de-Kort
Copy link
Contributor Author

Am I correct in assuming no further work is needed?

@WillAyd
Copy link
Member

WillAyd commented Jan 21, 2020

If you can clarify the min version required in a follow up (I think should be 1.0.6 not 1.0.5) that should be it

@WillAyd WillAyd mentioned this pull request Feb 1, 2020
5 tasks
@TomAugspurger
Copy link
Contributor

There's an issue with backported PRs. The author is set to the bot, rather than the original git author: scientific-python/MeeseeksDev#35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancement: XLSB support in read_excel()
6 participants