Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow toxlsx() to add or replace a worksheet. #502

Closed
wants to merge 3 commits into from
Closed

Allow toxlsx() to add or replace a worksheet. #502

wants to merge 3 commits into from

Conversation

chrullrich
Copy link

@chrullrich chrullrich commented Jul 14, 2020

This PR has the objective of enabling toxlsx() to be used to produce xlsx files with multiple worksheets.

The single unit test I added is successful, but I was not able to get the entire test suite to run.

Changes

  1. Added the overwrite argument to io.xlsx.toxlsx() and added a minimal docstring.

Checklist

Checklist for for pull requests including new code and/or changes to existing code...

  • Code
    • Includes unit tests
    • New functions have docstrings with examples that can be run with doctest
    • New functions are included in API docs
    • Docstrings include notes for any changes to API or behaviour
    • All changes documented in docs/changes.rst
  • Testing
    • (Optional) Tested local against remote servers
    • Travis CI passes (unit tests run under Linux)
    • AppVeyor CI passes (unit tests run under Windows)
    • Unit test coverage has not decreased (see Coveralls)
  • Changes
    • (Optional) Just a proof of concept
    • (Optional) Work in progress
    • Ready to review
    • Ready to merge

@coveralls
Copy link

coveralls commented Jul 14, 2020

Coverage Status

Coverage increased (+0.01%) to 91.925% when pulling 80e38e1 on chrullrich:toxlsx-add-sheet into 9ae3fc4 on petl-developers:master.

Copy link
Collaborator

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chrullrich, this looks good to me. Just had one question.

petl/io/xlsx.py Outdated
@@ -86,15 +87,32 @@ def __iter__(self):
pass


def toxlsx(tbl, filename, sheet=None, write_header=True):
def toxlsx(tbl, filename, sheet=None, write_header=True, overwrite=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the default should be overwrite=False? This would mean that it's less likely a user will accidentally overwrite an entire workbook when all they wanted to do was replace a single sheet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compatibility with existing code that does not know about the new argument.

Setting the default to False would mean that a new sheet would always be added unless there was a sheet name argument that matched a sheet in the workbook. Worst case scenario, but not implausible: Two cron jobs exchange data through Excel files (for some reason that someone, somewhere, will have come up with). One uses toxlsx() without an explicit sheet name, the next one fromxlsx(), also without one. Then petl is upgraded. The file will keep growing and the second job will forevermore read only the first sheet containing the last data written before the upgrade.

I'm all for validating and tracking dependency versions, but that doesn't mean I actually do it ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! 👍

Some options here:

  1. Maybe change overwrite from True/False to:
    1. overwrite the file
    2. replace the sheet content in the file. If the sheet argument is None recreate the file.
    3. append to the sheet content the file
  2. Perhaps the default should be overwrite='replace' for reducing compatibility problems.
    1. Calls with argument sheet with None will behave as currently.
    2. Calls with argument sheet filled with constant names by multiple jobs will have some impact by only a fixed number of sheets.
    3. Calls with argument sheet filled with random names by multiple jobs will be impacted and will keep growing. However this usage may not be common.

Copy link
Author

@chrullrich chrullrich Jul 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. It feels a bit weird to replace a single sheet if one is named, and the entire file if not, but it does make sense. However, replace with no sheet name also replaces overwrite because they are entirely identical, so the latter isn't needed. (Nonsense. overwrite is needed to specify a sheet name.)

Calling the operation that adds sheets to existing files append invites confusion with appendxlsx() that adds rows to existing sheets, so I'd prefer to call it add.

Updated PR coming up, either today or tomorrow.

The "add" mode is tricky because openpyxl will silently uniqify a colliding
sheet name when attempting to create it, i.e. the resulting sheet name may
not be the one requested. For predictability, the exact name is enforced.
"replace" with a sheet name expects the file to either not be there,
or to be a valid .xlsx file, not the empty file it used to get here.
wb = openpyxl.Workbook(write_only=True)
ws = wb.create_sheet(title=sheet)

if not os.path.exists(filename) or mode == "overwrite" \
Copy link
Member

@juarezr juarezr Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not os.path.exists(filename)

Since v1.6.0 petl allows reading and writing files to remote sources like remote servers or cloud providers when also using fsspec package.

However, this code will work only with local files.

Until now that is not a problem because the codes of fromxls() and toxls() are not working with remote sources.

But the PR #506 adds support for remote sources and also conflict with this PR.

Any ideas of what is the best way to conciliate these two changes? 😃

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I had sent the PR four weeks earlier, it would be your problem now? Thanks, I think. :-)

I have no clue at all how the remote sources code works, and cannot make any sense of it right now. I'll try to find the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chrullrich,

It's absolutely not a problem at all. 😃

When using remote sources, petl only call open/read/write on python file objects.

So os.path.exists() does not work when reading or writing to HTTP, S3, SMB remote sources for example.

Do you think there is any trick for not using os.path.exists() in this functionality?

My current plan is something like:

  1. Waith for @alimanfoo reviewing Allow toxlsx() to add or replace a worksheet. #502
  2. Find a workaround for os.path.exists()
  3. Merge Allow reading and writing Excel files in remote sources #506
  4. Adapt and merge this PR (Allow toxlsx() to add or replace a worksheet. #502)

@juarezr
Copy link
Member

juarezr commented Aug 6, 2020

Hi @chrullrich ,

What you think about this possible resolution ?

@juarezr
Copy link
Member

juarezr commented Aug 13, 2020

Closing this PR because it's changes were merged in #509.

@juarezr juarezr closed this Aug 13, 2020
@chrullrich chrullrich deleted the toxlsx-add-sheet branch March 26, 2021 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants