Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add helpers for xz, lz4 and zstandard codec and for s3 and smb remote servers #491

Merged
merged 11 commits into from
Jun 23, 2020

Conversation

juarezr
Copy link
Member

@juarezr juarezr commented Jun 23, 2020

add helpers for xz, lz4 and zstandard codec and for s3 and smb remote servers

Changes

  • Refactored url handling in to...() and from...() functions for allowing dynamic adding sources.
  • Added functions petl.io.sources.register_reader and petl.io.sources.register_writer for registering custom source helpers for handling I/O from remote protocols.
  • Added function petl.io.sources.register_codec for registering custom helpers for compressing and decompressing files with other algorithms.
  • Added classes petl.io.codec.xz.XZCodec, petl.io.codec.xz.LZ4Codec and petl.io.codec.zstd.ZstandardCodec for compressing files with XZ and the "state of art" LZ4 and Zstandard algorithms.
  • Added classes :class:petl.io.source.s3.S3Source and petl.io.source.smb.SMBSource reading and writing files to remote servers using int url the protocols s3:// and smb://.
  • Tested with test suite, AWS S3 and Windows Network Share.

Ready for comments.

Checklist

  • Includes unit tests
  • New functions have docstrings with examples that can be run with doctest
  • New functions are included in API docs
  • Docstrings include notes for any changes to API or behaviour
  • Travis CI passes (unit tests run under Linux)
  • AppVeyor CI passes (unit tests run under Windows)
  • Unit test coverage has not decreased (see Coveralls)
  • All changes documented in docs/changes.rst

@coveralls
Copy link

coveralls commented Jun 23, 2020

Coverage Status

Coverage increased (+0.6%) to 91.671% when pulling ecaa8a7 on juarezr:petl_sources into b1ef013 on petl-developers:master.

@alimanfoo
Copy link
Collaborator

Hi @juarezr, this all looks like a great improvement to me, no specific comments.

One thing to note, there is a new project called fsspec which provides common access to a variety of different file-system-like sources, with dispatching based on a URL-like string syntax. E.g., calling fsspec.open("s3://my-bucket/foo.csv") will delegate to s3fs. There is also support for Google Cloud via gcsfs and various other sources. So there may be an opportunity to leverage fsspec to do some of the work here. However, you may prefer to get this PR in then consider using fsspec at a later point. Happy to go with whatever you think works best.

@juarezr
Copy link
Member Author

juarezr commented Jun 23, 2020

Hi, @alimanfoo ,

  1. I think that fsspec will be a natural evolution of this functionality.
  2. For accessing AWS S3 I'm using package s3fs that is used/wraped fsspec
  3. I prefer get this PR first and follow with improvements after. Of course if you don't have any objections.
  4. Suggestions welcome.

Thanks

@alimanfoo
Copy link
Collaborator

  1. I prefer get this PR first and follow with improvements after.

Very happy with that approach.

@juarezr
Copy link
Member Author

juarezr commented Jun 23, 2020

I looked at fsspec more closely and noticed:

  1. I was doing sort of the same thing here
  2. There isn't a smb:// protocol yet (Windows shares)
  3. I have some things to figure out:
  4. If fsspec should handle local filesystems or petl existing sources
  5. If it detects compression by file extension or protocol prefix
  6. What happens to existing sources
  7. This could be handled further.

Maybe you can share your opinions about these point. 😄

Next time that I'll have time available, I will explore further.

But for now, do you think we can proceed with the pull request?

@alimanfoo
Copy link
Collaborator

But for now, do you think we can proceed with the pull request?

I think so, let's merge this PR, then you can investigate fsspec further if and when you think useful.

Is this PR ready to merge?

@alimanfoo
Copy link
Collaborator

  • There isn't a smb:// protocol yet (Windows shares)

Right, support for 'smb://' would need to be special cased within petl, or contributed upstream to fsspec.

  • If fsspec should handle local filesystems or petl existing sources

I believe fsspec could handle local filesystems.

  • If it detects compression by file extension or protocol prefix

I don't know, sorry.

  • What happens to existing sources

I imagine some of the existing sources could be removed and use fsspec instead.

@juarezr
Copy link
Member Author

juarezr commented Jun 23, 2020

Yes. This PR is ready to merge.

@juarezr
Copy link
Member Author

juarezr commented Jun 23, 2020

  • There isn't a smb:// protocol yet (Windows shares)

Right, support for 'smb://' would need to be special cased within petl, or contributed upstream to fsspec.

  • If fsspec should handle local filesystems or petl existing sources

I believe fsspec could handle local filesystems.

  • If it detects compression by file extension or protocol prefix

I don't know, sorry.

I will check this later, when I could get time available.

  • What happens to existing sources

I imagine some of the existing sources could be removed and use fsspec instead.

This makes easier for going further.

Do youn know if some breakage will happen if the existing sources are removed?

@alimanfoo alimanfoo merged commit ee769a7 into petl-developers:master Jun 23, 2020
@alimanfoo
Copy link
Collaborator

v1.5.0 released to pypi and conda-forge

@alimanfoo
Copy link
Collaborator

Do youn know if some breakage will happen if the existing sources are removed?

Yes, I imagine some people may have code that would break. Ideally we would deprecated those features (raise a DeprecationWarning) but leave them there for the next version, and then we could remove them in the version after that.

@juarezr
Copy link
Member Author

juarezr commented Jun 24, 2020

Do youn know if some breakage will happen if the existing sources are removed?

Yes, I imagine some people may have code that would break. Ideally we would deprecated those features (raise a DeprecationWarning) but leave them there for the next version, and then we could remove them in the version after that.

Looks like a good way to go further.

Thanks for the release.

@juarezr juarezr deleted the petl_sources branch September 17, 2020 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants