Skip to content

Commit

Permalink
add helpers for xz, lz4 and zstandard codec and for s3 and smb remote…
Browse files Browse the repository at this point in the history
… servers (#491)

* allow to register sources and codecs

* foward mode in append file functions

* allow using remote helpers + compression helpers

* add handlers for lz4, xz and zstandard compression

* add handlers for s3 and smb protocols

* document remote and compression handlers

* update .gitignore patterns

* updated chages.rst to reflect new helpers

* fix bug in test case for passing CI

* fix broken doc tests for s3 and smb

* improve coverage for helpers test cases

Co-authored-by: Juarez Rudsatz <juarez.rudsatz@ceabs.net>
  • Loading branch information
juarezr and Juarez Rudsatz committed Jun 23, 2020
1 parent b1ef013 commit ee769a7
Show file tree
Hide file tree
Showing 19 changed files with 771 additions and 36 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Expand Up @@ -6,9 +6,11 @@ MANIFEST
dist
.ipynb_checkpoints/
.idea
.vscode
.tox/
example*.*
.coverage
.eggs
tmp*
env/
spike*
Expand Down
23 changes: 23 additions & 0 deletions docs/changes.rst
@@ -1,6 +1,29 @@
Changes
=======

Version 1.5.0
-------------

* Added functions :func:`petl.io.sources.register_reader` and
:func:`petl.io.sources.register_writer` for registering custom source helpers for
hanlding I/O from remote protocols.
By :user:`juarezr`, :issue:`491`.

* Added function :func:`petl.io.sources.register_codec` for registering custom
helpers for compressing and decompressing files with other algorithms.
By :user:`juarezr`, :issue:`491`.

* Added classes :class:`petl.io.codec.xz.XZCodec`, :class:`petl.io.codec.xz.LZ4Codec`
and :class:`petl.io.codec.zstd.ZstandardCodec` for compressing files with `XZ` and
the "state of art" `LZ4` and `Zstandard` algorithms.
By :user:`juarezr`, :issue:`491`.

* Added classes :class:`petl.io.source.s3.S3Source` and
:class:`petl.io.source.smb.SMBSource` reading and writing files to remote
servers using int url the protocols `s3://` and `smb://`.
By :user:`juarezr`, :issue:`491`.


Version 1.4.0
-------------

Expand Down
62 changes: 58 additions & 4 deletions docs/io.rst
Expand Up @@ -20,6 +20,8 @@ string it is interpreted as follows:
* string ending with `.bz2` - read from file via bz2 decompression
* any other string - read directly from file

.. _io_extract_codec:

Some helper classes are also available for reading from other types of
file-like sources, e.g., reading data from a Zip file, a string or a
subprocess, see the section on :ref:`io_helpers` below for more
Expand Down Expand Up @@ -47,6 +49,8 @@ follows:
* string ending with `.bz2` - write to file via bz2 decompression
* any other string - write directly to file

.. _io_load_codec:

Some helper classes are also available for writing to other types of
file-like sources, e.g., writing to a Zip file or string buffer, see
the section on :ref:`io_helpers` below for more information.
Expand Down Expand Up @@ -285,8 +289,8 @@ Text indexes (Whoosh)
.. autofunction:: petl.io.whoosh.totextindex
.. autofunction:: petl.io.whoosh.appendtextindex

.. module:: petl.io.sources
.. _io_helpers:
.. module:: petl.io.avro
.. _io_avro:

Avro files (fastavro)
----------------------------
Expand Down Expand Up @@ -331,8 +335,8 @@ Avro files (fastavro)
:start-after: begin_complex_schema
:end-before: end_complex_schema

.. module:: petl.io.avro
.. _io_avro:
.. module:: petl.io.sources
.. _io_helpers:

I/O helper classes
------------------
Expand Down Expand Up @@ -364,3 +368,53 @@ for full details.
.. autoclass:: petl.io.sources.URLSource
.. autoclass:: petl.io.sources.MemorySource
.. autoclass:: petl.io.sources.PopenSource

.. _io_remotes:

Remote I/O helper classes
-------------------------

The following classes are helpers for reading (``from...()``) and writing
(``to...()``) functions transparently as a file-like source.

There are no need to instantiate them. They are used in the mecanism described
in :ref:`Extract <io_extract>` and :ref:`Load <io_load>`.

It's possible to read and write just by prefixing the protocol (e.g: `s3://`)
in the source path of the file.

.. autoclass:: petl.io.source.s3.S3Source
.. autoclass:: petl.io.source.smb.SMBSource

.. _io_codecs:

Compression I/O helper classes
------------------------------

The following classes are helpers for decompressing (``from...()``) and
compressing (``to...()``) in functions transparently as a file-like source.

There are no need to instantiate them. They are used in the mecanism described
in :ref:`Extract <io_extract_codec>` and :ref:`Load <io_load_codec>`.

It's possible to compress and decompress just by specifying the file extension
(e.g: `.csv.xz`) in end of the source filename.

.. autoclass:: petl.io.codec.xz.XZCodec
.. autoclass:: petl.io.codec.zstd.ZstandardCodec
.. autoclass:: petl.io.codec.lz4.LZ4Codec

.. _io_custom_helpers:

Custom I/O helper classes
------------------------------

For creating custom helpers for :ref:`remote I/O <io_remotes>` or
:ref:`compression <io_codecs>` use the following functions:

.. autofunction:: petl.io.sources.register_reader
.. autofunction:: petl.io.sources.register_writer
.. autofunction:: petl.io.sources.register_codec

See the source code of the classes in :mod:`petl.io.sources` module for
more details.
4 changes: 4 additions & 0 deletions optional_requirements.txt
Expand Up @@ -14,3 +14,7 @@ Whoosh==2.7.4
xlrd==1.2.0
xlwt==1.3.0
fastavro>=0.23.4
lz4
zstandard
smbprotocol>=1.0.1
s3fs>=0.2.2
2 changes: 2 additions & 0 deletions petl/io/__init__.py
Expand Up @@ -37,3 +37,5 @@
from petl.io.bcolz import frombcolz, tobcolz, appendbcolz

from petl.io.avro import fromavro, toavro, appendavro

from petl.io.sources import register_codec, register_reader, register_writer
2 changes: 1 addition & 1 deletion petl/io/avro.py
Expand Up @@ -228,7 +228,7 @@ def appendavro(table, target, schema=None, sample=9, **avro_args):
.. versionadded:: 1.4.0
"""
target2 = write_source_from_arg(target)
target2 = write_source_from_arg(target, mode='ab')
_write_toavro(table,
target=target2,
mode='a+b',
Expand Down
7 changes: 7 additions & 0 deletions petl/io/codec/__init__.py
@@ -0,0 +1,7 @@
from __future__ import absolute_import, print_function, division

from petl.io.codec.zstd import ZstandardCodec

from petl.io.codec.lz4 import LZ4Codec

from petl.io.codec.xz import XZCodec
50 changes: 50 additions & 0 deletions petl/io/codec/lz4.py
@@ -0,0 +1,50 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, print_function, division

from contextlib import contextmanager

from petl.io.sources import register_codec


class LZ4Codec(object):
'''
Allows compressing and decompressing .lz4 files
`LZ4`_ is lossless compression algorithm, providing compression
speed greather than 500 MB/s per core (>0.15 Bytes/cycle). It features an
extremely fast decoder, with speed in multiple GB/s per core (~1Byte/cycle)
.. note::
For working this codec require `python-lz4`_ to be installed, e.g.::
$ pip install lz4
.. versionadded:: 1.5.0
.. _python-lz4: https://github.com/python-lz4/python-lz4
.. _LZ4: http://www.lz4.org
'''

def __init__(self, filename, **kwargs):
self.filename = filename
self.kwargs = kwargs

def open_file(self, mode='rb'):
import lz4.frame
source = lz4.frame.open(self.filename, mode=mode, **self.kwargs)
return source

@contextmanager
def open(self, mode='r'):
mode2 = mode[:1] + r'b' # python2
source = self.open_file(mode=mode2)
try:
yield source
finally:
source.close()


register_codec('.lz4', LZ4Codec)

# end #
38 changes: 38 additions & 0 deletions petl/io/codec/xz.py
@@ -0,0 +1,38 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, print_function, division

from contextlib import contextmanager

from petl.io.sources import register_codec

class XZCodec(object):
'''
Allows compressing and decompressing .xz files compressed with `lzma`_.
.. versionadded:: 1.5.0
.. _lzma: https://docs.python.org/3/library/lzma.html
'''

def __init__(self, filename, **kwargs):
self.filename = filename
self.kwargs = kwargs

def open_file(self, mode='rb'):
import lzma
source = lzma.open(self.filename, mode=mode, **self.kwargs)
return source

@contextmanager
def open(self, mode='r'):
mode2 = mode[:1] + r'b' # python2
source = self.open_file(mode=mode2)
try:
yield source
finally:
source.close()


register_codec('.xz', XZCodec)

# end #
58 changes: 58 additions & 0 deletions petl/io/codec/zstd.py
@@ -0,0 +1,58 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import, print_function, division

import io
from contextlib import contextmanager

from petl.io.sources import register_codec


class ZstandardCodec(object):
'''
Allows compressing and decompressing .zstd files
`Zstandard`_ is a real-time compression algorithm, providing
high compression ratios. It offers a very wide range of compression / speed
trade-off, while being backed by a very fast decoder.
.. note::
For working this codec require `zstd`_ to be installed, e.g.::
$ pip install zstandard
.. versionadded:: 1.5.0
.. _zstd: https://github.com/indygreg/python-zstandard
.. _Zstandard: http://www.zstd.net
'''

def __init__(self, filename, **kwargs):
self.filename = filename
self.kwargs = kwargs

def open_file(self, mode='rb'):
import zstandard as zstd
if mode.startswith('r'):
cctx = zstd.ZstdDecompressor(**self.kwargs)
compressed = io.open(self.filename, mode)
source = cctx.stream_reader(compressed)
else:
cctx = zstd.ZstdCompressor(**self.kwargs)
uncompressed = io.open(self.filename, mode)
source = cctx.stream_writer(uncompressed)
return source

@contextmanager
def open(self, mode='r'):
mode2 = mode[:1] + r'b' # python2
source = self.open_file(mode=mode2)
try:
yield source
finally:
source.close()


register_codec('.zst', ZstandardCodec)

# end #
2 changes: 1 addition & 1 deletion petl/io/csv.py
Expand Up @@ -125,7 +125,7 @@ def appendcsv(table, source=None, encoding=None, errors='strict',
"""

source = write_source_from_arg(source)
source = write_source_from_arg(source, mode='ab')
csvargs.setdefault('dialect', 'excel')
appendcsv_impl(table, source=source, encoding=encoding, errors=errors,
write_header=write_header, **csvargs)
Expand Down
2 changes: 1 addition & 1 deletion petl/io/pickle.py
Expand Up @@ -116,7 +116,7 @@ def appendpickle(table, source=None, protocol=-1, write_header=False):


def _writepickle(table, source, mode, protocol, write_header):
source = write_source_from_arg(source)
source = write_source_from_arg(source, mode)
with source.open(mode) as f:
it = iter(table)
hdr = next(it)
Expand Down
5 changes: 5 additions & 0 deletions petl/io/source/__init__.py
@@ -0,0 +1,5 @@
from __future__ import absolute_import, print_function, division

from petl.io.source.s3 import S3Source

from petl.io.source.smb import SMBSource

0 comments on commit ee769a7

Please sign in to comment.