A python library to manipulate and transform sequences
Clone or download
Latest commit d608546 Oct 24, 2018

README.rst

PyPi package Continuous integration Documentation Code quality analysis Tests coverage Citable paper

SeqTools

SeqTools facilitates the manipulation of datasets and the evaluation of a transformation pipeline. Some of the provided functionnalities include: mapping element-wise operations, reordering, reindexing, concatenation, joining, slicing, minibatching, etc...

To improve ease of use, SeqTools assumes that dataset are objects that implement a list-like sequence interface: a container object with a length and its elements accessible via indexing or slicing. All SeqTools functions take and return objects compatible with this simple and convenient interface.

Sometimes manipulating a whole dataset with transformations or combinations can be slow and resource intensive; a transformed dataset might not even fit into memory! To circumvent this issue, SeqTools implements on-demand execution under the hood, so that computations are only run when needed, and only for actually required elements while ignoring the rest of the dataset. This helps to keep memory resources down to a bare minimum and accelerate the time it take to access any arbitrary result. This on-demand strategy helps to quickly define dataset-wide transformations and probe a few results for debugging or prototyping purposes, yet it is transparent for the users who still benefit from a simple and convenient list-like interface.

>>> def do(x):
...     print("-> computing now")
...     return x + 2
...
>>> a = [1, 2, 3, 4]
>>> m = seqtools.smap(do, a)
>>> # nothing printed because evaluation is delayed
>>> m[0]
-> computing now
3
>>> for v in m[:-2]:
...     print(v)
-> computing now
3
-> computing now
4

When comes the transition from prototyping to execution, the list-like container interface facilitates serial evaluation. Besides, SeqTools also provides simple helpers to dispatch work between multiple background workers (threads or processes), and therefore to maximize execution speed and resource usage.

SeqTools originally targets data science, more precisely the preprocessing stages of a dataset. Being aware of the experimental nature of this usage, on-demand execution is made as transparent as possible to users by providing fault-tolerant functions and insightful error reporting. Moreover, internal code is kept concise and clear with comments to facilitate error tracing through a failing transformation pipeline.

Nevertheless, this project purposedly keeps a generic interface and only requires minimal dependencies in order to facilitate reusability beyond this scope of application.

Example

>>> def f1(x):
...     return x + 1
...
>>> def f2(x):  # slow and memory heavy transformation
...     time.sleep(.01)
...     return [x for _ in range(500)]
...
>>> def f3(x):
...     return sum(x) / len(x)
...
>>> data = list(range(1000))

Without delayed evaluation, defining the pipeline and reading values looks like so:

>>> tmp1 = [f1(x) for x in data]
>>> tmp2 = [f2(x) for x in tmp1]  # takes 10 seconds and a lot of memory
>>> res = [f3(x) for x in tmp2]
>>> print(res[2])
3.0
>>> print(max(tmp2[2]))  # requires to store 499 500 useless values along
3

With seqtools:

>>> tmp1 = seqtools.smap(f1, data)
>>> tmp2 = seqtools.smap(f2, tmp1)
>>> res = seqtools.smap(f3, tmp2)  # no computations so far
>>> print(res[2])  # takes 0.01 seconds
3.0
>>> print(max(tmp2[2]))  # easy access to intermediate results
3

Batteries included!

The library comes with a set of functions to manipulate sequences:

concatenate
batch
gather
 
prefetch
interleaving
 

and others (suggestions are also welcome).

Installation

pip install seqtools

Documentation

The documentation is hosted at https://seqtools-doc.readthedocs.io.

Contributing and Support

Use the issue tracker to request features, propose improvements or report issues. For questions regarding usage, please send an email.

Related libraries

Joblib, proposes low-level functions with many optimization settings to optimize pipelined transformations. This library notably provides advanced caching mechanisms which are not the primary concern of SeqTool. SeqTool uses a simpler container-oriented interface with multiple utility functions in order to assist fast prototyping. On-demand evaluation is its default behaviour and applies at all layers of a transformation pipeline. In particular, parallel evaluation can be inserted in the middle of the transformation pipeline and won't block the execution to wait for the computation of all elements from the dataset.

SeqTools is conceived to connect nicely to the data loading pipeline of Machine Learning libraries such as PyTorch's torch.utils.data and torchvision.transforms or Tensorflow's tf.data. The interface of these libraries focuses on iterators to access transformed elements, contary to SeqTools which also provides arbitrary reads via indexing.