# `pipestat` python API

`pipestat` is a [Python package](https://pypi.org/project/pipestat/) for a standardized reporting of pipeline statistics.

It formalizes a way for pipeline developers and downstream tools developers to communicate -- results produced by a pipeline can easily and reliably become an input for downstream analyses.

## Usage

Here's how a pipeline developer can use `pipestat` to report results:

In [1]:
import pipestat

After importing the package, create an `PipeStatManager` object. The object constructor requires two pieces of information: 

1. a database back-end
2. a namespace to write to, for example the name of the pipeline

## Back-end types

Two types of back-ends are currently supported:

1. a **file** (pass a file path to the constructor)  
The changes reported using the `report` method of `PipeStatManger` will be securely written to the file. Currently only [YAML](https://yaml.org/) format is supported. 

2. a **`Mapping`** (pass a  dict-like object to the constuctor)  
This option gives the user the possibility to use a fully fledged database to back `PipeStatManager`. The `Mapping` is then an interface between the object and the database. Alternatively, for testing purposes, one can use a standard Python `dict` object.

To use a file as the back-end, just pass a file path string to the constructor. Let's create a temporary file first:

In [2]:
from tempfile import mkstemp
_, temp_file = mkstemp(suffix=".yaml")
print(temp_file)

/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/tmp7s0fvd6g.yaml


Now we can create a `PipeStatManager` object that uses this file as the back-end:

In [3]:
psm = pipestat.PipeStatManager(database=temp_file, name="test")

The results will be reported to a "test" namespace.

In [4]:
psm.name

'test'

Since we've used a newly created file, nothing has been reported yet:

In [5]:
psm.database

{}

To report a result, use a `report` method. It requires three pieces of information:
1. `id` -- name of the reported object
2. `type` -- type of the reported object
3. `value` -- value of the reported object

Here is a list of currently supported types and Python classes that represent them:

In [6]:
pipestat.CLASSES_BY_TYPE

{'integer': int,
 'float': float,
 'string': str,
 'boolean': bool,
 'object': collections.abc.Mapping,
 'null': NoneType,
 'array': list,
 'file': str,
 'image': str}

Consequently, to report an "integer" use a Python `int` object:

In [7]:
psm.report(id="reads_count",type="integer",value=1000)

Cached new 'test' record: reads_count=1000(integer)
Wrote 1 cached records: {'test': {'reads_count': {'value': 1000, 'type': 'integer'}}}


True

Inspect the object's database to verify whether the result has beed successfully reported:

In [8]:
psm.database

{'test': {'reads_count': {'value': 1000, 'type': 'integer'}}}

Only unique IDs can be reported:

In [9]:
psm.report(id="reads_count", type="integer", value=1001)

'reads_count' already in database for 'test' namespace


False

Most importantly, by backing the object by a file, the reported results persist -- another `PipeStatManager` object reads the results when created:

In [10]:
psm1 = pipestat.PipeStatManager(database=temp_file, name="test")

In [11]:
psm.database

{'test': {'reads_count': {'value': 1000, 'type': 'integer'}}}

That's because the contets are stored in the file we've specified at object creation stage:

In [12]:
!cat $temp_file

test:
  reads_count:
    value: 1000
    type: integer


Note that two processes can securely report to a single file and single namespace since `pipestat` supports locks and race-free writes to control multi-user conflicts and prevent data loss