# Pipestat Python API

Pipestat is a [Python package](https://pypi.org/project/pipestat/) for a standardized reporting of pipeline statistics.

It formalizes a way for pipeline developers and downstream tools developers to communicate -- results produced by a pipeline can easily and reliably become an input for downstream analyses.

## Usage

Here's how a pipeline developer can use `pipestat` to report results:

In [1]:
import pipestat
from jsonschema import ValidationError

After importing the package, create an `PipestatManager` object. The object constructor requires a few pieces of information: 

1. a namespace to write into, for example the name of the pipeline
2. a path to the schema file that describes results that can be reported
3. backend info: either path to a YAML-formatted file or pipestat config with PostgreSQL database login credentials

## Back-end types

Two types of back-ends are currently supported:

1. a **file** (pass a file path to the constructor)  
The changes reported using the `report` method of `PipestatManger` will be securely written to the file. Currently only [YAML](https://yaml.org/) format is supported. 

2. a **PostgreSQL database** (pass a path to the pipestat config to the constructor)
This option gives the user the possibility to use a fully fledged database to back `PipestatManager`. 

To use a file as the back-end, just pass a file path string to the constructor. Let's create a temporary file first:

In [2]:
from tempfile import mkstemp
_, temp_file = mkstemp(suffix=".yaml")
print(temp_file)

/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/tmpphtscb41.yaml


Now we can create a `PipestatManager` object that uses this file as the back-end:

In [3]:
psm = pipestat.PipestatManager(name="test", record_identifier="sample1", results_file=temp_file, schema_path="../tests/data/sample_output_schema.yaml")

Reading data from '/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/tmpphtscb41.yaml'


The results will be reported to a "test" namespace.

In [4]:
psm.name

'test'

By default, `PipestatManager` instance is bound to the record it was initialized with. However, reporting or removing results for a different record can be enforced in the respective methods with `record_identifier` argument.

In [5]:
psm.record_identifier

'sample1'

Since we've used a newly created file, nothing has been reported yet:

In [6]:
psm.data

YacAttMap: {}

## Reporting results

To report a result, use a `report` method. It requires three pieces of information:

1. record identifier -- record to report the result for, for example a unique name of the sample (optional if provided at `PipestatManager` initialization stage)
2. values -- a Python `dict` of resultID-value pairs to report. The top level keys need to correspond to the results identifiers defined in the schema

### Available results defined in schemas

To learn about the results that the current `PipestatManager` instance supports check out the `schema` property:

In [7]:
psm.schema

{'number_of_things': {'type': 'integer', 'description': 'Number of things'},
 'percentage_of_things': {'type': 'number',
  'description': 'Percentage of things'},
 'name_of_something': {'type': 'string', 'description': 'Name of something'},
 'swtich_value': {'type': 'boolean', 'description': 'Is the switch on of off'},
 'collection_of_things': {'type': 'array',
  'description': 'This store collection of values'},
 'output_object': {'type': 'object', 'description': 'Object output'},
 'output_file': {'type': 'file',
  'description': 'This a path to the output file'},
 'output_image': {'type': 'image',
  'description': 'This a path to the output image'}}

To learn about the actual required attributes of the reported results, like `file` or `image` (see: `output_file` and `output_image` results) select the `result_identifier` from the `result_schemas` property:

In [8]:
psm.result_schemas["output_file"]

{'type': 'object',
 'description': 'This a path to the output file',
 'properties': {'path': {'type': 'string'}, 'title': {'type': 'string'}},
 'required': ['path', 'title']}

### Results composition enforcement
As you can see, to report a `output_file` result, you need to provide an object with `path` and `title` string attributes. If you fail to do so `PipestatManager` will issue an informative validation error:

In [9]:
try: 
    psm.report(values={"output_file": {"path": "/home/user/path.csv"}})
except ValidationError as e:
    print(e)

'title' is a required property

Failed validating 'required' in schema:
    {'description': 'This a path to the output file',
     'properties': {'path': {'type': 'string'},
                    'title': {'type': 'string'}},
     'required': ['path', 'title'],
     'type': 'object'}

On instance:
    {'path': '/home/user/path.csv'}


Let's report a correct object this time:

In [10]:
psm.report(
  values={"output_file": {"path": "/home/user/path.csv", "title": "CSV file with some data"}}
)

Reported records for 'sample1' in 'test' namespace:
 - output_file: {'path': '/home/user/path.csv', 'title': 'CSV file with some data'}


True

Inspect the object's database to verify whether the result has been successfully reported:

In [11]:
psm.data

test:
  sample1:
    output_file:
      path: /home/user/path.csv
      title: CSV file with some data

No results duplication is allowed, unless you force overwrite:

In [12]:
psm.report(
  values={"output_file": {"path": "/home/user/path_new.csv", "title": "new CSV file with some data"}}
)

These results exist for 'sample1': ['output_file']


False

In [13]:
psm.report(
  values={"output_file": {"path": "/home/user/path_new.csv", "title": "new CSV file with some data"}},
  force_overwrite=True
)
psm.data

These results exist for 'sample1': ['output_file']
Overwriting existing results: ['output_file']
Reported records for 'sample1' in 'test' namespace:
 - output_file: {'path': '/home/user/path_new.csv', 'title': 'new CSV file with some data'}


test:
  sample1:
    output_file:
      path: /home/user/path_new.csv
      title: new CSV file with some data

Most importantly, by backing the object by a file, the reported results persist -- another `PipestatManager` object reads the results when created:

In [14]:
psm1 = pipestat.PipestatManager(
  name="test",
  record_identifier="sample1",
  results_file=temp_file,
  schema_path="../tests/data/sample_output_schema.yaml"
)

Reading data from '/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/tmpphtscb41.yaml'


In [15]:
psm.data

test:
  sample1:
    output_file:
      path: /home/user/path_new.csv
      title: new CSV file with some data

That's because the contents are stored in the file we've specified at object creation stage:

In [16]:
!echo $temp_file
!cat $temp_file

/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/tmpphtscb41.yaml
test:
  sample1:
    output_file:
      path: /home/user/path_new.csv
      title: new CSV file with some data


Note that two processes can securely report to a single file and single namespace since `pipestat` supports locks and race-free writes to control multi-user conflicts and prevent data loss.

### Results type enforcement

By default `PipestatManager` raises an exception if a non-compatible result value is reported. 

This behavior can be changed by setting `stric_type` to `True` in `PipestatManager.report` method. In thi case `PipestatManager` tries to cast the reported results values to the Python classes required by schema. For example, if a result defined as `integer` is reported and a `str` value is passed, the eventual value will be `int`:

In [17]:
psm.result_schemas["number_of_things"]

{'type': 'integer', 'description': 'Number of things'}

In [18]:
psm.report(values={"number_of_things": "10"}, strict_type=False)

Reported records for 'sample1' in 'test' namespace:
 - number_of_things: 10


True

The method will attempt to cast the value to a proper Python class and store the converted object. In case of a failure, an error will be raised:

In [19]:
try:
    psm.report(
      record_identifier="sample2",
      values={"number_of_things": []},
      strict_type=False
    )
except TypeError as e:
    print(e)

int() argument must be a string, a bytes-like object or a number, not 'list'


Note that in this case we tried to report a result for a different record (`sample2`), which had to be enforced with `record_identifier` argument.

In [20]:
psm.data

test:
  sample1:
    output_file:
      path: /home/user/path_new.csv
      title: new CSV file with some data
    number_of_things: 10

## Removing results

`PipestatManager` object also supports results removal. Call `remove` method and provide `record_identifier` and  `result_identifier` method to do so:

In [21]:
psm.remove(result_identifier="number_of_things")

Removed result 'number_of_things' for record 'sample1' from 'test' namespace


True

The entire record, skip the `result_identifier` argument:

In [22]:
psm.remove()

Removing 'sample1' record


True

Verify that an appropriate entry from the results was deleted:

In [23]:
psm.data

test: OrderedDict()