Skip to content
A flexible data validation package for Pandas data structures
Python Other
  1. Python 99.0%
  2. Other 1.0%
Branch: master
Clone or download
Latest commit 5b29d9a Nov 11, 2019

README.md

Pandera

A flexible and expressive pandas validation library.


Build Status PyPI version shields.io PyPI license pyOpenSci Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Status codecov PyPI pyversions DOI

pandas data structures hide a lot of information, and explicitly validating them at runtime in production-critical or reproducible research settings is a good idea. pandera enables users to:

  1. Check the types and properties of columns in a DataFrame or values in a Series.
  2. Perform more complex statistical validation like hypothesis testing.
  3. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.

pandera provides a flexible and expressive API for performing data validation on tidy (long-form) and wide data to make data processing pipelines more readable and robust.

Documentation

The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io

Install

Using pip:

pip install pandera

Using conda:

conda install -c cosmicbboy pandera

Example Usage

DataFrameSchema

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Check


# validate columns
schema = DataFrameSchema({
    # the check function expects a series argument and should output a boolean
    # or a boolean Series.
    "column1": Column(pa.Int, Check(lambda s: s <= 10)),
    "column2": Column(pa.Float, Check(lambda s: s < -1.2)),
    # you can provide a list of validators
    "column3": Column(pa.String, [
        Check(lambda s: s.str.startswith("value_")),
        Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

# alternatively, you can pass strings representing the legal pandas datatypes:
# http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes
schema = DataFrameSchema({
    "column1": Column("int64", Check(lambda s: s <= 10)),
    ...
})

df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})

validated_df = schema.validate(df)
print(validated_df)

#     column1  column2  column3
#  0        1     -1.3  value_1
#  1        4     -1.4  value_2
#  2        0     -2.9  value_3
#  3       10    -10.1  value_2
#  4        9    -20.4  value_1

Development Installation

git clone https://github.com/pandera-dev/pandera.git
cd pandera
pip install -r requirements.txt
pip install -e .

Tests

pip install pytest
pytest tests

Contributing to pandera GitHub contributors

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide on GitHub.

Issues

Go here to submit feature requests or bugfixes.

Other Data Validation Libraries

Here are a few other alternatives for validating Python data structures.

Generic Python object data validation

pandas-specific data validation

Why pandera?

  • pandas-centric data types, column nullability, and uniqueness are first-class concepts.
  • check_input and check_output decorators enable seamless integration with existing code.
  • Checks provide flexibility and performance by providing access to pandas API by design.
  • Hypothesis class provides a tidy-first interface for statistical hypothesis testing.
  • Checks and Hypothesis objects support both tidy and wide data validation.
  • Comprehensive documentation on key functionality.

Citation Information

@misc{niels_bantilan_2019_3385266,
  author       = {Niels Bantilan and
                  Nigel Markey and
                  Riccardo Albertazzi and
                  chr1st1ank},
  title        = {pandera-dev/pandera: 0.2.0 pre-release 1},
  month        = sep,
  year         = 2019,
  doi          = {10.5281/zenodo.3385266},
  url          = {https://doi.org/10.5281/zenodo.3385266}
}
You can’t perform that action at this time.