# Explain Pandera

Take a look at the [documentation](https://pandera.readthedocs.io/en/stable/index.html).

Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. The goal of Pandera is to make data processing pipelines more readable and robust with statistically typed dataframes.

Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. With pandera, you can:
- Define a schema once and use it to validate different dataframe types including pandas, polars, dask, modin, ibis, and pyspark.
- Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.
- Perform more complex statistical validation like hypothesis testing.
- Parse data to standardize the preprocessing steps needed to produce valid data.
- Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
- Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
- Synthesize data from schema objects for property-based testing with pandas data structures.
- Lazily Validate dataframes so that all validation rules are executed before raising an error.
- Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.

Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and ibis.

**Table of contents**:
1. [Explain Pandera](#1-explain-pandera)
   1. [Validate dataframe schema](#11-validate-dataframe-schema)
   2. [Column validation](#12-column-validation)
      1. [First validations](#121-first-validations)
      2. [Null values](#122-null-values)
      3. [Coercing types on columns](#123-coercing-types-on-columns)
      4. [Required columns](#124-required-columns)
      5. [Column regex pattern](#125-column-regex-pattern)
      6. [Validation the joint uniqueness of columns](#126-validating-the-joint-uniqueness-of-columns)
2. [Handling errors](#2-handling-errors)
   1. [Missing and required columns](#21-missing-and-required-columns)
      1. [Handling dataframe columns not in the schema](#211-handling-dataframe-columns-not-in-the-schema)
      2. [Adding missing columns](#212-adding-missing-columns)
   2. [Error reports](#22-error-reports)
3. [Synthesize data](#3-synthesize-data)
   1. [Basic usage](#31-basic-usage)
   2. [Usage in unit tests](#32-usage-in-unit-tests)
4. [Supported features by DataFrame backend](#4-supported-features-by-dataframe-backend)

# 1. Explain Pandera

## 1.1 Validate dataframe schema

Pandera is a good way to validate to validate the quality of the data. We highly recommend using it before you train your model.

Let's take a first example where you want to validate a dataframe. There are 3 columns with different types we want to validate: int, float and string. To do so, you first define a `pa.DataFrameSchema()` in which you will define the columns using `pa.Column(<type>)`.

In [1]:
import pandas as pd
import pandera.pandas as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": [1.1, 1.2, 1.3],
    "column3": ["a", "b", "c"],
})

schema = pa.DataFrameSchema({
    "column1": pa.Column(int),
    "column2": pa.Column(float),
    "column3": pa.Column(str),
})

validated_df = schema.validate(df)
print(validated_df)

   column1  column2 column3
0        1      1.1       a
1        2      1.2       b
2        3      1.3       c


If you provide a dataframe with the wrong type:

In [2]:
df_bug = pd.DataFrame({
    "column1": [1, 2, 0.4],
    "column2": [1.1, 1.2, 1.3],
    "column3": ["a", "b", "c"],
})

try:
    schema.validate(df_bug)
except pa.errors.SchemaError as e:
    print(e)

expected series 'column1' to have type int64, got float64


## 1.2 Column validation

A `Column` must specify the properties of a column in a dataframe object. It can be optionally verified for its data type, [null values] or duplicate values. The column can be coerced into the specified type, and the [required] parameter allows control over whether or not the column is allowed to be missing.

Similarly to pandas, the data type can be specified as:
* a string alias, as long as it is recognized by pandas.
* a python type: int, float, double, bool, str
* a numpy data type
* a pandas extension type: it can be an instance (e.g pd.CategoricalDtype(["a", "b"])) or a class (e.g pandas.CategoricalDtype) if it can be initialized with default values.
* a pandera DataType: it can also be an instance or a class.

### 1.2.1 First validations

In [3]:
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, pa.Check.ge(0)),
    "column2": pa.Column(float, pa.Check.lt(10)),
    "column3": pa.Column(
        str,
        [
            pa.Check.isin([*"abc"]),
            pa.Check(lambda series: series.str.len() == 1),
        ]
    ),
})

In [4]:
df = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": [1.1, 1.2, 1.3],
    "column3": ["a", "b", "c"],
})

validated_df = schema.validate(df)
print(validated_df)

   column1  column2 column3
0        1      1.1       a
1        2      1.2       b
2        3      1.3       c


In [5]:
df_bug = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": [1.1, 1.2, 12.3],
    "column3": ["a", "b", "c"],
})

try:
    schema.validate(df_bug)
except pa.errors.SchemaError as e:
    print(e)

Column 'column2' failed element-wise validator number 0: less_than(10) failure cases: 12.3


Column checks allow for the DataFrame’s values to be checked against a user-provided function. `Check` objects also support grouping by a different column so that the user can make assertions about subsets of the column of interest.

Column Hypotheses enable you to perform statistical hypothesis tests on a DataFrame in either wide or tidy format. See Hypothesis Testing for more details.

In [6]:
simple_schema = pa.DataFrameSchema({
    "column1": pa.Column(
        int,
        pa.Check(
            lambda x: 0 <= x <= 10,
            element_wise=True,
            error="range checker [0, 10]"
        )
    )
})

df_bug = pd.DataFrame({
    "column1": [-20, 5, 10, 30],
})

try:
    simple_schema.validate(df_bug)
except pa.errors.SchemaError as e:
    print(e)

Column 'column1' failed element-wise validator number 0: <Check <lambda>: range checker [0, 10]> failure cases: -20, 30


### 1.2.2 Null values

By default, SeriesSchema/Column objects assume that values are not nullable. In order to accept null values, you need to explicitly specify `nullable=True`, or else you’ll get an error.

In [7]:
import numpy as np
import pandas as pd
import pandera.pandas as pa


df = pd.DataFrame({"column1": [5, 1, np.nan]})

non_null_schema = pa.DataFrameSchema({
    "column1": pa.Column(float, pa.Check(lambda x: x > 0))
})

try:
    non_null_schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

non-nullable series 'column1' contains null values:
2   NaN
Name: column1, dtype: float64


Setting `nullable=True` allows for null values in the corresponding column.

In [8]:
null_schema = pa.DataFrameSchema({
    "column1": pa.Column(float, pa.Check(lambda x: x > 0), nullable=True)
})

null_schema.validate(df)

Unnamed: 0,column1
0,5.0
1,1.0
2,


### 1.2.3 Coercing types on columns

If you specify `Column(dtype, ..., coerce=True)` as part of the DataFrameSchema definition, calling `schema.validate` will first coerce the column into the specified `dtype` before applying validation checks.

In [9]:
import pandas as pd
import pandera.pandas as pa

df = pd.DataFrame({"column1": [1, 2, 3]})
schema = pa.DataFrameSchema({"column1": pa.Column(str, coerce=True)})

validated_df = schema.validate(df)
assert isinstance(validated_df.column1.iloc[0], str)

In [10]:
df = pd.DataFrame({"column1": [1., 2., 3, np.nan]})
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, coerce=True, nullable=True)
})

try:
    schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

Error while coercing 'column1' to type int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type int64:
   index  failure_case
0      3           NaN


The best way to handle this case is to simply specify the column as a `Float` or `Object`.

In [11]:
schema_object = pa.DataFrameSchema({
    "column1": pa.Column(object, coerce=True, nullable=True)
})
schema_float = pa.DataFrameSchema({
    "column1": pa.Column(float, coerce=True, nullable=True)
})

print(schema_object.validate(df).dtypes)
print(schema_float.validate(df).dtypes)

column1    object
dtype: object
column1    float64
dtype: object


### 1.2.4 Required columns

By default all columns specified in the schema are required, meaning that if a column is missing in the input DataFrame an exception will be thrown. If you want to make a column optional, specify `required=False` in the column constructor:

In [12]:
import pandas as pd
import pandera.pandas as pa


df = pd.DataFrame({"column2": ["hello", "pandera"]})
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, required=False),
    "column2": pa.Column(str)
})

schema.validate(df)

Unnamed: 0,column2
0,hello
1,pandera


Since `required=True` by default, missing columns would raise an error:

In [13]:
schema = pa.DataFrameSchema({
    "column1": pa.Column(int),
    "column2": pa.Column(str),
})

try:
    schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

column 'column1' not in dataframe. Columns in dataframe: ['column2']


### 1.2.5 Column regex pattern

In the case that your dataframe has multiple columns that share common statistical properties, you might want to specify a regex pattern that matches a set of meaningfully grouped columns that have `str` names.

In [14]:
import numpy as np
import pandas as pd
import pandera.pandas as pa

categories = ["A", "B", "C"]

np.random.seed(100)

dataframe = pd.DataFrame({
    "cat_var_1": np.random.choice(categories, size=100),
    "cat_var_2": np.random.choice(categories, size=100),
    "num_var_1": np.random.uniform(0, 10, size=100),
    "num_var_2": np.random.uniform(20, 30, size=100),
})

schema = pa.DataFrameSchema({
    "num_var_.+": pa.Column(
        float,
        checks=pa.Check.greater_than_or_equal_to(0),
        regex=True,
    ),
    "cat_var_.+": pa.Column(
        pa.Category,
        checks=pa.Check.isin(categories),
        coerce=True,
        regex=True,
    ),
})

schema.validate(dataframe).head()

Unnamed: 0,cat_var_1,cat_var_2,num_var_1,num_var_2
0,A,A,6.804147,24.743304
1,A,C,3.684308,22.774633
2,A,C,5.911288,28.416588
3,C,A,4.790627,21.95125
4,C,B,4.504166,28.563142


You can also regex pattern match on `pd.MultiIndex` columns:

In [15]:
np.random.seed(100)

dataframe = pd.DataFrame({
    ("cat_var_1", "y1"): np.random.choice(categories, size=100),
    ("cat_var_2", "y2"): np.random.choice(categories, size=100),
    ("num_var_1", "x1"): np.random.uniform(0, 10, size=100),
    ("num_var_2", "x2"): np.random.uniform(0, 10, size=100),
})

schema = pa.DataFrameSchema({
    ("num_var_.+", "x.+"): pa.Column(
        float,
        checks=pa.Check.greater_than_or_equal_to(0),
        regex=True,
    ),
    ("cat_var_.+", "y.+"): pa.Column(
        pa.Category,
        checks=pa.Check.isin(categories),
        coerce=True,
        regex=True,
    ),
})

schema.validate(dataframe).head()

Unnamed: 0_level_0,cat_var_1,cat_var_2,num_var_1,num_var_2
Unnamed: 0_level_1,y1,y2,x1,x2
0,A,A,6.804147,4.743304
1,A,C,3.684308,2.774633
2,A,C,5.911288,8.416588
3,C,A,4.790627,1.95125
4,C,B,4.504166,8.563142


### 1.2.6 Validating the joint uniqueness of columns

In some cases you might want to ensure that a group of columns are unique:

In [16]:
import pandas as pd
import pandera.pandas as pa

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
)
df = pd.DataFrame.from_records([
    {"a": 1, "b": 2, "c": 3},
    {"a": 1, "b": 2, "c": 3},
])
try:
    schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

columns '('a', 'c')' not unique:
   a  c
0  1  3
1  1  3


To control how unique errors are reported, the `report_duplicates` argument accepts:
* `exclude_first`: (default) report all duplicates except first occurrence
* `exclude_last`: report all duplicates except last occurrence
* `all`: report all duplicates

In [17]:
import pandas as pd
import pandera.pandas as pa

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
    report_duplicates = "exclude_first",
)
df = pd.DataFrame.from_records([
    {"a": 1, "b": 2, "c": 3},
    {"a": 1, "b": 2, "c": 3},
])

try:
    schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

columns '('a', 'c')' not unique:
   a  c
1  1  3


# 2. Handling errors

## 2.1 Missing and required columns

### 2.1.1 Handling dataframe columns not in the schema

By default, columns that aren’t specified in the schema aren’t checked. If you want to check that the DataFrame only contains columns in the schema, specify `strict=True`:

In [18]:
import pandas as pd
import pandera.pandas as pa


schema = pa.DataFrameSchema(
    {"column1": pa.Column(int)},
    strict=True)

df = pd.DataFrame({"column2": [1, 2, 3]})

try:
    schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

column 'column2' not in DataFrameSchema {'column1': <Schema Column(name=column1, type=DataType(int64))>}


Alternatively, if your DataFrame contains columns that are not in the schema, and you would like these to be dropped on validation, you can specify `strict='filter'`.

In [19]:
import pandas as pd
import pandera.pandas as pa


df = pd.DataFrame({"column1": ["drop", "me"],"column2": ["keep", "me"]})
schema = pa.DataFrameSchema({"column2": pa.Column(str)}, strict='filter')

schema.validate(df)

Unnamed: 0,column2
0,keep
1,me


### 2.1.2 Adding missing columns

When loading raw data into a form that’s ready for data processing, it’s often useful to have guarantees that the columns specified in the schema are present, even if they’re missing from the raw data. This is where it’s useful to specify `add_missing_columns=True` in your schema definition.

When you call `schema.validate(data)`, the schema will add any missing columns to the dataframe, defaulting to the default value if supplied at the column-level, or to `NaN` if the column is nullable.

In [20]:
import pandas as pd
import pandera.pandas as pa

schema = pa.DataFrameSchema(
    columns={
        "a": pa.Column(int),
        "b": pa.Column(int, default=1),
        "c": pa.Column(float, nullable=True),
    },
    add_missing_columns=True,
    coerce=True,
)
df = pd.DataFrame({"a": [1, 2, 3]})
schema.validate(df)

Unnamed: 0,a,b,c
0,1,1,
1,2,1,
2,3,1,


## 2.2 Error reports

If the dataframe is validated lazily with `lazy=True`, errors will be aggregated into an error report. The error report groups `DATA` and `SCHEMA` errors to to give an overview of error sources within a dataframe. Take the following schema and dataframe:

In [21]:
schema = pa.DataFrameSchema(
    {"id": pa.Column(int, pa.Check.lt(10))},
    name="MySchema",
    strict=True,
)

df = pd.DataFrame({"id": [1, None, 30], "extra_column": [1, 2, 3]})

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc)

{
    "SCHEMA": {
        "COLUMN_NOT_IN_SCHEMA": [
            {
                "schema": "MySchema",
                "column": "MySchema",
                "check": "column_in_schema",
                "error": "column 'extra_column' not in DataFrameSchema {'id': <Schema Column(name=id, type=DataType(int64))>}"
            }
        ],
        "SERIES_CONTAINS_NULLS": [
            {
                "schema": "MySchema",
                "column": "id",
                "check": "not_nullable",
                "error": "non-nullable series 'id' contains null values:1   NaNName: id, dtype: float64"
            }
        ],
        "WRONG_DATATYPE": [
            {
                "schema": "MySchema",
                "column": "id",
                "check": "dtype('int64')",
                "error": "expected series 'id' to have type int64, got float64"
            }
        ]
    },
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "MySchema",
          

Validating the above dataframe will result in data level errors, namely the `id` column having a value which fails a check, as well as schema level errors, such as the extra column and the `None` value.

This error report can be useful for debugging, with each item in the various lists corresponding to a `SchemaError`.

# 3. Synthesize data

`pandera` provides a utility for generating synthetic data purely from pandera schema or schema component objects. Under the hood, the schema metadata is collected to create a data-generating strategy using [hypothesis](https://hypothesis.readthedocs.io/en/latest/), which is a property-based testing library.

## 3.1 Basic usage

Once you’ve defined a schema, it’s easy to generate examples:

In [22]:
import pandera.pandas as pa

schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(int, pa.Check.eq(10)),
        "column2": pa.Column(float, pa.Check.eq(0.25)),
        "column3": pa.Column(str, pa.Check.eq("foo")),
        "column4": pa.Column(str, pa.Check.str_matches(r"[a-z]+\.[a-z]+@axa\-direct\.com"))
    }
)
schema.example(size=3)

Unnamed: 0,column1,column2,column3,column4
0,10,0.25,foo,k.qfkx@axa-direct.com
1,10,0.25,foo,yprw.wsgw@axa-direct.com
2,10,0.25,foo,rlcfk.iavdc@axa-direct.com


## 3.2 Usage in unit tests

The `example` method is available for all schemas and schema components, and is primarily meant to be used interactively. It could be used in a script to generate test cases, but `hypothesis` recommends against doing this and instead using the `strategy` method to create a `hypothesis` strategy that can be used in `pytest` unit tests.

In [23]:
import hypothesis
import pandera.pandas as pa

schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(int, pa.Check.eq(10)),
        "column2": pa.Column(float, pa.Check.eq(0.25)),
        "column3": pa.Column(str, pa.Check.eq("foo")),
    }
)


def processing_fn(df):
    return df.assign(column4=df.column1 * df.column2)

@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
    result = processing_fn(dataframe)
    assert "column4" in result

The above example is trivial, but you get the idea! Schema objects can create a `strategy` that can then be collected by a [pytest](https://docs.pytest.org/en/latest/) runner. We could also run the tests explicitly ourselves, or run it as a `unittest.TestCase`. For more information on testing with hypothesis, see the [hypothesis quick start guide](https://hypothesis.readthedocs.io/en/latest/quickstart.html#running-tests).

A more practical example involves using [schema transformations](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#dataframe-schema-transformations). We can modify the function above to make sure that `processing_fn` actually outputs the correct result:

In [24]:
out_schema = schema.add_columns({"column4": pa.Column(float)})

@pa.check_output(out_schema)
def processing_fn(df):
    return df.assign(column4=df.column1 * df.column2)

@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
    processing_fn(dataframe)

# 4. Supported features by DataFrame backend

Currently, pandera provides four validation backends: `pandas`, `pyspark`, `polars`, and `ibis`. The table below shows which of pandera’s features are available for the [supported dataframe libraries](https://pandera.readthedocs.io/en/stable/supported_libraries.html#dataframe-libraries):

![Supported features](../docs/pandera-libraries-compatibility.png)