# Explain Pandera

Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. The goal of Pandera is to make data processing pipelines more readable and robust with statistically typed dataframes.

Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. With pandera, you can:
- Define a schema once and use it to validate different dataframe types including pandas, polars, dask, modin, ibis, and pyspark.
- Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.
- Perform more complex statistical validation like hypothesis testing.
- Parse data to standardize the preprocessing steps needed to produce valid data.
- Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
- Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
- Synthesize data from schema objects for property-based testing with pandas data structures.
- Lazily Validate dataframes so that all validation rules are executed before raising an error.
- Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.

Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and ibis.

# 1. Imports

## 1.1 Packages

## 1.2 Options

# 2. Explain Pandera

Pandera is a good way to validate to validate the quality of the data. We highly recommend using it before you train your model.

Let's take a first example where you want to validate a dataframe. There are 3 columns with different types we want to validate: int, float and string. To do so, you first define a `pa.DataFrameSchema()` in which you will define the columns using `pa.Column(<type>)`.

In [7]:
import pandas as pd
import pandera.pandas as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": [1.1, 1.2, 1.3],
    "column3": ["a", "b", "c"],
})

schema = pa.DataFrameSchema({
    "column1": pa.Column(int),
    "column2": pa.Column(float),
    "column3": pa.Column(str),
})

validated_df = schema.validate(df)
print(validated_df)

   column1  column2 column3
0        1      1.1       a
1        2      1.2       b
2        3      1.3       c


If you provide a dataframe with the wrong type:

In [8]:
df_bug = pd.DataFrame({
    "column1": [1, 2, 0.4],
    "column2": [1.1, 1.2, 1.3],
    "column3": ["a", "b", "c"],
})

try:
    schema.validate(df_bug)
except pa.errors.SchemaError as e:
    print(e)

expected series 'column1' to have type int64, got float64


You can add more rules to check for each column using the `pa.Column()` function:

In [9]:
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, pa.Check.ge(0)),
    "column2": pa.Column(float, pa.Check.lt(10)),
    "column3": pa.Column(
        str,
        [
            pa.Check.isin([*"abc"]),
            pa.Check(lambda series: series.str.len() == 1),
        ]
    ),
})

In [10]:
df = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": [1.1, 1.2, 1.3],
    "column3": ["a", "b", "c"],
})

validated_df = schema.validate(df)
print(validated_df)

   column1  column2 column3
0        1      1.1       a
1        2      1.2       b
2        3      1.3       c


In [11]:
df_bug = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": [1.1, 1.2, 12.3],
    "column3": ["a", "b", "c"],
})

try:
    schema.validate(df_bug)
except pa.errors.SchemaError as e:
    print(e)

Column 'column2' failed element-wise validator number 0: less_than(10) failure cases: 12.3


If the dataframe does not pass validation checks, pandera provides useful error messages. An error argument can also be supplied to Check for custom error messages.

In the case that a validation Check is violated:

In [13]:
simple_schema = pa.DataFrameSchema({
    "column1": pa.Column(
        int,
        pa.Check(
            lambda x: 0 <= x <= 10,
            element_wise=True,
            error="range checker [0, 10]"
        )
    )
})

df_bug = pd.DataFrame({
    "column1": [-20, 5, 10, 30],
})

try:
    simple_schema.validate(df_bug)
except pa.errors.SchemaError as e:
    print(e)

Column 'column1' failed element-wise validator number 0: <Check <lambda>: range checker [0, 10]> failure cases: -20, 30
