# What's Pandera?

Pandera is an open source framework for precision data testing, built for data scientists and ML engineers.

In this notebook, you'll learn how to:

> 1. Define Pandera schemas for your dataframe-like objects 📦
> 2. Integrate them seamlessly into your data pipelines 🔀
> 3. Ensure your data and data transformation functions are correct ✅

▶️ Follow the tutorial and run the code cells below to get a sense of how Pandera works and how its error reporting system can provide direct insight into what specific data values caused the error.

First, install pandera:

In [1]:
!pip install pandera



## What are Schemas?

Dataframes and dataframe-like objects are structures with expected propterties or rules for the data contained inside. Most of the rules of these structures are known by the designers or analysts of the data, but not directly defined by the dataframe object itself, which means that some data may not follow the expected rules. 

In `pandera` we can explicitely define these rules in schemas, which specify types for dataframe-like objects, and then use these schemas to assert properties about data at runtime and try parsing it into a desired state.

Let's use a practical example. Suppose you're working with a transactions dataset of grocery `item`s and their associated `price`s. With these two categories we can make assumptions about the data and datatypes we expect in these fields. There may be a list of specific `item`s that are valid, or we can assume that any `price` should be greater than 0. We can state our assumptions about the data in `pandera` by writing a `Schema`, which can be defined in a `class`, as shown below.

In [2]:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series


class Schema(pa.DataFrameModel):
    item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True)
    price: Series[float] = pa.Field(gt=0, coerce=True)

You can see that the `Schema` class inherits from [`pandera.DataFrameModel`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model.DataFrameModel.html#pandera.api.pandas.model.DataFrameModel),
and defines two fields: `item` and `price`. For each of these fields, `pandera` provides a flexble and concise way to specify the expected datatype: `Series[str]` for `item` and `Series[float]` for `price`. 

Other properties can also be set. For this example, we assumed that there might be a specific list of `item`s that are valid, or that a `price` should be greater than 0. These properties are defined in the `Schema`. In the code above, we use set equivalence for the `item` field with `isin=...` to specify valid options from a list, and we use value ranges for the `price` field with `gt=...` to specify a valid numeric range. These are only a couple examples of [property methods](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html#pandera.api.checks.Check) that can be asserted.

Setting `coerce=True` will cause pandera to parse the columns into the expected datatypes, giving you the ability to ensure that data flowing through your pipeline is of the expected type.

## Runtime DataFrame Value Checks

We can now use the `Schema` class to validate data passing through a function. In the example below, consider the function `add_sales_tax`, which will take the hypothetical grocery data and calculate the sales tax from the `price`, returning a new dataframe with the additional information in a new column.

You can see why data validation would be important here. If the value in the `price` field is not the right datatype or is not greater than 0--as specified by the `Schema`--it will cause errors or further corrupt the data in any additional processes.

In [7]:
@pa.check_types(lazy=True)
def add_sales_tax(data: DataFrame[Schema]):
    # creates a new column in the data frame that calculates prices after sales tax
    data['after_tax'] = data['price'] + (data['price'] * .06)
    return data

As you will see when you run the code below, using the `@pa.check_types` [function decorator](https://pandera.readthedocs.io/en/stable/reference/decorators.html#decorators) and specifying the `data: DataFrame[Schema]` annotation in the function parameter will ensure that dataframe inputs are validated at runtime before being passed into the `add_sales_tax` function body.

By providing the `lazy=True` option in the `check_types` decorator, we're telling `pandera` to validate all field properties before raising a `SchemaErrors` exception.

With valid data, calling `add_sales_tax` shouldn't be a problem:

In [8]:
valid_data = pd.DataFrame.from_records([
    {"item": "apple", "price": 0.5},
    {"item": "orange", "price": 0.75}
])

add_sales_tax(valid_data)

Unnamed: 0,item,price,after_tax
0,apple,0.5,0.53
1,orange,0.75,0.795


With invalid data, however, `pandera` will raise a `SchemaErrors` exception:

In [9]:
invalid_data = pd.DataFrame.from_records([
    {"item": "applee", "price": 0.5},
    {"item": "orange", "price": -1000}
])

try:
    add_sales_tax(invalid_data)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,Column,item,"isin(['apple', 'orange'])",0,applee,0
1,Column,price,greater_than(0),0,-1000.0,1


The `exc.failure_cases` attribute in our `except` clause points to a dataframe that contains metadata about the failure cases that occurred when validating the data.

We can see that row index `0` had a failure case in the mispelling of `applee` in the `item` column, which failed the `isin({"apple", "orange"})` check for that field.

We can also see the row index `1` had a failure case of `-1000.0` in the `price` column, which failed the `gt=0` check for that field.

## In-line Validation

You can also use `Schema` classes to validate data in-line by calling the `validate` method, rather than at runtime as a part of a function.

In [10]:
Schema.validate(valid_data)

Unnamed: 0,item,price
0,apple,0.5
1,orange,0.75


This gives you ultimate flexibility on where you want to validate data in your code.

## Schemas as Data Quality Checkpoints

With `pandera`, you can use inheritance to indicate changes in the contents of a dataframe that some function has to implement. 

In the grocery example, let's assume we want to set an expiry date for each `item` in our list, but we want to validate the new data before AND after adding this new field, which means our schema will need to be different for the data in different points in the program. To accomplish this, first, we would build a second class that inherits from the original `Schema` class, as shown below.

In [11]:
class Schema(pa.DataFrameModel):
    item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True)
    price: Series[float] = pa.Field(gt=0, coerce=True)

class TransformedSchema(Schema):
    expiry: Series[pd.Timestamp] = pa.Field(coerce=True)

`TransformedSchema` will inherit the class attributes defined in `Schema`, with an additional `expiry` datetime field. In this case, we are asserting only a datatype of `Timestamp` on the `expiry` field.

Now we can implement a function that performs the transformation needed to connect these two schemas. 

The `transform_data` function below takes a dataframe object and a list of `datetime`s and returns the input dataframe with a new column for `expiry` populated with the values of the `datetime` list argument.

In [12]:
from datetime import datetime
from typing import List


@pa.check_types(lazy=True)
def transform_data(
    data: DataFrame[Schema],
    expiry: List[datetime],
) -> DataFrame[TransformedSchema]:
    return data.assign(expiry=expiry)


transform_data(valid_data, [datetime.now()] * valid_data.shape[0])

Unnamed: 0,item,price,expiry
0,apple,0.5,2024-05-31 19:00:40.086090
1,orange,0.75,2024-05-31 19:00:40.086090


Now every time we call the `transform_data` function, not only is the `data` input argument validated with the `Schema`, but the output dataframe is validated against `TransformedSchema`.

In addition to catching value errors, this also allows you to catch bugs in your data transformation code more easily. Observe the buggy code below:

In [13]:
@pa.check_types(lazy=True)
def transform_data(
    data: DataFrame[Schema],
    expiry: List[datetime],
) -> DataFrame[TransformedSchema]:
    return data.assign(expiryy=expiry)  # typo bug: 🐛


try:
    transform_data(valid_data, [datetime.now()] * valid_data.shape[0])
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,DataFrameSchema,TransformedSchema,column_in_dataframe,,expiry,


The `failure_cases` dataframe is telling us in the `check` column that the core `column_in_dataframe` check is failing because the `expiry` column is not present in the output dataframe.

Observe how the `schema_context` and `column` values in this `failure_cases` dataframe compare with those of the invalid data in the above examples. This shows the versitility of the error catching using `pandera`s `Schema`s. 

## Bonus: The Object-based API

In the examples above, we've talked about dataframe schemas using the `DataFrameModel` or class-based API. However, `pandera` also provides an object-based API for defining dataframe schemas.

While the [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) class-based API is closer in spirit to `dataclasses` and `pydantic`, which use Python classes to express complex data types , the
object-based [`DataFrameSchema`](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html) API enables you to transform your schema definition on the fly.

Consider the difference between the class-based API and the equivalent object-based API syntax below:

In [None]:
# class-based API
class Schema(pa.DataFrameModel):
    item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True)
    price: Series[float] = pa.Field(gt=0, coerce=True)

# object-based API
schema = pa.DataFrameSchema({
    "item": pa.Column(str, pa.Check.isin(["apple", "orange"]), coerce=True),
    "price": pa.Column(float, pa.Check.gt(0), coerce=True),
})

In the object-based API, you can add, remove, and update columns as you want, just as you would to a standard dataframe object:

In [None]:
transformed_schema = schema.add_columns({"expiry": pa.Column(pd.Timestamp)})
schema.remove_columns(["item"])  # remove the "item" column
schema.update_column("price", dtype=int)  # update the datatype of the "price" column to integer

You can use `DataFrameSchema`s to validate data just like `DataFrameModel` subclasses:

In [None]:
schema.validate(valid_data)

And, similar to the `check_types` decorator, you can use the` check_io` decorator to validate inputs and outputs of your functions.

In [None]:
@pa.check_io(data=schema, out=transformed_schema)
def fn(data, expiry):
    return data.assign(expiry=expiry)


fn(valid_data, [datetime.now()] * valid_data.shape[0])

### When to Use `DataFrameSchema` vs. `DataFrameModel`

Practically speaking, the two ways of writing pandera schemas are completely equivalent, and using one over the other boils down to a few factors:

1. Preference: some developers are more comfortable with one syntax over the other.
2. The class-based API unlocks static type-checking of data via [mypy](https://pandera.readthedocs.io/en/stable/mypy_integration.html)
   and integrates well with Python's type hinting system.
3. The object-based API is good if you want to dynamically update your schema definition at runtime.

At the end of the day, you can use them interchangeably in your applications.

### What's Next?

This notebook gave you a brief intro to Pandera, but this framework has a lot more to offer to help you test your data:

- [Create in-line custom checks](https://pandera.readthedocs.io/en/stable/checks.html)
- [Register custom checks](https://pandera.readthedocs.io/en/stable/extensions.html)
- [Define statistical hypothesis tests](https://pandera.readthedocs.io/en/stable/hypothesis.html)
- [Bootstrap schemas with data profiling](https://pandera.readthedocs.io/en/stable/schema_inference.html)
- [Synthesize fake data for unit testing](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html)
- [Scale Validation with Distributed DataFrames](https://pandera.readthedocs.io/en/stable/supported_libraries.html#)
- [Integrate with the Python Ecosystem](https://pandera.readthedocs.io/en/stable/integrations.html)