Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
I suggest allowing to pass schemas as yaml files. That way it wouldn't be necessary to hardcode all the checks when using pandera. Instead they would be defined in the yaml schema.
The yaml format needs to be designed thoroughly in this case to offer optimal flexibility. I could think of something like this:
YAML schema definition:
# General section for dataframe wide checks dataframe: - min_length: 1000 # Checks per column columns: column1: # List of checks, each one is a dictionary # this allows parametrization - type: int - max: 10 - allow_null: False column2: - type: float - max: -1.2 column2: - type: str - match: "^value_" # Allow custom functions (here with arguments) - custom_function: split_shape split_char: "_" expected_splits: 2
def split_shape(df, split_char, expected_splits): """Custom check function""" return (s.str.split(split_char, expand=True).shape == expected_splits) schema = DataFrameSchema.from_yaml( path="path_to_yaml", custom_functions = [split_shape] ) validated_df = schema.validate(df)
As we probably don't want that arbitrary Python code can be executed from the yaml file with the
Yes, sounds like a good workaround. And you're totally right that the API should be kept small and neat and not be cluttered with too many capabilities.