Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema definitions in yaml #91

Open
chr1st1ank opened this issue Aug 19, 2019 · 2 comments
Open

Schema definitions in yaml #91

chr1st1ank opened this issue Aug 19, 2019 · 2 comments
Labels

Comments

@chr1st1ank
Copy link
Contributor

@chr1st1ank chr1st1ank commented Aug 19, 2019

I suggest allowing to pass schemas as yaml files. That way it wouldn't be necessary to hardcode all the checks when using pandera. Instead they would be defined in the yaml schema.
There are two use cases I see:

  • Validating dataframes in a CI/CD pipeline. There it would be possible to have a validation step which could then be re-configured without changing the actual python code. It would be possible to check multiple different dataframes against their expected schemas with the same python code but just different yaml files
  • Pandera could offer a (simple) command line tool which reads data from some defined formats, such as pickle or json and directly checks it against a schema specified in a file.

The yaml format needs to be designed thoroughly in this case to offer optimal flexibility. I could think of something like this:

YAML schema definition:

# General section for dataframe wide checks
dataframe:
  - min_length: 1000
# Checks per column
columns:
  column1:
    # List of checks, each one is a dictionary
    # this allows parametrization
    - type: int
    - max: 10
    - allow_null: False
  column2:
    - type: float
    - max: -1.2
  column2:
    - type: str
    - match: "^value_"
    # Allow custom functions (here with arguments)
    - custom_function: split_shape
      split_char: "_"
      expected_splits: 2      

Python code:

def split_shape(df, split_char, expected_splits):
   """Custom check function"""
   return (s.str.split(split_char, expand=True).shape[1] == expected_splits)
   
schema = DataFrameSchema.from_yaml(
            path="path_to_yaml",
            custom_functions = [split_shape]
)

validated_df = schema.validate(df)

As we probably don't want that arbitrary Python code can be executed from the yaml file with the !!python syntax I suggest that we rather go with a mix of built-in checks and the option to add user defined functions as in the example above.

@cosmicBboy cosmicBboy added the proposal label Aug 19, 2019
@mastersplinter

This comment has been minimized.

Copy link
Collaborator

@mastersplinter mastersplinter commented Aug 19, 2019

@chr1st1ank great idea.

As an interim measure, one option I've used is putting the schemas in a .py file and importing that file. Not ideal, but keeps modules a little tidier and enables reuse.

@chr1st1ank

This comment has been minimized.

Copy link
Contributor Author

@chr1st1ank chr1st1ank commented Aug 20, 2019

Yes, sounds like a good workaround. And you're totally right that the API should be kept small and neat and not be cluttered with too many capabilities.
I still don't overview fully what pandera offers. The yaml format should be designed in a way that ideally everything pandera offers for valiation can be expressed with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.