Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in Checks for common operations #74

Open
cosmicBboy opened this issue Jun 12, 2019 · 8 comments
Open

Add built-in Checks for common operations #74

cosmicBboy opened this issue Jun 12, 2019 · 8 comments
Labels

Comments

@cosmicBboy
Copy link
Collaborator

@cosmicBboy cosmicBboy commented Jun 12, 2019

pandera offers a lot of flexibility in terms of the kinds of assertions you can make about a dataframe. However, it would be useful to have built-in checks for common operations. This issue is a proposal to define what those common operations are.

These should all use vectorized pandas operations.

For dtypes that support pandas comparison operators >, >=, <, <=, ==, !=

  • greater than scalar x
  • greater than or equal to scalar x
  • equal to scalar x
  • not equal to scalar x
  • less than scalar x
  • is in range x - y (inclusive or exclusive)
  • is sorted, ascending or descending

For dtypes that support checks for set membership

  • is in list-like/set s
  • is not in list-like/set s

For datetime dtypes

  • check for date format "<some_format>"
  • (comparison operators and range checks also apply here)

For string dtype

  • column matches "regex" pattern
  • no trailing whitespace
  • no leading whitespace
  • column contains "string"
  • column starts with "string"
  • column ends with "string"
@chr1st1ank

This comment has been minimized.

Copy link
Contributor

@chr1st1ank chr1st1ank commented Aug 21, 2019

This is definitely a good idea, especially if we want to import schemas from an external format like yaml.

I have drafted some already to combine them with the yaml reader which I would also put to discussion in the yaml topic (#91 ).
Looks like this:

class ValueRange(pandera.Check):
    """Check whether values are within a certain range."""
    def __init__(self, min_value=None, max_value=None):
        """Create a new ValueRange check object.

        :param min_value: Allowed minimum value. Should be a type comparable to
            the type of the pandas series to be validated (e.g. a numerical type
            for float or int and a datetime for datetime) .
        :param max_value: Allowed maximum value. Should be a type comparable to
            the type of the pandas series to be validated (e.g. a numerical type
            for float or int and a datetime for datetime).
        """
        super().__init__(fn=self.check)
        self.min_value = min_value
        self.max_value = max_value

    def check(self, series: pd.Series) -> pd.Series:
        """Compare the values of the series to the predefined limits.

        :returns pd.Series with the comparison result as True or False
        """
        if self.min_value is not None:
            bool_series = series >= self.min_value
        else:
            bool_series = pd.Series(data=True, index=series.index)

        if self.max_value is not None:
            return bool_series & (series <= self.max_value)

        return bool_series


class StringMatch(pandera.Check):
    """Check if strings in a pandas.Series match a given regular expression."""
    def __init__(self, regex: str):
        """Create a new StringMatch object based on the given regex.

        :param regex: Regular expression which must be matched
        """
        super().__init__(fn=self.match)
        self.regex = regex

    def match(self, series: pd.Series) -> pd.Series:
        """Check if all strings in the series match the regular expression.

        :returns pd.Series with the comparison result as True or False
        """
        return series.str.match(self.regex)


class StringLength(pandera.Check):
    """Check if the length of strings is within a specified range"""

    def __init__(self, min_len: int = None, max_len: int = None):
        """Create a new StringLength object with a given range

        :param min_len: Minimum length of strings (default: no minimum)
        :param max_len: Maximu length of strings (default: no maximum)
        """
        super().__init__(fn=self.check_string_length)
        self.min_len = min_len
        self.max_len = max_len

    def check_string_length(self, series: pd.Series) -> pd.Series:
        """Check if all strings does have an acceptable length

        :returns pd.Series with the validation result as True or False
        """
        if self.min_len is not None:
            bool_series = series.str.len() >= self.min_len
        else:
            bool_series = pd.Series(data=True, index=series.index)

        if self.max_len is not None:
            return bool_series & (series.str.len() <= self.max_len)
        return bool_series

Here is how they are used then:

    schema = pandera.SeriesSchema(
        pandas_dtype=series.dtype.name, nullable=True,
        checks=[checks.ValueRange(min_value=min_val, max_value=max_val)]
    )

    schema = pandera.SeriesSchema(
        pandas_dtype=pandera.String,
        nullable=False,
        checks=[checks.StringMatch(regex=pattern)]
    )

I think we shouldn't overdo it. For example the regex check covers all other string checks you listed above - is the user is willing to express them as regex.

What's your feeling about these checks?

Edit: To make more transparent what I am up to I created a draft pull request which is referenced below.

@cosmicBboy

This comment has been minimized.

Copy link
Collaborator Author

@cosmicBboy cosmicBboy commented Aug 21, 2019

One API design decision I'd like to discuss for this issue is how to implement built-in Check and Hypothesis functions. I'll propose two options, knowing there are probably more:

  1. For each built-in check/hypothesis, add a new class method to the corresponding class definition, much like the two sample ttest and one sample ttest.
  2. subclass the Check/Hypothesis classes to implement built-in checks/hypotheses.

Note that this issue is about how pandera implements built-in checks, not how users can extend the Check and Hypothesis for customized checks.

I'm leaning toward option (1), for the following reasons:

  • prevents proliferation of classes, which tends to yield a lot of boilerplate class definition code
  • encourages simplicity in check function implementation (complex check functions should be generally not be built-in methods) and extension (simply supply a function to Check)
  • Check/Hypothesis as an entrypoint to the built-in methods makes the API easier to use, it minimizes the number of high-level concepts that users have to learn (schemas, checks, hypotheses)

An example of (1), besides the two Hypothesis methods linked above, would be:

class Check(object):
    ...

    @classmethod
    def range(cls, min, max):
        return cls(
            fn=lambda s: (min <= s) & (s <= max),
            error="failed range check between %s and %s" % (min, max)
        )

For option (2), the subclassed equivalent would be:

class Range(Check):

    def __init__(self, min, max):
        self.min = min
        self.max = max
        super(Range, self).__init__(
            fn=self.range,
            error="failed range check between %s and %s" % (min, max)
        )

    def range(self, series):
        return (self.min <= series) & (series <= self.max)

Of course, the differences in the two examples might seem trivial and I think this boils down to taste... indeed, I originally designed the Check class for flexibility and extensibility via functions, not via a method that subclasses are supposed to implement.

My rationale behind this is that in pandera, classes should implement the key high-level concepts that map onto pandas data structure elements, whereas the specific checks that validate those data are either defined by users, or common checks should be extremely simple and have a clear entrypoint.

Thoughts @mastersplinter @chr1st1ank?

@chr1st1ank

This comment has been minimized.

Copy link
Contributor

@chr1st1ank chr1st1ank commented Aug 21, 2019

Ok, I get it. So there would only be the Check class with factory methods injecting the fn argument. Not that bad...
I have to think a moment what this leads to. We get fewer new classes but instead one rather huge class (pandas style, so to say...).
I am very used to class hierarchies, that's why I came up with the other approach at first.
For the user not that different after all.

I think the one or the other way we should separate this from the core code of the Check class. Maybe have something like a standard_checks module with either the factory fumctions or the subclasses. Both only use the public interface of Check.

@mastersplinter

This comment has been minimized.

Copy link
Collaborator

@mastersplinter mastersplinter commented Aug 21, 2019

I like the class method approach for brevity.

I like @chr1st1ank's suggestion of having the factory methods in a module to keep the Check itself tidy (it's already a bit big)

@cosmicBboy

This comment has been minimized.

Copy link
Collaborator Author

@cosmicBboy cosmicBboy commented Aug 31, 2019

@mastersplinter @chr1st1ank now that #98 is merged, I think there are a few entangled issues that make sense to raise here:

  1. The Check class is responsible for too much, pieces of logic need to be delegated elsewhere:
  1. Define a CheckBase class with core check logic.
  2. HypothesisBase should subclass CheckBase
  3. Check subclasses CheckBase and serves as entrypoint class for built-in checks via classmethod approach
  4. Hypothesis should subclass HypothesisBase and servers as entrypoint class for built-in hypotheses via classmethod approach.
@mastersplinter

This comment has been minimized.

Copy link
Collaborator

@mastersplinter mastersplinter commented Sep 4, 2019

@cosmicBboy So we'd have this hierarchy?

  • CheckBase
    • Check
    • HypothesisBase
      • Hypothesis

What're you thinking needs to be in HypothesisBase rather than just having something like this?:

  • CheckBase
    • Check
    • Hypothesis
@cosmicBboy

This comment has been minimized.

Copy link
Collaborator Author

@cosmicBboy cosmicBboy commented Sep 4, 2019

I was thinking that HypothesisBase would define core logic and methods that distinguish a Check from a Hypothesis, e.g. is_one_sample_test, hypothesis_check, and relationships, and the Hypothesis class would be the built-in hypothesis factory. This would make the relationship between CheckBase -> Check isomorphic to HypothesisBase -> Hypothesis.

However, I'm not wedded to this structure, as @chr1st1ank's proposal of built-in checks being subclasses is also a reasonable path. I think once the code is cleaned up a little bit, (#96, #99) it might be easier to make a decision

@chr1st1ank

This comment has been minimized.

Copy link
Contributor

@chr1st1ank chr1st1ank commented Sep 5, 2019

Ok, let's list what the classes currently seem to do:

  • Check:
    • Prepare a dataframe or series for a test (e.g. groupby)
    • Apply the test function
    • Find violations
    • Report violations
  • Hypothesis:
    • A little additional dataframe/series preparation
    • Create Hypothesis objects with special built-in test functions

What we are trying now is to add built-in test functions also for the general Check class. There I also like the functional approach suggested by @cosmicBboy. Note that this is also done already for the Hypothesis checks. The inheritance hierarchy we discuss here seem to point in the opposite direction, however.

My suggestion is to stick to the current rather flat class structure because more inheritance levels seem to complicate things further:
Let's try to design all built-in checks in a manner that they can only use the public API of the Check class to create concrete checks. The Hypothesis subclass seems necessary to change the behaviour of some methods of the Check class, because this can't be done with the init function. In other cases we only add a classmethod generating a specialised variant of Check:

  • Check
    • classmethod smaller_than()
    • classmethod string_matches()
    • ...
    • Hypothesis
      - classmethod one_sample_ttest()
      - classmethod two_sample_ttest()
      - ...

As we are already planning to move functionality out of the Check function to other places, there might not be too much code left to fill additional inheritance levels anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.