Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Add built-in Checks for common operations #74
These should all use vectorized pandas operations.
For dtypes that support pandas comparison operators
This is definitely a good idea, especially if we want to import schemas from an external format like yaml.
I have drafted some already to combine them with the yaml reader which I would also put to discussion in the yaml topic (#91 ).
class ValueRange(pandera.Check): """Check whether values are within a certain range.""" def __init__(self, min_value=None, max_value=None): """Create a new ValueRange check object. :param min_value: Allowed minimum value. Should be a type comparable to the type of the pandas series to be validated (e.g. a numerical type for float or int and a datetime for datetime) . :param max_value: Allowed maximum value. Should be a type comparable to the type of the pandas series to be validated (e.g. a numerical type for float or int and a datetime for datetime). """ super().__init__(fn=self.check) self.min_value = min_value self.max_value = max_value def check(self, series: pd.Series) -> pd.Series: """Compare the values of the series to the predefined limits. :returns pd.Series with the comparison result as True or False """ if self.min_value is not None: bool_series = series >= self.min_value else: bool_series = pd.Series(data=True, index=series.index) if self.max_value is not None: return bool_series & (series <= self.max_value) return bool_series class StringMatch(pandera.Check): """Check if strings in a pandas.Series match a given regular expression.""" def __init__(self, regex: str): """Create a new StringMatch object based on the given regex. :param regex: Regular expression which must be matched """ super().__init__(fn=self.match) self.regex = regex def match(self, series: pd.Series) -> pd.Series: """Check if all strings in the series match the regular expression. :returns pd.Series with the comparison result as True or False """ return series.str.match(self.regex) class StringLength(pandera.Check): """Check if the length of strings is within a specified range""" def __init__(self, min_len: int = None, max_len: int = None): """Create a new StringLength object with a given range :param min_len: Minimum length of strings (default: no minimum) :param max_len: Maximu length of strings (default: no maximum) """ super().__init__(fn=self.check_string_length) self.min_len = min_len self.max_len = max_len def check_string_length(self, series: pd.Series) -> pd.Series: """Check if all strings does have an acceptable length :returns pd.Series with the validation result as True or False """ if self.min_len is not None: bool_series = series.str.len() >= self.min_len else: bool_series = pd.Series(data=True, index=series.index) if self.max_len is not None: return bool_series & (series.str.len() <= self.max_len) return bool_series
Here is how they are used then:
schema = pandera.SeriesSchema( pandas_dtype=series.dtype.name, nullable=True, checks=[checks.ValueRange(min_value=min_val, max_value=max_val)] ) schema = pandera.SeriesSchema( pandas_dtype=pandera.String, nullable=False, checks=[checks.StringMatch(regex=pattern)] )
I think we shouldn't overdo it. For example the regex check covers all other string checks you listed above - is the user is willing to express them as regex.
What's your feeling about these checks?
Edit: To make more transparent what I am up to I created a draft pull request which is referenced below.
One API design decision I'd like to discuss for this issue is how to implement built-in
Note that this issue is about how
I'm leaning toward option (1), for the following reasons:
An example of (1), besides the two Hypothesis methods linked above, would be:
class Check(object): ... @classmethod def range(cls, min, max): return cls( fn=lambda s: (min <= s) & (s <= max), error="failed range check between %s and %s" % (min, max) )
For option (2), the subclassed equivalent would be:
class Range(Check): def __init__(self, min, max): self.min = min self.max = max super(Range, self).__init__( fn=self.range, error="failed range check between %s and %s" % (min, max) ) def range(self, series): return (self.min <= series) & (series <= self.max)
Of course, the differences in the two examples might seem trivial and I think this boils down to taste... indeed, I originally designed the
My rationale behind this is that in
Ok, I get it. So there would only be the Check class with factory methods injecting the fn argument. Not that bad...
I think the one or the other way we should separate this from the core code of the Check class. Maybe have something like a standard_checks module with either the factory fumctions or the subclasses. Both only use the public interface of Check.
I was thinking that
However, I'm not wedded to this structure, as @chr1st1ank's proposal of built-in checks being subclasses is also a reasonable path. I think once the code is cleaned up a little bit, (#96, #99) it might be easier to make a decision
Ok, let's list what the classes currently seem to do:
What we are trying now is to add built-in test functions also for the general Check class. There I also like the functional approach suggested by @cosmicBboy. Note that this is also done already for the Hypothesis checks. The inheritance hierarchy we discuss here seem to point in the opposite direction, however.
My suggestion is to stick to the current rather flat class structure because more inheritance levels seem to complicate things further:
As we are already planning to move functionality out of the Check function to other places, there might not be too much code left to fill additional inheritance levels anyway.