Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Schemas for arbitrary subsets (DataFrames and Series and groups of Series) #29

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file modified .gitignore
100644 → 100755
Empty file.
Empty file modified .travis.yml
100644 → 100755
Empty file.
Empty file modified LICENSE
100644 → 100755
Empty file.
Empty file modified README.rst
100644 → 100755
Empty file.
10 changes: 10 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
* [ ] Add validations that apply to every column in the DF equally
* [x] Fix CombinedValidations
* [x] Add replacement for allow_empty Columns
* [ ] New column() tests
* [ ] New CombinedValidation tests
* [x] Fix Negate
* [ ] Add facility for allow_empty
* [x] Fix messages
* [x] Re-implement the or/and using operators
* [ ] Allow and/or operators between Series-level and row-level validations
47 changes: 47 additions & 0 deletions UPDATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# ValidationWarnings
## Options for the ValidationWarning data
* We keep it as is, with one single ValidationWarning class that stores a `message` and a reference to the validation
that spawned it
* PREFERRED: As above, but we add a dictionary of miscellaneous kwargs to the ValidationWarning for storing stuff like the row index that failed
* We have a dataclass for each Validation type that stores things in a more structured way
* Why bother doing this if the Validation stores its own structure for the column index etc?

## Options for the ValidationWarning message
* It's generated from the Validation as a fixed string, as it is now
* It's generated dynamically by the VW
* This means that custom messages means overriding the VW class
* PREFERRED: It's generated dynamically in the VW by calling the parent Validation with a reference to itself, e.g.
```python
class ValidationWarning:
def __str__(self):
return self.validation.generate_message(self)

class Validation:
def generate_message(warning: ValidationWarning) -> str:
pass
```
* This lets the message function use all the validation properties, and the dictionary of kwargs that it specified
* `generate_message()` will call `default_message(**kwargs)`, the dynamic class method, or `self.custom_message`, the
non-dynamic string specified by the user
* Each category of Validation will define a `create_prefix()` method, that creates the {row: 1, column: 2} prefix
that goes before each message. Thus, `generate_message()` will concatenate that with the actual message
*

## Options for placing CombinedValidation in the inheritance hierarchy
* In order to make both CombinedValidation and BooleanSeriesValidation both share a class, so they can be chained together,
either we had to make a mixin that creates a "side path" that doesn't call `validate` (in this case, `validate_with_series`),
or we

# Rework of Validation Indexing
## All Indexed
* All Validations now have an index and an axis
* However, this index can be none, can be column only, row only, or both
* When combined with each other, the resulting boolean series will be broadcast using numpy broadcasting rules
* e.g.
* A per-series validation might have index 0 (column 0) and return a scalar (the whole series is okay)
* A per-cell validation might have index 0 (column 0) and return a series (True, True, False) indicating that cell 0 and 1 of column 0 are okay
* A per-frame validation would have index None, and might return True if the whole frame meets the validation, or a series indicating which columns or rows match the validation

# Rework of combinedvalidations
## Bitwise
* Could assign each validation a bit in a large bitwise enum, and `or` together a number each time that index fails a validatioin. This lets us track the origin of each warning, allowing us to slice them out by bit and generate an appropriate list of warnings
Empty file modified doc/common/introduction.rst
100644 → 100755
Empty file.
Empty file modified doc/readme/README.rst
100644 → 100755
Empty file.
Empty file modified doc/readme/conf.py
100644 → 100755
Empty file.
Empty file modified doc/site/Makefile
100644 → 100755
Empty file.
Empty file modified doc/site/conf.py
100644 → 100755
Empty file.
Empty file modified doc/site/index.rst
100644 → 100755
Empty file.
Empty file modified example/boolean.py
100644 → 100755
Empty file.
Empty file modified example/boolean.txt
100644 → 100755
Empty file.
Empty file modified example/example.py
100644 → 100755
Empty file.
Empty file modified example/example.txt
100644 → 100755
Empty file.
2 changes: 0 additions & 2 deletions pandas_schema/__init__.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
from .column import Column
from .validation_warning import ValidationWarning
from .schema import Schema
from .version import __version__
84 changes: 63 additions & 21 deletions pandas_schema/column.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,27 +1,69 @@
import typing
import pandas as pd

from . import validation
from .validation_warning import ValidationWarning
import pandas_schema.core
from pandas_schema.index import PandasIndexer

class Column:
def __init__(self, name: str, validations: typing.Iterable['validation._BaseValidation'] = [], allow_empty=False):
"""
Creates a new Column object

:param name: The column header that defines this column. This must be identical to the header used in the CSV/Data Frame you are validating.
:param validations: An iterable of objects implementing _BaseValidation that will generate ValidationErrors
:param allow_empty: True if an empty column is considered valid. False if we leave that logic up to the Validation
"""
self.name = name
self.validations = list(validations)
self.allow_empty = allow_empty
def column(
validations: typing.Iterable['pandas_schema.core.IndexSeriesValidation'],
index: PandasIndexer = None,
override: bool = False,
allow_empty=False
):
"""
A utility method for setting the index data on a set of Validations
:param validations: A list of validations to modify
:param index: The index of the series that these validations will now consider
:param override: If true, override existing index values. Otherwise keep the existing ones
:param allow_empty: Allow empty rows (NaN) to pass the validation
See :py:class:`pandas_schema.validation.IndexSeriesValidation`
"""
for valid in validations:
if override or valid.index is None:
valid.index = index

def validate(self, series: pd.Series) -> typing.List[ValidationWarning]:
"""
Creates a list of validation errors using the Validation objects contained in the Column

:param series: A pandas Series to validate
:return: An iterable of ValidationError instances generated by the validation
"""
return [error for validation in self.validations for error in validation.get_errors(series, self)]
def column_sequence(
validations: typing.Iterable['pandas_schema.core.IndexSeriesValidation'],
override: bool = False
):
"""
A utility method for setting the index data on a set of Validations. Applies a sequential position based index, so
that the first validation gets index 0, the second gets index 1 etc. Note: this will not modify any index that
already has some kind of index
:param validations: A list of validations to modify
:param override: If true, override existing index values. Otherwise keep the existing ones
"""
for i, valid in validations:
if override or valid.index is None:
valid.index = PandasIndexer(i, typ='positional')
#
# def label_column(
# validations: typing.Iterable['pandas_schema.core.IndexSeriesValidation'],
# index: typing.Union[int, str],
# ):
# """
# A utility method for setting the label-based column for each validation
# :param validations: A list of validations to modify
# :param index: The label of the series that these validations will now consider
# """
# return _column(
# validations,
# index,
# position=False
# )
#
# def positional_column(
# validations: typing.Iterable['pandas_schema.core.IndexSeriesValidation'],
# index: int,
# ):
# """
# A utility method for setting the position-based column for each validation
# :param validations: A list of validations to modify
# :param index: The index of the series that these validations will now consider
# """
# return _column(
# validations,
# index,
# position=True

Loading