Contingent validation & optional columns #31

kotarf · 2020-02-23T01:06:48Z

There is a use case for validation where we want to contingently validate a column. We might not require such a column, but will want to validate it if it happens to be provided. This would come up when PandasSchema is used in an 'ETL' pipeline where data is submitted by users or 3rd parties.

If the column is optional and present and at least one non-null value exists, validate as we do normally.
If the column is optional and present but is empty, do not validate.
If the column is optional and not present, do not raise w/ the missing column exception.

I don't think allow_empty has this behavior so I introduced a new parameter. It's close but it has subtly different behavior.

This might go against the design of the library in that all provided columns are considered required in the Schema. In which case, I guess this feature doesn't make sense.

Otherwise, I don't think this PR is ready to go (eg. more tests and the logic might be off) but I'm happy to continue it so let me know what you think. Thx

multimeric · 2020-02-23T08:28:07Z

Hi @kotarf, thanks for the PR. I can appreciate your use-case and would like to support it. However, considering that I'm in the middle of a pseudo-rewrite that will better support this need, I'm hesitant to work through a PR too much. That said, if you need this functionality soon, it doesn't look like a huge PR and I'm happy to review it.

kotarf · 2020-02-23T19:07:33Z

Hi @kotarf, thanks for the PR. I can appreciate your use-case and would like to support it. However, considering that I'm in the middle of a pseudo-rewrite that will better support this need, I'm hesitant to work through a PR too much. That said, if you need this functionality soon, it doesn't look like a huge PR and I'm happy to review it.

Nope, not really urgent. I also had to add a fix; the logic should be correct now.

I'd like to work against this re-write. Is it being worked on in #29?

multimeric · 2020-02-25T14:02:19Z

It is being worked on there, but it's all in so much flux I wouldn't recommend merging that branch in right now. Broadly speaking, the PR decouples Validations from Series, so your problem of optional Series this is naturally something that I'll have to solve anyway.

If it's not urgent, I'll check in here after I've finished that branch.

kotarf · 2020-02-25T14:57:33Z

It is being worked on there, but it's all in so much flux I wouldn't recommend merging that branch in right now. Broadly speaking, the PR decouples Validations from Series, so your problem of optional Series this is naturally something that I'll have to solve anyway.

If it's not urgent, I'll check in here after I've finished that branch.

That's fine, I actually have another potential feature that's slightly more involved. I'll try to get up to date with the rework and code against that. Let me know if this PR becomes irrelevant and I'll close it out.

whege · 2020-11-06T20:13:35Z

Hi, wanted to check on the status of the PR since I see it has been a while since any discussion and I currently have a use case for this feature in a project. If it is still being worked on, I am happy to help contribute if needed to help complete the PR.

multimeric · 2020-11-07T04:40:16Z

Unfortunately it's in the same place as before, because I've been quite busy. If you or anyone else wants to expedite this, you could try to implement this by writing a PR against my 1.0 branch: https://github.com/TMiguelT/PandasSchema/tree/bitwise.

Unfortunately I haven't written a comprehensive changes document, that will come later. However the docstrings are generally up-to-date, so that should help you in writing this. Basically the main changes in that branch that impacts this feature are:

I have abolished the notion of Columns, because for situations like this one, it doesn't necessarily make sense
Each Validation object has its own internal "index" which is an instance of AxisIndexer, a thin wrapper around .loc and .iloc

So to implement the behaviour mentioned in the OP, you would have your ordinary validation set, and-ed together, then you would or it with a validator that passes when the column doesn't exist, or-ed with a validator that passes when all of the column is empty. e.g. (CanConvertValidation(int) & InRangeValidation(min=1, max=8)) | IsAbsentValidation() | AllEmptyValidation().

Then you have to implement the IsAbsentValidation() such that it returns True if the index doesn't exist in the DataFrame (yes really a scalar True, not a Series, we use numpy broadcasting logic to handle this), and also the AllEmptyValidation() which returns True if the entire Series is empty. Note, this differs from the existing IsEmptyValidation, which returns a boolean Series of True and False depending on whether each cell in that Series is empty.

Let me know if you have any issues or need any clarifications.

kotarf added 2 commits February 22, 2020 19:52

Adds contingent validation with support for optional columns

19cec12

Allow optional columns to be missing from data

06ae5a1

Fix

ca38653

fix expression again

df32cb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contingent validation & optional columns #31

Contingent validation & optional columns #31

kotarf commented Feb 23, 2020 •

edited

Loading

multimeric commented Feb 23, 2020

kotarf commented Feb 23, 2020

multimeric commented Feb 25, 2020

kotarf commented Feb 25, 2020

whege commented Nov 6, 2020

multimeric commented Nov 7, 2020

Contingent validation & optional columns #31

Are you sure you want to change the base?

Contingent validation & optional columns #31

Conversation

kotarf commented Feb 23, 2020 • edited Loading

multimeric commented Feb 23, 2020

kotarf commented Feb 23, 2020

multimeric commented Feb 25, 2020

kotarf commented Feb 25, 2020

whege commented Nov 6, 2020

multimeric commented Nov 7, 2020

kotarf commented Feb 23, 2020 •

edited

Loading