Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added validation method IsTypeValidation #56

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

chrispijo
Copy link

Proposed new method to solve the issue with the validation method IsDTypeValidation. This method also indicates which rows are not valid.

@chrispijo
Copy link
Author

I do not yet know how to link this pull request to the existing issue. But it concerns issue #39.

Removed typing at line 229 because this caused an issue in Python 3.5.
Changed line 234 because format `f"text {variable}"` is not allowed in Python 3.5.
Typing in line 240 was not allowed.
@multimeric multimeric linked an issue Mar 5, 2021 that may be closed by this pull request
@multimeric
Copy link
Owner

Hmm, I'm not sure I like this approach. Pandas series are inherently all the same type (unless dtype=object), so it's wasteful to check each element when we can instead just look at the dtype of the series. Even if you were looking at each element I would advise against using a loop, and opt for a vectorised operation.

@chrispijo
Copy link
Author

I was indeed looking at every element such that it returns validation messages for the specific cells that fail. If a series is object, it is clear that there are one or more inconsistencies, but it would help if we'd know the specific cells. Why wouldn't you want to include that?

If adding this functionality to the package is not preferred, that is okay. We can define the class separately and add it to the Column()-object.

I know see that using bool_series = series.apply(type) == int is faster than a for-loop (4.6x in my test). Not sure how to compare series.apply(type) to a list (e.g. [int, float]) however.

@multimeric
Copy link
Owner

The justification for the feature is fine, but you would need to make it clear in the docstring that this validation only makes sense for an object series, you would need to check that the Series is an object series, and also as I said I would much prefer vectorised operations. Look at pandas.Series.isin for your use-case.

@multimeric multimeric self-requested a review March 5, 2021 22:39
@chrispijo
Copy link
Author

chrispijo commented Mar 6, 2021

I looked at your vectorization remark but I do not know how to streamline it more. It is now encorporated as series.apply(type).isin(self.allowed_types). It is unclear how to replace the apply-loop with vectorization. The isin is probably already vectorization? The additional isin improved speed to 5.63x faster.

Besides that, if the series is of non-object dtype, it now 'redirects' to IsDTypeValidation. But I am not satisfied with the code. See also DISLIKE-remarks in-code.

@@ -214,6 +215,80 @@ def validate(self, series: pd.Series) -> pd.Series:
return (series >= self.min) & (series < self.max)


def convert_type_to_dtype(type_to_convert: type) -> np.dtype:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly sure that np.dtype(int) returns np.int64, making this function redundant.

Copy link
Author

@chrispijo chrispijo Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked some different Python versions. On Linux Ubuntu with Python 3.8.5, it returns indeed np.int64. In my IDE on Windows it returns np.int32. The latter is for both Python 3.8 and 3.9. These stackoverflow answers explain that this results from C in Windows, where long int is 32bit despite the system being 64bit.

So pd.Series([1,2,3]).dtype results in np.int64 and pd.Series(np.array([1,2,3])).dtype results in np.int32.

It makes it tricky to anticipate which is to happen when..

EDIT:
Converting the series instead might be a solution. The below code is pretty consistent, although I only did data types int, float and bool. Leaving out (at least?) datetime. np.zeros feels a bit hacky though. And there remains a conversion.

np.dtype(int)  # int32
series = pd.Series([1,2,3])  # int64
python_type = type(np.zeros(1, series.dtype).tolist()[0])  # int
series_converted_type = series.astype(python_type)  # int32

np.dtype(float)  # float64
series = pd.Series([1.0,2,3])  # float64
python_type = type(np.zeros(1, series.dtype).tolist()[0])  # float
series_converted_type = series.astype(python_type)  # float64

np.dtype(bool)  # bool (dtype)
series = pd.Series([True,False,True])  # bool (dtype)
python_type = type(np.zeros(1, series.dtype).tolist()[0])  # bool (normal Python class)
series_converted_type = series.astype(python_type)  # bool (dtype)


# Numpy dtypes other than 'object' can be validated with IsDtypeValidation instead, but only if the
# allowed_types is singular. Otherwise continue.
# DISLIKE 01: IsDtypeValidation only allows a single dtype. So this if-statement redirects only if one type is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather that you implement multiple dtype support in the IsDtypeValidation rather than here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Looking forward on your answer about converting types.

new_validation_method = IsDtypeValidation(dtype=np.dtype(allowed_type))
return new_validation_method.get_errors(series=series)

# Else, validate each element along the allowed types.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this in the default method implementation? If so just call super.get_errors()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I misunderstood you. But the code below line 274 can then be rewritten to

return super().get_errors(series=series, column=column)

where the default value None for column-variable was removed.
I will commit this together with your other feedback later on.

Btw. Why did you use column as a variable name (in get_errors())? It shadows from the outer scope.

pandas_schema/version.py Outdated Show resolved Hide resolved
@chrispijo
Copy link
Author

A new version is pushed. I made IsTypeValidation how I think it is best (to my knowledge). IsDtypeValidation is changed to allow for multiple dtypes.

The test-file test_validation.py returns errors. Test-files are new for me. I've got to look into that after the weekend.

@qotho
Copy link

qotho commented Mar 17, 2021

I tried using this fix and I replaced my use of IsDtypeValidation with IsTypeValidation. I left the argument as a string with the type name, as in IsTypeValidation('int64'). I got "TypeError: data type 'n' not understood" because IsTypeValidation is expecting a list of types, but I passed a string. The code iterated the string characters and errored when it hit 'n' in 'int64'. Maybe IsTypeValidation should check the argument type and if it is a string, wrap it in an array?

@qotho
Copy link

qotho commented Mar 17, 2021

Also, this still seems to give me the same error I got before with the following example:

series = pd.Series([1,2,3,4]).astype('Int64')
v = IsDtypeValidation('Int64')
v.get_errors(series)

TypeError: Cannot interpret 'Int64Dtype' as a data type

Isn't this the original error this change was supposed to address (except for StringDtype), or did I misunderstand?

@chrispijo
Copy link
Author

chrispijo commented May 3, 2021

I tried using this fix and I replaced my use of IsDtypeValidation with IsTypeValidation. I left the argument as a string with the type name, as in IsTypeValidation('int64'). I got "TypeError: data type 'n' not understood" because IsTypeValidation is expecting a list of types, but I passed a string. The code iterated the string characters and errored when it hit 'n' in 'int64'. Maybe IsTypeValidation should check the argument type and if it is a string, wrap it in an array?

The validation method is meant to allow the normal Python built-in types (like str, float, int, bool), and thus be used as IsTypeValidation([int, float]) for instance.

I am not sure how to correctly check if provided list items in the argument is of the correct types. I will include the following:

if type(allowed_types) != list:
    raise PanSchArgumentError('The argument "allowed_types" passed to IsTypeValidation is not of type list. Provide a '
                              'list containing one or more of the Python built-in types "str", "int", "float" or '
                              '"bool".')


for allowed_type in allowed_types:
    if allowed_type not in [str, int, float, bool]:
        raise PanSchArgumentError('The item "{}" provided in the argument "allowed_types" as passed to '
                                  'IsTypeValidation is not of the correct type. Provide one of Python built-in types '
                                  '"str", "int", "float" or "bool".'.format(allowed_type))

The downside however is that these four are probably not all possible types in a dataframe. The latter could be replaced with if type(allowed_type) != type, but then list (as in IsTypeValidation([list])) would be a valid argument, as list is also an build-in Python type.

@chrispijo
Copy link
Author

chrispijo commented May 3, 2021

Also, this still seems to give me the same error I got before with the following example:

series = pd.Series([1,2,3,4]).astype('Int64')
v = IsDtypeValidation('Int64')
v.get_errors(series)

TypeError: Cannot interpret 'Int64Dtype' as a data type

Isn't this the original error this change was supposed to address (except for StringDtype), or did I misunderstand?

Yes you are right. It derailed somewhat to an alternative validation method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IsDtypeValidation-issue for pandas StringDtype
3 participants