-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added validation method IsTypeValidation #56
base: master
Are you sure you want to change the base?
Conversation
I do not yet know how to link this pull request to the existing issue. But it concerns issue #39. |
Removed typing at line 229 because this caused an issue in Python 3.5.
Changed line 234 because format `f"text {variable}"` is not allowed in Python 3.5.
Typing in line 240 was not allowed.
Hmm, I'm not sure I like this approach. Pandas series are inherently all the same type (unless dtype=object), so it's wasteful to check each element when we can instead just look at the dtype of the series. Even if you were looking at each element I would advise against using a loop, and opt for a vectorised operation. |
I was indeed looking at every element such that it returns validation messages for the specific cells that fail. If a series is object, it is clear that there are one or more inconsistencies, but it would help if we'd know the specific cells. Why wouldn't you want to include that? If adding this functionality to the package is not preferred, that is okay. We can define the class separately and add it to the Column()-object. I know see that using |
The justification for the feature is fine, but you would need to make it clear in the docstring that this validation only makes sense for an object series, you would need to check that the Series is an object series, and also as I said I would much prefer vectorised operations. Look at |
I looked at your vectorization remark but I do not know how to streamline it more. It is now encorporated as Besides that, if the series is of non-object dtype, it now 'redirects' to IsDTypeValidation. But I am not satisfied with the code. See also DISLIKE-remarks in-code. |
pandas_schema/validation.py
Outdated
@@ -214,6 +215,80 @@ def validate(self, series: pd.Series) -> pd.Series: | |||
return (series >= self.min) & (series < self.max) | |||
|
|||
|
|||
def convert_type_to_dtype(type_to_convert: type) -> np.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly sure that np.dtype(int)
returns np.int64
, making this function redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked some different Python versions. On Linux Ubuntu with Python 3.8.5, it returns indeed np.int64
. In my IDE on Windows it returns np.int32
. The latter is for both Python 3.8 and 3.9. These stackoverflow answers explain that this results from C in Windows, where long int is 32bit despite the system being 64bit.
So pd.Series([1,2,3]).dtype
results in np.int64
and pd.Series(np.array([1,2,3])).dtype
results in np.int32
.
It makes it tricky to anticipate which is to happen when..
EDIT:
Converting the series instead might be a solution. The below code is pretty consistent, although I only did data types int, float and bool. Leaving out (at least?) datetime. np.zeros feels a bit hacky though. And there remains a conversion.
np.dtype(int) # int32
series = pd.Series([1,2,3]) # int64
python_type = type(np.zeros(1, series.dtype).tolist()[0]) # int
series_converted_type = series.astype(python_type) # int32
np.dtype(float) # float64
series = pd.Series([1.0,2,3]) # float64
python_type = type(np.zeros(1, series.dtype).tolist()[0]) # float
series_converted_type = series.astype(python_type) # float64
np.dtype(bool) # bool (dtype)
series = pd.Series([True,False,True]) # bool (dtype)
python_type = type(np.zeros(1, series.dtype).tolist()[0]) # bool (normal Python class)
series_converted_type = series.astype(python_type) # bool (dtype)
pandas_schema/validation.py
Outdated
|
||
# Numpy dtypes other than 'object' can be validated with IsDtypeValidation instead, but only if the | ||
# allowed_types is singular. Otherwise continue. | ||
# DISLIKE 01: IsDtypeValidation only allows a single dtype. So this if-statement redirects only if one type is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather that you implement multiple dtype support in the IsDtypeValidation
rather than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Looking forward on your answer about converting types.
pandas_schema/validation.py
Outdated
new_validation_method = IsDtypeValidation(dtype=np.dtype(allowed_type)) | ||
return new_validation_method.get_errors(series=series) | ||
|
||
# Else, validate each element along the allowed types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this in the default method implementation? If so just call super.get_errors()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I misunderstood you. But the code below line 274 can then be rewritten to
return super().get_errors(series=series, column=column)
where the default value None
for column
-variable was removed.
I will commit this together with your other feedback later on.
Btw. Why did you use column
as a variable name (in get_errors()
)? It shadows from the outer scope.
A new version is pushed. I made IsTypeValidation how I think it is best (to my knowledge). IsDtypeValidation is changed to allow for multiple dtypes. The test-file test_validation.py returns errors. Test-files are new for me. I've got to look into that after the weekend. |
I tried using this fix and I replaced my use of IsDtypeValidation with IsTypeValidation. I left the argument as a string with the type name, as in IsTypeValidation('int64'). I got "TypeError: data type 'n' not understood" because IsTypeValidation is expecting a list of types, but I passed a string. The code iterated the string characters and errored when it hit 'n' in 'int64'. Maybe IsTypeValidation should check the argument type and if it is a string, wrap it in an array? |
Also, this still seems to give me the same error I got before with the following example:
Isn't this the original error this change was supposed to address (except for StringDtype), or did I misunderstand? |
The validation method is meant to allow the normal Python built-in types (like I am not sure how to correctly check if provided list items in the argument is of the correct types. I will include the following: if type(allowed_types) != list:
raise PanSchArgumentError('The argument "allowed_types" passed to IsTypeValidation is not of type list. Provide a '
'list containing one or more of the Python built-in types "str", "int", "float" or '
'"bool".')
for allowed_type in allowed_types:
if allowed_type not in [str, int, float, bool]:
raise PanSchArgumentError('The item "{}" provided in the argument "allowed_types" as passed to '
'IsTypeValidation is not of the correct type. Provide one of Python built-in types '
'"str", "int", "float" or "bool".'.format(allowed_type)) The downside however is that these four are probably not all possible types in a dataframe. The latter could be replaced with |
Yes you are right. It derailed somewhat to an alternative validation method. |
Proposed new method to solve the issue with the validation method IsDTypeValidation. This method also indicates which rows are not valid.