Added validation method IsTypeValidation #56

chrispijo · 2021-03-04T22:07:09Z

Proposed new method to solve the issue with the validation method IsDTypeValidation. This method also indicates which rows are not valid.

chrispijo · 2021-03-04T22:10:04Z

I do not yet know how to link this pull request to the existing issue. But it concerns issue #39.

Removed typing at line 229 because this caused an issue in Python 3.5.

Changed line 234 because format `f"text {variable}"` is not allowed in Python 3.5.

Typing in line 240 was not allowed.

multimeric · 2021-03-05T11:30:18Z

Hmm, I'm not sure I like this approach. Pandas series are inherently all the same type (unless dtype=object), so it's wasteful to check each element when we can instead just look at the dtype of the series. Even if you were looking at each element I would advise against using a loop, and opt for a vectorised operation.

chrispijo · 2021-03-05T12:17:19Z

I was indeed looking at every element such that it returns validation messages for the specific cells that fail. If a series is object, it is clear that there are one or more inconsistencies, but it would help if we'd know the specific cells. Why wouldn't you want to include that?

If adding this functionality to the package is not preferred, that is okay. We can define the class separately and add it to the Column()-object.

I know see that using bool_series = series.apply(type) == int is faster than a for-loop (4.6x in my test). Not sure how to compare series.apply(type) to a list (e.g. [int, float]) however.

multimeric · 2021-03-05T22:39:31Z

The justification for the feature is fine, but you would need to make it clear in the docstring that this validation only makes sense for an object series, you would need to check that the Series is an object series, and also as I said I would much prefer vectorised operations. Look at pandas.Series.isin for your use-case.

chrispijo · 2021-03-06T16:52:06Z

I looked at your vectorization remark but I do not know how to streamline it more. It is now encorporated as series.apply(type).isin(self.allowed_types). It is unclear how to replace the apply-loop with vectorization. The isin is probably already vectorization? The additional isin improved speed to 5.63x faster.

Besides that, if the series is of non-object dtype, it now 'redirects' to IsDTypeValidation. But I am not satisfied with the code. See also DISLIKE-remarks in-code.

multimeric · 2021-03-07T07:21:48Z

pandas_schema/validation.py

@@ -214,6 +215,80 @@ def validate(self, series: pd.Series) -> pd.Series:
        return (series >= self.min) & (series < self.max)


+def convert_type_to_dtype(type_to_convert: type) -> np.dtype:


I'm fairly sure that np.dtype(int) returns np.int64, making this function redundant.

I checked some different Python versions. On Linux Ubuntu with Python 3.8.5, it returns indeed np.int64. In my IDE on Windows it returns np.int32. The latter is for both Python 3.8 and 3.9. These stackoverflow answers explain that this results from C in Windows, where long int is 32bit despite the system being 64bit.

So pd.Series([1,2,3]).dtype results in np.int64 and pd.Series(np.array([1,2,3])).dtype results in np.int32.

It makes it tricky to anticipate which is to happen when..

EDIT:
Converting the series instead might be a solution. The below code is pretty consistent, although I only did data types int, float and bool. Leaving out (at least?) datetime. np.zeros feels a bit hacky though. And there remains a conversion.

np.dtype(int) # int32 series = pd.Series([1,2,3]) # int64 python_type = type(np.zeros(1, series.dtype).tolist()[0]) # int series_converted_type = series.astype(python_type) # int32 np.dtype(float) # float64 series = pd.Series([1.0,2,3]) # float64 python_type = type(np.zeros(1, series.dtype).tolist()[0]) # float series_converted_type = series.astype(python_type) # float64 np.dtype(bool) # bool (dtype) series = pd.Series([True,False,True]) # bool (dtype) python_type = type(np.zeros(1, series.dtype).tolist()[0]) # bool (normal Python class) series_converted_type = series.astype(python_type) # bool (dtype)

multimeric · 2021-03-07T07:22:40Z

pandas_schema/validation.py

+
+        # Numpy dtypes other than 'object' can be validated with IsDtypeValidation instead, but only if the
+        # allowed_types is singular. Otherwise continue.
+        # DISLIKE 01: IsDtypeValidation only allows a single dtype. So this if-statement redirects only if one type is


I would rather that you implement multiple dtype support in the IsDtypeValidation rather than here.

Agree. Looking forward on your answer about converting types.

multimeric · 2021-03-07T07:23:48Z

pandas_schema/validation.py

+            new_validation_method = IsDtypeValidation(dtype=np.dtype(allowed_type))
+            return new_validation_method.get_errors(series=series)
+
+        # Else, validate each element along the allowed types.


Isn't this in the default method implementation? If so just call super.get_errors()

Correct me if I misunderstood you. But the code below line 274 can then be rewritten to

return super().get_errors(series=series, column=column)

where the default value None for column-variable was removed.
I will commit this together with your other feedback later on.

Btw. Why did you use column as a variable name (in get_errors())? It shadows from the outer scope.

pandas_schema/version.py

chrispijo · 2021-03-10T20:15:09Z

A new version is pushed. I made IsTypeValidation how I think it is best (to my knowledge). IsDtypeValidation is changed to allow for multiple dtypes.

The test-file test_validation.py returns errors. Test-files are new for me. I've got to look into that after the weekend.

qotho · 2021-03-17T12:16:10Z

I tried using this fix and I replaced my use of IsDtypeValidation with IsTypeValidation. I left the argument as a string with the type name, as in IsTypeValidation('int64'). I got "TypeError: data type 'n' not understood" because IsTypeValidation is expecting a list of types, but I passed a string. The code iterated the string characters and errored when it hit 'n' in 'int64'. Maybe IsTypeValidation should check the argument type and if it is a string, wrap it in an array?

qotho · 2021-03-17T12:42:12Z

Also, this still seems to give me the same error I got before with the following example:

series = pd.Series([1,2,3,4]).astype('Int64')
v = IsDtypeValidation('Int64')
v.get_errors(series)

TypeError: Cannot interpret 'Int64Dtype' as a data type

Isn't this the original error this change was supposed to address (except for StringDtype), or did I misunderstand?

chrispijo · 2021-05-03T17:45:54Z

I tried using this fix and I replaced my use of IsDtypeValidation with IsTypeValidation. I left the argument as a string with the type name, as in IsTypeValidation('int64'). I got "TypeError: data type 'n' not understood" because IsTypeValidation is expecting a list of types, but I passed a string. The code iterated the string characters and errored when it hit 'n' in 'int64'. Maybe IsTypeValidation should check the argument type and if it is a string, wrap it in an array?

The validation method is meant to allow the normal Python built-in types (like str, float, int, bool), and thus be used as IsTypeValidation([int, float]) for instance.

I am not sure how to correctly check if provided list items in the argument is of the correct types. I will include the following:

if type(allowed_types) != list:
    raise PanSchArgumentError('The argument "allowed_types" passed to IsTypeValidation is not of type list. Provide a '
                              'list containing one or more of the Python built-in types "str", "int", "float" or '
                              '"bool".')


for allowed_type in allowed_types:
    if allowed_type not in [str, int, float, bool]:
        raise PanSchArgumentError('The item "{}" provided in the argument "allowed_types" as passed to '
                                  'IsTypeValidation is not of the correct type. Provide one of Python built-in types '
                                  '"str", "int", "float" or "bool".'.format(allowed_type))

The downside however is that these four are probably not all possible types in a dataframe. The latter could be replaced with if type(allowed_type) != type, but then list (as in IsTypeValidation([list])) would be a valid argument, as list is also an build-in Python type.

chrispijo · 2021-05-03T17:46:57Z

Also, this still seems to give me the same error I got before with the following example:
series = pd.Series([1,2,3,4]).astype('Int64')
v = IsDtypeValidation('Int64')
v.get_errors(series)

TypeError: Cannot interpret 'Int64Dtype' as a data type
Isn't this the original error this change was supposed to address (except for StringDtype), or did I misunderstand?

Yes you are right. It derailed somewhat to an alternative validation method.

Add IsTypeValidation

8b8c865

chrispijo added 3 commits March 5, 2021 11:37

Update validation.py

b56483a

Removed typing at line 229 because this caused an issue in Python 3.5.

Update validation.py

e382ce5

Changed line 234 because format `f"text {variable}"` is not allowed in Python 3.5.

Update validation.py

b1835a3

Typing in line 240 was not allowed.

multimeric linked an issue Mar 5, 2021 that may be closed by this pull request

IsDtypeValidation-issue for pandas StringDtype #39

Open

multimeric self-requested a review March 5, 2021 22:39

chrispijo added 2 commits March 6, 2021 17:38

Corrections after feedback

ca8e1e5

Corrections Python 3.5

2253125

multimeric requested changes Mar 7, 2021

View reviewed changes

chrispijo added 3 commits March 10, 2021 20:42

Feedback IsTypeValidation added. IsDtypeValidation changed

3598ce9

Update test-file

59713cb

Removed test-changes

f8e593e

Added validation of input argument

d05f365

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added validation method IsTypeValidation #56

Added validation method IsTypeValidation #56

chrispijo commented Mar 4, 2021

chrispijo commented Mar 4, 2021

multimeric commented Mar 5, 2021

chrispijo commented Mar 5, 2021

multimeric commented Mar 5, 2021

chrispijo commented Mar 6, 2021 •

edited

Loading

multimeric Mar 7, 2021

chrispijo Mar 9, 2021 •

edited

Loading

multimeric Mar 7, 2021

chrispijo Mar 9, 2021

multimeric Mar 7, 2021

chrispijo Mar 9, 2021

chrispijo commented Mar 10, 2021

qotho commented Mar 17, 2021

qotho commented Mar 17, 2021

chrispijo commented May 3, 2021 •

edited

Loading

chrispijo commented May 3, 2021 •

edited

Loading

		@@ -214,6 +215,80 @@ def validate(self, series: pd.Series) -> pd.Series:
		return (series >= self.min) & (series < self.max)


		def convert_type_to_dtype(type_to_convert: type) -> np.dtype:

Added validation method IsTypeValidation #56

Are you sure you want to change the base?

Added validation method IsTypeValidation #56

Conversation

chrispijo commented Mar 4, 2021

chrispijo commented Mar 4, 2021

multimeric commented Mar 5, 2021

chrispijo commented Mar 5, 2021

multimeric commented Mar 5, 2021

chrispijo commented Mar 6, 2021 • edited Loading

multimeric Mar 7, 2021

Choose a reason for hiding this comment

chrispijo Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

multimeric Mar 7, 2021

Choose a reason for hiding this comment

chrispijo Mar 9, 2021

Choose a reason for hiding this comment

multimeric Mar 7, 2021

Choose a reason for hiding this comment

chrispijo Mar 9, 2021

Choose a reason for hiding this comment

chrispijo commented Mar 10, 2021

qotho commented Mar 17, 2021

qotho commented Mar 17, 2021

chrispijo commented May 3, 2021 • edited Loading

chrispijo commented May 3, 2021 • edited Loading

chrispijo commented Mar 6, 2021 •

edited

Loading

chrispijo Mar 9, 2021 •

edited

Loading

chrispijo commented May 3, 2021 •

edited

Loading

chrispijo commented May 3, 2021 •

edited

Loading