Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IsDtypeValidation-issue for pandas StringDtype #39

Open
chrispijo opened this issue Sep 23, 2020 · 2 comments · May be fixed by #56
Open

IsDtypeValidation-issue for pandas StringDtype #39

chrispijo opened this issue Sep 23, 2020 · 2 comments · May be fixed by #56
Labels

Comments

@chrispijo
Copy link

Hi. I am trying to confirm if all values in a Pandas-column are off type string. Doing this with IsDtypeValidation returns the error TypeError: Cannot interpret 'StringDtype' as a data type'. I made a topic on StackOverflow, and based on the comments I suspect that this might actually be in error in the IsDtypeValidation-class.

Is this an error? Or do I misuse the class/package?

import numpy as np
import pandas as pd
from pandas_schema.validation import IsDtypeValidation

series = pd.Series(["a", "b", "c"])

# Works as expected:
#   Returns a validation warning as the series is of dtype 'object' and not 'string'.
print(f"dtype = {series.dtypes}")  # Returns: dtype = object
idv = IsDtypeValidation(dtype=np.dtype(np.str))
validation_warnings = idv.get_errors(series=series)
print(validation_warnings[0])  # Returns: The column  has a dtype of object which is not a subclass of the required type <U0

# But we know that the series only contains string-values. Thus convert_dtypes() below.
# Does not work as expected:
#   Returns an error and traceback with 'TypeError: Cannot interpret 'StringDtype' as a data type'.
#   Expected output should be no error or validation warning.
series = series.convert_dtypes()
print(f"dtype = {series.dtypes}")  # Returns: dtype = string
idv = IsDtypeValidation(dtype=np.dtype(np.str))
validation_warnings = idv.get_errors(series=series)  # Error occurs in this line: 'TypeError: Cannot interpret 'StringDtype' as a data type'

Besides that, awesome work! Really handy package.

@multimeric
Copy link
Owner

Hmm, so this comes down to the fact that:

>>>import pandas as pd
>>>import numpy as np
>>> np.dtype(str)
dtype('<U')
>>> pd.StringDtype()
StringDtype
>>> np.issubdtype(np.dtype(str), pd.StringDtype())
TypeError: Cannot interpret 'StringDtype' as a data type

However, I'm not actually sure why this is the case. I would have thought an official Pandas Dtype extension would be compatible with the numpy API. I will look into it, but I'm happy to hear your input on how this should be implemented.

@multimeric multimeric added the bug label Sep 23, 2020
@chrispijo
Copy link
Author

Ok. Good to know. Thanks for the quick response.

@chrispijo chrispijo linked a pull request Mar 4, 2021 that will close this issue
@multimeric multimeric linked a pull request Mar 5, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants