-
-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SchemaError on index name for single-index dataframe when using SchemaModel #323
Comments
thanks for submitting this bug @kristomi, I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?), although I think this behavior needs to be amended to support your use case. Potential Solutions
class Schema(pa.SchemaModel):
...
class Config:
named_index: True # default False
class Schema(pa.SchemaModel):
__index__: pa.typing.Index[int] # un-named index
value_1: pa.typing.Series[int]
value_2: pa.typing.Series[int] I'm sort of leaning toward 2 or 3... the nice thing about 2 is that it opens up support for un-named multiindex. |
Thanks for the quick response. The way I constructed the invalid dataframe is a very typical workflow for me, where dataframes are passed around and indices are set and reset all the time. On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation. |
Yes, that's right. I also like the fact that 2. addresses unnamed multiindex and aligns the model api with the standard api. My issue with 3. is that it increases the complexity for new-comers. It would also be confusing in inherited models: class Schema(pa.SchemaModel):
__index__: pa.typing.Index[int] # un-named index
value_1: pa.typing.Series[int]
value_2: pa.typing.Series[int]
class SubSchema(Schema):
# With current implementation, we would create a MultiIndex !
# What if we want to name the unnamed __index__?
year: pa.typing.Index[int]
If 3. is selected, I agree it would be reasonable, and perhaps more natural, to validate index name by default. After all, you do name the index. Edit: I meant "If 2. is selected". |
You guys have way more insight than I do, and I clearly see the problem with inheritance now that you mention it. Perhaps nr 2 is a better choice, then. Anyway, I'm not in a position to see all the implications of solving this one way or the other. |
@kristomi np! It's already a great help to report bugs and feedback 🙂 @cosmicBboy Once a decision has been reached, I'd be happy to take care of the changes. I saw you already have a lot on your plate. |
thanks for the feedback! After thinking about it for a little bit, I'd like to go for solution # 2, with a slight addition: Introduce a
|
Thanks @jeffzi! Let me know what you think about the above proposal and if you have any questions/concerns about it |
I would suggest Name validation is mandatory for Series because we need the name to get the column to validate from the DataFrame: That would not be an issue if pandera supported columns order. It is something that I actually wanted to suggest for the machine learning use case. Many ML libraries ignore the column names and rely on the order (possibly casting to a numpy array).
Your example says it should be True by default ❓ I agree with your other points. |
sounds good 👍
woops, yes I meant True by default :) |
Thanks for the awesome
SchemaModel
interface! That really improves readability and usability.Describe the bug
I am not able to validate a pandas DataFrame with a single index column when that index is named. The code on line 304 in
pandera/model.py
seems to explicitly force the schema to have aNone
-named index when the index only has one column. How can I get around this?Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected behavior
I expect this to read the index name for the column, and accept the index. Alternatively that I can specify in the
Config
class that I want the schema to accept named indices.Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: