Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaError on index name for single-index dataframe when using SchemaModel #323

Closed
2 tasks done
kristomi opened this issue Nov 12, 2020 · 9 comments · Fixed by #326
Closed
2 tasks done

SchemaError on index name for single-index dataframe when using SchemaModel #323

kristomi opened this issue Nov 12, 2020 · 9 comments · Fixed by #326
Labels
bug Something isn't working

Comments

@kristomi
Copy link

kristomi commented Nov 12, 2020

Thanks for the awesome SchemaModel interface! That really improves readability and usability.

Describe the bug
I am not able to validate a pandas DataFrame with a single index column when that index is named. The code on line 304 in pandera/model.py seems to explicitly force the schema to have a None-named index when the index only has one column. How can I get around this?

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • [] (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera as pa
import pandas as pd

df_invalid = pd.DataFrame(
    {"year": [2019, 2020], "value_1": [1, 2], "value_2": [2, 3]}
).set_index("year")
df_valid = pd.DataFrame({"value_1": [1, 2], "value_2": [2, 3]}, index=[2019, 2020])


class Schema(pa.SchemaModel):
    year: pa.typing.Index[int]
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]


Schema.validate(df_valid)
Schema.validate(df_invalid)

Expected behavior

I expect this to read the index name for the column, and accept the index. Alternatively that I can specify in the Config class that I want the schema to accept named indices.

Desktop (please complete the following information):

  • OS: MacOS
  • Browser Chrome
  • Version 86
@kristomi kristomi added the bug Something isn't working label Nov 12, 2020
@cosmicBboy
Copy link
Collaborator

thanks for submitting this bug @kristomi, I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?), although I think this behavior needs to be amended to support your use case.

Potential Solutions

  1. Skip the name check for single-indexed dataframes during validation altogether.
  2. Add Config option to accept named indices
class Schema(pa.SchemaModel):
    ...

    class Config:
        named_index: True  # default False
  1. Don't override index name with None in https://github.com/pandera-dev/pandera/blob/master/pandera/model.py#L295 and provide a special name for un-named single indexes, for example:
class Schema(pa.SchemaModel):
    __index__: pa.typing.Index[int]  # un-named index
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

I'm sort of leaning toward 2 or 3... the nice thing about 2 is that it opens up support for un-named multiindex.

Thoughts @kristomi and @jeffzi?

@kristomi
Copy link
Author

Thanks for the quick response.

The way I constructed the invalid dataframe is a very typical workflow for me, where dataframes are passed around and indices are set and reset all the time. On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.

@jeffzi
Copy link
Collaborator

jeffzi commented Nov 12, 2020

I think the rationale for this behavior is was that we didn't want users to have to name their indexes to specify valid dataframes (@jeffzi am I getting that correct?)

Yes, that's right.

I also like the fact that 2. addresses unnamed multiindex and aligns the model api with the standard api. My issue with 3. is that it increases the complexity for new-comers. It would also be confusing in inherited models:

class Schema(pa.SchemaModel):
    __index__: pa.typing.Index[int]  # un-named index
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]


class SubSchema(Schema):
    # With current implementation, we would create a MultiIndex !
    #  What if we want to name the unnamed __index__?
    year: pa.typing.Index[int]  

On that background, I would prefer solution nr 3, because indices would then be validated as named by default, unless you construct the schema with your special notation.

If 3. is selected, I agree it would be reasonable, and perhaps more natural, to validate index name by default. After all, you do name the index.

Edit: I meant "If 2. is selected".

@kristomi
Copy link
Author

You guys have way more insight than I do, and I clearly see the problem with inheritance now that you mention it. Perhaps nr 2 is a better choice, then. Anyway, I'm not in a position to see all the implications of solving this one way or the other.

@jeffzi
Copy link
Collaborator

jeffzi commented Nov 12, 2020

@kristomi np! It's already a great help to report bugs and feedback 🙂

@cosmicBboy Once a decision has been reached, I'd be happy to take care of the changes. I saw you already have a lot on your plate.

@cosmicBboy
Copy link
Collaborator

thanks for the feedback! After thinking about it for a little bit, I'd like to go for solution # 2, with a slight addition:

Introduce a verify_name kwarg to the SeriesSchemaBase class constructor and verify_names kwarg in MultiIndex constructor

  • verify_name(s) should be an instance property.
  • If True, verify the name of the Series/Index/MultiIndex, otherwise don't.
  • This would be True by default in SeriesSchemaBase and SeriesSchema, False by default in the Index, and False (I think?) by default in MultiIndex
  • The Field field should also add this new kwarg.
  • The SchemaModel Config should have multi_index__verify_names kwarg
  • For MultiIndex, if verify_names==False, modify the checked dataframe here so that the column names are 0-indexed integers. (Note: MultiIndex inherits from DataFrameSchema and uses its underlying validation logic by casting the pd.MultiIndex into a DataFrame... a questionable design decision perhaps, but a vestigial part of the early days of this library when I was lazy and didn't want to re-implement validation for multiindexes 🙃)

My reasoning for this is:

  • I agree with @jeffzi that I'd prefer not to introduce a special concept for un-named indexes, as it degrades usability
  • The MultiIndex object already handles un-named MultiIndexes provided by the user in the validated object by falling back to 0-indexed integer column names. verify_names just adds granular control on whether the user wants to explicitly check these names on the MultiIindex.validate call.
  • This keeps the API tight in the sense that users can control the name-checking behavior of both single indexes and multiindexes, while providing reasonable defaults... typically, I don't really care whether a single index has the correct name, but if I do I just specify pa.Index(verify_name=True) or Field(verify_name=True) in the class-based API. This setup should also support the df_invalid in your example code @kristomi and gracefully handle the inheritance case:
class Schema(pa.SchemaModel):
    year: pa.typing.Index[int]  # by default, pa.Index(verify_name=False) so pandera won't check the "year" name here
    value_1: pa.typing.Series[int]
    value_2: pa.typing.Series[int]

class SubSchema(Schema):
    month: pa.typing.Index[int]  # by default, pa.MultiIndex(verify_names=True) so pandera will construct multiindex here and check index names

@cosmicBboy
Copy link
Collaborator

Once a decision has been reached, I'd be happy to take care of the changes.

Thanks @jeffzi! Let me know what you think about the above proposal and if you have any questions/concerns about it

@jeffzi
Copy link
Collaborator

jeffzi commented Nov 17, 2020

I would suggest check_name instead of verify_name to keep the same vocabulary used elsewhere in pandera.

Name validation is mandatory for Series because we need the name to get the column to validate from the DataFrame:
https://github.com/pandera-dev/pandera/blob/31fe07b30b267527ecea7f6b1435f49b6b963abf/pandera/schemas.py#L937-L941

That would not be an issue if pandera supported columns order. It is something that I actually wanted to suggest for the machine learning use case. Many ML libraries ignore the column names and rely on the order (possibly casting to a numpy array).

False by default in the Index, and False (I think?) by default in MultiIndex

Your example says it should be True by default ❓
Personally I only give a name to a single index when I set it explicitly, which is unnecessary most of the time. I always give a name to a MultiIndex though. True by default for MultiIndex is also closer to pandera's standard API.

I agree with your other points.

@cosmicBboy
Copy link
Collaborator

I would suggest check_name instead of verify_name

sounds good 👍

Your example says it should be True by default

woops, yes I meant True by default :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants