Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for #158 #159

Merged
merged 13 commits into from Jul 25, 2019
Merged

Conversation

pim-hoeven
Copy link
Contributor

Fixed the bug caused when using the grouped_estimator with a string column as grouping variable.

Solution: Try the checks without adjustments, if that fails: remove the grouping column from the array or dataframe.

@koaning
Copy link
Owner

koaning commented Jul 3, 2019

  1. first of all; thanks for picking this up!
  2. your work is much appreciated. ask matthijs for a reward (this project now has pretty stickers)
  3. i wonder ... when would we want to have the column in the subset? it should be a constant value. could it not be removed in all cases?

@pim-hoeven
Copy link
Contributor Author

pim-hoeven commented Jul 3, 2019

  1. My pleasure!
  2. @MBrouns can I get a sticker?
  3. Always removing the grouping column was my first approach, but then some other tests failed, for example in checking for check_estimators_nan_inf and check_fit2d_predict1d (cannot drop a column in that case). Therefore I adopted this EAFP approach. Do you agree?

@koaning
Copy link
Owner

koaning commented Jul 3, 2019

ah yes. those are tricky.

i see two paths going forward and maybe a check with @MBrouns might be good.

  1. We can make an estimator that contains a GroupEstimator in a sklearn.pipeline.Pipeline. The idea is to have a preprocessing step that adds the column that we want to remove. This might cause the tests to pass ... this option is OK, but a bit meh.
  2. Alternatively we might also ignore those two tests in particular. Especially for meta estimators we cannot have the same rules as for a base estimator in sklearn. I would be fine with dropping those two tests in favour of, what I feel, is better behaviour for the model that is being implemented. This option feels best for me at the moment.

The sklearn tests are guidelines that are easy to automate but they aren't devine gospel. @pim-hoeven feel free to question my thoughts here if you have a better alternative.

@pim-hoeven
Copy link
Contributor Author

I think the second approach makes more sense for two reasons:

  1. It's easier to implement and keeps the GroupEstimator cleaner and less complicated
  2. Especially the check_fit2d_predict1d makes no sense in this context: if we don't have a grouping column then why use a GroupEstimator?

Regarding the check_estimators_nan_inf we might want to build in checks on the grouping columns as well, specifically that it does not contain missing values.

So, (@MBrouns) my proposed solution would be:

  • Include a __check_group_columns in the estimator that checks whether the grouping column(s) exist and are valid
  • Do the check_X_y on the non-grouping columns
  • Remove the check_fit2d_predict1d

Do you agree?

@koaning
Copy link
Owner

koaning commented Jul 3, 2019

Sounds great to me.

sklego/meta.py Show resolved Hide resolved
sklego/meta.py Outdated
def __validate(self, X, y=None):
try:
X_data = X.drop(columns=self.groups, inplace=False)
except AttributeError: # np.array
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels cleaner to test this using isinstance(X, pd.DataFrame) would you not agree?

sklego/meta.py Show resolved Hide resolved
tests/test_meta/test_grouped_model.py Show resolved Hide resolved
sklego/meta.py Outdated
@@ -54,6 +54,47 @@ def __init__(self, estimator, groups, use_fallback=True):
self.groups = groups
self.use_fallback = use_fallback

def __check_group_cols_exist(self, X):
try:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking the type of X is probably better to do via isinstance instead waiting for an AttributeError.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in this case the intent is clearer with an isinstance check.

@koaning
Copy link
Owner

koaning commented Jul 11, 2019

  1. Great work! This is an improvement.
  2. I did have some comments on this. Mainly around the checking if something is a pd.DataFrame object or not. My view suggests that asserting this via isinstance is a better habbit. Feel free to challenge me on this.
  3. There are some places where adding comments makes it easier to understand why a certain decision is made.

@pim-hoeven
Copy link
Contributor Author

  1. Thanks, and thank you for the comments!
  2. I do not have a strong view on this personally. I used to do these checks using isinstance, but read somewhere that the try-except approach is preferred (because, for example, we might in the feature want to pass something that behaves like a DataFrame but does not necessarily pass the isinstance check). See for example the answers to this so post. Still, I'm not very strongly opinionated on this point (and agree that isinstance is more readable), so if this argument does not convince you than I'll gladly change the try-excepts to isinstance checks.
  3. Agreed, I will add some comments

sklego/meta.py Outdated
@@ -54,6 +54,47 @@ def __init__(self, estimator, groups, use_fallback=True):
self.groups = groups
self.use_fallback = use_fallback

def __check_group_cols_exist(self, X):
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in this case the intent is clearer with an isinstance check.

sklego/meta.py Outdated
except AttributeError:
try:
ncols = X.shape[1]
except IndexError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is X ever only 1d?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be. groupby-mean as a model?

sklego/meta.py Outdated Show resolved Hide resolved
@pim-hoeven
Copy link
Contributor Author

Apologies for the delay, I made the requested changes

@koaning
Copy link
Owner

koaning commented Jul 23, 2019

No worries about the delay, it is all volunteer work.

@koaning
Copy link
Owner

koaning commented Jul 23, 2019

There is something interesting happening in the ci pipeline though. It runs fine locally?

________ ERROR collecting tests/test_meta/test_estimatortransformer.py _________
ImportError while importing test module '/home/travis/build/koaning/scikit-lego/tests/test_meta/test_estimatortransformer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_meta/test_estimatortransformer.py:2: in <module>
    from pandas.tests.extension.numpy_.test_numpy_nested import np
E   ModuleNotFoundError: No module named 'pandas.tests.extension.numpy_'
___________ ERROR collecting tests/test_meta/test_outlier_remover.py ___________
ImportError while importing test module '/home/travis/build/koaning/scikit-lego/tests/test_meta/test_outlier_remover.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_meta/test_outlier_remover.py:2: in <module>
    from pandas.tests.extension.numpy_.test_numpy_nested import np
E   ModuleNotFoundError: No module named 'pandas.tests.extension.numpy_'

@koaning
Copy link
Owner

koaning commented Jul 23, 2019

urgh. it seems unrelated to your work, i think pandas has a new version or something.

image

@koaning
Copy link
Owner

koaning commented Jul 23, 2019

yep. it's pandas.

image

@pim-hoeven
Copy link
Contributor Author

No I didn't have these locally, but only ran the relevant tests. Is the pandas issue something I can / should change?

@koaning
Copy link
Owner

koaning commented Jul 23, 2019

i haven't looked at your code yet but i'll prolly make a fix tomorrow/day after. once that is merged to master we can try again. :)

@koaning
Copy link
Owner

koaning commented Jul 23, 2019

@pim-hoeven do check the conversations. it feels like a comment here and there might still be missing.

@koaning
Copy link
Owner

koaning commented Jul 25, 2019

I am fixing the failing tests here; #167

@koaning
Copy link
Owner

koaning commented Jul 25, 2019

Also, just to check @pim-hoeven, you've got a sticker right?

@pim-hoeven
Copy link
Contributor Author

pim-hoeven commented Jul 25, 2019

  • Added some more comments to code
  • Added sticker to laptop
  • Added GroupedEstimator to my project at the client

@koaning koaning merged commit aaaf485 into koaning:master Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants