Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix start of train split in TimeGapSplit and added n_split parameter #324

Merged
merged 12 commits into from May 3, 2020

Conversation

rpauli
Copy link
Contributor

@rpauli rpauli commented Apr 20, 2020

Addresses changes in #192 and #232
I am currently working on a Time Series problem with vibration data where I needed a functionality like the one suggested in #232 so I decided to add it here.

I tried to explain the changes in functionality in the docstring:

Each validation fold doesn't overlap. The entire 'window' moves by 1 valid_duration until there is not enough data.
If this would lead to more splits then specified with n_splits, the 'window' moves by
the validation_duration times the fraction of possible splits and requested splits
-- n_possible_splits = (total_length-train_duration-gap_duration)//valid_duration
-- time_shift = valid_duratiopn n_possible_splits/n_slits
so the CV spans the whole dataset.
If train_duration is not passed but n_split,
the training duration is increased to
-- train_duration = total_length-(self.gap_duration + self.valid_duration * self.n_splits)
such that the shifting the entire window by one validation duration spans the whole training set

The changes are also added to the docs notebook for visualization.

Copy link
Collaborator

@MBrouns MBrouns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rpauli, thanks for the PR!

I like the change you're proposing, I think it's a nice addition to the current functionality. I do have a few questions though:

  • Would it make sense to rename n_splits to max_splits? In the current implementation it seems possible to end up with fewer than n_splits splits, but never more.
  • I've added some small questions on specific parts of the code
  • Could you add some of the checks you do in the notebook as automated pytest tests? Let me know if you need help with that.

sklego/model_selection.py Show resolved Hide resolved
sklego/model_selection.py Outdated Show resolved Hide resolved
sklego/model_selection.py Outdated Show resolved Hide resolved
@rpauli
Copy link
Contributor Author

rpauli commented Apr 20, 2020

Would it make sense to rename n_splits to max_splits? In the current implementation it seems possible to end up with fewer than n_splits splits, but never more.

Fewer shouldnt be possible, I added this check for it:
if n_split_max < self.n_splits

Could you add some of the checks you do in the notebook as automated pytest tests? Let me know if you need help with that.

I'll come up with a suggestion later today, thanks!

To make this explicit:
The prior behavior was to take train_duration and subtract gap_duration to get the train split.
I found this behavior unintuitive and confusing. Although I think I understand why it was done this way I now add the gap between train and validation without subtracting it from train_duration.
if this doesn't work for you I can change it back

doc/timegapsplit.ipynb Outdated Show resolved Hide resolved
sklego/model_selection.py Outdated Show resolved Hide resolved
@koaning
Copy link
Owner

koaning commented Apr 22, 2020

@rpauli I like these changes. Two quick things though.

  1. Could you have a look at the tests? It feels like there's asserts missing
  2. The notebook situation with the docs is not ideal, and part of me is embarassed to say this, but I fear you need to render it for the docs to render.

@koaning
Copy link
Owner

koaning commented May 1, 2020

There is no rush, but just to check; @rpauli are you waiting for feedback from us?

@rpauli
Copy link
Contributor Author

rpauli commented May 2, 2020

I added some more explicit assert statements and aded the strict tag to the tests that are supposed to fail. I noticed you handled that differently in test_timegapsplit_too_big_gap should I also change it to catch the except?

I ran the ipynb before pushing, is this what you mean with render it?

@koaning
Copy link
Owner

koaning commented May 2, 2020

I ran the ipynb before pushing, is this what you mean with render it?

Yes 👍

@koaning
Copy link
Owner

koaning commented May 2, 2020

I've only got one comment about the tests, but it is starting to look green to me. @MBrouns?

@rpauli
Copy link
Contributor Author

rpauli commented May 2, 2020

I reread some pytest docs and it seems I misunderstood how xfail is supposed to be used, changed it to use
with pytest.raise

@MBrouns
Copy link
Collaborator

MBrouns commented May 3, 2020

LGTM! We can merge when @koaning approves as well.

Thanks a lot for the pull request @rpauli! If you reach out to me on Twitter or LinkedIn with your address I'll make sure you'll get some stickers

@koaning koaning merged commit 154f867 into koaning:master May 3, 2020
@koaning
Copy link
Owner

koaning commented May 3, 2020

It is merged, I will now also make a release with this feature in it so that you can use it right away.

@rpauli just to check, have we met in real life at a PyData by any chance? I'm curious to hear how you discovered this package.

@koaning
Copy link
Owner

koaning commented May 3, 2020

Also, @rpauli if you have a twitter handle I can mention you when I announce the update.

@koaning
Copy link
Owner

koaning commented May 3, 2020

And it's live with a version bump: https://pypi.org/project/scikit-lego/0.4.2/#history

@rpauli
Copy link
Contributor Author

rpauli commented May 3, 2020

@rpauli just to check, have we met in real life at a PyData by any chance? I'm curious to hear how you discovered this package.

Saw one of your pyData talks on gaussian processes and outlier detection and found this package with things I also implemented at work (although less structured) so I decided to contribute a bit

@rpauli
Copy link
Contributor Author

rpauli commented May 3, 2020

And it's live with a version bump: https://pypi.org/project/scikit-lego/0.4.2/#history

Great to hear, first open source contribution I wasn't paid for!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants