Fix start of train split in TimeGapSplit and added n_split parameter #324

rpauli · 2020-04-20T08:59:24Z

Addresses changes in #192 and #232
I am currently working on a Time Series problem with vibration data where I needed a functionality like the one suggested in #232 so I decided to add it here.

I tried to explain the changes in functionality in the docstring:

Each validation fold doesn't overlap. The entire 'window' moves by 1 valid_duration until there is not enough data.
If this would lead to more splits then specified with n_splits, the 'window' moves by
the validation_duration times the fraction of possible splits and requested splits
-- n_possible_splits = (total_length-train_duration-gap_duration)//valid_duration
-- time_shift = valid_duratiopn n_possible_splits/n_slits
so the CV spans the whole dataset.
If train_duration is not passed but n_split,
the training duration is increased to
-- train_duration = total_length-(self.gap_duration + self.valid_duration * self.n_splits)
such that the shifting the entire window by one validation duration spans the whole training set

The changes are also added to the docs notebook for visualization.

MBrouns

Hi @rpauli, thanks for the PR!

I like the change you're proposing, I think it's a nice addition to the current functionality. I do have a few questions though:

Would it make sense to rename n_splits to max_splits? In the current implementation it seems possible to end up with fewer than n_splits splits, but never more.
I've added some small questions on specific parts of the code
Could you add some of the checks you do in the notebook as automated pytest tests? Let me know if you need help with that.

sklego/model_selection.py

rpauli · 2020-04-20T11:36:45Z

Would it make sense to rename n_splits to max_splits? In the current implementation it seems possible to end up with fewer than n_splits splits, but never more.

Fewer shouldnt be possible, I added this check for it:
if n_split_max < self.n_splits

Could you add some of the checks you do in the notebook as automated pytest tests? Let me know if you need help with that.

I'll come up with a suggestion later today, thanks!

To make this explicit:
The prior behavior was to take train_duration and subtract gap_duration to get the train split.
I found this behavior unintuitive and confusing. Although I think I understand why it was done this way I now add the gap between train and validation without subtracting it from train_duration.
if this doesn't work for you I can change it back

doc/timegapsplit.ipynb

sklego/model_selection.py

koaning · 2020-04-22T20:25:12Z

@rpauli I like these changes. Two quick things though.

Could you have a look at the tests? It feels like there's asserts missing
The notebook situation with the docs is not ideal, and part of me is embarassed to say this, but I fear you need to render it for the docs to render.

koaning · 2020-05-01T10:06:43Z

There is no rush, but just to check; @rpauli are you waiting for feedback from us?

rpauli · 2020-05-02T10:30:55Z

I added some more explicit assert statements and aded the strict tag to the tests that are supposed to fail. I noticed you handled that differently in test_timegapsplit_too_big_gap should I also change it to catch the except?

I ran the ipynb before pushing, is this what you mean with render it?

koaning · 2020-05-02T13:01:17Z

I ran the ipynb before pushing, is this what you mean with render it?

Yes 👍

tests/test_model_selection/test_timegapsplit.py

koaning · 2020-05-02T13:11:05Z

I've only got one comment about the tests, but it is starting to look green to me. @MBrouns?

rpauli · 2020-05-02T14:17:26Z

I reread some pytest docs and it seems I misunderstood how xfail is supposed to be used, changed it to use
with pytest.raise

MBrouns · 2020-05-03T08:48:56Z

LGTM! We can merge when @koaning approves as well.

Thanks a lot for the pull request @rpauli! If you reach out to me on Twitter or LinkedIn with your address I'll make sure you'll get some stickers

koaning · 2020-05-03T09:07:45Z

It is merged, I will now also make a release with this feature in it so that you can use it right away.

@rpauli just to check, have we met in real life at a PyData by any chance? I'm curious to hear how you discovered this package.

koaning · 2020-05-03T09:08:35Z

Also, @rpauli if you have a twitter handle I can mention you when I announce the update.

koaning · 2020-05-03T10:24:11Z

And it's live with a version bump: https://pypi.org/project/scikit-lego/0.4.2/#history

rpauli · 2020-05-03T14:18:38Z

@rpauli just to check, have we met in real life at a PyData by any chance? I'm curious to hear how you discovered this package.

Saw one of your pyData talks on gaussian processes and outlier detection and found this package with things I also implemented at work (although less structured) so I decided to contribute a bit

rpauli · 2020-05-03T14:26:29Z

And it's live with a version bump: https://pypi.org/project/scikit-lego/0.4.2/#history

Great to hear, first open source contribution I wasn't paid for!

rpauli added 2 commits April 20, 2020 10:53

added Examples for n_splits and max_training_set functionality

3515de4

added n_split param and option to use all previous data for train split

ee8dc4d

MBrouns reviewed Apr 20, 2020

View reviewed changes

sklego/model_selection.py Show resolved Hide resolved

sklego/model_selection.py Outdated Show resolved Hide resolved

sklego/model_selection.py Outdated Show resolved Hide resolved

rpauli added 3 commits April 20, 2020 18:54

Merge branch 'master' into master

4a12fb4

fix typo

4ff85a2

add tests for new functionality

5275abb

koaning requested changes Apr 21, 2020

View reviewed changes

doc/timegapsplit.ipynb Outdated Show resolved Hide resolved

koaning reviewed Apr 21, 2020

View reviewed changes

sklego/model_selection.py Outdated Show resolved Hide resolved

rpauli added 2 commits April 22, 2020 08:29

Added a sentence for the examples

46643c1

renamed parameter to window, add underscore

619d43e

rpauli added 3 commits May 2, 2020 12:25

add asserts to tests

80b84b9

ran .ipynb so can be rendered

d089afc

Merge branch 'master' into master

20d0651

koaning reviewed May 2, 2020

View reviewed changes

tests/test_model_selection/test_timegapsplit.py Outdated Show resolved Hide resolved

change xfail to with pytest.raise for failing test

6afcf0a

Merge branch 'master' of github.com:rpauli/scikit-lego

66834f5

MBrouns approved these changes May 3, 2020

View reviewed changes

koaning approved these changes May 3, 2020

View reviewed changes

koaning merged commit 154f867 into koaning:master May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix start of train split in TimeGapSplit and added n_split parameter #324

Fix start of train split in TimeGapSplit and added n_split parameter #324

rpauli commented Apr 20, 2020

MBrouns left a comment

rpauli commented Apr 20, 2020 •

edited

koaning commented Apr 22, 2020

koaning commented May 1, 2020

rpauli commented May 2, 2020

koaning commented May 2, 2020

koaning commented May 2, 2020

rpauli commented May 2, 2020

MBrouns commented May 3, 2020

koaning commented May 3, 2020

koaning commented May 3, 2020

koaning commented May 3, 2020

rpauli commented May 3, 2020

rpauli commented May 3, 2020

Fix start of train split in TimeGapSplit and added n_split parameter #324

Fix start of train split in TimeGapSplit and added n_split parameter #324

Conversation

rpauli commented Apr 20, 2020

MBrouns left a comment

Choose a reason for hiding this comment

rpauli commented Apr 20, 2020 • edited

koaning commented Apr 22, 2020

koaning commented May 1, 2020

rpauli commented May 2, 2020

koaning commented May 2, 2020

koaning commented May 2, 2020

rpauli commented May 2, 2020

MBrouns commented May 3, 2020

koaning commented May 3, 2020

koaning commented May 3, 2020

koaning commented May 3, 2020

rpauli commented May 3, 2020

rpauli commented May 3, 2020

rpauli commented Apr 20, 2020 •

edited