Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TimeGapSplit mutates the data #191

Closed
rubenvdg opened this issue Sep 19, 2019 · 11 comments
Closed

[BUG] TimeGapSplit mutates the data #191

rubenvdg opened this issue Sep 19, 2019 · 11 comments
Labels
bug Something isn't working

Comments

@rubenvdg
Copy link

rubenvdg commented Sep 19, 2019

TimeGapSplit mutates the pandas.DataFrame it operates on if the date column is not a datetime type. For example:

from pandas.api.types import (
    is_object_dtype,
    is_datetime64_any_dtype
)

date_range = pd.date_range(start='1/1/2018', end='1/30/2018')
dates = [date.strftime('%m-%d-%Y') for date in date_range]

df = (
    pd.DataFrame(
        data=np.random.randint(0, 30, size=(30, 4)),
        columns=list('ABCy')
    )
    .assign(
        date=dates
    )
)

assert is_object_dtype(df['date'])

cv = TimeGapSplit(
    df=df,
    date_col='date',
    train_duration=timedelta(days=3),
    valid_duration=timedelta(days=1),
)

assert is_datetime64_any_dtype(df['date'])

Is this desirable behavior?

Possible remedies are:

  • Only accept the datetime type
  • Accept str type, but make a copy and leave the pandas.DataFrame as is.

Happy to hear your thoughts @kayhoogland @stephanecollot @koaning

@rubenvdg rubenvdg added the bug Something isn't working label Sep 19, 2019
@rubenvdg rubenvdg changed the title [BUG] TimeGapSplit mutates the data [BUG] TimeGapSplit mutates the data Sep 19, 2019
@stephanecollot
Copy link
Contributor

stephanecollot commented Sep 19, 2019 via email

@koaning
Copy link
Owner

koaning commented Sep 19, 2019

copy feels like a safer option for now. it allows for flexibility in the future.

@rubenvdg
Copy link
Author

I tried to fix this issue by adding

self.df = df.copy().reset_index(drop=True)

in the __init__. However, this results in tests not working.

This has to do with the fact that in the initalization a df is passed and then in the .split() method a different dataframe is passed, c.f., https://github.com/koaning/scikit-lego/blob/master/tests/test_model_selection/test_timegapsplit.py#L23.

Is there a particular reason why you would like to __init__ with a different pd.DataFrame than to which you like to apply the .split() to?

@stephanecollot
Copy link
Contributor

Why do you do .reset_index(drop=True)?
The index is very important, that how it can work with 2 different df -> the X_train usually doesn't contain the date column, and df contains the date (and y for example)

If you just remove reset_index it should work

@stephanecollot
Copy link
Contributor

And in the test train = df.head(25) this is actually testing the fact that you don't mess with the index ^^.

@rubenvdg
Copy link
Author

To (try to) resolve #193.

@rubenvdg
Copy link
Author

rubenvdg commented Sep 20, 2019

But maybe I don't understand the use case that TimeGapSplit tries to solve. The convention in sklearn is to __init__ with some settings (but without data) and then apply the splitter to data in .split(), c.f., https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html.

@stephanecollot
Copy link
Contributor

Yep it seem so.
You can read:
https://scikit-lego.readthedocs.io/en/latest/timegapsplit.html
https://scikit-lego.readthedocs.io/en/latest/api/sklego.html?highlight=TimeGapSplit#sklego.model_selection.TimeGapSplit

This TimeGapSplit needs data in the init, this is why it is scikit-lego.

A remark that can useful for your understand I think: The date column cannot be the index because you could have multiple row per date.

Solve 1 ticket per MR I would suggest.

stephanecollot added a commit to stephanecollot/scikit-lego that referenced this issue Sep 20, 2019
@stephanecollot
Copy link
Contributor

Ok I'm fixing this, with a big performance optimisation.

stephanecollot added a commit to stephanecollot/scikit-lego that referenced this issue Sep 20, 2019
@stephanecollot
Copy link
Contributor

This was fixed and merged in #227

@stephanecollot
Copy link
Contributor

@koaning can you close?

@MBrouns MBrouns closed this as completed Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants