New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TimeGapSplit mutates the data #191
Comments
TimeGapSplit
mutates the data
Hi,
Nice catch.
Yep that is because I did the conversion :
https://github.com/koaning/scikit-lego/blob/master/sklego/model_selection.py#L39
Actually we don't need the entire dataframe, we just need a serie of the
date with the same index as the dataframe.
Therefore we could replace parameters `df, date_col` with `date_serie` =
`df['date]` and remove the conversion.
About enforcing the type to be `datetime`, why not but if I'm not mistaken
there are plenty of different datetime types, would be nice to give the
flexibility, if it works with pd.DateOffset or timedelta.
Cheers
…On Thu, 19 Sep 2019 at 09:35, Ruben van de Geer ***@***.***> wrote:
TimeGapSplit mutates the pandas.DataFrame it operates on if the date
column is not a datetime type. For example:
from pandas.api.types import (
is_object_dtype,
is_datetime64_any_dtype
)
df = (
pd.DataFrame(
data=np.random.randint(0, 30, size=(30, 4)),
columns=list('ABCy')
)
.assign(
date=dates
)
)
assert is_object_dtype(df['date'])
cv = TimeGapSplit(
df=df,
date_col='date',
train_duration=timedelta(days=3),
valid_duration=timedelta(days=1),
)
assert is_datetime64_any_dtype(df['date'])
Is this desirable behavior?
Possible remedies are:
- Only accept the datetime type
- Accept str type, but make a copy and leave the pandas.DataFrame as
is.
Happy to hear your thoughts @kayhoogland <https://github.com/kayhoogland>
@stephanecollot <https://github.com/stephanecollot> @koaning
<https://github.com/koaning>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#191?email_source=notifications&email_token=AAJBFDDWKB3YTP6OFAE5YODQKMTTLA5CNFSM4IYHQR72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HMKUV3A>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJBFDGX3MNBW5R3NRLUGT3QKMTTLANCNFSM4IYHQR7Q>
.
|
copy feels like a safer option for now. it allows for flexibility in the future. |
I tried to fix this issue by adding
in the This has to do with the fact that in the initalization a Is there a particular reason why you would like to |
Why do you do If you just remove reset_index it should work |
And in the test |
To (try to) resolve #193. |
But maybe I don't understand the use case that |
Yep it seem so. This TimeGapSplit needs data in the init, this is why it is scikit-lego. A remark that can useful for your understand I think: The date column cannot be the index because you could have multiple row per date. Solve 1 ticket per MR I would suggest. |
Ok I'm fixing this, with a big performance optimisation. |
This was fixed and merged in #227 |
@koaning can you close? |
TimeGapSplit
mutates thepandas.DataFrame
it operates on if the date column is not adatetime
type. For example:Is this desirable behavior?
Possible remedies are:
datetime
typestr
type, but make a copy and leave thepandas.DataFrame
as is.Happy to hear your thoughts @kayhoogland @stephanecollot @koaning
The text was updated successfully, but these errors were encountered: