Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.apply adds a frequency to a freq=None DatetimeIndex as a side-effect #22150

Closed
dycw opened this issue Aug 1, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@dycw
Copy link

commented Aug 1, 2018

Code Sample, a copy-pastable example if possible

import numpy as np, pandas as pd

def sudden_frequency(num_columns):
    index = pd.DatetimeIndex(["1950-06-30", "1952-10-24", "1953-05-29"])
    columns = list(range(num_columns))
    df = pd.DataFrame(np.random.random((len(index), num_columns)), index, columns)
    df.apply(lambda sr: sr)
    return index

for num_columns in range(5):
    print(num_columns, "--", sudden_frequency(num_columns))

Output:

0 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)
1 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)
2 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')
3 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')
4 -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')

Problem description

This particular index (found by hypothesis) suddenly gains a frequency it is used in a DataFrame, with >= 2 columns, which goes on to call ".apply".

Expected Output

n -- DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)

for all n.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.16.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C
LOCALE: None.None

pandas: 0.23.3
pytest: 3.6.4
pip: 18.0
setuptools: 39.2.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd

This comment has been minimized.

Copy link
Member

commented Aug 1, 2018

Can you provide a more minimal example to reproduce the issue?

@WillAyd WillAyd added the Needs Info label Aug 1, 2018

@dycw

This comment has been minimized.

Copy link
Author

commented Aug 3, 2018

Yes. I have reduced the conditions to the following:

  1. The DatetimeIndex is of length >= 3.
  2. The DatetimeIndex has an inferrable frequency.
  3. The DataFrame has >= 2 columns.
from hypothesis import given
from hypothesis.strategies import composite, dates, integers, sampled_from
from pandas import DataFrame, DatetimeIndex, Timestamp, date_range


@composite
def indices(draw, max_length=5):
    date = draw(
        dates(
            min_value=Timestamp.min.ceil("D").to_pydatetime().date(),
            max_value=Timestamp.max.floor("D").to_pydatetime().date(),
        ).map(Timestamp)
    )
    periods = draw(integers(0, max_length))
    freq = draw(sampled_from(list("BDHTS")))
    dr = date_range(date, periods=periods, freq=freq)
    return DatetimeIndex(list(dr))


@given(index=indices(5), num_columns=integers(0, 5))
def test_main(index, num_columns):
    original = index.copy()
    df = DataFrame(True, index=index, columns=range(num_columns))
    df.apply(lambda x: x)
    assert index.freq == original.freq

One example is

index = DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='D'), num_columns = 2

    @given(index=indices(5), num_columns=integers(0, 5))
    def test_main(index, num_columns):
        original = index.copy()
        df = DataFrame(True, index=index, columns=range(num_columns))
        df.apply(lambda x: x)
>       assert index.freq == original.freq
E       AssertionError: assert <Day> == None
E        +  where <Day> = DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='D').freq
E        +  and   None = DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq=None).freq
@mroeschke

This comment has been minimized.

Copy link
Member

commented Aug 3, 2018

Thanks @dycw. I can reproduce with a similar example:

In [1]: index = pd.DatetimeIndex(["1950-06-30", "1952-10-24", "1953-05-29"])

In [2]: df = pd.DataFrame(1, index=index, columns=range(2))

In [3]: index
Out[3]: DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq=None)

In [4]: df.apply(lambda x: x)
Out[4]:
            0  1
1950-06-30  1  1
1952-10-24  1  1
1953-05-29  1  1

# Gains a frequency
In [5]: index
Out[5]: DatetimeIndex(['1950-06-30', '1952-10-24', '1953-05-29'], dtype='datetime64[ns]', freq='WOM-4FRI')

In [6]: pd.__version__
Out[6]: '0.23.3'

#14927 may be playing a role here somewhere. Investigation and PR' are always welcome!

@HannahFerch

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2018

I would like to look into this, however, am quite new to open source - can I ask questions here if I get stuck or rather in another place, e.g. Gitter?

@WillAyd

This comment has been minimized.

Copy link
Member

commented Aug 10, 2018

@HannahFerch high level can ask questions here or on Gitter. For detailed code review it is easiest if you just push a PR and get feedback directly on that

@HannahFerch

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2018

@WillAyd Makes sense. Thanks!

@HannahFerch

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2018

I have been looking at the example of @mroeschke. The setting of the frequency takes place in pandas.core.apply.FrameRowApply when wrapping the results with wrap_results_for_axis(). This calls self.obj._constructor, which returns a results object with freq='WOM-4FRI' instead of the original freq='None' that went inside.
Should the setting of the frequency be prevented at this point or better be set back to the original 'None' later on before returning the df?

@mroeschke

This comment has been minimized.

Copy link
Member

commented Aug 24, 2018

It would be more ideal to prevent self.obj._constructor from setting a new freq.

@jreback jreback added this to the 0.24.0 milestone Sep 4, 2018

HannahFerch added a commit to HannahFerch/pandas that referenced this issue Sep 9, 2018

HannahFerch added a commit to HannahFerch/pandas that referenced this issue Sep 16, 2018

Merge remote-tracking branch 'upstream/master' into pandas-dev#22150
# Conflicts:
#	doc/source/whatsnew/v0.24.0.txt

HannahFerch added a commit to HannahFerch/pandas that referenced this issue Sep 16, 2018

jreback added a commit that referenced this issue Sep 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.