Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: behavior change in dataframe.shift(, freq=infer) when upgrading from pandas 1.1.5 to 1.2.x #40799

Open
2 of 3 tasks
laurent-mutricy opened this issue Apr 6, 2021 · 14 comments
Labels
Bug Frequency DateOffsets Regression Functionality that used to work in a prior pandas version

Comments

@laurent-mutricy
Copy link

laurent-mutricy commented Apr 6, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
import io

# replicate the typical data I am reading with pd.rea_csv
dates=np.arange(0,600.1,0.1)
dateString="time, count\n"
for  count, time in enumerate(dates):
    dateString=dateString+"{:0.3f}, {}\n".format(time, count)
# create dataframe and convert 'time' as a datetime and put it to index.
df=pd.read_csv(io.BytesIO(dateString.encode()),sep=',')
df['date']=pd.to_datetime(df.time, unit='s')
df.set_index('date', inplace=True, drop=False)
df.shift(1, freq='infer') # <-- this is working with pandas 1.1.5 but not any more with 1.2.0 and above
Traceback (most recent call last):
  File "debugPandas.py", line 23, in <module>
    df.shift(1, freq='infer')
  File "\lib\site-packages\pandas\core\frame.py", line 4600, in shift
    return super().shift(
  File "\lib\site-packages\pandas\core\generic.py", line 9453, in shift
    raise ValueError(msg)
ValueError: Freq was not set in the index hence cannot be inferred

Problem description

I am reading time series results from an other software with pd.read_csv and I put the time vector converted as datetime in index. At some stage in the data processing I need to use dataframe.shift using freq='infer'. This was working fine with pandas 1.1.5 but stopped working when upgrading to pandas 1.2.0 getting a ValueError. I could not find any reference of this in the change log so I guess it might be considered as a regression.
I worked it around forcing the proper frequency in the shift function instead of using infer

Investigations

Trying to go into more depth in the issue using the above dataframe df

print(df.index.inferred_freq)
print(pd._libs.algos.unique_deltas(df.index.asi8))

returns for pandas 1.1.5

100L
[100000000]

and for pandas 1.2.x

None
[ 99999999 100000000 100000001]

There is some kind of rounding issue when converting the floats serie to datatime serie. This is not the case if the index is created directly from the dates ndarray so there might be something with the read_csv part.
This is working fine waterver version of pandas:

dates=np.arange(0,600.1,0.1)
df=pd.DataFrame(data=dates, columns=['time'])
df['date']=pd.to_datetime(df.time, unit='s')
df.set_index('date', inplace=True, drop=False)
df.shift(1, freq='infer')  # <-- works with pandas 1.2.x
print(df.index.inferred_freq) # --> 100L
print(pd._libs.algos.unique_deltas(df.index.asi8)) # --> [100000000]

Output of pd.show_versions()

for pandas 1.2.3

INSTALLED VERSIONS

commit : f2c8480
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.2.3
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

for pandas 1.1.5

INSTALLED VERSIONS

commit : b5958ee
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.1.5
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@laurent-mutricy laurent-mutricy added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2021
@phofl
Copy link
Member

phofl commented Apr 6, 2021

I think the freq inference changed with 1.2. Do you know if this is expected now @jbrockmendel ?

Labelling as a regression for now

@phofl phofl added Frequency DateOffsets Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2021
@phofl phofl added this to the 1.2.4 milestone Apr 6, 2021
@jbrockmendel
Copy link
Member

DataFrame.shift still accepts freq="infer". I think the change is in pd.to_datetime(dates).inferred_freq now being None, likely due to floating point error treatment changing somewhere.

@laurent-mutricy
Copy link
Author

pd.to_datetime(dates, unit='s').inferred_freq gives '100L' which is the expected answer.
pd.to_datetime(dates).inferred_freq was already giving None with pandas 1.1.5
You need to use pd.read_csv to reproduce the issue. The floating point error may be located within the dataframe constructor.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Apr 12, 2021
@simonjayhawkins
Copy link
Member

first bad commit: [c1484b1] PERF: pd.to_datetime, unit='s' much slower for float64 than for int64 (#35027)

@simonjayhawkins
Copy link
Member

moving to 1.2.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.2.4, 1.2.5 Apr 12, 2021
@simonjayhawkins
Copy link
Member

more minimal example

>>> idx = pd.to_datetime(
...     np.array([f"{x:0.3f}" for x in np.arange(0, 600.1, 0.1)], dtype="float64"), unit="s"
... )
>>> 
>>> pd.Series(range(6001),index=idx).shift(1, freq="infer")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simon/pandas/pandas/core/series.py", line 4849, in shift
    return super().shift(
  File "/home/simon/pandas/pandas/core/generic.py", line 9207, in shift
    raise ValueError(msg)
ValueError: Freq was not set in the index hence cannot be inferred
>>> 

pd.to_datetime can handle the low precision strings, but not after being converted to a numpy float array. The PR that caused the regression changed the conversion from floats.

>>> 
>>> idx = pd.to_datetime([f"{x:0.3f}" for x in np.arange(0, 600.1, 0.1)], unit="s")
>>> 
>>> pd.Series(range(6001),index=idx).shift(1, freq="infer")
1970-01-01 00:00:00.100       0
1970-01-01 00:00:00.200       1
1970-01-01 00:00:00.300       2
1970-01-01 00:00:00.400       3
1970-01-01 00:00:00.500       4
                           ... 
1970-01-01 00:09:59.700    5996
1970-01-01 00:09:59.800    5997
1970-01-01 00:09:59.900    5998
1970-01-01 00:10:00.000    5999
1970-01-01 00:10:00.100    6000
Length: 6001, dtype: int64
>>> 

@simonjayhawkins
Copy link
Member

@jbrockmendel thoughts on this one?

@jbrockmendel
Copy link
Member

The only way around this I see is reverting #35027. maybe parts of it can be salvaged?

@simonjayhawkins
Copy link
Member

#35027 did give a very respectable perf gain, but the changes in the tests that were included with that PR indicated that there were some numerical differences.

We have not had other reports, suggesting that these numerical differences are not generally an issue. This bug report, shows that where the numerical precision is limited, these numerical differences become relevant.

So on one hand, don't think we should revert #35027, but not sure how to fix either.

@simonjayhawkins
Copy link
Member

we should maybe change the title of this issue, as the current title is misleading.

@laurent-mutricy
Copy link
Author

Tracing the calls to to_datetime from @simonjayhawkins minimal example there is no many differences. The one that is most interesseting in my opinion is when datetimes.py is called

 --- modulename: datetimes, funcname: _convert_listlike_datetimes
datetimes.py(309):     if isinstance(arg, (list, tuple)):
datetimes.py(310):         arg = np.array(arg, dtype="O")

the arg = np.array(arg, dtype="O") is not called for idx = pd.to_datetime(np.array([f"{x:0.3f}" for x in np.arange(0, 600.1, 0.1)], dtype="float64"), unit="s" )

with the following dirty hack to core/tools/datetimes.py around line 310 :

    import numpy
    if isinstance(arg, (list, tuple, numpy.ndarray)):
        arg = np.array(arg, dtype="O")

the bug is solved.
I hope this helps

@simonjayhawkins
Copy link
Member

not checked, but guessing this works since it doesn't use the faster code for floats.

interestingly, I guess if we are converting a list or tuple to an object array, we may be missing the perf gain for floats in some cases.

It maybe, that we need to patch the code in the read_csv stage, so that an object array is always passed to pd.to_datetime, to account for the lower precision from the csv string based storage (although that might not always be the case). I assume a float array is currently passed otherwise we would not be getting the error?

@laurent-mutricy
Copy link
Author

f"{x:0.14f}" is still failing but f"{x:0.15f}" will not reproduce the bug.
I am not sure to understand where the speed up of #35027 rely on? is it the .astype("M8[ns]", copy=False) instead of .astype(object) ?? if so shouldn't we pass over the bug to the numpy peoples ?

@simonjayhawkins
Copy link
Member

@pandas-dev/pandas-core not sure what to do here. I'm removing from 1.2.5 milestone so that can proceed with the 1.2.5 release.

@simonjayhawkins simonjayhawkins modified the milestones: 1.2.5, Contributions Welcome Jun 16, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Frequency DateOffsets Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants