BUG: behavior change in dataframe.shift(, freq=infer) when upgrading from pandas 1.1.5 to 1.2.x #40799

laurent-mutricy · 2021-04-06T11:35:24Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
import io

# replicate the typical data I am reading with pd.rea_csv
dates=np.arange(0,600.1,0.1)
dateString="time, count\n"
for  count, time in enumerate(dates):
    dateString=dateString+"{:0.3f}, {}\n".format(time, count)
# create dataframe and convert 'time' as a datetime and put it to index.
df=pd.read_csv(io.BytesIO(dateString.encode()),sep=',')
df['date']=pd.to_datetime(df.time, unit='s')
df.set_index('date', inplace=True, drop=False)
df.shift(1, freq='infer') # <-- this is working with pandas 1.1.5 but not any more with 1.2.0 and above

Traceback (most recent call last):
  File "debugPandas.py", line 23, in <module>
    df.shift(1, freq='infer')
  File "\lib\site-packages\pandas\core\frame.py", line 4600, in shift
    return super().shift(
  File "\lib\site-packages\pandas\core\generic.py", line 9453, in shift
    raise ValueError(msg)
ValueError: Freq was not set in the index hence cannot be inferred

Problem description

I am reading time series results from an other software with pd.read_csv and I put the time vector converted as datetime in index. At some stage in the data processing I need to use dataframe.shift using freq='infer'. This was working fine with pandas 1.1.5 but stopped working when upgrading to pandas 1.2.0 getting a ValueError. I could not find any reference of this in the change log so I guess it might be considered as a regression.
I worked it around forcing the proper frequency in the shift function instead of using infer

Investigations

Trying to go into more depth in the issue using the above dataframe df

print(df.index.inferred_freq)
print(pd._libs.algos.unique_deltas(df.index.asi8))

returns for pandas 1.1.5

100L
[100000000]

and for pandas 1.2.x

None
[ 99999999 100000000 100000001]

There is some kind of rounding issue when converting the floats serie to datatime serie. This is not the case if the index is created directly from the dates ndarray so there might be something with the read_csv part.
This is working fine waterver version of pandas:

dates=np.arange(0,600.1,0.1)
df=pd.DataFrame(data=dates, columns=['time'])
df['date']=pd.to_datetime(df.time, unit='s')
df.set_index('date', inplace=True, drop=False)
df.shift(1, freq='infer')  # <-- works with pandas 1.2.x
print(df.index.inferred_freq) # --> 100L
print(pd._libs.algos.unique_deltas(df.index.asi8)) # --> [100000000]

Output of `pd.show_versions()`

for pandas 1.2.3

INSTALLED VERSIONS

commit : f2c8480
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.2.3
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

for pandas 1.1.5

INSTALLED VERSIONS

commit : b5958ee
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.1.5
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-04-06T11:42:05Z

I think the freq inference changed with 1.2. Do you know if this is expected now @jbrockmendel ?

Labelling as a regression for now

jbrockmendel · 2021-04-06T22:39:43Z

DataFrame.shift still accepts freq="infer". I think the change is in pd.to_datetime(dates).inferred_freq now being None, likely due to floating point error treatment changing somewhere.

laurent-mutricy · 2021-04-07T12:01:09Z

pd.to_datetime(dates, unit='s').inferred_freq gives '100L' which is the expected answer.
pd.to_datetime(dates).inferred_freq was already giving None with pandas 1.1.5
You need to use pd.read_csv to reproduce the issue. The floating point error may be located within the dataframe constructor.

simonjayhawkins · 2021-04-12T10:34:02Z

first bad commit: [c1484b1] PERF: pd.to_datetime, unit='s' much slower for float64 than for int64 (#35027)

simonjayhawkins · 2021-04-12T10:35:20Z

moving to 1.2.5

simonjayhawkins · 2021-05-28T16:13:34Z

more minimal example

>>> idx = pd.to_datetime(
...     np.array([f"{x:0.3f}" for x in np.arange(0, 600.1, 0.1)], dtype="float64"), unit="s"
... )
>>> 
>>> pd.Series(range(6001),index=idx).shift(1, freq="infer")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simon/pandas/pandas/core/series.py", line 4849, in shift
    return super().shift(
  File "/home/simon/pandas/pandas/core/generic.py", line 9207, in shift
    raise ValueError(msg)
ValueError: Freq was not set in the index hence cannot be inferred
>>>

pd.to_datetime can handle the low precision strings, but not after being converted to a numpy float array. The PR that caused the regression changed the conversion from floats.

>>> 
>>> idx = pd.to_datetime([f"{x:0.3f}" for x in np.arange(0, 600.1, 0.1)], unit="s")
>>> 
>>> pd.Series(range(6001),index=idx).shift(1, freq="infer")
1970-01-01 00:00:00.100       0
1970-01-01 00:00:00.200       1
1970-01-01 00:00:00.300       2
1970-01-01 00:00:00.400       3
1970-01-01 00:00:00.500       4
                           ... 
1970-01-01 00:09:59.700    5996
1970-01-01 00:09:59.800    5997
1970-01-01 00:09:59.900    5998
1970-01-01 00:10:00.000    5999
1970-01-01 00:10:00.100    6000
Length: 6001, dtype: int64
>>>

simonjayhawkins · 2021-06-01T10:37:15Z

@jbrockmendel thoughts on this one?

jbrockmendel · 2021-06-03T15:06:04Z

The only way around this I see is reverting #35027. maybe parts of it can be salvaged?

simonjayhawkins · 2021-06-04T11:09:56Z

#35027 did give a very respectable perf gain, but the changes in the tests that were included with that PR indicated that there were some numerical differences.

We have not had other reports, suggesting that these numerical differences are not generally an issue. This bug report, shows that where the numerical precision is limited, these numerical differences become relevant.

So on one hand, don't think we should revert #35027, but not sure how to fix either.

simonjayhawkins · 2021-06-04T11:13:55Z

we should maybe change the title of this issue, as the current title is misleading.

laurent-mutricy · 2021-06-04T12:40:59Z

Tracing the calls to to_datetime from @simonjayhawkins minimal example there is no many differences. The one that is most interesseting in my opinion is when datetimes.py is called

 --- modulename: datetimes, funcname: _convert_listlike_datetimes
datetimes.py(309):     if isinstance(arg, (list, tuple)):
datetimes.py(310):         arg = np.array(arg, dtype="O")

the arg = np.array(arg, dtype="O") is not called for idx = pd.to_datetime(np.array([f"{x:0.3f}" for x in np.arange(0, 600.1, 0.1)], dtype="float64"), unit="s" )

with the following dirty hack to core/tools/datetimes.py around line 310 :

    import numpy
    if isinstance(arg, (list, tuple, numpy.ndarray)):
        arg = np.array(arg, dtype="O")

the bug is solved.
I hope this helps

simonjayhawkins · 2021-06-04T12:58:50Z

not checked, but guessing this works since it doesn't use the faster code for floats.

interestingly, I guess if we are converting a list or tuple to an object array, we may be missing the perf gain for floats in some cases.

It maybe, that we need to patch the code in the read_csv stage, so that an object array is always passed to pd.to_datetime, to account for the lower precision from the csv string based storage (although that might not always be the case). I assume a float array is currently passed otherwise we would not be getting the error?

laurent-mutricy · 2021-06-04T21:14:25Z

f"{x:0.14f}" is still failing but f"{x:0.15f}" will not reproduce the bug.
I am not sure to understand where the speed up of #35027 rely on? is it the .astype("M8[ns]", copy=False) instead of .astype(object) ?? if so shouldn't we pass over the bug to the numpy peoples ?

simonjayhawkins · 2021-06-16T14:46:10Z

@pandas-dev/pandas-core not sure what to do here. I'm removing from 1.2.5 milestone so that can proceed with the 1.2.5 release.

laurent-mutricy added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2021

phofl added Frequency DateOffsets Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2021

phofl added this to the 1.2.4 milestone Apr 6, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Apr 12, 2021

code sample for pandas-dev#40799

1c98206

simonjayhawkins modified the milestones: 1.2.4, 1.2.5 Apr 12, 2021

simonjayhawkins mentioned this issue Jun 10, 2021

RLS: 1.2.5 #40917

Closed

simonjayhawkins modified the milestones: 1.2.5, Contributions Welcome Jun 16, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: behavior change in dataframe.shift(, freq=infer) when upgrading from pandas 1.1.5 to 1.2.x #40799

BUG: behavior change in dataframe.shift(, freq=infer) when upgrading from pandas 1.1.5 to 1.2.x #40799

laurent-mutricy commented Apr 6, 2021 •

edited

INSTALLED VERSIONS

INSTALLED VERSIONS

phofl commented Apr 6, 2021 •

edited

jbrockmendel commented Apr 6, 2021

laurent-mutricy commented Apr 7, 2021

simonjayhawkins commented Apr 12, 2021

simonjayhawkins commented Apr 12, 2021

simonjayhawkins commented May 28, 2021

simonjayhawkins commented Jun 1, 2021

jbrockmendel commented Jun 3, 2021

simonjayhawkins commented Jun 4, 2021

simonjayhawkins commented Jun 4, 2021

laurent-mutricy commented Jun 4, 2021

simonjayhawkins commented Jun 4, 2021

laurent-mutricy commented Jun 4, 2021

simonjayhawkins commented Jun 16, 2021

BUG: behavior change in dataframe.shift(, freq=infer) when upgrading from pandas 1.1.5 to 1.2.x #40799

BUG: behavior change in dataframe.shift(, freq=infer) when upgrading from pandas 1.1.5 to 1.2.x #40799

Comments

laurent-mutricy commented Apr 6, 2021 • edited

Code Sample, a copy-pastable example

Problem description

Investigations

Output of pd.show_versions()

INSTALLED VERSIONS

INSTALLED VERSIONS

phofl commented Apr 6, 2021 • edited

jbrockmendel commented Apr 6, 2021

laurent-mutricy commented Apr 7, 2021

simonjayhawkins commented Apr 12, 2021

simonjayhawkins commented Apr 12, 2021

simonjayhawkins commented May 28, 2021

simonjayhawkins commented Jun 1, 2021

jbrockmendel commented Jun 3, 2021

simonjayhawkins commented Jun 4, 2021

simonjayhawkins commented Jun 4, 2021

laurent-mutricy commented Jun 4, 2021

simonjayhawkins commented Jun 4, 2021

laurent-mutricy commented Jun 4, 2021

simonjayhawkins commented Jun 16, 2021

laurent-mutricy commented Apr 6, 2021 •

edited

Output of `pd.show_versions()`

phofl commented Apr 6, 2021 •

edited