ValueError: Values falls after last bin when Resampling using pd.tseries.offsets.Nano as period #12037

Closed
nothinkelse opened this Issue Jan 14, 2016 · 5 comments

Comments

Projects
None yet
3 participants

I have a timeseries in dataframe named dfi with non-eqispaced times as index

print dfi.value
print
print "len "+ str(len(dfi))

Output:

datetime
2015-10-01 13:58:10.427   -10.072100
2015-10-01 13:58:11.419   -10.072100
2015-10-01 13:58:12.417   -10.072100
2015-10-01 13:58:13.420   -10.072100
2015-10-01 13:58:14.426   -10.072100
2015-10-01 13:58:15.427   -10.072100
2015-10-01 13:58:16.418   -10.072100
2015-10-01 13:58:17.418    -9.753230
2015-10-01 13:58:18.416    -9.753230
2015-10-01 13:58:19.428    -9.753230
2015-10-01 13:58:20.427    -9.753230
2015-10-01 13:58:21.419    -9.753230
2015-10-01 13:58:22.416    -9.753230
2015-10-01 13:58:23.429    -9.753230
2015-10-01 13:58:24.416    -9.753230
2015-10-01 13:58:25.428    -9.753230
2015-10-01 13:58:26.418    -9.753230
2015-10-01 13:58:27.416    -9.396140
2015-10-01 13:58:28.416    -9.396140
2015-10-01 13:58:29.429    -9.396140
2015-10-01 13:58:32.416    -9.396140
2015-10-01 13:58:33.427    -9.396140
2015-10-01 13:58:34.428    -9.396140
2015-10-01 13:58:35.462    -9.396140
2015-10-01 13:58:36.416    -9.396140
2015-10-01 13:58:37.427    -9.010000
2015-10-01 13:58:38.428    -9.010000
2015-10-01 13:58:39.435    -9.010000
2015-10-01 13:58:40.437    -9.010000
2015-10-01 13:58:41.416    -9.010000
                             ...    
2015-10-03 23:59:28.052    -0.759718
2015-10-03 23:59:29.040    -0.759718
2015-10-03 23:59:30.048    -0.759718
2015-10-03 23:59:31.048    -0.759718
2015-10-03 23:59:32.060    -0.759718
2015-10-03 23:59:33.049    -0.759718
2015-10-03 23:59:34.051    -0.759718
2015-10-03 23:59:35.041    -0.759718
2015-10-03 23:59:36.061    -0.759718
2015-10-03 23:59:37.059    -1.010490
2015-10-03 23:59:38.040    -1.010490
2015-10-03 23:59:39.051    -1.010490
2015-10-03 23:59:40.040    -1.010490
2015-10-03 23:59:41.072    -1.010490
2015-10-03 23:59:42.049    -1.010490
2015-10-03 23:59:43.038    -1.010490
2015-10-03 23:59:44.040    -1.010490
2015-10-03 23:59:45.040    -1.010490
2015-10-03 23:59:48.049    -1.133730
2015-10-03 23:59:49.049    -1.133730
2015-10-03 23:59:50.048    -1.133730
2015-10-03 23:59:52.050    -1.133730
2015-10-03 23:59:53.050    -1.133730
2015-10-03 23:59:54.059    -1.133730
2015-10-03 23:59:55.049    -1.133730
2015-10-03 23:59:56.041    -1.133730
2015-10-03 23:59:59.039    -1.296430
2015-10-04 00:00:00.050    -1.296430
2015-10-04 00:00:01.060    -1.296430
2015-10-04 00:00:02.040    -1.296430
Name: value, dtype: float64

I get an error when running this code:

print period_seconds
period_nanos=int(period_seconds*(10**9))
print period_nanos
res= dfi.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])

Output + error:

4.035752
4035751999

ValueError                                Traceback (most recent call last)
<ipython-input-14-92e377227823> in <module>()
      5     period_nanos=int(period_seconds*(10**9))
      6     print period_nanos
----> 7     res= dfi.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])
      8 
      9     nullrows=pd.isnull(res).any(1).nonzero()[0]

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
   3641                               fill_method=fill_method, convention=convention,
   3642                               limit=limit, base=base)
-> 3643         return sampler.resample(self).__finalize__(self)
   3644 
   3645     def first(self, offset):

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj)
     80 
     81         if isinstance(ax, DatetimeIndex):
---> 82             rs = self._resample_timestamps()
     83         elif isinstance(ax, PeriodIndex):
     84             offset = to_offset(self.freq)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, kind)
    274         axlabels = self.ax
    275 
--> 276         self._get_binner_for_resample(kind=kind)
    277         grouper = self.grouper
    278         binner = self.binner

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_binner_for_resample(self, kind)
    118             kind = self.kind
    119         if kind is None or kind == 'timestamp':
--> 120             self.binner, bins, binlabels = self._get_time_bins(ax)
    121         elif kind == 'timedelta':
    122             self.binner, bins, binlabels = self._get_time_delta_bins(ax)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_time_bins(self, ax)
    179 
    180         # general version, knowing nothing about relative frequencies
--> 181         bins = lib.generate_bins_dt64(ax_values, bin_edges, self.closed, hasnans=ax.hasnans)
    182 
    183         if self.closed == 'right':

pandas\lib.pyx in pandas.lib.generate_bins_dt64 (pandas\lib.c:20875)()

ValueError: Values falls after last bin

packages versions:

import pip
installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
     for i in installed_packages])
for i in installed_packages_list:
    print i

Output:

alabaster==0.7.6
anaconda-client==1.2.1
argcomplete==1.0.0
astropy==1.1.1
babel==2.1.1
backports-abc==0.4
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.4.1
bitarray==0.8.1
blaze==0.9.0
bokeh==0.11.0
boto==2.38.0
bottleneck==1.0.0
cdecimal==2.3
cffi==1.2.1
clyent==1.2.0
colorama==0.3.3
comtypes==1.1.2
conda-build==1.18.2
conda-env==2.4.5
conda==3.19.0
configobj==5.0.6
cryptography==0.9.1
cycler==0.9.0
cython==0.23.4
cytoolz==0.7.4
datashape==0.5.0
decorator==4.0.6
docutils==0.12
enum34==1.1.2
et-xmlfile==1.0.1
fastcache==1.0.2
flask==0.10.1
funcsigs==0.4
futures==3.0.3
gevent-websocket==0.9.3
gevent==1.0.1
greenlet==0.4.9
grin==1.2.1
h5py==2.5.0
idna==2.0
ipaddress==1.0.14
ipykernel==4.1.1
ipython-genutils==0.1.0
ipython==4.0.1
ipywidgets==4.1.0
itsdangerous==0.24
jdcal==1.2
jedi==0.9.0
jinja2==2.8
jsonschema==2.4.0
jupyter-client==4.1.1
jupyter-console==4.0.3
jupyter-core==4.0.6
jupyter==1.0.0
llvmlite==0.8.0
lxml==3.5.0
markupsafe==0.23
matplotlib==1.5.1
menuinst==1.3.2
mistune==0.7.1
multipledispatch==0.4.8
nbconvert==4.1.0
nbformat==4.0.1
networkx==1.10
nltk==3.1
nose==1.3.7
notebook==4.1.0
numba==0.22.1
numexpr==2.4.6
numpy==1.10.1
odo==0.4.0
openpyxl==2.3.2
pandas==0.17.1
path.py==0.0.0
patsy==0.4.0
pep8==1.6.2
pickleshare==0.5
pillow==3.0.0
pip==7.1.2
ply==3.8
psutil==3.2.2
py==1.4.30
pyasn1==0.1.9
pycosat==0.6.1
pycparser==2.14
pycrypto==2.6.1
pycurl==7.19.5.3
pyflakes==1.0.0
pygments==2.0.2
pyopenssl==0.15.1
pyparsing==2.0.3
pyreadline==2.1
pytest==2.8.1
python-dateutil==2.4.2
pytz==2015.7
pywin32==219
pyyaml==3.11
pyzmq==15.2.0
qtconsole==4.1.1
requests==2.9.0
rope==0.9.4
scikit-image==0.11.3
scikit-learn==0.17
scipy==0.16.0
setuptools==19.1.1
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.10.0
snowballstemmer==1.2.0
sockjs-tornado==1.0.1
sphinx-rtd-theme==0.1.7
sphinx==1.3.1
spyder==2.3.8
sqlalchemy==1.0.11
statsmodels==0.6.1
sympy==0.7.6.1
tables==3.2.2
toolz==0.7.4
tornado==4.3
traitlets==4.0.0
ujson==1.33
unicodecsv==0.14.1
werkzeug==0.11.3
wheel==0.26.0
xlrd==0.9.4
xlsxwriter==0.7.7
xlwings==0.6.1
xlwt==1.0.0
Contributor

jreback commented Jan 14, 2016

xref #9119

does look buggy. can you post an easily reproducible/simpler example that can be easily copy-pasted

jreback added this to the Next Major Release milestone Jan 14, 2016

Here is reproducible/simpler example:

running in ipython notebook and python 2

import pandas as pd
import numpy as np

start=1443707890427
end=1443916802040
dif=end-start
length=1000
np.random.seed(seed=16516)
timestamps=np.random.random_integers(0,dif,length);
timestamps =timestamps+start
timestamps = np.sort(timestamps)

datetimes=pd.to_datetime(timestamps,unit="ms")
values = np.random.rand(length)

dt_test=pd.DataFrame(values,columns=["value"],index=datetimes)
print "dt_test.head()"
print dt_test.head()
print

period_seconds=4.035752
print "period_seconds"
print period_seconds
print
period_nanos=int(period_seconds*(10**9))
print "period_nanos"
print period_nanos

res= dt_test.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])

Output:

dt_test.head()
                            value
2015-10-01 13:59:11.020  0.795006
2015-10-01 13:59:30.583  0.725395
2015-10-01 14:01:31.597  0.731184
2015-10-01 14:06:40.423  0.982237
2015-10-01 14:08:28.432  0.014274

period_seconds
4.035752

period_nanos
4035751999
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-f6df1fcb5427> in <module>()
     27 print period_nanos
     28 
---> 29 res= dt_test.value.resample(pd.tseries.offsets.Nano(period_nanos), how=[np.min, np.max,'mean'])

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
   3641                               fill_method=fill_method, convention=convention,
   3642                               limit=limit, base=base)
-> 3643         return sampler.resample(self).__finalize__(self)
   3644 
   3645     def first(self, offset):

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj)
     80 
     81         if isinstance(ax, DatetimeIndex):
---> 82             rs = self._resample_timestamps()
     83         elif isinstance(ax, PeriodIndex):
     84             offset = to_offset(self.freq)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, kind)
    274         axlabels = self.ax
    275 
--> 276         self._get_binner_for_resample(kind=kind)
    277         grouper = self.grouper
    278         binner = self.binner

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_binner_for_resample(self, kind)
    118             kind = self.kind
    119         if kind is None or kind == 'timestamp':
--> 120             self.binner, bins, binlabels = self._get_time_bins(ax)
    121         elif kind == 'timedelta':
    122             self.binner, bins, binlabels = self._get_time_delta_bins(ax)

C:\Users\USER1\Anaconda2\lib\site-packages\pandas\tseries\resample.pyc in _get_time_bins(self, ax)
    179 
    180         # general version, knowing nothing about relative frequencies
--> 181         bins = lib.generate_bins_dt64(ax_values, bin_edges, self.closed, hasnans=ax.hasnans)
    182 
    183         if self.closed == 'right':

pandas\lib.pyx in pandas.lib.generate_bins_dt64 (pandas\lib.c:20875)()

ValueError: Values falls after last bin
Contributor

BranYang commented Jan 20, 2016

The issue is caused by line 164, 165 in pandas/tseries/resample.py

binner = labels = DatetimeIndex(freq=self.freq,
                                start=first.replace(tzinfo=None),
                                # replace will truncate to millisecond 
                                end=last.replace(tzinfo=None),
                                tz=tz,
                                name=ax.name)

Consider this example

In [1]: import pandas as pd

In [2]: from pandas.tseries.index import DatetimeIndex

In [3]: s_ns = 1443707950041939524

In [4]: itvl = 10**9

In [5]: e_ns = s_ns + itvl

In [6]: s = pd.Timestamp(s_ns).tz_localize(None)

In [7]: e = pd.Timestamp(e_ns).tz_localize(None)

In [8]: e
Out[8]: Timestamp('2015-10-01 13:59:11.041939524')

In [9]: indx = DatetimeIndex(freq=pd.tseries.offsets.Nano(itvl/20),start=s, end=
e,tz=None)

In [10]: indx[-1]
Out[10]: Timestamp('2015-10-01 13:59:11.041939524', offset='50000000N')

In [11]: replaced = DatetimeIndex(freq=pd.tseries.offsets.Nano(itvl/20),start=s.
replace(tzinfo=None), end=e.replace(tzinfo=None),tz=None)

In [12]: replaced[-1]
Out[12]: Timestamp('2015-10-01 13:59:11.041939', offset='50000000N')

The last item clearly out of the bound if using replace.
Should we consider not to use replace given its current behavior (i.e., throw away the nano second information)?

Contributor

jreback commented Jan 20, 2016

@BranYang hmm, that does look likely.

Timestamp.replace is pretty naive in that it doesn't understand nanoseconds at all. So it indeed dropping the nanos.

What you need to do is fix that as I believe this is a symptom of an invalid replace.

want to take a crack at it? looking tslib.pyx/Timestamp

jreback added the Timeseries label Jan 20, 2016

Contributor

jreback commented Jan 20, 2016

a couple of other issues might be showing similar symtoms, e.g. #6085 (and linked from there). If this proves to fix, we will want to add tests for those as well.

jreback closed this in ab29f93 Feb 10, 2016

@cldy cldy added a commit to cldy/pandas that referenced this issue Feb 11, 2016

@BranYang @cldy BranYang + cldy Fix #12037 Error when Resampling using pd.tseries.offsets.Nano as period
Closes #12037

Author: Bran Yang <snowolfy@163.com>

Closes #12270 from BranYang/nanosec and squashes the following commits:

bff0c85 [Bran Yang] Add to whatsnew and some comments
fd0b307 [Bran Yang] Fix #12037 Error when Resampling using pd.tseries.offsets.Nano as period
74c8344
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment