Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upsampling with resample('B').ffill() fails with "ValueError: Length mismatch" #16624

Closed
AdamGleave opened this issue Jun 7, 2017 · 4 comments
Labels
Bug Resample resample method
Milestone

Comments

@AdamGleave
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

# length-10 index
idx = pd.date_range('2017-01-02', '2017-01-13', freq='B')

def create_series(exclude):
  subset = pd.Index([idx[i] for i in range(len(idx)) if i not in exclude])
  return pd.Series(np.arange(len(subset)), index=subset)

TESTS = [
  [3, 4], # delete Thursday and Friday of first week - PASS
  [4, 5], # delete Friday first week and Monday second week - PASS
  [5, 6], # delete Monday and Tuesday of second week - PASS
  [6], # delete Tuesday of second week - PASS
  [7], # delete Wednesday of second week - PASS
  [6, 7], # delete Tuesday and Wednesday of second week - FAIL
]

for exclude in TESTS:
  print('TEST: {}'.format(exclude))
  s = create_series(exclude)
  print('IN:\n{}'.format(s))
  t = s.resample('D').ffill()  # always works
  u = s.resample('B').last()  # always works
  v = s.resample('B').ffill()  # fails on [6, 7]
  print('OUT:\n{}'.format(v))

Problem description

s.resample('B').ffill() raises an exception, "ValueError: Length mismatch: Expected axis has 8 elements, new values have 10 elements". This occurs when upsampling from an index containing a multi-day, mid-week gap, but not at many other positions in the week; I am not sure exactly what predicts whether this problem occurs. Other resampling methods, e.g. last(), work as expected. Furthermore, forward filling at a daily ('D') frequency works. I cannot see any reason why this particular case should not be supported, so I believe it is a bug.

Expected Output

I expect the last statement not to raise an exception, and instead to produce the same output as s.resample('D').ffill().resample('B').first(), i.e. I expect:

> s = create_series([6, 7)]
> s
2017-01-02    0
2017-01-03    1
2017-01-04    2
2017-01-05    3
2017-01-06    4
2017-01-09    5
2017-01-12    6
2017-01-13    7
> s.resample('B').ffill()
2017-01-02    0
2017-01-03    1
2017-01-04    2
2017-01-05    3
2017-01-06    4
2017-01-09    5
2017-01-10    5
2017-01-11    5
2017-01-12    6
2017-01-13    7

Output of pd.show_versions()

I have tried this on both the latest stable version 0.20.2 and the master branch, Git revision 10c17d4, and both exhibit the same problem.

# Development branch INSTALLED VERSIONS ------------------ commit: 10c17d4 python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.10.10-100.fc24.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.utf8 LOCALE: en_GB.UTF-8

pandas: 0.21.0.dev+136.g10c17d4
pytest: 3.0.1
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
xarray: None
IPython: 5.1.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.2.0-b1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Stable version

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.10-100.fc24.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.utf8
LOCALE: en_GB.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

@dsm054
Copy link
Contributor

dsm054 commented Jun 7, 2017

IIUC, the failing case is exactly the one for which infer_freq guesses that your index already has a BusinessDay freq:

In [95]: {tuple(e): pd.tseries.frequencies.infer_freq(create_series(e).index) for e in TESTS}
Out[95]: {(3, 4): None, (4, 5): None, (5, 6): None, (6,): None, (6, 7): 'B', (7,): None}

Because in that particular case, the gaps between days are always 1 or 3:

In [97]: create_series((6,7))
Out[97]: 
2017-01-02    0
2017-01-03    1
2017-01-04    2
2017-01-05    3
2017-01-06    4
2017-01-09    5
2017-01-12    6
2017-01-13    7
dtype: int64

which triggers

        # Business daily. Maybe
        if self.day_deltas == [1, 3]:
            return 'B'

which means that it takes the first branch here in resample when it needs to take the second:

        if limit is None and to_offset(ax.inferred_freq) == self.freq:
            result = obj.copy()
            result.index = res_index
        else:
            result = obj.reindex(res_index, method=method,
                                 limit=limit, fill_value=fill_value)

@jreback
Copy link
Contributor

jreback commented Jun 9, 2017

yeah just needs a len check and this should be good (so it would take the next branch)

@jreback jreback added this to the Next Major Release milestone Jun 9, 2017
@jreback
Copy link
Contributor

jreback commented Jun 9, 2017

@AdamGleave a PR would be appreciated!

AdamGleave pushed a commit to AdamGleave/pandas that referenced this issue Jun 12, 2017
AdamGleave pushed a commit to AdamGleave/pandas that referenced this issue Jun 12, 2017
@AdamGleave
Copy link
Contributor Author

@jreback see #16683

@jreback jreback modified the milestones: 0.20.3, Next Major Release Jun 12, 2017
@jreback jreback modified the milestones: 0.21.0, 0.20.3 Jul 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

No branches or pull requests

3 participants