ValueError when trying to compute Quantile #14357

Closed
Rubyj opened this Issue Oct 5, 2016 · 9 comments

Comments

Projects
None yet
5 participants

Rubyj commented Oct 5, 2016 edited

In [7]: df = pd.DataFrame(np.random.randn(10, 2))

In [8]: df.iloc[1, 1] = np.nan

In [9]: df.quantile(.5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-66d518aa86c6> in <module>()
----> 1 df.quantile(.5)

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/frame.py in quantile(self, q, axis, numeric_only, interpolation)
   5152                                      axis=1,
   5153                                      interpolation=interpolation,
-> 5154                                      transposed=is_transposed)
   5155
   5156         if result.ndim == 2:

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/internals.py in quantile(self, **kwargs)
   3142
   3143     def quantile(self, **kwargs):
-> 3144         return self.reduction('quantile', **kwargs)
   3145
   3146     def setitem(self, **kwargs):

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/internals.py in reduction(self, f, axis, consolidate, transposed, **kwargs)
   3071         for b in self.blocks:
   3072             kwargs['mgr'] = self
-> 3073             axe, block = getattr(b, f)(axis=axis, **kwargs)
   3074
   3075             axes.append(axe)

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/internals.py in quantile(self, qs, interpolation, axis, mgr)
   1325             values = _block_shape(values[~mask], ndim=self.ndim)
   1326             if self.ndim > 1:
-> 1327                 values = values.reshape(result_shape)
   1328
   1329         from pandas import Float64Index

ValueError: total size of new array must be unchanged

original post follows


I have a simple dataframe that I created as follows:

df[df['Week of'] == week]

where week is a week name I'm filtering by

I have been taking the quartile values of this dataframe as follows:

df[df['Week of'] == week].quantile(.25)

However since the update to Pandas 0.19 I am receiving the error (this code worked fine before):

values = values.reshape(result_shape)
ValueError: total size of new array must be unchanged

Rubyj changed the title from ValueError when trying to compute Quartile to ValueError when trying to compute Quantile Oct 5, 2016

Contributor

chris-b1 commented Oct 5, 2016

Can you please make this a fully reproducible example with dummy data?

Rubyj commented Oct 5, 2016

I have tracked this error down to there being NaN values in some, but not all, of the columns for a row (2 out of 10 in this case). I then tried to compute the quartile of that DF and pandas did not like this. My solution is to plug the NaN values with 0.

Contributor

TomAugspurger commented Oct 5, 2016

Edited in a reproducible example. Hard to say for sure, but maybe related to 4de83d2

It's definitely related to a (float) block having some cols with missing values:

In [11]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)))
In [13]: df.iloc[1, 1] = np.nan

In [14]: df.quantile(.5)
Out[14]:
0    4.5
1    7.0
Name: 0.5, dtype: float64

and

In [15]: df = pd.DataFrame(np.random.randn(10, 2))

In [17]: df.iloc[0, :] = np.nan

In [18]: df.quantile(.5)
Out[18]:
0    0.347815
1    0.072105
Name: 0.5, dtype: float64

both work

TomAugspurger added this to the 0.19.1 milestone Oct 5, 2016

Contributor

jreback commented Oct 6, 2016

In [10]: pd.__version__
Out[10]: '0.19.0'

In [11]: np.random.seed(1234)

In [12]: df = pd.DataFrame(np.random.randn(10, 2))
    ...:
    ...:
    ...: df.iloc[0, :] = np.nan
    ...:

In [13]: df
Out[13]:
          0         1
0       NaN       NaN
1  1.432707 -0.312652
2 -0.720589  0.887163
3  0.859588 -0.636524
4  0.015696 -2.242685
5  1.150036  0.991946
6  0.953324 -2.021255
7 -0.334077  0.002118
8  0.405453  0.289092
9  1.321158 -1.546906

In [14]: df.median()
Out[14]:
0    0.859588
1   -0.312652
dtype: float64

In [15]: df.quantile(0.5)
Out[15]:
0    0.859588
1   -0.312652
Name: 0.5, dtype: float64

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)))
    ...: df.iloc[1, 1] = np.nan
    ...:
    ...:

In [17]: df
Out[17]:
   0    1
0  0  3.0
1  2  NaN
2  1  3.0
3  1  3.0
4  7  1.0
5  7  4.0
6  0  5.0
7  1  5.0
8  9  9.0
9  4  0.0

In [18]: df.median()
Out[18]:
0    1.5
1    3.0
dtype: float64

In [19]: df.quantile(0.5)
Out[19]:
0    1.5
1    3.0
Name: 0.5, dtype: float64
Contributor

jreback commented Oct 6, 2016

@Rubyj you'll have to show a complete end-to-end reproducible example. This was a bug in 0.18.1 but is correct in 0.19.0.

jreback removed this from the 0.19.1 milestone Oct 6, 2016

Contributor

TomAugspurger commented Oct 6, 2016 edited

@jreback the problem seems to be a DataFrame with a FloatBloack that has at least 1 col with no missing values and at least 1 col with some missing values (see my edit at the top of the OP)

Rubyj commented Oct 6, 2016 edited

@jreback

@TomAugspurger provided a reproducible example for me in my original post and added the labels that you removed. Not sure if you saw that. Thank you Tom 👍

jorisvandenbossche added this to the 0.19.1 milestone Oct 6, 2016

Contributor

jreback commented Oct 7, 2016

@TomAugspurger your example works, I see that you changed the top of post. thanks.

Contributor

jreback commented Oct 7, 2016

so in this case, the individual dims needs to be iterated (corresponding with the columns). with the quantiling then combined, rather than doing this all at once. numpy doesn't handle the nans in the quantiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment