New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when trying to compute Quantile #14357

Closed
Rubyj opened this Issue Oct 5, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@Rubyj

Rubyj commented Oct 5, 2016

In [7]: df = pd.DataFrame(np.random.randn(10, 2))

In [8]: df.iloc[1, 1] = np.nan

In [9]: df.quantile(.5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-66d518aa86c6> in <module>()
----> 1 df.quantile(.5)

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/frame.py in quantile(self, q, axis, numeric_only, interpolation)
   5152                                      axis=1,
   5153                                      interpolation=interpolation,
-> 5154                                      transposed=is_transposed)
   5155
   5156         if result.ndim == 2:

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/internals.py in quantile(self, **kwargs)
   3142
   3143     def quantile(self, **kwargs):
-> 3144         return self.reduction('quantile', **kwargs)
   3145
   3146     def setitem(self, **kwargs):

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/internals.py in reduction(self, f, axis, consolidate, transposed, **kwargs)
   3071         for b in self.blocks:
   3072             kwargs['mgr'] = self
-> 3073             axe, block = getattr(b, f)(axis=axis, **kwargs)
   3074
   3075             axes.append(axe)

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas-0.19.0rc1+21.ge596cbf-py3.5-macosx-10.11-x86_64.egg/pandas/core/internals.py in quantile(self, qs, interpolation, axis, mgr)
   1325             values = _block_shape(values[~mask], ndim=self.ndim)
   1326             if self.ndim > 1:
-> 1327                 values = values.reshape(result_shape)
   1328
   1329         from pandas import Float64Index

ValueError: total size of new array must be unchanged

original post follows


I have a simple dataframe that I created as follows:

df[df['Week of'] == week]

where week is a week name I'm filtering by

I have been taking the quartile values of this dataframe as follows:

df[df['Week of'] == week].quantile(.25)

However since the update to Pandas 0.19 I am receiving the error (this code worked fine before):

values = values.reshape(result_shape)
ValueError: total size of new array must be unchanged

@Rubyj Rubyj changed the title from ValueError when trying to compute Quartile to ValueError when trying to compute Quantile Oct 5, 2016

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Oct 5, 2016

Contributor

Can you please make this a fully reproducible example with dummy data?

Contributor

chris-b1 commented Oct 5, 2016

Can you please make this a fully reproducible example with dummy data?

@Rubyj

This comment has been minimized.

Show comment
Hide comment
@Rubyj

Rubyj Oct 5, 2016

I have tracked this error down to there being NaN values in some, but not all, of the columns for a row (2 out of 10 in this case). I then tried to compute the quartile of that DF and pandas did not like this. My solution is to plug the NaN values with 0.

Rubyj commented Oct 5, 2016

I have tracked this error down to there being NaN values in some, but not all, of the columns for a row (2 out of 10 in this case). I then tried to compute the quartile of that DF and pandas did not like this. My solution is to plug the NaN values with 0.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 5, 2016

Contributor

Edited in a reproducible example. Hard to say for sure, but maybe related to 4de83d2

It's definitely related to a (float) block having some cols with missing values:

In [11]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)))
In [13]: df.iloc[1, 1] = np.nan

In [14]: df.quantile(.5)
Out[14]:
0    4.5
1    7.0
Name: 0.5, dtype: float64

and

In [15]: df = pd.DataFrame(np.random.randn(10, 2))

In [17]: df.iloc[0, :] = np.nan

In [18]: df.quantile(.5)
Out[18]:
0    0.347815
1    0.072105
Name: 0.5, dtype: float64

both work

Contributor

TomAugspurger commented Oct 5, 2016

Edited in a reproducible example. Hard to say for sure, but maybe related to 4de83d2

It's definitely related to a (float) block having some cols with missing values:

In [11]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)))
In [13]: df.iloc[1, 1] = np.nan

In [14]: df.quantile(.5)
Out[14]:
0    4.5
1    7.0
Name: 0.5, dtype: float64

and

In [15]: df = pd.DataFrame(np.random.randn(10, 2))

In [17]: df.iloc[0, :] = np.nan

In [18]: df.quantile(.5)
Out[18]:
0    0.347815
1    0.072105
Name: 0.5, dtype: float64

both work

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 6, 2016

Contributor
In [10]: pd.__version__
Out[10]: '0.19.0'

In [11]: np.random.seed(1234)

In [12]: df = pd.DataFrame(np.random.randn(10, 2))
    ...:
    ...:
    ...: df.iloc[0, :] = np.nan
    ...:

In [13]: df
Out[13]:
          0         1
0       NaN       NaN
1  1.432707 -0.312652
2 -0.720589  0.887163
3  0.859588 -0.636524
4  0.015696 -2.242685
5  1.150036  0.991946
6  0.953324 -2.021255
7 -0.334077  0.002118
8  0.405453  0.289092
9  1.321158 -1.546906

In [14]: df.median()
Out[14]:
0    0.859588
1   -0.312652
dtype: float64

In [15]: df.quantile(0.5)
Out[15]:
0    0.859588
1   -0.312652
Name: 0.5, dtype: float64

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)))
    ...: df.iloc[1, 1] = np.nan
    ...:
    ...:

In [17]: df
Out[17]:
   0    1
0  0  3.0
1  2  NaN
2  1  3.0
3  1  3.0
4  7  1.0
5  7  4.0
6  0  5.0
7  1  5.0
8  9  9.0
9  4  0.0

In [18]: df.median()
Out[18]:
0    1.5
1    3.0
dtype: float64

In [19]: df.quantile(0.5)
Out[19]:
0    1.5
1    3.0
Name: 0.5, dtype: float64
Contributor

jreback commented Oct 6, 2016

In [10]: pd.__version__
Out[10]: '0.19.0'

In [11]: np.random.seed(1234)

In [12]: df = pd.DataFrame(np.random.randn(10, 2))
    ...:
    ...:
    ...: df.iloc[0, :] = np.nan
    ...:

In [13]: df
Out[13]:
          0         1
0       NaN       NaN
1  1.432707 -0.312652
2 -0.720589  0.887163
3  0.859588 -0.636524
4  0.015696 -2.242685
5  1.150036  0.991946
6  0.953324 -2.021255
7 -0.334077  0.002118
8  0.405453  0.289092
9  1.321158 -1.546906

In [14]: df.median()
Out[14]:
0    0.859588
1   -0.312652
dtype: float64

In [15]: df.quantile(0.5)
Out[15]:
0    0.859588
1   -0.312652
Name: 0.5, dtype: float64

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)))
    ...: df.iloc[1, 1] = np.nan
    ...:
    ...:

In [17]: df
Out[17]:
   0    1
0  0  3.0
1  2  NaN
2  1  3.0
3  1  3.0
4  7  1.0
5  7  4.0
6  0  5.0
7  1  5.0
8  9  9.0
9  4  0.0

In [18]: df.median()
Out[18]:
0    1.5
1    3.0
dtype: float64

In [19]: df.quantile(0.5)
Out[19]:
0    1.5
1    3.0
Name: 0.5, dtype: float64
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 6, 2016

Contributor

@Rubyj you'll have to show a complete end-to-end reproducible example. This was a bug in 0.18.1 but is correct in 0.19.0.

Contributor

jreback commented Oct 6, 2016

@Rubyj you'll have to show a complete end-to-end reproducible example. This was a bug in 0.18.1 but is correct in 0.19.0.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 6, 2016

Contributor

@jreback the problem seems to be a DataFrame with a FloatBloack that has at least 1 col with no missing values and at least 1 col with some missing values (see my edit at the top of the OP)

Contributor

TomAugspurger commented Oct 6, 2016

@jreback the problem seems to be a DataFrame with a FloatBloack that has at least 1 col with no missing values and at least 1 col with some missing values (see my edit at the top of the OP)

@Rubyj

This comment has been minimized.

Show comment
Hide comment
@Rubyj

Rubyj Oct 6, 2016

@jreback

@TomAugspurger provided a reproducible example for me in my original post and added the labels that you removed. Not sure if you saw that. Thank you Tom 👍

Rubyj commented Oct 6, 2016

@jreback

@TomAugspurger provided a reproducible example for me in my original post and added the labels that you removed. Not sure if you saw that. Thank you Tom 👍

@jorisvandenbossche jorisvandenbossche added this to the 0.19.1 milestone Oct 6, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 7, 2016

Contributor

@TomAugspurger your example works, I see that you changed the top of post. thanks.

Contributor

jreback commented Oct 7, 2016

@TomAugspurger your example works, I see that you changed the top of post. thanks.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 7, 2016

Contributor

so in this case, the individual dims needs to be iterated (corresponding with the columns). with the quantiling then combined, rather than doing this all at once. numpy doesn't handle the nans in the quantiling.

Contributor

jreback commented Oct 7, 2016

so in this case, the individual dims needs to be iterated (corresponding with the columns). with the quantiling then combined, rather than doing this all at once. numpy doesn't handle the nans in the quantiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment