Bug in pd.Series.mean() #6915

zoof · 2014-04-20T00:34:32Z

In some cases, the mean is computed incorrectly. Numpy however does the correct calculation. There is no problem with the standard deviation calculation. The following is an example.

In [11]: np.array(stateemp.area.tolist()).mean()
Out[11]: 23785.447812211703

In [12]: stateemp.area.mean()
Out[12]: 58.927762478114879

In [13]: np.array(stateemp.area.tolist()).std()
Out[13]: 22883.862745218048

In [14]: stateemp.area.std()
Out[14]: 22883.864924811925

In [15]: pd.__version__
Out[15]: '0.13.1'

In [16]: np.__version__
Out[16]: '1.8.1'

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2014-04-20T00:41:34Z

Are you able to narrow it down to an example dataframe where you have the issue you can post here?

zoof · 2014-04-20T00:47:10Z

I have the dataframe that I found the problem with -- it is quite large but I can delete all but the two variables where the calculation goes awry and post the dataset somewhere.

jorisvandenbossche · 2014-04-20T00:58:14Z

If you can, please do, that would be interesting.

Do you have NaNs? And what is the result of stateemp.area.values.mean()?

zoof · 2014-04-20T02:07:25Z

After some playing around, one hypothesis is that the bug has something to do with int32 vs int64 dtypes. So initially, I exported it to csv and tried it on another computer and I got the right answer. I then saved it as an hdf5 file and I got the wrong answer. Looking at the dtypes:

In [30]: stateemp.area.dtype
Out[30]: dtype('int32')

In [31]: stateemp.area.mean()
Out[31]: 58.927762478114879

and

In [34]: stateemp.area.dtype
Out[34]: dtype('int64')

In [35]: stateemp.area.mean()
Out[35]: 23785.447812211703

The two files are at:
https://copy.com/cOl2YgRicfco
https://copy.com/fddCxmq1Wpmp

In answer to your questions:

In [21]: stateemp.area.values.mean()
Out[21]: 23785.447812211703

In [22]: isnan(stateemp.area).sum()
Out[22]: 0

jtratner · 2014-04-23T15:58:58Z

Could you post output of describe() both pre- and post-load?

zoof · 2014-04-23T17:11:48Z

I'm not sure what you mean by pre-load.

jreback · 2014-04-23T17:37:18Z

what @jtratner means is show EXACTLY what you are doing with those files, every command in an ipython session, so it can simply be copy pasted and reproduced

zoof · 2014-04-23T18:19:48Z

You mean something like:

Python 2.7.6 (default, Feb 26 2014, 12:01:28) 
Type "copyright", "credits" or "license" for more information.

IPython 2.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: import pandas as pd

In [2]: df=pd.read_hdf('pandas_mean_error.h5','stateemp')

In [3]: df.area.describe()
Out[3]: 
count    5249571.000000
mean          58.927762
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

In [4]: df=pd.read_csv('pandas_mean_error.csv.gz',compression='gzip')

In [5]: df.area.describe()
Out[5]: 
count    5249571.000000
mean       23785.447812
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

In [6]:

zoof · 2014-04-23T18:21:53Z

That said, in addition to int32 vs int64, it seems to be architecture specific -- I've also done the same on a 64 bit machine where the mean gets computed correctly. I was doing the construction of the datasets using a 32 bit machine. I'm running one of the flavors of Manjaro linux on all my machines.

jreback · 2014-04-23T18:23:24Z

ok; that's shows the problem (but the reason we ask for ALL the info is you didn't mention that it was in an hdf file).

can you show what you are writing? (e.g. you are reading the file, but show the code to generate the frame that you are then writing).

it should be completely self-contained and copy-pastable.

zoof · 2014-04-23T18:27:24Z

I did in my comment from 4 days ago. As I also mentioned in that comment, it appeared to be a difference between int32 and int64 so it's not obvious to me why hdf5 matters -- I observed the problem before saving anything out to hdf5.

jreback · 2014-04-23T18:30:14Z

@zoof

ok then show how you created them

you are showing a sympton, but not how you created the data. So it is still impossible to even guess where the problem lies.

zoof · 2014-04-23T18:33:58Z

import pandas as pd
from numpy import nan,isnan

# set directories
rawdat = '/home/tct/rothstein/uiflows/rawdata'
scratch = '/home/tct/rothstein/uiflows/scratch'

stateemp = pd.read_csv(rawdat+'/miscbls/sm.data.1.AllData.gz',header=0,names=['seriesid','year','period','value','footnote'],converters={'footnote':str},sep='\t',compression='gzip')
stateemp['sa']=stateemp.seriesid.str[2]=='S'
stateemp['st_fips']=stateemp.seriesid.str[3:5].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['supersec']=stateemp.seriesid.str[10:12].astype(int)
stateemp['ind']=stateemp.seriesid.str[10:18].astype(int)
stateemp['type']=stateemp.seriesid.str[18:20].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['month']=stateemp.period.str.replace('M','').astype(float)
stateemp.loc[stateemp.month<=12,'periodicity']='M'
stateemp.loc[stateemp.month==13,'periodicity']='A'
stateemp.loc[stateemp.periodicity=='A','month']=nan
del(stateemp['period'])
stateemp['footnote_txt']=''
stateemp.loc[stateemp.footnote=='1','footnote_txt']='series break'
stateemp.loc[stateemp.footnote=='P','footnote_txt']='prelim'
stateemp['footnote']=~ (stateemp.footnote_txt=='')

areas=pd.read_csv(rawdat+'/miscbls/sm.area',header=0,names=['area','area_name'],usecols=[0,1],sep='\t')
industry=pd.read_csv(rawdat+'/miscbls/sm.industry',header=0,names=['ind','ind_name'],usecols=[0,1],sep='\t')
state=pd.read_csv(rawdat+'/miscbls/sm.state',header=0,names=['st_fips','st_name'],usecols=[0,1],sep='\t')

stateemp=stateemp.merge(areas,on='area')

As you can see, I haven't done anything with area but read in another datafile and merge it into the larger dataset.

jreback · 2014-04-23T18:44:58Z

w/o the original files its still impossible to look

that said, don't do astype(int), do astype('int64'); pandas keeps everything as int64 and float64 (you CAN use the other dtypes), however, int is platform dependent (so will be 32-bit on 32 bit platforms and 64-bit on 64-bit platforms). try to avoid that in general. you could do int32 if you really want as well.

furthermore, better to use convert_objects(convert_numeric=True) on the entire dataframe (or can do on the series). This will infer properly (and set non-valid entries to nan).

zoof · 2014-04-23T18:53:42Z

Ok, thanks for the workaround but that doesn't imply that series means are not being computed incorrectly for certain dtypes and certain architectures.

In looking at the code again, the only dataset needed is, https://copy.com/722BDiVaQ7BL, and the code I posted can be truncated following:

stateemp['area']=stateemp.seriesid.str[5:10].astype(int)

jreback · 2014-04-23T18:53:51Z

can you show:

areas.head()
areas.dtypes

jreback · 2014-04-23T18:54:22Z

its possible that their is a bug in the merging, but I can't reproduce, that's why I am asking :)

zoof · 2014-04-23T18:55:55Z

As I pointed out in the prior comment, the merge command is not necessary (i.e., it is my astype(int) that is making it int32) but nevertheless:

In [11]: areas.head()
Out[11]: 
    area                            area_name
0      0                            Statewide
1  10180                          Abilene, TX
2  10380  Aguadilla-Isabela-San Sebastian, PR
3  10420                            Akron, OH
4  10500                           Albany, GA

[5 rows x 2 columns]

In [12]: areas.dtypes
Out[12]: 
area          int64
area_name    object
dtype: object

jreback · 2014-04-23T18:59:52Z

ok, how about a sample of stateemp (with the code you have as written)

zoof · 2014-04-23T19:02:59Z

All of stateemp is here: https://copy.com/722BDiVaQ7BL in 'stateemp'. I.e., read_hdf('pandas_mean_error.h5','stateemp'). The code was posted before but again, could be further truncated to reproduce the problem.

jreback · 2014-04-23T19:12:47Z

how did you write the hdf file? you have an odd filter in their

zoof · 2014-04-23T19:32:44Z

stateemp.to_hdf('pandas_mean_error.h5','stateemp',complevel=9,complib='bzip2')

Although I can't be absolutely sure which complib I used since I was doing it interactively to produce datasets for you folks to play with.

jreback · 2014-04-23T19:37:05Z

ok...can you save w/o bzip...i don't have installed (and generally not that efficient anyhow, use 'blosc')

zoof · 2014-04-23T19:50:48Z

Here is one that has been bloscified.
https://copy.com/7lM72h3Br2Uy

jreback · 2014-04-23T19:58:08Z

ok.still doesn't show anything, this looks fine.

In [5]: df.dtypes
Out[5]: 
seriesid         object
year              int64
value           float64
footnote           bool
sa                 bool
st_fips           int32
area              int32
supersec          int32
ind               int32
type              int32
month           float64
periodicity      object
footnote_txt     object
dtype: object

In [6]: df.area.mean()
Out[6]: 23785.447812211703

In [7]: df.area.describe()
Out[7]: 
count    5249571.000000
mean       23785.447812
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

zoof · 2014-04-23T20:06:33Z

Maybe it is very architecture specific -- the work was initially done on one of the early atom cpus (embarrassed grin) and then I also tested it on a P4. The atom's cpu flags are:

flags       : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm movbe lahf_lm dtherm

I can get the P4 flags if that would be useful.

jreback · 2014-04-23T20:08:32Z

weird....

ok...use the methods I described above for conversions and generally keep to 64-bit dtypes

closing...reopen if you need

zoof · 2014-04-23T20:46:19Z

So why is int32 problematic? Presumably it takes less space. If I'm not wrong then there are cases where I'd even prefer to use int8, especially on RAM constrained machines.

jreback · 2014-04-23T20:48:38Z

int32 is not problematic at all. however astype(int) IS (as it then is platform dependent).

u CAN use int8 if you want. but to be honest unless you are saving millions and millions of rows its not going to make any different (and could actually be slower as manipulating int64 is pretty optimized as a single instruction; i think int8's are not nearly so)

zoof · 2014-04-24T01:18:21Z

This bug is really bothering me so I installed a 32 bit linux (also Manjaro) on a spare partition of my 64 bit machine and same problem -- df.area.describe() gives me a mean of 58.9... So on three separate machines with 32 bit OSes, I'm observing this issue. I don't think this is an isolated case due to special hardware. I'm happy to try some alternate 32 bit flavor of linux (Ubuntu, Debian, what have you). Alternatively, if int32 is really not an important dtype, perhaps it should simply be dropped.

jreback · 2014-04-24T01:21:12Z

@zoof is certainly could be a bug on 32-bit. But need a simple reproducible test.
Try to narrow it down to: a) construct a particular frame, b) perform an operation, c) works on 64, fails on 32 bit.

zoof · 2014-04-24T01:39:49Z

You have the dataset. I read it in on 32 bit OSes, ask it to df.area.describe() and I get 58.927762 for the mean. When I do the same on a 64 bit linux, I get a mean of 23785.447812. Do I need to do exhaustive tests on every conceivable hardware and versions of linux? BTW, I have now confirmed the problem on a live session of 32 bit Ubuntu 14.04 from which I am now typing.

jreback · 2014-04-24T01:42:42Z

you are missing the point

without a simple test i can't even begin to figure out where the problem is

in order for this to move forward you need to make a simple test

you can even read in a data set but it has to be short, preferably from a string

it could be a numpy, python or pandas bug
it needs to be narrowed down

zoof · 2014-04-24T02:36:07Z

It does not seem to be a numpy bug:

In [8]: np.array(df.area).dtype
Out[8]: dtype('int32')

In [9]: np.array(df.area).mean()
Out[9]: 23785.447812211703

If it is a Python bug, it works in a way that does not affect numpy.

It is not clear to me why it has to be short.

Is this a standard method of operation that if the precise bug cannot be pinned down, the issue is closed? Despite the fact that it is clearly a bug. I'll say again, you have the dataset. You could produce the erroneous result if you tried it with a 32 bit linux distro.

jreback · 2014-04-24T02:44:13Z

it probably is a bug

but pandas has a test suite of almost 5000 tests

in order to patch anything it has to be a test

how can we include this massive data set in the test suite? it's simply not possible

you found and issue great - but u have to narrow it down by producing a much smaller example that can be put directly in the code

someone has to run the test and see that it fails on a particular platform

and then test a fix that works and does not break anything else

what u have provided is an indication of a bug
but since it is not readily reproducible it is impossible to move forward

if you do narrow it down pls reopen the issue

zoof · 2014-04-24T03:51:47Z

The best I can do is cut the series down to about 90,000. Taking every 58th observation gets:

In [30]: df.loc[df.index/58.==df.index/58,'area'].describe()
Out[30]: 
count    90510.000000
mean    -23663.382599
std      22887.987276
min          0.000000
25%          0.000000
50%      21500.000000
75%      38900.000000
max      97961.000000
Name: area, dtype: float64

In [31]: len(df.loc[df.index/58.==df.index/58,'area'])
Out[31]: 90510

Using every 59th fails to produce an error.

jreback · 2014-04-24T11:29:14Z

pls show pd.print_versions()
I am looking to see if bottleneck is installed

zoof · 2014-04-24T13:14:18Z

bottleneck is not installed:

In [4]: pd.print_versions()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-be909eec5dd8> in <module>()
----> 1 pd.print_versions()

AttributeError: 'module' object has no attribute 'print_versions'

Should I install and try the calculation again?

jreback · 2014-04-24T13:22:36Z

sorry meant pd.show_versions()

in any event, this is a numpy bug, see numpy/numpy#4638

jreback · 2014-04-24T13:34:12Z

installing bottleneck will fix this on 32-bit, or use int64 dtypes, or work on 64-bit until numpy fixes

zoof · 2014-04-24T13:36:04Z

Many thanks! Will do so.

jreback · 2014-04-24T13:39:54Z

thanks for reporting. something bugs are hard to fine!

jreback · 2014-04-24T14:51:28Z

@zoof ok...going to fix on the pandas side, as this is broken in both bottleneck (for float32), suprisingly int32 works, and in numpy on both. they 'wont' fix as its the user responsibility to do the upcasting (which is odd because on 64-bit it works), because the return variable is already 64-bit, whilst on 32-bit it is not. weird.

zoof · 2014-04-24T15:06:58Z

That is weird. Seems to me that if upcasting the the user's responsibility, it should at least throw a warning or else how is the user to know that there is a problem. I guess use only int64 and float64.

jreback · 2014-04-24T15:16:31Z

their is a way to intercept it np.seterr(), but pandas turns this off for a variety of reason.
its no big deal to fix once I saw the problem.

you see the importance of narrowing down the problem; have to be able to debug it.

jreback closed this as completed Apr 23, 2014

jreback mentioned this issue Apr 24, 2014

BUG: overflow on integer ops on 32-bit numpy/numpy#4638

Closed

This was referenced Apr 24, 2014

BUG: nansum platform overflow pydata/bottleneck#83

Closed

BUG: Bug in sum/mean on 32-bit platforms on overflows (GH6915) #6954

Merged

jreback added Bug labels Apr 24, 2014

jreback added this to the 0.14.0 milestone Apr 24, 2014

lumbric mentioned this issue Feb 13, 2019

Wrong result for float32 Series when using bottleneck #25307

Closed

Bug in pd.Series.mean() #6915

Bug in pd.Series.mean() #6915

Comments

zoof commented Apr 20, 2014

jorisvandenbossche commented Apr 20, 2014

zoof commented Apr 20, 2014

jorisvandenbossche commented Apr 20, 2014

zoof commented Apr 20, 2014

jtratner commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 23, 2014

jreback commented Apr 23, 2014

zoof commented Apr 24, 2014

jreback commented Apr 24, 2014

zoof commented Apr 24, 2014

jreback commented Apr 24, 2014

zoof commented Apr 24, 2014

jreback commented Apr 24, 2014

zoof commented Apr 24, 2014

jreback commented Apr 24, 2014

zoof commented Apr 24, 2014

jreback commented Apr 24, 2014

jreback commented Apr 24, 2014

zoof commented Apr 24, 2014

jreback commented Apr 24, 2014

jreback commented Apr 24, 2014

zoof commented Apr 24, 2014

jreback commented Apr 24, 2014