Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in pd.Series.mean() #6915

Closed
zoof opened this issue Apr 20, 2014 · 45 comments · Fixed by #6954
Closed

Bug in pd.Series.mean() #6915

zoof opened this issue Apr 20, 2014 · 45 comments · Fixed by #6954
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@zoof
Copy link

zoof commented Apr 20, 2014

In some cases, the mean is computed incorrectly. Numpy however does the correct calculation. There is no problem with the standard deviation calculation. The following is an example.

In [11]: np.array(stateemp.area.tolist()).mean()
Out[11]: 23785.447812211703

In [12]: stateemp.area.mean()
Out[12]: 58.927762478114879

In [13]: np.array(stateemp.area.tolist()).std()
Out[13]: 22883.862745218048

In [14]: stateemp.area.std()
Out[14]: 22883.864924811925

In [15]: pd.__version__
Out[15]: '0.13.1'

In [16]: np.__version__
Out[16]: '1.8.1'
@jorisvandenbossche
Copy link
Member

Are you able to narrow it down to an example dataframe where you have the issue you can post here?

@zoof
Copy link
Author

zoof commented Apr 20, 2014

I have the dataframe that I found the problem with -- it is quite large but I can delete all but the two variables where the calculation goes awry and post the dataset somewhere.

@jorisvandenbossche
Copy link
Member

If you can, please do, that would be interesting.

Do you have NaNs? And what is the result of stateemp.area.values.mean()?

@zoof
Copy link
Author

zoof commented Apr 20, 2014

After some playing around, one hypothesis is that the bug has something to do with int32 vs int64 dtypes. So initially, I exported it to csv and tried it on another computer and I got the right answer. I then saved it as an hdf5 file and I got the wrong answer. Looking at the dtypes:

In [30]: stateemp.area.dtype
Out[30]: dtype('int32')

In [31]: stateemp.area.mean()
Out[31]: 58.927762478114879

and

In [34]: stateemp.area.dtype
Out[34]: dtype('int64')

In [35]: stateemp.area.mean()
Out[35]: 23785.447812211703

The two files are at:
https://copy.com/cOl2YgRicfco
https://copy.com/fddCxmq1Wpmp

In answer to your questions:

In [21]: stateemp.area.values.mean()
Out[21]: 23785.447812211703

In [22]: isnan(stateemp.area).sum()
Out[22]: 0

@jtratner
Copy link
Contributor

Could you post output of describe() both pre- and post-load?

@zoof
Copy link
Author

zoof commented Apr 23, 2014

I'm not sure what you mean by pre-load.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

what @jtratner means is show EXACTLY what you are doing with those files, every command in an ipython session, so it can simply be copy pasted and reproduced

@zoof
Copy link
Author

zoof commented Apr 23, 2014

You mean something like:

Python 2.7.6 (default, Feb 26 2014, 12:01:28) 
Type "copyright", "credits" or "license" for more information.

IPython 2.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: import pandas as pd

In [2]: df=pd.read_hdf('pandas_mean_error.h5','stateemp')

In [3]: df.area.describe()
Out[3]: 
count    5249571.000000
mean          58.927762
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

In [4]: df=pd.read_csv('pandas_mean_error.csv.gz',compression='gzip')

In [5]: df.area.describe()
Out[5]: 
count    5249571.000000
mean       23785.447812
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

In [6]: 

@zoof
Copy link
Author

zoof commented Apr 23, 2014

That said, in addition to int32 vs int64, it seems to be architecture specific -- I've also done the same on a 64 bit machine where the mean gets computed correctly. I was doing the construction of the datasets using a 32 bit machine. I'm running one of the flavors of Manjaro linux on all my machines.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok; that's shows the problem (but the reason we ask for ALL the info is you didn't mention that it was in an hdf file).

can you show what you are writing? (e.g. you are reading the file, but show the code to generate the frame that you are then writing).

it should be completely self-contained and copy-pastable.

@zoof
Copy link
Author

zoof commented Apr 23, 2014

I did in my comment from 4 days ago. As I also mentioned in that comment, it appeared to be a difference between int32 and int64 so it's not obvious to me why hdf5 matters -- I observed the problem before saving anything out to hdf5.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@zoof

ok then show how you created them

you are showing a sympton, but not how you created the data. So it is still impossible to even guess where the problem lies.

@zoof
Copy link
Author

zoof commented Apr 23, 2014

import pandas as pd
from numpy import nan,isnan

# set directories
rawdat = '/home/tct/rothstein/uiflows/rawdata'
scratch = '/home/tct/rothstein/uiflows/scratch'

stateemp = pd.read_csv(rawdat+'/miscbls/sm.data.1.AllData.gz',header=0,names=['seriesid','year','period','value','footnote'],converters={'footnote':str},sep='\t',compression='gzip')
stateemp['sa']=stateemp.seriesid.str[2]=='S'
stateemp['st_fips']=stateemp.seriesid.str[3:5].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['supersec']=stateemp.seriesid.str[10:12].astype(int)
stateemp['ind']=stateemp.seriesid.str[10:18].astype(int)
stateemp['type']=stateemp.seriesid.str[18:20].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['month']=stateemp.period.str.replace('M','').astype(float)
stateemp.loc[stateemp.month<=12,'periodicity']='M'
stateemp.loc[stateemp.month==13,'periodicity']='A'
stateemp.loc[stateemp.periodicity=='A','month']=nan
del(stateemp['period'])
stateemp['footnote_txt']=''
stateemp.loc[stateemp.footnote=='1','footnote_txt']='series break'
stateemp.loc[stateemp.footnote=='P','footnote_txt']='prelim'
stateemp['footnote']=~ (stateemp.footnote_txt=='')

areas=pd.read_csv(rawdat+'/miscbls/sm.area',header=0,names=['area','area_name'],usecols=[0,1],sep='\t')
industry=pd.read_csv(rawdat+'/miscbls/sm.industry',header=0,names=['ind','ind_name'],usecols=[0,1],sep='\t')
state=pd.read_csv(rawdat+'/miscbls/sm.state',header=0,names=['st_fips','st_name'],usecols=[0,1],sep='\t')

stateemp=stateemp.merge(areas,on='area')

As you can see, I haven't done anything with area but read in another datafile and merge it into the larger dataset.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

w/o the original files its still impossible to look

that said, don't do astype(int), do astype('int64'); pandas keeps everything as int64 and float64 (you CAN use the other dtypes), however, int is platform dependent (so will be 32-bit on 32 bit platforms and 64-bit on 64-bit platforms). try to avoid that in general. you could do int32 if you really want as well.

furthermore, better to use convert_objects(convert_numeric=True) on the entire dataframe (or can do on the series). This will infer properly (and set non-valid entries to nan).

@zoof
Copy link
Author

zoof commented Apr 23, 2014

Ok, thanks for the workaround but that doesn't imply that series means are not being computed incorrectly for certain dtypes and certain architectures.

In looking at the code again, the only dataset needed is, https://copy.com/722BDiVaQ7BL, and the code I posted can be truncated following:

stateemp['area']=stateemp.seriesid.str[5:10].astype(int)

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

can you show:

  • areas.head()
  • areas.dtypes

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

its possible that their is a bug in the merging, but I can't reproduce, that's why I am asking :)

@zoof
Copy link
Author

zoof commented Apr 23, 2014

As I pointed out in the prior comment, the merge command is not necessary (i.e., it is my astype(int) that is making it int32) but nevertheless:

In [11]: areas.head()
Out[11]: 
    area                            area_name
0      0                            Statewide
1  10180                          Abilene, TX
2  10380  Aguadilla-Isabela-San Sebastian, PR
3  10420                            Akron, OH
4  10500                           Albany, GA

[5 rows x 2 columns]

In [12]: areas.dtypes
Out[12]: 
area          int64
area_name    object
dtype: object

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok, how about a sample of stateemp (with the code you have as written)

@zoof
Copy link
Author

zoof commented Apr 23, 2014

All of stateemp is here: https://copy.com/722BDiVaQ7BL in 'stateemp'. I.e., read_hdf('pandas_mean_error.h5','stateemp'). The code was posted before but again, could be further truncated to reproduce the problem.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

how did you write the hdf file? you have an odd filter in their

@zoof
Copy link
Author

zoof commented Apr 23, 2014

stateemp.to_hdf('pandas_mean_error.h5','stateemp',complevel=9,complib='bzip2')

Although I can't be absolutely sure which complib I used since I was doing it interactively to produce datasets for you folks to play with.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok...can you save w/o bzip...i don't have installed (and generally not that efficient anyhow, use 'blosc')

@zoof
Copy link
Author

zoof commented Apr 23, 2014

Here is one that has been bloscified.
https://copy.com/7lM72h3Br2Uy

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok.still doesn't show anything, this looks fine.

In [5]: df.dtypes
Out[5]: 
seriesid         object
year              int64
value           float64
footnote           bool
sa                 bool
st_fips           int32
area              int32
supersec          int32
ind               int32
type              int32
month           float64
periodicity      object
footnote_txt     object
dtype: object

In [6]: df.area.mean()
Out[6]: 23785.447812211703

In [7]: df.area.describe()
Out[7]: 
count    5249571.000000
mean       23785.447812
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

@zoof
Copy link
Author

zoof commented Apr 23, 2014

Maybe it is very architecture specific -- the work was initially done on one of the early atom cpus (embarrassed grin) and then I also tested it on a P4. The atom's cpu flags are:

flags       : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm movbe lahf_lm dtherm

I can get the P4 flags if that would be useful.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

weird....

ok...use the methods I described above for conversions and generally keep to 64-bit dtypes

closing...reopen if you need

@jreback jreback closed this as completed Apr 23, 2014
@zoof
Copy link
Author

zoof commented Apr 23, 2014

So why is int32 problematic? Presumably it takes less space. If I'm not wrong then there are cases where I'd even prefer to use int8, especially on RAM constrained machines.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

int32 is not problematic at all. however astype(int) IS (as it then is platform dependent).

u CAN use int8 if you want. but to be honest unless you are saving millions and millions of rows its not going to make any different (and could actually be slower as manipulating int64 is pretty optimized as a single instruction; i think int8's are not nearly so)

@zoof
Copy link
Author

zoof commented Apr 24, 2014

This bug is really bothering me so I installed a 32 bit linux (also Manjaro) on a spare partition of my 64 bit machine and same problem -- df.area.describe() gives me a mean of 58.9... So on three separate machines with 32 bit OSes, I'm observing this issue. I don't think this is an isolated case due to special hardware. I'm happy to try some alternate 32 bit flavor of linux (Ubuntu, Debian, what have you). Alternatively, if int32 is really not an important dtype, perhaps it should simply be dropped.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

@zoof is certainly could be a bug on 32-bit. But need a simple reproducible test.
Try to narrow it down to: a) construct a particular frame, b) perform an operation, c) works on 64, fails on 32 bit.

@zoof
Copy link
Author

zoof commented Apr 24, 2014

You have the dataset. I read it in on 32 bit OSes, ask it to df.area.describe() and I get 58.927762 for the mean. When I do the same on a 64 bit linux, I get a mean of 23785.447812. Do I need to do exhaustive tests on every conceivable hardware and versions of linux? BTW, I have now confirmed the problem on a live session of 32 bit Ubuntu 14.04 from which I am now typing.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

you are missing the point

without a simple test i can't even begin to figure out where the problem is

in order for this to move forward you need to make a simple test

you can even read in a data set but it has to be short, preferably from a string

it could be a numpy, python or pandas bug
it needs to be narrowed down

@zoof
Copy link
Author

zoof commented Apr 24, 2014

It does not seem to be a numpy bug:

In [8]: np.array(df.area).dtype
Out[8]: dtype('int32')

In [9]: np.array(df.area).mean()
Out[9]: 23785.447812211703

If it is a Python bug, it works in a way that does not affect numpy.

It is not clear to me why it has to be short.

Is this a standard method of operation that if the precise bug cannot be pinned down, the issue is closed? Despite the fact that it is clearly a bug. I'll say again, you have the dataset. You could produce the erroneous result if you tried it with a 32 bit linux distro.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

it probably is a bug

but pandas has a test suite of almost 5000 tests

in order to patch anything it has to be a test

how can we include this massive data set in the test suite? it's simply not possible

you found and issue great - but u have to narrow it down by producing a much smaller example that can be put directly in the code

someone has to run the test and see that it fails on a particular platform

and then test a fix that works and does not break anything else

what u have provided is an indication of a bug
but since it is not readily reproducible it is impossible to move forward

if you do narrow it down pls reopen the issue

@zoof
Copy link
Author

zoof commented Apr 24, 2014

The best I can do is cut the series down to about 90,000. Taking every 58th observation gets:

In [30]: df.loc[df.index/58.==df.index/58,'area'].describe()
Out[30]: 
count    90510.000000
mean    -23663.382599
std      22887.987276
min          0.000000
25%          0.000000
50%      21500.000000
75%      38900.000000
max      97961.000000
Name: area, dtype: float64

In [31]: len(df.loc[df.index/58.==df.index/58,'area'])
Out[31]: 90510

Using every 59th fails to produce an error.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

pls show pd.print_versions()
I am looking to see if bottleneck is installed

@zoof
Copy link
Author

zoof commented Apr 24, 2014

bottleneck is not installed:

In [4]: pd.print_versions()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-be909eec5dd8> in <module>()
----> 1 pd.print_versions()

AttributeError: 'module' object has no attribute 'print_versions'

Should I install and try the calculation again?

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

sorry meant pd.show_versions()

in any event, this is a numpy bug, see numpy/numpy#4638

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

installing bottleneck will fix this on 32-bit, or use int64 dtypes, or work on 64-bit until numpy fixes

@zoof
Copy link
Author

zoof commented Apr 24, 2014

Many thanks! Will do so.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

thanks for reporting. something bugs are hard to fine!

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

@zoof ok...going to fix on the pandas side, as this is broken in both bottleneck (for float32), suprisingly int32 works, and in numpy on both. they 'wont' fix as its the user responsibility to do the upcasting (which is odd because on 64-bit it works), because the return variable is already 64-bit, whilst on 32-bit it is not. weird.

@zoof
Copy link
Author

zoof commented Apr 24, 2014

That is weird. Seems to me that if upcasting the the user's responsibility, it should at least throw a warning or else how is the user to know that there is a problem. I guess use only int64 and float64.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

their is a way to intercept it np.seterr(), but pandas turns this off for a variety of reason.
its no big deal to fix once I saw the problem.

you see the importance of narrowing down the problem; have to be able to debug it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants