New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computing mean on non-numeric column turns dataframe into a complex number? #22506

Closed
hhuuggoo opened this Issue Aug 25, 2018 · 10 comments

Comments

Projects
None yet
3 participants
@hhuuggoo
Copy link
Contributor

hhuuggoo commented Aug 25, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                                                              
df = pd.DataFrame({                                                                                                                                                              
    "user":["A", "A", "A", "A", "A"],                                                                                                                                            
    "connections":[18446744.0, 4970.0, 4749.0, 4719.0, 4704.0]                                                                                                                   
})                                                                                                                                                                               
df['connections2'] = df.connections.astype('int64')                                                                                                                              
print("BAD")                                                                                                                                                                     
print(df.mean())                                                                                                                                                                 
                                                                                                                                                                                 
                                                                                                                                                                                 df = pd.DataFrame({                                                                                                                                                              
    "user":["A", "A", "A", "A", "A"],                                                                                                                                            
    "connections":[18446744.0, 4970.0, 4749.0, 4719.0, 4704.0]                                                                                                                   
})                                                                                                                                                                               
df['connections2'] = df.connections                                                                                                                                              
print("GOOD")                                                                                                                                                                    
print(df.mean())                                                                                                                                                                 

## the output is 
BAD                                                                                                                                                                              
user            (9.363467632937e-311+9.3634676604193e-311j)                                                                                                                      
connections                                              0j                                                                                                                      
connections2                                 (3693177.2+0j)                                                                                                                      
dtype: complex128                                                                                                                                                                
GOOD                                                                                                                                                                             
connections     3693177.2                                                                                                                                                        
connections2    3693177.2                                                                                                                                                        
dtype: float64

Problem description

Having the string,float and int columns in the dataframe gives me complex numbers on the output of mean.

Expected Output

connections 3693177.2
connections2 3693177.2
dtype: float64

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
NSTALLED VERSIONS

commit: fa47b8d
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1065-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+515.gfa47b8d
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@hhuuggoo

This comment has been minimized.

Copy link
Contributor

hhuuggoo commented Aug 25, 2018

I'm going to close this issue right now - I think it has to be related to the integer overflow (which is what I was looking at before I ran into this) I'll re-open if it's not fixed when I fix the integer overflow

@hhuuggoo hhuuggoo closed this Aug 25, 2018

@hhuuggoo

This comment has been minimized.

Copy link
Contributor

hhuuggoo commented Aug 25, 2018

I take it back - if I reduce the integer values so that there is no overflow, the weird complex number problem still exists so this is definitely an independent issue

@hhuuggoo hhuuggoo reopened this Aug 25, 2018

@hhuuggoo

This comment has been minimized.

Copy link
Contributor

hhuuggoo commented Aug 25, 2018

so I think the issue is with the anaconda build of numpy 1.14-1.15. numpy 1.13 from anaconda appears to be fine, as well as pypi wheels for numpy 1.15.

The problematic builds of numpy produce inconsistent behavior for mixed types being cast to complex types (the pandas code relies on errors being thrown from the astype operation)

I'm going to test it a bit more on a different machine, and then I'll close the issue and open one with anaconda

@hhuuggoo

This comment has been minimized.

Copy link
Contributor

hhuuggoo commented Sep 8, 2018

just tested it on my desktop - it's definitely an issue with anaconda builds of numpy 1.14 and 1.15. But I don't think this is a pandas issue so I'll close it

The simpler reproduction of the issue (which doesn't involve pandas at all) is

import numpy as np                                                                                                                                                                                                                                                              
import logging                                                                                                                                                                                                                                                                  
arr = np.array(['AAAAA', 18465886.0, 18465886.0], dtype=object)                                                                                                                                                                                                                 
print(arr.astype(np.complex64))                                                                                                                                                                                                                                                 
print(arr.astype(np.complex64))                                                                                                                                                                                                                                                 

outputs:

[4.8938719e-22+4.569e-41j 1.8465886e+07+0.000e+00j                                                                                                                                                                                                                              
 1.8465886e+07+0.000e+00j]                                                                                                                                                                                                                                                      
Traceback (most recent call last):                                                                                                                                                                                                                                              
  File "repro2.py", line 5, in <module>                                                                                                                                                                                                                                         
    print(arr.astype(np.complex64))                                                                                                                                                                                                                                             
TypeError: must be real number, not str   

This is obviously wrong because the same cast works the first time and fails the second time.

I ran this with the following environment:

# platform: linux-64                                                                                                                                                                                                                                                            
blas=1.0=mkl                                                                                                                                                                                                                                                                    
ca-certificates=2018.03.07=0                                                                                                                                                                                                                                                    
certifi=2018.8.24=py37_1                                                                                                                                                                                                                                                        
intel-openmp=2018.0.3=0                                                                                                                                                                                                                                                         
libedit=3.1.20170329=h6b74fdf_2                                                                                                                                                                                                                                                 
libffi=3.2.1=hd88cf55_4                                                                                                                                                                                                                                                         
libgcc-ng=8.2.0=hdf63c60_1                                                                                                                                                                                                                                                      
libgfortran-ng=7.3.0=hdf63c60_0                                                                                                                                                                                                                                                 
libstdcxx-ng=8.2.0=hdf63c60_1                                                                                                                                                                                                                                                   
mkl=2018.0.3=1                                                                                                                                                                                                                                                                  
mkl_fft=1.0.4=py37h4414c95_1                                                                                                                                                                                                                                                    
mkl_random=1.0.1=py37h4414c95_1                                                                                                                                                                                                                                                 
ncurses=6.1=hf484d3e_0                                                                                                                                                                                                                                                          
numpy=1.15.1=py37h1d66e8a_0                                                                                                                                                                                                                                                     
numpy-base=1.15.1=py37h81de0dd_0                                                                                                                                                                                                                                                
openssl=1.0.2p=h14c3975_0                                                                                                                                                                                                                                                       
pandas=0.23.4=py37h04863e7_0                                                                                                                                                                                                                                                    
pip=10.0.1=py37_0                                                                                                                                                                                                                                                               
python=3.7.0=hc3d631a_0                                                                                                                                                                                                                                                         
python-dateutil=2.7.3=py37_0                                                                                                                                                                                                                                                    
pytz=2018.5=py37_0                                                                                                                                                                                                                                                              
readline=7.0=h7b6447c_5                                                                                                                                                                                                                                                         
setuptools=40.2.0=py37_0                                                                                                                                                                                                                                                        
six=1.11.0=py37_1                                                                                                                                                                                                                                                               
sqlite=3.24.0=h84994c4_0                                                                                                                                                                                                                                                        
tk=8.6.8=hbc83047_0                                                                                                                                                                                                                                                             
wheel=0.31.1=py37_0                                                                                                                                                                                                                                                             
xz=5.2.4=h14c3975_4                                                                                                                                                                                                                                                             
zlib=1.2.11=ha838bed_2                   

@hhuuggoo hhuuggoo closed this Sep 8, 2018

@bear24rw

This comment has been minimized.

Copy link

bear24rw commented Sep 18, 2018

@hhuuggoo did you file this with numpy?

@hhuuggoo

This comment has been minimized.

Copy link
Contributor

hhuuggoo commented Sep 18, 2018

no cause its not a numpy issue (the non-anaconda builds are fine) I'm not sure where to file an issue about anaconda packages anymore

@hhuuggoo

This comment has been minimized.

Copy link
Contributor

hhuuggoo commented Sep 18, 2018

@bear24rw

This comment has been minimized.

Copy link

bear24rw commented Sep 18, 2018

I have the same issue with a non-anaconda build (just normal pip install on macos with homebrew python3)

@bear24rw

This comment has been minimized.

Copy link

bear24rw commented Sep 19, 2018

@bear24rw

This comment has been minimized.

Copy link

bear24rw commented Oct 1, 2018

Should be fixed now via numpy/numpy#11993

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment