Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: convert masked arrays for Series #20427

Open
alorenzo175 opened this issue Mar 20, 2018 · 4 comments
Open

ENH: convert masked arrays for Series #20427

alorenzo175 opened this issue Mar 20, 2018 · 4 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Resample resample method

Comments

@alorenzo175
Copy link
Contributor

Problem description

When a Series is constructed from a float32, masked numpy array, calling mean() on a resample produces NaNs. This doesn't occur with float64, masked arrays or non-masked float32 arrays. Some operations like first() work while median() raises a value error.

Code Sample, a copy-pastable example if possible

import numpy as np                                                                                                                                             
import pandas as pd                                                                                                                                            


arr32 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float32')                                                                              
arr64 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float64')
index = pd.date_range(start='2018-03-01 12:00:00Z', end='2018-03-01 12:10:00Z',
                      freq='5min')
                                                                                                                                                                                                                                                                                                                              
ser32 = pd.Series(arr32, index=index)
ser64 = pd.Series(arr64, index=index)

print('float32 masked array')                                                                                                                                  
print(ser32.resample('5min').mean())
print(ser32.resample('10min').mean())

print('float64 masked array')                                                                                                                                  
print(ser64.resample('5min').mean())
print(ser64.resample('10min').mean())

print('non-masked float32')                                                                                                                                    
print(pd.Series(arr32.data, index=index).resample('5min').mean())

ser32.resample('5min').median()

which outputs

float32 masked array                                                           
2018-03-01 12:00:00+00:00   NaN                                                
2018-03-01 12:05:00+00:00   NaN                                                
2018-03-01 12:10:00+00:00   NaN                                                
Freq: 5T, dtype: float32                                                       
2018-03-01 12:00:00+00:00   NaN                                                
2018-03-01 12:10:00+00:00   NaN                                                
Freq: 10T, dtype: float32                                                      
float64 masked array                                                           
2018-03-01 12:00:00+00:00    1.0                                               
2018-03-01 12:05:00+00:00    2.0                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 5T, dtype: float64                                                       
2018-03-01 12:00:00+00:00    1.5                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 10T, dtype: float64                                                      
non-masked float32                                                             
2018-03-01 12:00:00+00:00    1.0                                               
2018-03-01 12:05:00+00:00    2.0                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 5T, dtype: float32 

Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1145, in median
return self._cython_agg_general('median', **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 921, in _cython_agg_general
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2314, in aggregate
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2242, in _cython_operation
values = _ensure_float64(values)
File "pandas/_libs/algos_common_helper.pxi", line 3182, in pandas._libs.algos.ensure_float64
File "pandas/_libs/algos_common_helper.pxi", line 3187, in pandas._libs.algos.ensure_float64
TypeError: astype() got an unexpected keyword argument 'copy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 128, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "bad_pandas.py", line 26, in <module>
ser32.resample('5min').median()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 621, in f
return self._downsample(_method)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 773, in _downsample
self.grouper, axis=self.axis).aggregate(how, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 3121, in aggregate
return getattr(self, func_or_funcs)(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1156, in median
return self._python_agg_general(f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 939, in _python_agg_general
result, counts = self.grouper.agg_series(obj, f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2591, in agg_series
return grouper.get_result()
File "pandas/_libs/src/reduce.pyx", line 279, in pandas._libs.lib.SeriesBinGrouper.get_result
File "pandas/_libs/src/reduce.pyx", line 265, in pandas._libs.lib.SeriesBinGrouper.get_result
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 933, in <lambda>
f = lambda x: func(x, *args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1155, in f
return x.median(axis=self.axis, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/generic.py", line 7315, in stat_func
numeric_only=numeric_only)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/series.py", line 2577, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 77, in _f
return f(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 131, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.9-300.fc27.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: 1.5.1
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

@jreback
Copy link
Contributor

jreback commented Mar 26, 2018

so masked arrays are converted automatically for DataFrames, but I guess not for Series. We should just do this. A foreign ndarray like this doesn't have enough support to be a first class object in pandas (not too mention its too complex and to be honest not worth it, does anyone use masked arrays?)

So would take a PR to convert masked arrays for Series.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 26, 2018
@jreback jreback added this to the Next Major Release milestone Mar 26, 2018
@jreback jreback changed the title Resampling operations on float32 masked array Series ENH: convert masked arrays for Series Mar 26, 2018
@dsm054
Copy link
Contributor

dsm054 commented Nov 12, 2018

This seems to work for me even in pandas 0.22.0 with numpy >= 1.15.1. Maybe something changed which (unintentionally) handled this case?

@arw2019
Copy link
Member

arw2019 commented Nov 21, 2020

This works on 1.2 master (due to #24581 and follow-ons).

There are tests in pandas/tests/frame/test_constructors. The tests don't use a datetime index but AFAICT this isn't the core issue here

@arw2019 arw2019 added the Closing Candidate May be closeable, needs more eyeballs label Nov 21, 2020
@jbrockmendel
Copy link
Member

Another difference between the Series/DataFrame behavior with numpy masked arrays is what we do with the fill value

from numpy.ma import mrecords

mask = [(True, False), (False, True), (False, False), (False, True), (False, False)]
data = np.ma.array(np.ma.zeros(5, dtype=[("date", "<f8"), ("price", "<f8")]), mask=mask, fill_value=9999)

recs = data.view(mrecords.mrecarray)

df = pd.DataFrame(recs)

sers = {name: pd.Series(recs[name]) for name in recs.dtype.names}
expected = pd.DataFrame(sers)

>>> df
     date   price
0  9999.0     0.0
1     0.0  9999.0
2     0.0     0.0
3     0.0  9999.0
4     0.0     0.0

>>> expected
   date  price
0   NaN    0.0
1   0.0    NaN
2   0.0    0.0
3   0.0    NaN
4   0.0    0.0

i.e. with the mrecords we fill with the array's fill_value, whereas for Series we ignore it. This happens bc for Series we go through sanitize_masked_array while for MaskedRecords we go through fill_masked_arrays.

Easy to make these match, just need to decide which is "right"

@MarcoGorelli MarcoGorelli added Needs Discussion Requires discussion from core team before further action and removed Closing Candidate May be closeable, needs more eyeballs labels Mar 21, 2021
@mroeschke mroeschke added Enhancement Resample resample method and removed Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 19, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Resample resample method
Projects
None yet
Development

No branches or pull requests

7 participants