ENH: convert masked arrays for Series #20427

alorenzo175 · 2018-03-20T20:48:26Z

Problem description

When a Series is constructed from a float32, masked numpy array, calling mean() on a resample produces NaNs. This doesn't occur with float64, masked arrays or non-masked float32 arrays. Some operations like first() work while median() raises a value error.

Code Sample, a copy-pastable example if possible

import numpy as np                                                                                                                                             
import pandas as pd                                                                                                                                            


arr32 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float32')                                                                              
arr64 = np.ma.array([1.0, 2.0, 3.0], mask=[False, False, False], dtype='float64')
index = pd.date_range(start='2018-03-01 12:00:00Z', end='2018-03-01 12:10:00Z',
                      freq='5min')
                                                                                                                                                                                                                                                                                                                              
ser32 = pd.Series(arr32, index=index)
ser64 = pd.Series(arr64, index=index)

print('float32 masked array')                                                                                                                                  
print(ser32.resample('5min').mean())
print(ser32.resample('10min').mean())

print('float64 masked array')                                                                                                                                  
print(ser64.resample('5min').mean())
print(ser64.resample('10min').mean())

print('non-masked float32')                                                                                                                                    
print(pd.Series(arr32.data, index=index).resample('5min').mean())

ser32.resample('5min').median()

which outputs

float32 masked array                                                           
2018-03-01 12:00:00+00:00   NaN                                                
2018-03-01 12:05:00+00:00   NaN                                                
2018-03-01 12:10:00+00:00   NaN                                                
Freq: 5T, dtype: float32                                                       
2018-03-01 12:00:00+00:00   NaN                                                
2018-03-01 12:10:00+00:00   NaN                                                
Freq: 10T, dtype: float32                                                      
float64 masked array                                                           
2018-03-01 12:00:00+00:00    1.0                                               
2018-03-01 12:05:00+00:00    2.0                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 5T, dtype: float64                                                       
2018-03-01 12:00:00+00:00    1.5                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 10T, dtype: float64                                                      
non-masked float32                                                             
2018-03-01 12:00:00+00:00    1.0                                               
2018-03-01 12:05:00+00:00    2.0                                               
2018-03-01 12:10:00+00:00    3.0                                               
Freq: 5T, dtype: float32 

Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1145, in median
return self._cython_agg_general('median', **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 921, in _cython_agg_general
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2314, in aggregate
min_count=min_count)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2242, in _cython_operation
values = _ensure_float64(values)
File "pandas/_libs/algos_common_helper.pxi", line 3182, in pandas._libs.algos.ensure_float64
File "pandas/_libs/algos_common_helper.pxi", line 3187, in pandas._libs.algos.ensure_float64
TypeError: astype() got an unexpected keyword argument 'copy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 128, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "bad_pandas.py", line 26, in <module>
ser32.resample('5min').median()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 621, in f
return self._downsample(_method)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/resample.py", line 773, in _downsample
self.grouper, axis=self.axis).aggregate(how, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 3121, in aggregate
return getattr(self, func_or_funcs)(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1156, in median
return self._python_agg_general(f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 939, in _python_agg_general
result, counts = self.grouper.agg_series(obj, f)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 2591, in agg_series
return grouper.get_result()
File "pandas/_libs/src/reduce.pyx", line 279, in pandas._libs.lib.SeriesBinGrouper.get_result
File "pandas/_libs/src/reduce.pyx", line 265, in pandas._libs.lib.SeriesBinGrouper.get_result
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 933, in <lambda>
f = lambda x: func(x, *args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/groupby.py", line 1155, in f
return x.median(axis=self.axis, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/generic.py", line 7315, in stat_func
numeric_only=numeric_only)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/series.py", line 2577, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 77, in _f
return f(*args, **kwargs)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 131, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/pandas/core/nanops.py", line 386, in nanmedian
values = values.ravel()
File "/home/user/anaconda/envs/testpd/lib/python3.6/site-packages/numpy/ma/core.py", line 4532, in ravel
r._mask = ndarray.ravel(self._mask, order=order).reshape(r.shape)
ValueError: cannot reshape array of size 0 into shape (1,)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.9-300.fc27.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: 1.5.1
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

The text was updated successfully, but these errors were encountered:

jreback · 2018-03-26T10:38:32Z

so masked arrays are converted automatically for DataFrames, but I guess not for Series. We should just do this. A foreign ndarray like this doesn't have enough support to be a first class object in pandas (not too mention its too complex and to be honest not worth it, does anyone use masked arrays?)

So would take a PR to convert masked arrays for Series.

dsm054 · 2018-11-12T23:31:31Z

This seems to work for me even in pandas 0.22.0 with numpy >= 1.15.1. Maybe something changed which (unintentionally) handled this case?

arw2019 · 2020-11-21T17:45:49Z

This works on 1.2 master (due to #24581 and follow-ons).

There are tests in pandas/tests/frame/test_constructors. The tests don't use a datetime index but AFAICT this isn't the core issue here

jbrockmendel · 2021-03-03T00:13:15Z

Another difference between the Series/DataFrame behavior with numpy masked arrays is what we do with the fill value

from numpy.ma import mrecords

mask = [(True, False), (False, True), (False, False), (False, True), (False, False)]
data = np.ma.array(np.ma.zeros(5, dtype=[("date", "<f8"), ("price", "<f8")]), mask=mask, fill_value=9999)

recs = data.view(mrecords.mrecarray)

df = pd.DataFrame(recs)

sers = {name: pd.Series(recs[name]) for name in recs.dtype.names}
expected = pd.DataFrame(sers)

>>> df
     date   price
0  9999.0     0.0
1     0.0  9999.0
2     0.0     0.0
3     0.0  9999.0
4     0.0     0.0

>>> expected
   date  price
0   NaN    0.0
1   0.0    NaN
2   0.0    0.0
3   0.0    NaN
4   0.0    0.0

i.e. with the mrecords we fill with the array's fill_value, whereas for Series we ignore it. This happens bc for Series we go through sanitize_masked_array while for MaskedRecords we go through fill_masked_arrays.

Easy to make these match, just need to decide which is "right"

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 26, 2018

jreback added this to the Next Major Release milestone Mar 26, 2018

jreback changed the title ~~Resampling operations on float32 masked array Series~~ ENH: convert masked arrays for Series Mar 26, 2018

jbrockmendel removed Effort Medium labels Oct 21, 2019

arw2019 added the Closing Candidate May be closeable, needs more eyeballs label Nov 21, 2020

MarcoGorelli added Needs Discussion Requires discussion from core team before further action and removed Closing Candidate May be closeable, needs more eyeballs labels Mar 21, 2021

mroeschke added Enhancement Resample resample method and removed Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 19, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

mvashishtha mentioned this issue Aug 23, 2023

BUG: creating dataframe from masked numpy array turns missing string values into NaN #54706

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: convert masked arrays for Series #20427

ENH: convert masked arrays for Series #20427

alorenzo175 commented Mar 20, 2018

INSTALLED VERSIONS

jreback commented Mar 26, 2018

dsm054 commented Nov 12, 2018

arw2019 commented Nov 21, 2020

jbrockmendel commented Mar 3, 2021

ENH: convert masked arrays for Series #20427

ENH: convert masked arrays for Series #20427

Comments

alorenzo175 commented Mar 20, 2018

Problem description

Code Sample, a copy-pastable example if possible

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Mar 26, 2018

dsm054 commented Nov 12, 2018

arw2019 commented Nov 21, 2020

jbrockmendel commented Mar 3, 2021

Output of `pd.show_versions()`