value_counts not working correctly on (some?) ExtensionArrays #33172

buhrmann · 2020-03-31T09:29:46Z

Code Sample

pd.Series(list("abcde"), dtype="string").value_counts(normalize=True)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-718821f804b4> in <module>
      1 # lang.value_counts(normalize=True)
----> 2 pd.Series(list("abcde"), dtype="string").value_counts(normalize=True)

~/anaconda/envs/grapy/lib/python3.7/site-packages/pandas/core/base.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1233             normalize=normalize,
   1234             bins=bins,
-> 1235             dropna=dropna,
   1236         )
   1237         return result

~/anaconda/envs/grapy/lib/python3.7/site-packages/pandas/core/algorithms.py in value_counts(values, sort, ascending, normalize, bins, dropna)
    729 
    730     if normalize:
--> 731         result = result / float(counts.sum())
    732 
    733     return result

AttributeError: 'IntegerArray' object has no attribute 'sum'

Problem description

The problem seems to be that value_counts() on a string extension dtype returns an Int64 dtype, and sum is not implemented for IntegerArrays , although it is for Series with ExtensionArrays:

Expected Output

vc = pd.Series(list("abcde"), dtype="string").value_counts(normalize=False)
print(vc)
print(vc / vc.sum())

d    1
e    1
b    1
c    1
a    1
dtype: Int64
d    0.2
e    0.2
b    0.2
c    0.2
a    0.2
dtype: float64

May be related to #22843?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.0
pip : 20.0.2
setuptools : 46.1.1.post20200322
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2020-04-06T17:26:19Z

For the Int64 case

>>> import pandas as pd
>>>
>>> pd.__version__
'0.26.0.dev0+1729.g8bdd7b13c'
>>>
>>> pd.Series([1, 2, 3, 4, 4], dtype="Int64").value_counts(normalize=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\base.py", line 1265, in value_counts
    dropna=dropna,
  File "C:\Users\simon\pandas\pandas\core\algorithms.py", line 724, in value_counts
    result = result / float(counts.sum())
AttributeError: 'IntegerArray' object has no attribute 'sum'
>>>

on 0.25.3

>>> import pandas as pd
>>>
>>> pd.__version__
'0.25.3'
>>>
>>> pd.Series([1, 2, 3, 4, 4], dtype="Int64").value_counts(normalize=True)
4    0.4
1    0.2
2    0.2
3    0.2
dtype: float64
>>>

8bdd7b1 is the first bad commit
commit 8bdd7b1
Author: Tom Augspurger TomAugspurger@users.noreply.github.com
Date: Thu Jan 9 13:19:34 2020 -0600

BUG: BooleanArray.value_counts dropna (#30824)

#30824 made changes to masked extension arrays, so assume behaviour of other extension arrays is not impacted.

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1122.g01f73100d'
>>>
>>> pd.Series(pd.date_range("2000", periods=3, freq="A", tz="Europe/London")).value_counts(
...     normalize=True
... )
2001-12-31 00:00:00+00:00    0.333333
2000-12-31 00:00:00+00:00    0.333333
2002-12-31 00:00:00+00:00    0.333333
dtype: float64
>>>
>>> pd.Series(list("abbcc")).astype("category").value_counts(normalize=True)
c    0.4
b    0.4
a    0.2
dtype: float64
>>>

simonjayhawkins · 2020-04-25T09:06:19Z

this is fixed on master, xref #33538, just needs tests.

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1361.g77a0f19c5'
>>>
>>> pd.Series(list("abcde"), dtype="string").value_counts(normalize=True)
e    0.2
b    0.2
a    0.2
c    0.2
d    0.2
dtype: float64
>>>

yao-eastside · 2020-04-25T19:20:45Z

@simonjayhawkins On which version we should have the tests? I can work on this one. I will try 0.26.x, 0.25.x, and 1.1.x and report back here. I guess we also need unittest right?

yao-eastside · 2020-04-25T20:10:48Z

take

yao-eastside · 2020-04-25T22:27:28Z

looks like @kotamatsuoka is working on the unit tests. I will wait and see.

jbrockmendel added ExtensionArray Extending pandas with custom dtypes or arrays. Bug labels Apr 3, 2020

simonjayhawkins added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 6, 2020

kotamatsuoka mentioned this issue Apr 20, 2020

BUG: value_counts not working correctly on ExtensionArrays #33674

Merged

5 tasks

jreback added this to the 1.1 milestone Apr 20, 2020

simonjayhawkins mentioned this issue Apr 23, 2020

ENH: Implement IntegerArray.sum #33538

Merged

5 tasks

simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 25, 2020

github-actions bot assigned yao-eastside Apr 25, 2020

yao-eastside removed their assignment Apr 25, 2020

jreback closed this as completed in #33674 May 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

value_counts not working correctly on (some?) ExtensionArrays #33172

value_counts not working correctly on (some?) ExtensionArrays #33172

buhrmann commented Mar 31, 2020

INSTALLED VERSIONS

simonjayhawkins commented Apr 6, 2020

simonjayhawkins commented Apr 25, 2020

yao-eastside commented Apr 25, 2020 •

edited

Loading

yao-eastside commented Apr 25, 2020

yao-eastside commented Apr 25, 2020 •

edited

Loading

value_counts not working correctly on (some?) ExtensionArrays #33172

value_counts not working correctly on (some?) ExtensionArrays #33172

Comments

buhrmann commented Mar 31, 2020

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Apr 6, 2020

simonjayhawkins commented Apr 25, 2020

yao-eastside commented Apr 25, 2020 • edited Loading

yao-eastside commented Apr 25, 2020

yao-eastside commented Apr 25, 2020 • edited Loading

Output of `pd.show_versions()`

yao-eastside commented Apr 25, 2020 •

edited

Loading

yao-eastside commented Apr 25, 2020 •

edited

Loading