Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

value_counts not working correctly on (some?) ExtensionArrays #33172

Closed
buhrmann opened this issue Mar 31, 2020 · 5 comments · Fixed by #33674
Closed

value_counts not working correctly on (some?) ExtensionArrays #33172

buhrmann opened this issue Mar 31, 2020 · 5 comments · Fixed by #33674
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@buhrmann
Copy link

Code Sample

pd.Series(list("abcde"), dtype="string").value_counts(normalize=True)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-718821f804b4> in <module>
      1 # lang.value_counts(normalize=True)
----> 2 pd.Series(list("abcde"), dtype="string").value_counts(normalize=True)

~/anaconda/envs/grapy/lib/python3.7/site-packages/pandas/core/base.py in value_counts(self, normalize, sort, ascending, bins, dropna)
   1233             normalize=normalize,
   1234             bins=bins,
-> 1235             dropna=dropna,
   1236         )
   1237         return result

~/anaconda/envs/grapy/lib/python3.7/site-packages/pandas/core/algorithms.py in value_counts(values, sort, ascending, normalize, bins, dropna)
    729 
    730     if normalize:
--> 731         result = result / float(counts.sum())
    732 
    733     return result

AttributeError: 'IntegerArray' object has no attribute 'sum'

Problem description

The problem seems to be that value_counts() on a string extension dtype returns an Int64 dtype, and sum is not implemented for IntegerArrays , although it is for Series with ExtensionArrays:

Expected Output

vc = pd.Series(list("abcde"), dtype="string").value_counts(normalize=False)
print(vc)
print(vc / vc.sum())
d    1
e    1
b    1
c    1
a    1
dtype: Int64
d    0.2
e    0.2
b    0.2
c    0.2
a    0.2
dtype: float64

May be related to #22843?

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.0
pip : 20.0.2
setuptools : 46.1.1.post20200322
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

@jbrockmendel jbrockmendel added ExtensionArray Extending pandas with custom dtypes or arrays. Bug labels Apr 3, 2020
@simonjayhawkins
Copy link
Member

For the Int64 case

>>> import pandas as pd
>>>
>>> pd.__version__
'0.26.0.dev0+1729.g8bdd7b13c'
>>>
>>> pd.Series([1, 2, 3, 4, 4], dtype="Int64").value_counts(normalize=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\base.py", line 1265, in value_counts
    dropna=dropna,
  File "C:\Users\simon\pandas\pandas\core\algorithms.py", line 724, in value_counts
    result = result / float(counts.sum())
AttributeError: 'IntegerArray' object has no attribute 'sum'
>>>

on 0.25.3

>>> import pandas as pd
>>>
>>> pd.__version__
'0.25.3'
>>>
>>> pd.Series([1, 2, 3, 4, 4], dtype="Int64").value_counts(normalize=True)
4    0.4
1    0.2
2    0.2
3    0.2
dtype: float64
>>>

8bdd7b1 is the first bad commit
commit 8bdd7b1
Author: Tom Augspurger TomAugspurger@users.noreply.github.com
Date: Thu Jan 9 13:19:34 2020 -0600

BUG: BooleanArray.value_counts dropna (#30824)

#30824 made changes to masked extension arrays, so assume behaviour of other extension arrays is not impacted.

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1122.g01f73100d'
>>>
>>> pd.Series(pd.date_range("2000", periods=3, freq="A", tz="Europe/London")).value_counts(
...     normalize=True
... )
2001-12-31 00:00:00+00:00    0.333333
2000-12-31 00:00:00+00:00    0.333333
2002-12-31 00:00:00+00:00    0.333333
dtype: float64
>>>
>>> pd.Series(list("abbcc")).astype("category").value_counts(normalize=True)
c    0.4
b    0.4
a    0.2
dtype: float64
>>>

@simonjayhawkins simonjayhawkins added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 6, 2020
@jreback jreback added this to the 1.1 milestone Apr 20, 2020
@simonjayhawkins
Copy link
Member

this is fixed on master, xref #33538, just needs tests.

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1361.g77a0f19c5'
>>>
>>> pd.Series(list("abcde"), dtype="string").value_counts(normalize=True)
e    0.2
b    0.2
a    0.2
c    0.2
d    0.2
dtype: float64
>>>

@simonjayhawkins simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 25, 2020
@yao-eastside
Copy link

yao-eastside commented Apr 25, 2020

@simonjayhawkins On which version we should have the tests? I can work on this one. I will try 0.26.x, 0.25.x, and 1.1.x and report back here. I guess we also need unittest right?

@yao-eastside
Copy link

take

@yao-eastside
Copy link

yao-eastside commented Apr 25, 2020

looks like @kotamatsuoka is working on the unit tests. I will wait and see.

@yao-eastside yao-eastside removed their assignment Apr 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants