Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: SparseSeries.value_counts ignores fill_value #12835

Closed
wants to merge 2 commits into from

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 9, 2016

Also fixed an issue which categorical value_counts resets name.

@sinhrks sinhrks added Sparse Sparse Data Type Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 9, 2016
@sinhrks sinhrks added this to the 0.18.1 milestone Apr 9, 2016
@sinhrks sinhrks added the Bug label Apr 9, 2016
@@ -268,69 +267,81 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

if com.is_categorical_dtype(values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually with only a small rewriting I think you could do:

if com.is_extension_dtype(values) or com.is_datetimelike(values):
    result = values.values_counts(dropna)
    result.name = name
else:
   .....

if we define DateTimeIndex.value_counts that properly handles localization before/after (instead of in here)
will be much cleaner.

Further it makes sense to define .value_counts() in tseries/base.py (and move the inference from core/base.py/value_counts.py there. To handle the dispatching of period/DatetimeIndex.

a bit of refactoring but moves functionaility where it should be.

@sinhrks sinhrks force-pushed the sparse_valuecounts branch 2 times, most recently from 81fac4e to 91c4c2f Compare April 9, 2016 18:00
@sinhrks
Copy link
Member Author

sinhrks commented Apr 9, 2016

Found current master doesn't work for datetimetz. Let me check.

values = [pd.Timestamp('2011-01-01 09:00', tz='US/Eastern'),
                  pd.Timestamp('2011-01-01 10:00', tz='US/Eastern'),
                  pd.Timestamp('2011-01-01 11:00', tz='US/Eastern'),
                  pd.Timestamp('2011-01-01 09:00', tz='US/Eastern'),
                  pd.Timestamp('2011-01-01 09:00', tz='US/Eastern'),
                  pd.Timestamp('2011-01-01 11:00', tz='US/Eastern')]
s = pd.Series(values, name='xxx')

# NG, index must have tz
s.value_counts()
# 2011-01-01 14:00:00    3
# 2011-01-01 16:00:00    2
# 2011-01-01 15:00:00    1
# Name: xxx, dtype: int64

@sinhrks sinhrks force-pushed the sparse_valuecounts branch 2 times, most recently from 89ab6ea to add609f Compare April 24, 2016 01:56
@sinhrks sinhrks added Timezones Timezone data dtype Categorical Categorical Data Type labels Apr 24, 2016
@sinhrks
Copy link
Member Author

sinhrks commented Apr 24, 2016

Additionally fixing the other Categorical issue.

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).value_counts(normalize=True)
# UnboundLocalError: local variable 'counts' referenced before assignment

@sinhrks sinhrks force-pushed the sparse_valuecounts branch 2 times, most recently from c14cbf5 to f1d62ee Compare April 24, 2016 03:32
@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

@sinhrks pls rebase when you can

@codecov-io
Copy link

codecov-io commented Apr 28, 2016

Current coverage is 84.04%

Merging #12835 into master will increase coverage by +0.07%

@@             master     #12835   diff @@
==========================================
  Files           136        136          
  Lines         49908      49931    +23   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          41876      41962    +86   
+ Misses         8032       7969    -63   
  Partials          0          0          
  1. File ...ndas/core/groupby.py (not in diff) was modified. more
    • Misses -63
    • Partials 0
    • Hits +63

Powered by Codecov. Last updated by 38a7531

keys, counts = htable.value_count_scalar64(values, dropna)
dtype = values.dtype
is_period = com.is_period_arraylike(values)
is_datetimetz = com.is_datetimetz(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't happen here as is_extension_type will do this

@@ -1875,7 +1873,6 @@ def check_nunique(df, keys):
check_nunique(frame, ['jim'])
check_nunique(frame, ['jim', 'joe'])

@slow
Copy link
Member Author

@sinhrks sinhrks Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed @slow as groupby.value_coutns has separate logic depending on numpy version.

if sort:
result = result.sort_values(ascending=ascending)
values = values.view(np.int64)
keys, counts = htable.value_count_scalar64(values, dropna)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, though I still think we should handle in tseries/base, e.g. define DatetimeIndexOpsMixIN and just defer to it like we are doing for sparse/categorical.

Copy link
Member Author

@sinhrks sinhrks Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse/categorical value_counts only supports dropna kwds. It may better to be an internal method.

If we do the same thing using DatetimeIndexOpsMixin, the impl should be:

  • DatetimeIndexOpsMixin 's public value_counts calls algorithms.py (as it needs to handle kwds like normalize)
  • algorithms.py convert datetime-likes to datetime-like Index to handle Series
  • Calls DatetimeIndexOpsMixin 's internal _value_counts

Suppose to be little complex, or do you have anything in your mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ok, it can ge tricky. ok, we'll keep in mind for later. ideally to put specific logic. merging.

@jreback jreback closed this in 8439d28 Apr 29, 2016
@sinhrks sinhrks deleted the sparse_valuecounts branch April 29, 2016 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Sparse Sparse Data Type Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SparseSeries.value_counts doesn't include fill_value counts
3 participants