Skip to content

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Jun 10, 2021

Built on #41916

There is a slight slowdown in 3 benchmarks, due to use of lexsort instead of argsort to sort with both the data and mask so that na_option can be properly handled. I am not sure if this can be avoided (though as a plus there is also potential perf improvement for users running code in parallel since the ranking is in a nogil block now).

Benchmarks:

[499ef8c0]       [d678bbf0]
<master>         <ref/rank_2d_dedup>
7.14±0.1ms       7.04±0.4ms     0.99  categoricals.Rank.time_rank_int
7.85±0.3ms       7.34±0.2ms     0.93  categoricals.Rank.time_rank_int_cat
7.30±0.2ms       6.95±0.6ms     0.95  categoricals.Rank.time_rank_int_cat_ordered
121±3ms         122±10ms     1.01  categoricals.Rank.time_rank_string
8.66±1ms         8.13±1ms     0.94  categoricals.Rank.time_rank_string_cat
6.37±1ms       6.11±0.4ms     0.96  categoricals.Rank.time_rank_string_cat_ordered
9.34±2ms         10.1±1ms     1.08  frame_methods.Rank.time_rank('float')
3.81±1ms       4.16±0.4ms     1.09  frame_methods.Rank.time_rank('int')
59.0±20ms       48.0±0.9ms    ~0.81  frame_methods.Rank.time_rank('object')
2.75±0.4ms       3.53±0.7ms    ~1.28  frame_methods.Rank.time_rank('uint')
1.06±0.04ms      1.01±0.04ms     0.95  groupby.RankWithTies.time_rank_ties('datetime64', 'average')
1.13±0.06ms      1.02±0.04ms    ~0.90  groupby.RankWithTies.time_rank_ties('datetime64', 'dense')
1.19±0.1ms      1.04±0.03ms    ~0.87  groupby.RankWithTies.time_rank_ties('datetime64', 'first')
1.12±0.04ms      1.04±0.06ms     0.93  groupby.RankWithTies.time_rank_ties('datetime64', 'max')
1.03±0.05ms      1.03±0.04ms     1.00  groupby.RankWithTies.time_rank_ties('datetime64', 'min')
1.22±0.1ms      1.10±0.07ms    ~0.90  groupby.RankWithTies.time_rank_ties('float32', 'average')
1.13±0.09ms      1.12±0.04ms     0.99  groupby.RankWithTies.time_rank_ties('float32', 'dense')
1.15±0.1ms      1.05±0.05ms     0.91  groupby.RankWithTies.time_rank_ties('float32', 'first')
-     1.19±0.05ms      1.00±0.03ms     0.84  groupby.RankWithTies.time_rank_ties('float32', 'max')
1.18±0.04ms      1.13±0.08ms     0.96  groupby.RankWithTies.time_rank_ties('float32', 'min')
1.25±0.1ms      1.17±0.03ms     0.94  groupby.RankWithTies.time_rank_ties('float64', 'average')
1.45±0.5ms         990±10μs    ~0.68  groupby.RankWithTies.time_rank_ties('float64', 'dense')
1.21±0.2ms      1.11±0.08ms     0.92  groupby.RankWithTies.time_rank_ties('float64', 'first')
1.18±0.05ms      1.15±0.07ms     0.97  groupby.RankWithTies.time_rank_ties('float64', 'max')
1.32±0.3ms         963±40μs    ~0.73  groupby.RankWithTies.time_rank_ties('float64', 'min')
1.03±0.1ms         971±80μs     0.94  groupby.RankWithTies.time_rank_ties('int64', 'average')
1.02±0.06ms      1.02±0.06ms     1.01  groupby.RankWithTies.time_rank_ties('int64', 'dense')
1.13±0.08ms      1.01±0.04ms    ~0.89  groupby.RankWithTies.time_rank_ties('int64', 'first')
1.24±0.1ms         959±20μs    ~0.77  groupby.RankWithTies.time_rank_ties('int64', 'max')
1.14±0.06ms         979±50μs    ~0.86  groupby.RankWithTies.time_rank_ties('int64', 'min')
10.9±0.6ms       9.80±0.4ms    ~0.90  series_methods.Rank.time_rank('float')
7.18±0.3ms       7.15±0.6ms     1.00  series_methods.Rank.time_rank('int')
53.7±1ms         49.6±4ms     0.92  series_methods.Rank.time_rank('object')
7.64±0.2ms       7.78±0.6ms     1.02  series_methods.Rank.time_rank('uint')
9.96±0.2ms       11.7±0.3ms    ~1.17  stat_ops.Rank.time_average_old('DataFrame', False)
+      9.89±0.2ms       11.9±0.6ms     1.20  stat_ops.Rank.time_average_old('DataFrame', True)
13.5±0.9ms       12.6±0.8ms     0.93  stat_ops.Rank.time_average_old('Series', False)
12.8±1ms       11.9±0.6ms     0.93  stat_ops.Rank.time_average_old('Series', True)
+      9.84±0.3ms       11.7±0.4ms     1.19  stat_ops.Rank.time_rank('DataFrame', False)
+      8.53±0.3ms       11.7±0.8ms     1.38  stat_ops.Rank.time_rank('DataFrame', True)
14.0±3ms         12.2±1ms    ~0.87  stat_ops.Rank.time_rank('Series', False)
13.0±0.9ms         12.1±1ms     0.93  stat_ops.Rank.time_rank('Series', True)

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Refactor Internal refactoring of code labels Jun 10, 2021
@jreback jreback added this to the 1.4 milestone Jun 17, 2021
@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

lgtm. @jbrockmendel

@jbrockmendel
Copy link
Member

LGTM; caveat im not that familiar with the cases in #19560

@jreback jreback merged commit 7a38d63 into pandas-dev:master Jun 25, 2021
@jreback
Copy link
Contributor

jreback commented Jun 25, 2021

thanks @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the ref/rank_2d_dedup branch June 25, 2021 17:41
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Refactor Internal refactoring of code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raise ValueError When Attempting to Rank Object Dtypes

3 participants