# PERF: groupby rank is slow when tie count is big #21237

opened this issue May 29, 2018 · 1 comment

Contributor

### peterpanmj commented May 29, 2018

#### Code Sample, a copy-pastable example if possible

```df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})

In [31]: %%timeit
...: t = df.groupby("B").rank()

608 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [32]: %%timeit
...: t = df.A.rank()
1.27 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [33]: %%timeit
...: t = df.groupby("B").apply(pd.Series.rank)
...:
6.51 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

#### Problem description

groupby rank is much slower than without groupby when there is a lot of ties

#### Expected Output

```In [42]: df1 = pd.DataFrame({"A":np.random.rand(30000) ,"B":[1]*30000})

In [44]: %%timeit
...: t = df1.groupby("B").apply(pd.Series.rank)
...:
10.1 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %%timeit
...: t = df1.groupby("B").rank()
...:
4.77 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)```

Output of `pd.show_versions()`

Member

### WillAyd commented May 29, 2018

Not surprised by this as it is even called out in the comments of that function:

Line 524 in b2eec25

 # this implementation is inefficient because it will

Investigation and a PR for a more efficient implementation would certainly be welcome!

