Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_rank() and row_number() returning very different results inside a filter() #313

Closed
justmarkham opened this issue Mar 11, 2014 · 6 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@justmarkham
Copy link
Contributor

Here's an example from the window-functions.Rmd in the vignettes:

# getting set up:
library(dplyr)
library(Lahman)
batting <- select(tbl_df(Batting), playerID, yearID, teamID, G, AB:H) 
batting <- arrange(batting, playerID, yearID, teamID)
players <- group_by(batting, playerID)

The example presented is: "For each player, find the two years with most hits":

filter(players, min_rank(desc(H)) <= 2 & H > 0)

Below are the results. Notice that playerID "aaronto01" has 6 records, not 2:

Source: local data frame [32,724 x 7]
Groups: playerID

    playerID yearID teamID   G  AB   R   H
1  aaronha01   1966    ATL 158 603 117 168
2  aaronha01   1970    ATL 150 516 103 154
3  aaronto01   1962    ML1 141 334  54  77
4  aaronto01   1963    ML1  72 135   6  27
5  aaronto01   1965    ML1   8  16   1   3
6  aaronto01   1968    ATL  98 283  21  69
7  aaronto01   1969    ATL  49  60  13  15
8  aaronto01   1970    ATL  44  63   3  13
9   abadan01   2003    BOS   9  17   1   2
10  abadfe01   2012    HOU  37   7   0   1

Let's just look at him:

filter(players, playerID=="aaronto01") %.% arrange(desc(H))

Below are the results. Looks like the years with most hits are 1962 and 1968.

   playerID yearID teamID   G  AB  R  H
1 aaronto01   1962    ML1 141 334 54 77
2 aaronto01   1968    ATL  98 283 21 69
3 aaronto01   1963    ML1  72 135  6 27
4 aaronto01   1969    ATL  49  60 13 15
5 aaronto01   1970    ATL  44  63  3 13
6 aaronto01   1971    ATL  25  53  4 12
7 aaronto01   1965    ML1   8  16  1  3

Let's try row_number() instead of min_rank():

filter(players, row_number(desc(H)) <= 2 & H > 0)

These results look correct. Notice that the data frame has 23,854 rows instead of 32,724 rows:

Source: local data frame [23,854 x 7]
Groups: playerID

    playerID yearID teamID   G  AB   R   H
1  aaronha01   1959    ML1 154 629 116 223
2  aaronha01   1963    ML1 161 631 121 201
3  aaronto01   1962    ML1 141 334  54  77
4  aaronto01   1968    ATL  98 283  21  69
5   abadan01   2003    BOS   9  17   1   2
6   abadfe01   2012    HOU  37   7   0   1
7  abadijo01   1875    BR2   1   4   1   1
8  abadijo01   1875    PH3  11  45   3  10
9  abbated01   1904    BSN 154 579  76 148
10 abbated01   1905    BSN 153 610  70 170

Perhaps this is a bug with min_rank() or filter()? My apologies if the issue is simply that I'm misunderstanding the functions or the data.

Thanks!

@hadley
Copy link
Member

hadley commented Mar 11, 2014

I'd recommend using mutate() to add row_number() and min_rank() explicitly so you can see what's going - I suspect it's just a problem with ties.

@justmarkham
Copy link
Contributor Author

Good suggestion. However, I'm still getting the same strange result. Here's an example with aaronha01, using min_rank:

mutate(players, new=min_rank(desc(H))) %.% filter(playerID=="aaronha01") %.% arrange(new)

    playerID yearID teamID   G  AB   R   H new
1  aaronha01   1966    ATL 158 603 117 168   1
2  aaronha01   1970    ATL 150 516 103 154   2
3  aaronha01   1963    ML1 161 631 121 201   3
4  aaronha01   1964    ML1 145 570 103 187   4
5  aaronha01   1968    ATL 160 606  84 174   5
6  aaronha01   1965    ML1 150 570 109 181   6
7  aaronha01   1969    ATL 147 547 100 164   7
8  aaronha01   1962    ML1 156 592 127 191   8
9  aaronha01   1971    ATL 139 495  95 162   9
10 aaronha01   1974    ATL 112 340  47  91  10
11 aaronha01   1972    ATL 129 449  75 119  11
12 aaronha01   1975    ML4 137 465  45 109  12
13 aaronha01   1967    ATL 155 600 113 184  13
14 aaronha01   1973    ATL 120 392  84 118  14
15 aaronha01   1976    ML4  85 271  22  62  15
16 aaronha01   1961    ML1 155 603 115 197  16
17 aaronha01   1954    ML1 122 468  58 131  17
18 aaronha01   1955    ML1 153 602 105 189  17
19 aaronha01   1956    ML1 153 609 106 200  17
20 aaronha01   1957    ML1 151 615 118 198  17
21 aaronha01   1958    ML1 153 601 109 196  17
22 aaronha01   1959    ML1 154 629 116 223  17
23 aaronha01   1960    ML1 153 590 102 172  23

I can't make heads or tails of what is being ranked.

On the other hand, when I use row_number(), I get the expected result:

mutate(players, new=row_number(desc(H))) %.% filter(playerID=="aaronha01") %.% arrange(new)

    playerID yearID teamID   G  AB   R   H new
1  aaronha01   1959    ML1 154 629 116 223   1
2  aaronha01   1963    ML1 161 631 121 201   2
3  aaronha01   1956    ML1 153 609 106 200   3
4  aaronha01   1957    ML1 151 615 118 198   4
5  aaronha01   1961    ML1 155 603 115 197   5
6  aaronha01   1958    ML1 153 601 109 196   6
7  aaronha01   1962    ML1 156 592 127 191   7
8  aaronha01   1955    ML1 153 602 105 189   8
9  aaronha01   1964    ML1 145 570 103 187   9
10 aaronha01   1967    ATL 155 600 113 184  10
11 aaronha01   1965    ML1 150 570 109 181  11
12 aaronha01   1968    ATL 160 606  84 174  12
13 aaronha01   1960    ML1 153 590 102 172  13
14 aaronha01   1966    ATL 158 603 117 168  14
15 aaronha01   1969    ATL 147 547 100 164  15
16 aaronha01   1971    ATL 139 495  95 162  16
17 aaronha01   1970    ATL 150 516 103 154  17
18 aaronha01   1954    ML1 122 468  58 131  18
19 aaronha01   1972    ATL 129 449  75 119  19
20 aaronha01   1973    ATL 120 392  84 118  20
21 aaronha01   1975    ML4 137 465  45 109  21
22 aaronha01   1974    ATL 112 340  47  91  22
23 aaronha01   1976    ML4  85 271  22  62  23

For what it's worth, I also tried dense_rank(), cume_dist(), percent_rank(), and ntile(). dense_rank() produced the same result as min_rank, whereas all the others produced the same result as row_number().

@justmarkham
Copy link
Contributor Author

Update: I put together a simpler (but still large) fake dataset, and am still seeing the same strange behavior with min_rank():

https://gist.github.com/justmarkham/9500369

@hadley
Copy link
Member

hadley commented Mar 14, 2014

Thanks. Can you reproduce the problem with a smaller dataset?

@justmarkham
Copy link
Contributor Author

No problem. I made a dataset with 50 rows:
https://gist.github.com/justmarkham/9593453

I upgraded to 0.1.3 before running this, just in case.

@hadley hadley added the bug label Mar 17, 2014
@hadley hadley added this to the v0.2 milestone Mar 17, 2014
@romainfrancois romainfrancois self-assigned this Mar 26, 2014
romainfrancois added a commit that referenced this issue Mar 26, 2014
@romainfrancois
Copy link
Member

I think there was a relic of how grouped data used to be organised (when we used to group groups together). Should be fixed now. @justmarkham can you test the devel version on your to see if this behaves correctly. Even better if you can come up with some tests to add to the test suite.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants