Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

join and match_df don't match string to factor #128

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

crowding commented Mar 9, 2013

dfF <- data.frame(character=c("Aeryn", "Jothee", "Jothee", "Chiana", "Scorpius", "Scorpius"),
                 species=c("Sebacian", "Luxan", "Sebacian", "Nibari", "Sebacian", "Scarran"))
dfS <- colwise(as.character)(dfF)

matchF <- data.frame(species="Sebacian", stringsAsFactors=TRUE)
matchS <- colwise(as.character)(matchF)

#"merge" matches strings to factors both directions
merge(dfF, matchS) #matches
merge(dfS, matchF) #matches

#as does '=='
dfF$species == matchS$species #matches
dfS$species == matchF$species #matches

#`match_df` doesn't match a string to a factor
match_df(dfF, matchF) #matches (despite having different level sets)
match_df(dfF, matchS) #matches string to factor
match_df(dfS, matchS) #matches string to string
match_df(dfS, matchF) #NO MATCHES for factor to string

#nor does `join`, (so inner joins are not commutative)
join(dfF, matchS, type="inner") #matches
join(dfS, matchF, type="inner") #NO MATCHES

I think this is a bug since match_df is supposed to match like == and we expect inner joins to be commutative up to an ordering.

crowding added a commit to crowding/plyr that referenced this pull request Mar 9, 2013

crowding added a commit to crowding/plyr that referenced this pull request Mar 9, 2013

crowding added a commit to crowding/plyr that referenced this pull request Mar 9, 2013

krlmlr added a commit to krlmlr/plyr that referenced this pull request Jul 6, 2013

Could you please explain the rationale of moving this code outside the if (!is.matrix...? Does it make sense for matrices at all? I thought that matrices are atomic so that all cells are from the same domain (i.e., factor or not factor). However, all tests still pass, so I'm puzzled.

Contributor

crowding commented Jul 6, 2013

A "matrix" is just a vector (any type; numeric, character or list) with a "dims" attribute of length 2. A factor is a vector with a class of "factor" (which should have a "levels" attribute). You can have objects that have both levels and dims.

Here's a factor-matrix and a character-matrix:

> fm <- rep(factor(c("foo","bar", "baz")), 2)
> dim(fm) <- c(2,3)
> fm
     [,1] [,2] [,3]
[1,] foo  baz  bar 
[2,] bar  foo  baz 
Levels: bar baz foo
> cm <- matrix(c("foo", "bar", "baz"), nrow=2, ncol=3, byrow=TRUE)
> cm
     [,1]  [,2]  [,3] 
[1,] "foo" "bar" "baz"
[2,] "foo" "bar" "baz"
> is.matrix(fm)
[1] TRUE
> is.matrix(cm)
[1] TRUE

Factor-ness and matrix-ness are logically independent, so the test for factors should not depend on the test for matrices.

This patch doesn't actually implement reasonable behavior on all combinations of inputs as it's outside the scope of the original bug) but cases that failed before, like rbind.fill combining a factor-matrix to a char-matrix, might fail more obviously.

Contributor

krlmlr commented Jul 6, 2013

Should we deal with the matrix case at all? If yes, there should be tests to support the behavior. Otherwise I suggest treating only the "data frame" case and handling everything inside if(!is.matrix.

I'm trying to implement a faster version of rbind.fill. Currently, your change is conflicting with mine. While I can cherry-pick the test you supplied and just make sure that my code passes it, it would be more elegant if our changes were orthogonal.

@hadley hadley closed this Jul 10, 2013

krlmlr added a commit to krlmlr/plyr that referenced this pull request Jul 16, 2013

@crowding crowding deleted the crowding:match-factor branch Aug 24, 2013

Contributor

crowding commented Aug 24, 2013

I think this got closed because the branch the previous PR was made against was deleted. Hadley, can you reopen?

Also, a question: when a$x is a factor and b$x is a character, is it preferable to:

  • make rbind.fill(a,b)$x a factor and rbind.fill(b,a)$x a character, or
  • make both characters?

crowding added a commit to crowding/plyr that referenced this pull request Aug 25, 2013

Owner

hadley commented Aug 26, 2013

Weird - I don't seem to be able to re-open either.

And I'd argue that if you ever get a mix of factor and character, then you should return a character.

crowding added a commit to crowding/plyr that referenced this pull request Jan 2, 2014

wibeasley added a commit to wibeasley/plyr that referenced this pull request Jan 3, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment