Skip to content

join and match_df don't match string to factor #128

Closed
wants to merge 2 commits into from

3 participants

@crowding
crowding commented Jan 5, 2013
dfF <- data.frame(character=c("Aeryn", "Jothee", "Jothee", "Chiana", "Scorpius", "Scorpius"),
                 species=c("Sebacian", "Luxan", "Sebacian", "Nibari", "Sebacian", "Scarran"))
dfS <- colwise(as.character)(dfF)

matchF <- data.frame(species="Sebacian", stringsAsFactors=TRUE)
matchS <- colwise(as.character)(matchF)

#"merge" matches strings to factors both directions
merge(dfF, matchS) #matches
merge(dfS, matchF) #matches

#as does '=='
dfF$species == matchS$species #matches
dfS$species == matchF$species #matches

#`match_df` doesn't match a string to a factor
match_df(dfF, matchF) #matches (despite having different level sets)
match_df(dfF, matchS) #matches string to factor
match_df(dfS, matchS) #matches string to string
match_df(dfS, matchF) #NO MATCHES for factor to string

#nor does `join`, (so inner joins are not commutative)
join(dfF, matchS, type="inner") #matches
join(dfS, matchF, type="inner") #NO MATCHES

I think this is a bug since match_df is supposed to match like == and we expect inner joins to be commutative up to an ordering.

@crowding crowding added a commit to crowding/plyr that referenced this pull request Mar 9, 2013
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 bf03866
@crowding crowding added a commit to crowding/plyr that referenced this pull request Mar 9, 2013
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 e734088
@crowding crowding added a commit to crowding/plyr that referenced this pull request Mar 9, 2013
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 da94c9c
@krlmlr krlmlr added a commit to krlmlr/plyr that referenced this pull request Jul 6, 2013
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 5dfb674
@krlmlr

Could you please explain the rationale of moving this code outside the if (!is.matrix...? Does it make sense for matrices at all? I thought that matrices are atomic so that all cells are from the same domain (i.e., factor or not factor). However, all tests still pass, so I'm puzzled.

@crowding
crowding commented Jul 6, 2013

A "matrix" is just a vector (any type; numeric, character or list) with a "dims" attribute of length 2. A factor is a vector with a class of "factor" (which should have a "levels" attribute). You can have objects that have both levels and dims.

Here's a factor-matrix and a character-matrix:

> fm <- rep(factor(c("foo","bar", "baz")), 2)
> dim(fm) <- c(2,3)
> fm
     [,1] [,2] [,3]
[1,] foo  baz  bar 
[2,] bar  foo  baz 
Levels: bar baz foo
> cm <- matrix(c("foo", "bar", "baz"), nrow=2, ncol=3, byrow=TRUE)
> cm
     [,1]  [,2]  [,3] 
[1,] "foo" "bar" "baz"
[2,] "foo" "bar" "baz"
> is.matrix(fm)
[1] TRUE
> is.matrix(cm)
[1] TRUE

Factor-ness and matrix-ness are logically independent, so the test for factors should not depend on the test for matrices.

This patch doesn't actually implement reasonable behavior on all combinations of inputs as it's outside the scope of the original bug) but cases that failed before, like rbind.fill combining a factor-matrix to a char-matrix, might fail more obviously.

@krlmlr
krlmlr commented Jul 6, 2013

Should we deal with the matrix case at all? If yes, there should be tests to support the behavior. Otherwise I suggest treating only the "data frame" case and handling everything inside if(!is.matrix.

I'm trying to implement a faster version of rbind.fill. Currently, your change is conflicting with mine. While I can cherry-pick the test you supplied and just make sure that my code passes it, it would be more elegant if our changes were orthogonal.

@hadley hadley closed this Jul 10, 2013
@krlmlr krlmlr added a commit to krlmlr/plyr that referenced this pull request Jul 16, 2013
@crowding crowding test for rbind.fill checks for factor-on-character (#128, currently f…
…ailing)

Conflicts:
	R/rbind-fill.r
8f28a6b
@crowding crowding deleted the crowding:match-factor branch Aug 24, 2013
@crowding

I think this got closed because the branch the previous PR was made against was deleted. Hadley, can you reopen?

Also, a question: when a$x is a factor and b$x is a character, is it preferable to:

  • make rbind.fill(a,b)$x a factor and rbind.fill(b,a)$x a character, or
  • make both characters?
@crowding crowding added a commit to crowding/plyr that referenced this pull request Aug 25, 2013
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 e95cced
@hadley
Owner
hadley commented Aug 26, 2013

Weird - I don't seem to be able to re-open either.

And I'd argue that if you ever get a mix of factor and character, then you should return a character.

@crowding crowding added a commit to crowding/plyr that referenced this pull request Jan 2, 2014
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 41af81a
@wibeasley wibeasley added a commit to wibeasley/plyr that referenced this pull request Jan 3, 2014
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 b28ee12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.