Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

join and match_df don't match string to factor #128

Closed
wants to merge 2 commits into from

3 participants

@crowding
dfF <- data.frame(character=c("Aeryn", "Jothee", "Jothee", "Chiana", "Scorpius", "Scorpius"),
                 species=c("Sebacian", "Luxan", "Sebacian", "Nibari", "Sebacian", "Scarran"))
dfS <- colwise(as.character)(dfF)

matchF <- data.frame(species="Sebacian", stringsAsFactors=TRUE)
matchS <- colwise(as.character)(matchF)

#"merge" matches strings to factors both directions
merge(dfF, matchS) #matches
merge(dfS, matchF) #matches

#as does '=='
dfF$species == matchS$species #matches
dfS$species == matchF$species #matches

#`match_df` doesn't match a string to a factor
match_df(dfF, matchF) #matches (despite having different level sets)
match_df(dfF, matchS) #matches string to factor
match_df(dfS, matchS) #matches string to string
match_df(dfS, matchF) #NO MATCHES for factor to string

#nor does `join`, (so inner joins are not commutative)
join(dfF, matchS, type="inner") #matches
join(dfS, matchF, type="inner") #NO MATCHES

I think this is a bug since match_df is supposed to match like == and we expect inner joins to be commutative up to an ordering.

@crowding crowding referenced this pull request from a commit in crowding/plyr
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 bf03866
@crowding crowding referenced this pull request from a commit in crowding/plyr
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 e734088
@crowding crowding referenced this pull request from a commit in crowding/plyr
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 da94c9c
@krlmlr krlmlr referenced this pull request from a commit in krlmlr/plyr
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 5dfb674
@krlmlr

Could you please explain the rationale of moving this code outside the if (!is.matrix...? Does it make sense for matrices at all? I thought that matrices are atomic so that all cells are from the same domain (i.e., factor or not factor). However, all tests still pass, so I'm puzzled.

@crowding

A "matrix" is just a vector (any type; numeric, character or list) with a "dims" attribute of length 2. A factor is a vector with a class of "factor" (which should have a "levels" attribute). You can have objects that have both levels and dims.

Here's a factor-matrix and a character-matrix:

> fm <- rep(factor(c("foo","bar", "baz")), 2)
> dim(fm) <- c(2,3)
> fm
     [,1] [,2] [,3]
[1,] foo  baz  bar 
[2,] bar  foo  baz 
Levels: bar baz foo
> cm <- matrix(c("foo", "bar", "baz"), nrow=2, ncol=3, byrow=TRUE)
> cm
     [,1]  [,2]  [,3] 
[1,] "foo" "bar" "baz"
[2,] "foo" "bar" "baz"
> is.matrix(fm)
[1] TRUE
> is.matrix(cm)
[1] TRUE

Factor-ness and matrix-ness are logically independent, so the test for factors should not depend on the test for matrices.

This patch doesn't actually implement reasonable behavior on all combinations of inputs as it's outside the scope of the original bug) but cases that failed before, like rbind.fill combining a factor-matrix to a char-matrix, might fail more obviously.

@krlmlr

Should we deal with the matrix case at all? If yes, there should be tests to support the behavior. Otherwise I suggest treating only the "data frame" case and handling everything inside if(!is.matrix.

I'm trying to implement a faster version of rbind.fill. Currently, your change is conflicting with mine. While I can cherry-pick the test you supplied and just make sure that my code passes it, it would be more elegant if our changes were orthogonal.

@hadley hadley closed this
@krlmlr krlmlr referenced this pull request from a commit in krlmlr/plyr
@crowding crowding test for rbind.fill checks for factor-on-character (#128, currently f…
…ailing)

Conflicts:
	R/rbind-fill.r
8f28a6b
@crowding crowding deleted the branch
@crowding

I think this got closed because the branch the previous PR was made against was deleted. Hadley, can you reopen?

Also, a question: when a$x is a factor and b$x is a character, is it preferable to:

  • make rbind.fill(a,b)$x a factor and rbind.fill(b,a)$x a character, or
  • make both characters?
@crowding crowding referenced this pull request from a commit in crowding/plyr
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 e95cced
@hadley
Owner

Weird - I don't seem to be able to re-open either.

And I'd argue that if you ever get a mix of factor and character, then you should return a character.

@crowding crowding referenced this pull request from a commit in crowding/plyr
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 41af81a
@wibeasley wibeasley referenced this pull request from a commit in wibeasley/plyr
@crowding crowding rbind.fill checks for factor-on-character. Fixes #128 b28ee12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 9, 2013
  1. @crowding
  2. @crowding

    update NEWS

    crowding authored
This page is out of date. Refresh to see the latest.
Showing with 31 additions and 4 deletions.
  1. +6 −1 NEWS
  2. +6 −3 R/rbind-fill.r
  3. +19 −0 inst/tests/test-join.r
View
7 NEWS
@@ -1,3 +1,8 @@
+* `join(x,y)` works when the key column in X is character and Y is
+ factor. Additionally `rbind.fill(x,y)` converts factor columns of Y
+ to character when columns of X are character. (Thanks to Peter
+ Meilstrup; #128)
+
* Fix faulty array allocation which caused problems when using `split_indices`
with large (> 2^24) vectors. (Fixes #131)
@@ -420,4 +425,4 @@ Version 0.1.1 (2008-10-08)
* argument names now start with . (instead of ending with it) - this should prevent name clashes with arguments of the called function
* return informative error if .fun is not a function
- * use full names in all internal calls to avoid argument name clashes
+ * use full names in all internal calls to avoid argument name clashes
View
9 R/rbind-fill.r
@@ -55,10 +55,13 @@ rbind.fill <- function(...) {
df <- dfs[[i]]
for(var in names(df)) {
+ if (is.factor(output[[var]]) && is.character(df[[var]])) {
+ output[[var]] <- as.character(output[[var]])
+ }
+ if (is.factor(df[[var]]) && is.character(output[[var]])) {
+ df[[var]] <- as.character(df[[var]])
+ }
if (!is.matrix(output[[var]])) {
- if (is.factor(output[[var]]) && is.character(df[[var]])) {
- output[[var]] <- as.character(output[[var]])
- }
output[[var]][rng] <- df[[var]]
} else {
output[[var]][rng, ] <- df[[var]]
View
19 inst/tests/test-join.r
@@ -164,3 +164,22 @@ test_that("column orders are common, x only, y only", {
expect_equal(names(right2), c("a", "b", "c"))
})
+
+test_that("strings match to factors", {
+
+ dfF <- data.frame(character = c("Aeryn", "Jothee", "Jothee",
+ "Chiana", "Scorpius", "Scorpius"),
+ species = c("Sebacian", "Luxan", "Sebacian",
+ "Nibari", "Sebacian", "Scarran"),
+ stringsAsFactors = TRUE)
+ dfS <- colwise(as.character)(dfF)
+ matchF <- data.frame(species = "Sebacian", stringsAsFactors = TRUE)
+ matchS <- colwise(as.character)(matchF)
+
+ #nor does `join`, (so inner joins are not commutative)
+ expect_equal(3, nrow(join(dfF, matchF, type = "inner", by="species")))
+ expect_equal(3, nrow(join(dfS, matchS, type = "inner", by="species")))
+ expect_equal(3, nrow(join(dfS, matchF, type = "inner", by="species")))
+ expect_equal(3, nrow(join(dfF, matchS, type = "inner", by="species")))
+
+})
Something went wrong with that request. Please try again.