Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner_join character and factor #455

Closed
kismsu opened this issue Jun 9, 2014 · 10 comments
Closed

inner_join character and factor #455

kismsu opened this issue Jun 9, 2014 · 10 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@kismsu
Copy link

kismsu commented Jun 9, 2014

I've noticed that if you join on column which is character in one table and factor in another, you're getting unstable results. Some records match, some not. Should the function return an error, or at least a warning, that your columns have different type?

@romainfrancois
Copy link
Member

Can you spare some reproducible example please.

@rickyars
Copy link

Possibly related to issue #450? Here's an example:

library(dplyr)

foo <- data.frame(id = letters, var1 = "foo", stringsAsFactors=FALSE)
bar <- data.frame(id = rep(letters, 2), var2 = "bar")

this doesn't work:

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

however using merge works just fine:

tmp3 <- merge(foo, bar, by="id")
tmp4 <- merge(bar, foo, by="id")

what's even weirder is what happens when you switch who has the factor variable:

foo <- data.frame(id = letters, var1 = "foo")
bar <- data.frame(id = rep(letters, 2), var2 = "bar", stringsAsFactors=FALSE)

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

@rickyars
Copy link

Here's an even smaller example:

foo <- data.frame(id = c("a", "b"), var1 = "foo")
bar <- data.frame(id = c("a", "b"), var2 = "bar", stringsAsFactors=FALSE)

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

foo <- data.frame(id = c("a", "b"), var1 = "foo", stringsAsFactors=FALSE)
bar <- data.frame(id = c("a", "b"), var2 = "bar")

tmp1 <- inner_join(foo, bar, by="id")
tmp2 <- inner_join(bar, foo, by="id")

@romainfrancois romainfrancois self-assigned this Jun 11, 2014
@kismsu
Copy link
Author

kismsu commented Jun 11, 2014

Yep, the same

@hadley
Copy link
Member

hadley commented Sep 12, 2014

And here's a test

test_that("inner_join is symmetric (even when joining on character & factor)", {
  foo <- data_frame(id = c("a", "b"), var1 = factor("foo"))
  bar <- data_frame(id = c("a", "b"), var2 = "bar")

  tmp1 <- inner_join(foo, bar, by="id")
  tmp2 <- inner_join(bar, foo, by="id")

  expect_is(tmp1$var1, "character")
  expect_is(tmp2$var1, "character")
  expect_equal(names(tmp1), c("id", "var1", "var2"))
  expect_equal(names(tmp2), c("id", "var2", "var1"))

  expect_equal(tmp1, tmp2)
})

@romainfrancois
Copy link
Member

I don't get it. I think it is perfectly normal that:

> str(tmp1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   2 obs. of  3 variables:
 $ id  : chr  "a" "b"
 $ var1: Factor w/ 1 level "foo": 1 1
 $ var2: chr  "bar" "bar"
> str(tmp2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   2 obs. of  3 variables:
 $ id  : chr  "a" "b"
 $ var2: chr  "bar" "bar"
 $ var1: Factor w/ 1 level "foo": 1 1

Perhaps

foo <- data_frame(id = factor(c("a", "b")), var1 = "foo")
bar <- data_frame(id = c("a", "b"), var2 = "bar")

which indeed gives something wrong:

> str(tmp1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   2 obs. of  3 variables:
 $ id  : Factor w/ 2 levels "a","b": 1 2
 $ var1: chr  "foo" "foo"
 $ var2: chr  "bar" "bar"
> str(tmp2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  3 variables:
 $ id  : chr "a"
 $ var2: chr "bar"
 $ var1: chr "foo"

@hadley
Copy link
Member

hadley commented Sep 16, 2014

@romainfrancois oh oops, yeah, I think I put factor around the wrong variable

@romainfrancois
Copy link
Member

I think it's ok now, at least according to the test I put in place here;
https://github.com/hadley/dplyr/blob/master/tests/testthat/test-joins.r#L195

@spymark
Copy link

spymark commented Sep 22, 2014

Hi Romain, I think you have left a couple of data_frame() function calls, instead of data.frame(). It's in the test you added (lines 196- 197).

@romainfrancois
Copy link
Member

That is intended. data_frame is much nicer to use.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants