Passing `rename` identical column names #127

Closed
wibeasley opened this Issue Dec 20, 2012 · 6 comments

3 participants

@wibeasley

I really like how the updated rename function warns if you try to replace a column name that doesn't already exist (#46). Here's a related issue.

ds <- data.frame(C1=1:6, C2=runif(6), C3=rnorm(6))
ds <- plyr::rename(x=ds, replace=c(
  "C1"="A"
  ,"C2"="A"
  ,"C3"="A"
))
colnames(ds) #Produces: [1] "A" "A" "A"

Is this the desired behavior? If not, here are three alternatives you may want to consider:

  1. The function throws an error and doesn't proceed.
  2. The function automatically renames them A.1, A.2, and A.3.
  3. The function automatically renames them A, A.1, and A.2. This would be similar to what this code does: data.frame(A=1:6, A=runif(6), A=rnorm(6))

FWIW, This first option is my preference, because I think it would catch many unintentional errors. If someone is using the rename function, they're probably already be in a good position to make the A.1, A.2, A.3 rename themselves.

If the current version doesn't produce the intended behavior, I'm happy to help with this issue further (coding or otherwise).

@hadley
Owner

@wch what do you think?

@wch
Collaborator
wch commented Jan 2, 2014

How about giving a warning in cases like this, with an option like warn_dup = TRUE? I don't think the user should be totally prevented from giving two items the same name.

There are two distinct cases you might want to consider:

  1. The replace has two items with the same output name, like result = c(A="x", B="x")
  2. The data frame already has a column with the name "x", and you have result = c(A="x")

Note that the underlying code which does the renaming is in revalue. That function works on the contents of a vector, and in that case it definitely makes sense to allow both of operations above without any warnings.

@wibeasley

Cool. I'm happy to try this and issue a pull request. I like that the function won't try to guess what to rename it.

What do you think about

  • a parameter named error_duplicate with a default value of FALSE. If the clients says TRUE, they function won't proceed if 2+ columns have the same name..
  • a second parameter named warn_duplicate with a default value of TRUE.

Otherwise, this could be replaced by a single parameter that accepts values like silent, message, warning, and error. The specified event would be triggered if a naming conflict exists. Execution would stop only with error. The default value would be warning.

I'm leaning towards this single parameter interface. Do you guys have a preference?

@wibeasley

I just submitted a pull request (#186 )

@wch, I essentially deferred/ignored detecting scenario 2 until after the columns were renamed. I couldn't think of a reliable way to detect that if the user decided to swap names legally, which your mapvalues() function seems to handle without problems.

If you want me to do something different with the catching or warning behavior, just tell me.

test_that("Single Swapping (shouldn't cause problems)", {
  #Notice how a becomes c, while c becomes f.
  x <- list(a = 1, b = 2, c = 3)
  replace_list <- c("c" = "f", "b" = "e", "a" = "c")
  expected_value <- list(c = 1, e = 2, f = 3)
  actual_value <- rename(x=x, replace=replace_list)
  expect_identical(actual_value, expected=expected_value)
})
test_that("Double Swapping (shouldn't cause problems)", {
  #Notice how a becomes c, while c becomes a.
  x <- list(a = 1, b = 2, c = 3)
  replace_list <- c("c" = "a", "b" = "z", "a" = "c")
  expected_value <- list(c = 1, z = 2, a = 3)
  actual_value <- rename(x=x, replace=replace_list)
  expect_identical(actual_value, expected=expected_value)
})

https://github.com/wibeasley/plyr/blob/ba04eecc3b7cb5b5bf84f096841628c376545abd/inst/tests/test-rename.r#L105-L120

@wibeasley

@wch, you stated in #194 (comment)

I think that it would be preferable to just have an argument like warn_dup=TRUE -- to me it seems that the simplicity of doing it that way outweighs the utility of specifying silent/message/warning/error, and it parallels the warn_missing argument.

I think there are scenarios where it's preferable to allow the user to use more defensive programming than just a warning. For instance I'm on a team of three statisticians with a live clinical-ish dataset with 2,000+ variables, and many needs to be renamed. It's difficult keeping track of the names in our three collective heads. And as you can imagine, colliding names are difficult to catch later in the analysis & reporting phase. When collisions occur, I think a drastic, attention-grabbing error is the desired behavior. I don't even want to present the chance of it being silently overlooked.

On another project, the code that calls plyr::rename() is at least three layers below the Rnw/Rmd files. Because knitr is producing a public document, the warnings are hidden, so the trivial warnings aren't distracting (ie, Warning: Removed 11 rows containing missing values (geom_path).) The naming collision seems serious enough to warrant an error, especially since the warning won't even bubble up to the surface of the report.

@wibeasley wibeasley added a commit to wibeasley/plyr that referenced this issue Mar 31, 2015
@wibeasley wibeasley warns if duplicates in rename
Related to #127 and #194
b71090f
@wibeasley wibeasley added a commit to wibeasley/plyr that referenced this issue Apr 4, 2015
@wibeasley wibeasley warns if duplicates in rename
Related to #127 and #194
5088acb
@wibeasley

Closed by #251. Thanks guys.

@wibeasley wibeasley closed this Apr 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment