Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding and removing labels should maybe change the class? #111

Closed
DanChaltiel opened this issue Nov 24, 2021 · 6 comments
Closed

Adding and removing labels should maybe change the class? #111

DanChaltiel opened this issue Nov 24, 2021 · 6 comments

Comments

@DanChaltiel
Copy link

Hi Joseph,

Thanks for this great package, the R community really needed a standard to work with labels.

BTW, I'm the creator of the package crosstable which has a lot of label-related functions, and I wanted to export those functions to a specific package when I discovered yours. I hope one day I can depend on your package instead, to avoid code redundancy.

I'm facing a problem when using labelled in coordination with haven. In data read from SAS (I'm mostly using haven::read_xpt() and haven::read_sas()), columns have a labelled class. I think this makes a lot of sense because you can then design methods for variables instead of checking the attribute.

However, as haven will not implement vctrs methods for labelled variables for some reason (see tidyverse/haven#565), you can have some pesky errors when working with dplyr::mutate(), tidyr::pivot_longer(), and their friends. Removing labels with labelled::remove_labels will not avoid the error, but it would work if you remove the class (which crosstable::remove_labels() do).

Here is an example:

library(tidyverse)
Hmisc::label(iris$Sepal.Width) = "Width of sepal" #mimics the label attribute of haven::read_xpt()
class(iris$Sepal.Width)
#> [1] "labelled" "numeric"
iris %>% labelled::remove_labels() %>% 
  mutate(Sepal.Width=if_else(Sepal.Width>2, 2, Sepal.Width)) %>% head(1)
#> Error: Problem with `mutate()` column `Sepal.Width`.
#> i `Sepal.Width = if_else(Sepal.Width > 2, 2, Sepal.Width)`.
#> x `false` must have class `numeric`, not class `labelled/numeric`.
iris %>% crosstable::remove_labels() %>%
  mutate(Sepal.Width=if_else(Sepal.Width>2, 2, Sepal.Width)) %>% head(1)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1           2          1.4         0.2  setosa

Created on 2021-11-24 by the reprex package (v2.0.1)

For more insight into what adding a class can bring, here are the print.labelled() functions from Hmisc and expss:

Code
library(tidyverse)

labelled::var_label(iris$Sepal.Length) = "Length of sepal"
iris$Sepal.Length %>% head()
#> [1] 5.1 4.9 4.7 4.6 5.0 5.4
class(iris$Sepal.Length)
#> [1] "numeric"

Hmisc::label(iris$Sepal.Width) = "Width of sepal"
iris$Sepal.Width %>% head()
#> Width of sepal 
#> [1] 3.5 3.0 3.2 3.1 3.6 3.9
class(iris$Sepal.Width)
#> [1] "labelled" "numeric"

expss::var_lab(iris$Species) = "The specie"
#> Registered S3 methods overwritten by 'expss':
#>   method                 from 
#>   [.labelled             Hmisc
#>   as.data.frame.labelled base 
#>   print.labelled         Hmisc
iris$Species %>% head()
#> LABEL: The specie 
#> VALUES:
#> setosa, setosa, setosa, setosa, setosa, setosa
class(iris$Species)
#> [1] "labelled" "factor"

Created on 2021-11-24 by the reprex package (v2.0.1)

If that is OK with you, I think removing the class in remove_labels() is safe.

Adding a class in var_label() might require implementing vctrs methods (link, example) if you want to be thorough, but I think it would add a significant value to your package.

@larmarange
Copy link
Owner

Hi. Thanks for your feedback.

First of all, I have the feeling that there has been confusion between different concepts. Hmisc and haven handle labelled data differently.

The class labelled is a class used by Hmisc and added to a vector when adding a variable label.

haven (and therefore labelled as well) attach variable labels to a vector by adding a label attribute but do not change the class of the vector. So a numeric vector with a variable label does not have a labelled class.

haven has introduced an haven_labelled class (it's a different name in order to not interfere with Hmisc) to handle value labels, i.e. numeric vectors with value labels attached to some values.

library(tidyverse)
library(labelled)
library(haven)

var_label(iris$Sepal.Width) <- "Width of sepal" 

# Adding a variable label do not change the class
class(iris$Sepal.Width)
#> [1] "numeric"

iris %>% 
  mutate(Sepal.Width=if_else(Sepal.Width>2, 2, Sepal.Width)) %>% head(1)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1           2          1.4         0.2  setosa

# When importing a SAS file with haven, no labelled class added to vector 
path <- system.file("examples", "iris.sas7bdat", package = "haven")
df <- read_sas(path)
df %>% look_for()
#>  pos variable     label col_type values
#>  1   Sepal_Length —     dbl            
#>  2   Sepal_Width  —     dbl            
#>  3   Petal_Length —     dbl            
#>  4   Petal_Width  —     dbl            
#>  5   Species      —     chr

Created on 2021-11-24 by the reprex package (v2.0.1)

@larmarange
Copy link
Owner

The class labelled does not exist in haven/labelled universe. This is a class used by Hmisc and Hmisc as another philosophy. Both are not designed to be used together.

@DanChaltiel
Copy link
Author

DanChaltiel commented Nov 25, 2021

Oh, indeed I messed up with my code and grew sure that read_sas changed the class also, while it seems I got lost and did that myself using crosstable. Sorry for the confusion!

However, the question is still open to me: if vctrs methods were implemented this would mean more features and it would still be compatible with haven.

Do you know why haven's philosophy is to not add the class? Maybe I should even remove it from crosstable...

@larmarange
Copy link
Owner

As the labelled class is a feature implemented by Hmisc, the question of implementing vctrs methods for this particular class is a question for Hmisc.

Regarding haven, from my understanding (I cannot speak for haven's team), the main purpose of the package was to allow importing data from SAS, SPSS and Stata without losing information and in a consistent way (while foreign is not consistent, the output format when importing a file from SAS or SPSS is not the same). It requires tools to handle value labels (introduced through the haven_labelled class, perceived as an intermediary format before conversion into factors or numeric vectors or characters), tagged NAs (as in SAS and Stata) and user-defined missing values as in SPSS (through haven_labelled_spss class). The names of the classes are prefixed by haven_ to avoid any conflict with Hmisc.

The labelled package has been developed to provide to manipulate features introduced by haven and therefore follows haven data format.

Regarding variable labels (i.e. labels attached to a variable, not to specific values), when imported by haven, they are simply attached as an attribute of the vector and there is no specific class. Technically, while tagged NAs can be used only with double vectors and value labels with numeric or character vectors, variable labels can be attached to any type of vector (including factor, date, date-time, logical, etc.). This information (variable label) is just metadata attached to the vector but do not change the nature of the vector and therefore there is no need to change the class of the vector (the vector is still the same). Adding a class is a high risk to create incompatibilities with other packages / functions / methods, in particular for exotic types of vectors introduced by other packages. A new class could interfere with all statistical functions. When printing a vector, it is important to use the print method developed specifically for that type of vector.
In addition, in the tidyverse, users manipulate essentially tibbles / data.frames, not individual vectors.

So far, a request for a specific class is only for changing the printing of a vector in the console. For that purpose, the gain is too small compared to the added complexity. There is no real need to systematically print the variable label in the console, because there are other tools more appropriate, tools that can be easily used by users when they want to display the variable labels.

Variable labels are displayed in RStudio viewer. With labelled package, you can access the variable label with var_label. If you want to display information about a variable or a group of variable, you can easily use look_for() (see https://larmarange.github.io/labelled/articles/look_for.html ). Alternatively, you can use questionr::describe() (see https://juba.github.io/questionr/reference/describe.html )

To summary, assigning a dedicated class to vectors having a variable label adds a lot of complexity and risks of incompatibility with other packages / functions. The interest seems minimal (just changing printing in the console) and we already have other tools to display variable labels (often more nicely) to users.

@larmarange
Copy link
Owner

For creating nicely formatted summary tables, it is relevant to use variable labels when available. However, I do not have the feeling that you need the Hmisc labelled class for that. The gtsummary package ( https://www.danieldsjoberg.com/gtsummary/ ) who also produced summary tables use variable labels without requiring Hmisc.

@DanChaltiel
Copy link
Author

Thank you for this clear and thorough explanation.

Although I never use this package anymore, I learned about labels through Hmisc, hence my biased belief that a class was a useful feature.

You totally convinced me though, there is indeed really minimal interest in adding the labelled class. I'll remove this from crosstable.

DanChaltiel added a commit to DanChaltiel/crosstable that referenced this issue Nov 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants