-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
factor values in pivot_longer.()
are converted to character
#202
Comments
Good catch on this one - honestly it's odd that
You would think that |
Smaller reprex: pacman::p_load(tidytable, tidyverse, data.table)
test_df <- tidytable(a = factor("a"), b = factor("b"))
# tidyr
test_df %>%
pivot_longer(everything())
#> # A tibble: 2 x 2
#> name value
#> <chr> <fct>
#> 1 a a
#> 2 b b
# tidytable
test_df %>%
pivot_longer.()
#> # tidytable [2 × 2]
#> name value
#> <chr> <chr>
#> 1 a a
#> 2 b b For comparison: pacman::p_load(tidytable, tidyverse, data.table)
test_df <- tidytable(a = factor("a"), b = factor("b"))
# data.table
test_df %>%
melt(measure.vars = names(test_df), variable.factor = FALSE) %>%
as_tidytable()
#> # tidytable [2 × 2]
#> variable value
#> <chr> <chr>
#> 1 a a
#> 2 b b |
I think, this issue exists, independent of the BTW, in this complex setting of reshaping from many columns to many columns (550 wide_df columns to reshape to about 140 long_df columns with varying number of rows 10,000-500,000), I observe that |
Yep, this issue only has to do with the In the case that values are factors you want
I think this option only seems to help with simpler pivoting where there is only one "value" column: pacman::p_load(tidytable)
test_df <- map_dfc.(letters, ~ tidytable(!!.x := 1:50000))
bench::mark(
normal = pivot_longer.(test_df),
fast_pivot = pivot_longer.(test_df, fast_pivot = TRUE),
check = FALSE
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 normal 10.43ms 12.13ms 78.5 15.55MB 89.7
#> 2 fast_pivot 2.53ms 3.28ms 246. 9.95MB 121. |
.value
in pivot_longer.()
.value
in pivot_longer.()
pivot_longer.()
are converted to character
New reprex that pinpoints the issue: pacman::p_load(tidytable)
fct_df <- tidytable(x = factor("a"), y = factor("b"))
chr_df <- tidytable(x = "a", y = "b")
dbl_df <- tidytable(x = 1, y = 2)
df_list <- list(fct_df, chr_df, dbl_df)
df_list %>%
map.(pivot_longer.)
#> [[1]]
#> # A tidytable: 2 × 2
#> name value
#> <chr> <chr>
#> 1 x a
#> 2 y b
#>
#> [[2]]
#> # A tidytable: 2 × 2
#> name value
#> <chr> <chr>
#> 1 x a
#> 2 y b
#>
#> [[3]]
#> # A tidytable: 2 × 2
#> name value
#> <chr> <dbl>
#> 1 x 1
#> 2 y 2 Looks like this only affects the cases where all of the |
Solution - check if the columns are factor and adjust Spot check of the time cost: pacman::p_load(tidytable)
df_names <- c(letters, LETTERS, paste0(letters, LETTERS))
test_df <- df_names %>%
map_dfc.(~ tidytable(!!.x := sample(as.factor(letters), 1000000, TRUE)))
factor_check <- function(data, cols) {
all_names <- names(data)
fct_flag <- map_lgl.(test_df, is.factor)
names(fct_flag) <- all_names
values_factor <- all(fct_flag[cols])
values_factor
}
bench::mark(
type_check = factor_check(test_df, df_names[1:50]),
iterations = 50
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 type_check 107µs 116µs 4273. 12.6KB 0 |
All set: pacman::p_load(tidytable)
test_df <- tidytable(a = factor("a"), b = factor("b"))
test_df %>%
pivot_longer.()
#> # A tidytable: 2 × 2
#> name value
#> <chr> <fct>
#> 1 a a
#> 2 b b |
Hi @markfairbanks , I have just installed your latest dev version from github and the reprex on top still results in #reprex tidytable pivot_longer.()
library(tidytable, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
#> Warning: Paket 'tidyr' wurde unter R Version 4.0.4 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 4.0.4 erstellt
set.seed(2048)
rows <- 100
ids <- 50 #simple data set with many different IDs and 1M rows, 3 cols
df <- tibble(id = as.character(sample(1:ids, size = rows, replace = TRUE)), #using character variable as ID
bike = sample(c("mountain", "allround", "road", "bmx", NA_character_), size = rows, replace = TRUE),
year = sample(1980:2020, size = rows, replace = TRUE),
color = factor(sample(c("silver", "green", "blue", NA_character_), size = rows, replace = TRUE))) %>%
#calculate a chronological counter for bike per id
tidytable::arrange.(id, year) %>%
#calculate new renumbered variable group by case_id_var
tidytable::mutate.(nr_of_bike = as.integer(tidytable::row_number.()), .by = id) #by creating one line per id and repeat all vars nr_of_bike times. New vars have .nr as suffix
### get names from df to provide variables that need to be transposed to pivot function
trans_vars <- names(df)[!names(df) %in% c("id", "nr_of_bike")]
### perform pivot_wider
wide_df <- df %>%
tidytable::pivot_wider.(
names_from = "nr_of_bike",
values_from = tidyselect::all_of(trans_vars),
names_sep = "."
)
### show wide_df as result
wide_df
#> # A tidytable: 45 x 16
#> id bike.1 bike.2 bike.3 bike.4 bike.5 year.1 year.2 year.3 year.4 year.5
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
#> 1 1 road allrou~ mount~ <NA> <NA> 1984 1990 1993 NA NA
#> 2 10 road allrou~ <NA> <NA> <NA> 1982 1987 2009 NA NA
#> 3 11 road road bmx mount~ <NA> 1983 2000 2003 2006 NA
#> 4 12 allrou~ mounta~ bmx <NA> <NA> 1984 1988 2018 NA NA
#> 5 13 allrou~ <NA> <NA> <NA> <NA> 1990 NA NA NA NA
#> 6 14 allrou~ <NA> <NA> <NA> <NA> 1983 NA NA NA NA
#> 7 15 mounta~ <NA> <NA> <NA> <NA> 2004 NA NA NA NA
#> 8 16 bmx <NA> <NA> <NA> <NA> 1999 NA NA NA NA
#> 9 17 mounta~ road allro~ <NA> <NA> 2004 2006 2012 NA NA
#> 10 18 bmx <NA> <NA> <NA> <NA> 1983 NA NA NA NA
#> # ... with 35 more rows, and 5 more variables: color.1 <fct>, color.2 <fct>,
#> # color.3 <fct>, color.4 <fct>, color.5 <fct>
#pivot_wider.() preserves the original column type of color as a vector ### get variable names for pivot_longer (all variables that have a number suffix after the dot)
varying_vars <- colnames(wide_df) %>% stringr::str_subset(.,
paste0("\\.", "(?=[:digit:]$|(?=[:digit:](?=[:digit:]$))|(?=N(?=A$)))"))
constant_vars <- colnames(wide_df)[!colnames(wide_df) %in% c(varying_vars)]
### perform tidyr::pivot_longer()
wide_df %>% tidyr::pivot_longer(
-c(tidyselect::all_of(constant_vars)),
names_to = c(".value", "nr_of_bike"),
names_pattern = "(.*)\\.(.*)",
values_drop_na = TRUE
)
#> # A tibble: 100 x 5
#> id nr_of_bike bike year color
#> <chr> <chr> <chr> <int> <fct>
#> 1 1 1 road 1984 green
#> 2 1 2 allround 1990 blue
#> 3 1 3 mountain 1993 silver
#> 4 10 1 road 1982 green
#> 5 10 2 allround 1987 silver
#> 6 10 3 <NA> 2009 blue
#> 7 11 1 road 1983 <NA>
#> 8 11 2 road 2000 green
#> 9 11 3 bmx 2003 silver
#> 10 11 4 mountain 2006 blue
#> # ... with 90 more rows
#tidyr preserves column type of "color" when pivoting longer
### perform tidytable::pivot_longer.()
wide_df %>% tidytable::pivot_longer.(
-c(tidyselect::all_of(constant_vars)),
names_to = c(".value", "nr_of_bike"),
names_pattern = "(.*)\\.(.*)",
values_drop_na = TRUE,
fast_pivot = FALSE
) %>%
#sort by id and nr_of_bike
arrange.(id, nr_of_bike)
#> # A tidytable: 100 x 5
#> id nr_of_bike bike color year
#> <chr> <chr> <chr> <chr> <int>
#> 1 1 1 road green 1984
#> 2 1 2 allround blue 1990
#> 3 1 3 mountain silver 1993
#> 4 10 1 road green 1982
#> 5 10 2 allround silver 1987
#> 6 10 3 <NA> blue 2009
#> 7 11 1 road <NA> 1983
#> 8 11 2 road green 2000
#> 9 11 3 bmx silver 2003
#> 10 11 4 mountain blue 2006
#> # ... with 90 more rows
#tidytable changes column type of "color" when pivoting longer Created on 2021-04-07 by the reprex package (v2.0.0) |
Sorry about this, I should have tested back on the original dataset you sent. I see now what I missed. My initial thought is that this more complex case isn't solvable with data.table. data.table has a pretty simple option setting of So in the case above either all of "bike", "color", and "year" will return as factors, or it will return like it did above (where factor columns are converted to character).
Thanks for catching this, I'll take a look. |
First, great work on the implementation of
.value
inpivot_longer.()
. Even withpivot_fast=FALSE
, performance is great.One last tiny bit of improvement I would suggest, is that the function should preserve original column types. In the example below,
df$color
column should be a factor. This is maintained bypivot_wider.()
[sowide_df$color.1
,wide_df$color.1
etc. are factors using the default setting]. But usingpivot_longer().
thedf$color
column is modified to character.Also, I am not sure if option
fast_pivot
is working as intended or maybe just the description is not correct. In the case below wherenames_to = c(".value", "nr_of_bike")
,pivot_fast=TRUE
only makes "nr_of_bike" a factor, but not the ".value" columns.For reference, I have also added the
tidyr::pivot_longer()
code to show that tidyverse preserves column types with default settings.Created on 2021-03-02 by the reprex package (v1.0.0)
Session info
The text was updated successfully, but these errors were encountered: