-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New split function - a multidimensional generalization of current split family functions #203
Comments
Hi @w-wojtek, thanks for reaching out and for digging into how splitting works in rtables. I must admit to being a bit confused. From what I can see this is a straightforward application of the The function (copied from #109 ) is:
With that function, however, you'd simply do
I have added Please let me know if you have use cases beyond those covered by this function |
Hallo @gmbecker Thank you very much for your prompt answer. My case is that the split must remain at the Please let me know if you would like to get more details about it. |
@gmbecker just want to add some examples. Using two split variables and d <- data.frame(
V1 = factor(c("V1L1", "V1L1", "V1L2", "V1L2", "V1L2"), levels = c("V1L1", "V1L2")),
V2 = factor(c("V2L1", "V2L2", "V2L3", "V2L4", "V2L5"), levels = c("V2L1", "V2L2", "V2L3", "V2L4", "V2L5"))
)
cfun <- function(x, labelstr) {
denom <- length(x)
num <- length(unique(x))
in_rows(
c(num, denom),
.formats = "xx / xx",
.labels = sprintf("%s", labelstr)
)
}
basic_table() %>%
split_rows_by("V1") %>%
summarize_row_groups(format = "xx") %>%
split_rows_by("V2", split_fun = trim_levels_by_map(innervar = "V2", outervar = "V1", map = d)) %>%
summarize_row_groups("V2", cfun = cfun) %>%
build_table(d) We want the denominators to be based on outer variable (V1 in this case) but due to the second row split it's not possible.
Using the function proposed by @w-wojtek this works: afun <- function(x) {
denom <- length(x)
num <- table(unique(x))
result <- lapply(num, `c`, denom)
do.call(in_rows, lapply(result, rcell, format = "xx / xx"))
}
basic_table() %>%
split_rows_by("V1", split_fun = keep_split_levels_of_vars(map = d)) %>%
summarize_row_groups(format = "xx") %>%
analyze("V2", afun = afun) %>%
build_table(d)
|
Hi @anajens @w-wojtek, I do see now what you're getting at here. And upon further inspection it is conceptually a pretty straightforward extension of the existing I'm considering ways of specifying the above special case within a map, in which case we wouldn't need two separate function (factories) - possibly using NAs - but I'm not set on that yet. If that doesn't happen, I do think I'll want to change the name of this function, but naming things is (NP-)hard so we'll see if I can come up with anything more succinct. That said, I am still a bit surprised that that this isn't a data-preprocessing task, ie if the restriction on groupings/chains of levels of different variables is known at layouting time (a requirement for this approach), then I would have expected the data would simply be restricted to those data who meet those requirements before passing it to Also, @w-wojtek your function is good work, but there's a trick (that I would not have expected you to know, mind you) that makes it much simpler, and will likely be useful to you again someday as a power-user of the framework, which is that you can call another splitting function directly with your custom splitting function and then massage the results from there. The beginning of your custom function is recreating the work done by the existing
which lets us do:
and get
Do please note, however, that with col-counts turned on the column count is |
Thank you @gmbecker (@anajens ) for the comprehensive answer. I had looked at About pre-processing, I may be missing something, but I do not see how this can be achieved at the pre-processing stage, without using some esoteric and sophisticated tricks (probably even more complex than the option with custom split function discussed here). You cannot do much before a final data split - as the levels of inner variable to be kept in split data portion depend on the levels of the split variable. Also, we do not want to add one split more, by the current inner variable (V2 in our example used here), as we will lost the grouping needed by the counting function, at the next step. Thank you very much indeed for sharing with me the details and further tricks about the implementation. I was thinking about using Please let us know if you decide on the way you want to specify such an extension. |
I don't thin keep_split_levels is the correct thing to conceptually abstract from here. It doesn't matter too much in practice but for the sake of clarity I'd say that this function generalizes
While it is somewhat non-trivial to write a conceptually similar |
FYI: I am swtiching to my old GitHub account, and deleted the previous one used in this discussion. |
Hi @gmbecker , @wwojciech has proposed a very nice new extension to the function above which in fact make it even more general so that it covers the use case for which Here's Wojciech's new design for this function: The main idea is that in the metadata The SV is conditional on all the OVs. Here's a practical example of this function in use. map <- data.frame(
LBCAT = c("CHEMISTRY", "CHEMISTRY", "CHEMISTRY", "IMMUNOLOGY"),
PARAMCD = c("ALT", "CRP", "CRP", "IGA"),
ANRIND = c("LOW", "LOW", "HIGH", "HIGH"),
stringsAsFactors = FALSE
) Our map has one outer variable LBCAT, the split variable is PARAMCD and we have only one inner variable ANRIND. basic_table() %>%
split_rows_by("LBCAT") %>%
split_rows_by("PARAMCD", split_fun = trim_levels_by_map_new(map = map)) %>%
analyze("ANRIND") %>%
build_table(ex_adlb) Result
Let's chat offline if you have any further questions. We can also provide any additional unit tests. Function code: trim_levels_in_groups_new <- function(map = NULL) {
if (is.null(map) || any(sapply(map, class) != "character"))
stop("No map dataframe was provided or not all of the columns are of type character.")
myfun <- function(df, spl, vals = NULL, labels = NULL, trim = FALSE) {
allvars <- colnames(map)
splvar <- rtables:::spl_payload(spl)
outvars <- allvars[-(which(allvars == splvar):length(allvars))]
## handle conditioning on outer variables (if any),
## subset the map depending on where we are in the split tree
if (outvars != "") {
outvars_levs <- unique(df[, outvars])[ , , drop = TRUE]
if (!is.null(dim(outvars_levs)))
stop ("Outer variables were not specified correctly in the map.")
outvars_filter <- paste(paste(outvars, paste0("'", outvars_levs, "'"), sep = " == "),
collapse = " & ")
map <- subset(map, subset = eval(parse(text = outvars_filter)), select = setdiff(allvars, outvars))
map <- unique(map)
}
## handle split variable
nondup <- !duplicated(map[[splvar]])
ksl_fun <- keep_split_levels(only = map[[splvar]][nondup], reorder = TRUE)
ret <- ksl_fun(df, spl, vals, labels, trim = trim)
## keep non-split (inner) variables levels
ret$datasplit <- lapply(ret$values, function(splvar_lev) {
df3 <- ret$datasplit[[splvar_lev]]
# loop through inner variables
for (iv in setdiff(colnames(map), splvar)) {
iv_lev <- df3[[iv]]
levkeep <- as.character(na.omit(unique(map[map[, splvar] == splvar_lev, iv])))
if (is.factor(iv_lev) && !all(levkeep %in% levels(iv_lev)))
stop("Attempted to keep invalid factor level(s) in split ", setdiff(levkeep, levels(iv_lev)))
df3 <- df3[iv_lev %in% levkeep, , drop = FALSE]
if (is.factor(iv_lev))
df3[[iv]] <- factor(as.character(df3[[iv]]), levels = levkeep)
}
df3
})
names(ret$datasplit) <- ret$values
ret
}
myfun
} |
Thank you @anajens . |
Hi @gmbecker , Here are the requirements for the new split function along with some examples. As we discussed, the function Requirements
Example 1: Sample dataset with 1 record per subject (single analysis var)df_1 <- data.frame(
arm = c("a", "b", "c", "a", "b", "c"),
dcstud = factor(c("disc", "comp", "comp", "disc", "disc", "comp"),
levels = c("disc", "comp")),
discont = factor(c("dropout", "none", "none", "adverse", "moving", "none"),
levels = c("none", "dropout", "moving", "notime", "adverse", "death", "ill", "reaction")),
discont_group = factor(c("other", "none", "none", "safety", "other", "none"))
)
map_1 <- data.frame(
discstud = rep("disc", 4),
discont_group = c("safety", "safety", "other", "other"),
discont = c("adverse", "death", "dropout", "moving")
)
# Full layout
basic_table() %>%
split_rows_by("dcstud") %>%
split_rows_by("discont_group") %>%
analyze("discont") %>%
build_table(df_1) Desired output:
Example 2: Long dataset with repeated sections (single analysis var)df_2 <- data.frame(
visit = factor(c(rep("W1", 4), rep("W2", 4)), levels = c("W1", "W2")),
paramcd = factor(rep(c("ALT", "ALT", "IGA", "IGA"), 2), levels = c("ALT", "IGA")),
avalc = factor(rep(c("A1", "A2", "I1", "I2"), 2), levels = c("A1", "A2", "A3", "I1", "I2"))
)
# Visit is not in map because we want all map sections repeated for all visits in data
map_2 <- data.frame(
paramcd = c("ALT", "ALT", "ALT", "IGA", "IGA"),
avalc = c("A1", "A2", "A3", "I1", "I2")
)
basic_table() %>%
split_rows_by("visit") %>%
split_rows_by("paramcd") %>%
analyze("avalc") %>%
build_table(df_2) Desired output:
Example 3: Long dataset (single analysis var)This is the example from @wwojciech's design proposal: df_3 <- data.frame(
LBCAT = factor(c("CHEMISTRY", "CHEMISTRY", "CHEMISTRY", "IMMUNOLOGY"), levels = c("CHEMISTRY", "IMMUNOLOGY")),
PARAMCD = factor(c("ALT", "CRP", "CRP", "IGA"), levels = c("ALT", "CRP", "IGA")),
ANRIND = factor(c("LOW", "LOW", "HIGH", "HIGH"), levels = c("LOW", "HIGH"))
)
map_3 <- data.frame(
LBCAT = c("CHEMISTRY", "CHEMISTRY", "CHEMISTRY", "IMMUNOLOGY"),
PARAMCD = c("ALT", "CRP", "CRP", "IGA"),
ANRIND = c("LOW", "LOW", "HIGH", "HIGH")
)
basic_table() %>%
split_rows_by("LBCAT") %>%
split_rows_by("PARAMCD") %>%
analyze("ANRIND") %>%
build_table(df_3)
Desired table:
Example 4: Long dataset (multiple analysis vars)df_4 <- data.frame(
LBCAT = factor(c("CHEMISTRY", "CHEMISTRY", "CHEMISTRY", "IMMUNOLOGY"), levels = c("CHEMISTRY", "IMMUNOLOGY")),
PARAMCD = factor(c("ALT", "CRP", "CRP", "IGA"), levels = c("ALT", "CRP", "IGA")),
ANRIND = factor(c("LOW", "LOW", "HIGH", "HIGH"), levels = c("LOW", "HIGH")),
BNRIND = factor(c("LOW", "LOW", "HIGH", "HIGH"), levels = c("LOW", "HIGH"))
)
map_4 <- data.frame(
LBCAT = c("CHEMISTRY", "CHEMISTRY", "CHEMISTRY", "IMMUNOLOGY"),
PARAMCD = c("ALT", "CRP", "CRP", "IGA"),
ANRIND = c("LOW", "LOW", "HIGH", "HIGH"),
BNRIND = c("LOW", "LOW", "HIGH", "HIGH")
)
basic_table() %>%
split_rows_by("LBCAT") %>%
split_rows_by("PARAMCD") %>%
analyze(c("ANRIND", "BNRIND")) %>%
build_table(df_4) Desired output:
|
Hi @anajens I think, we could also add Thank you! |
I've just added (in gabe_tabletree_work only for now) experimental support for both analysis/content functions and split functions to accept a new This will allow us to write a much safer and less complex version(s) of the multi-level split function in plenty of time for the next release. Ugly "proof of life":
|
Hi @gmbecker Thank you for sharing this information. I would like to make sure I correctly understand your last post. I conclude from it that with this new update, we can use the new parameter Thank you |
Hi @wwojciech the plan is for us to refactor your function to use . |
@wwojciech @anajens I have added One thing to note is that tabulation will fail (with an informative message) if you get into a situation where there are no valid children according to the map. In practice this means that either: a) all upstream combinations of levels need to have at least one value in the map at the level you're splitting on, or We may someday be able to declare a "global splitting map" but that is not the case now, you'll need to call It is supported in both row-splitting and column splitting contexts. Please give it a try in |
Thank you very much @gmbecker! Looking forward to trying it out and will have some feedback for you next week. |
* Exp allow a/cfuns + splfuns to accept .prev_splvals arg. #203 dev vbump * Fix bug where names weren't showing up for .prev_splvals. #203 dev vbump * Fix off-by-one error in pagination, sep in txt export. Fixes #213 * add experimental fnotes_at_path function. Needs tests. #219. vbump * Exp allow a/cfuns + splfuns to accept .prev_splvals arg. #203 dev vbump * Fix bug where names weren't showing up for .prev_splvals. #203 dev vbump * Fix off-by-one error in pagination, sep in txt export. Fixes #213 * add experimental fnotes_at_path function. Needs tests. #219. vbump * Run GH actions for all branches * Working fntes_at_path with tests. Closes #219. dev vbump * col ref footnote support. related to #219. Closes #187. dev vbump * Support and tests for trim_levels_to_map. closes #203. Devel vbump. * cell_values and value_at methods for Row objects. closes #210. dev vbump * Trim outer levels to trim_levels_in_groups by deflt. #236 dev vbump * Cleanup, additional tests, and fix bugs uncovered by new tests. * Add NEWS entries, prepare for merge into main Co-authored-by: dinakar29 <26552821+dinakar29@users.noreply.github.com>
Hi @gmbecker ( @danielinteractive - FYI )
The issue I am raising here is quite similar to recent #109 (
trim_levels_by_map
), but it is more generic.Sometimes, when splitting rows with
split_rows_by
based on a certain variable, we may want to keep/drop/remove the levels of not only the split variable, but also the levels of other variables in data split. As an example for 'keep' scenario; consider a simple data set defined belowLets say I split rows of
d
withsplit_rows_by(V1)
. Now, after the split I would like to keep the following levels onlyV1L1
forV1
andV2L1
,V2L2
forV2
for the first data split (rows 1-2)V1L2
forV1
andV2L3
,V2L4
forV2
for the second data split (rows 3-4).Similar examples can be constructed for drop/remove scenarios. So this is a kind of multidimensional generalization of current split functions for variables other than just split variable.
I wrote this below function for the 'keep' scenario, as an example. It is far from perfect but works well for at least some basics scenarios I was able to create.
I would be thankful, if you could share your thoughts on this idea, especially in case we could achieve the same results with some existing
rtables
functionality? or if not - how feasible would be to add this functionality intortables
.Thank you.
W.
The text was updated successfully, but these errors were encountered: