You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Really like this package, but, I ran into a case I could not handle. Sometimes, there are many rows at the head of a data frame that contain information on column names. They can be ragged (e.g., some cells have entries, and some do not). I used this package as inspiration to write a function to deal with this in some data processing I'm doing, but realized, you might want it to be in the package! Rather than fork and pull, here's what I wrote - modify as you like, and I think it could find a happy home here. Or maybe I missed work you already did (hope not!)
#' Make many initial row headings into names
#'
#' @param adf A `data.frame` or `tibble` object whose top few rows are the
#' new column names.
#' @param nrows How many rows at the top are to be used to create new column
#' names.
#' @param sliding_headers Defaults to TRUE. Many data sets only enter
#' part of a column name once in a single row, and then have several subrows
#' beneath them. E.g., Row 1 has "Fertilized" in column 1 and "Unfertilized"
#' in Row 1 Column 4. Row 2 then has the same two entries repeated twice,
#' such as "Caged" and "Uncaged". We want "Fertilized_Caged",
#' "Fertilized_Uncaged", "Unfertilized_Caged", "Unfertilized_Uncaged".
#' This is not always the case, however, so, a TRUE/FALSE switch.
#' @param sep How should different cells be combined? Defaults to "_".
#'
#' @return The original data frame, but with new names and the
#' top `nrows` rows removed
#' @export
#'
#' @examples
#'
#' test_df <- tibble::tribble(
#' ~x1, ~x2, ~x3, ~x4, ~x5,
#' "", "Yes", "", "No", "",
#' "", "We", "Have", "Bananas", "",
#' "Why", "", "", "", "Today",
#' "1", "2", "3", "4", "5"
#' )
#'
#' print(test_df)
#' print(many_name_rows(test_df, 3))
many_name_rows <- function(adf,
nrows,
sliding_headers=TRUE,
sep = "_"){
#get the top rows of the data frame that contain the column names
new_names <- adf[seq_len(nrows),] %>%
dplyr::mutate(ROW_ID = 1:nrows)
#pivot the data longer for easy manipulation
#and give a column number, as column names are often
#not properly sortable later. Also, make "" into NA
#for easier string substitution later
new_names <- new_names %>%
tidyr::pivot_longer(cols = -ROW_ID,
names_to = "cols",
values_to = "vals") %>%
dplyr::mutate(vals = gsub("^$", NA, vals),
col_num = rep(seq_len(dplyr::n_distinct(cols)),
dplyr::n()/dplyr::n_distinct(cols)))
#if some header entries are only put in once, but should be copied
#across many columns, slide them down using tidyr::fill
if(sliding_headers){
new_names <- new_names %>%
dplyr::group_by(cols) %>%
tidyr::fill(vals, .direction = "down") %>%
dplyr::ungroup()
}
#collapse the column names together into single elements
new_names <- new_names %>%
dplyr::group_by(col_num) %>%
dplyr::summarize(vals = paste0(vals, collapse = sep),
.groups = "keep") %>%
dplyr::mutate(vals = gsub("NA_", "", vals),
vals = gsub("_NA", "", vals) #end of row
)
#pivot wide using the correct column order
new_names_wide <- new_names %>%
tidyr::pivot_wider(names_from = col_num,
values_from = vals)
#fix the names of the old data frame
names(adf) <- new_names_wide[1,]
#strip the first few rows of the data frame
#and return
adf[-seq_len(nrows),]
}
The text was updated successfully, but these errors were encountered:
I realize I might have re-invented unbreak_rows a wee bit... but not entirely, as particularly the fill issue is different, and this method doesn't rely on character matching. Hope it helps! I couldn't get unbreak_rows to do what I needed, and this seemed similar.
Thanks for the input and for letting me know that you use the package. I do realize that I tend to over-rely on regex matching so this looks very promising. Let me play around with your function and hopefully once I understand the workflow I can add it (with proper attribution). I've actually had to deal with these ragged headers as well.
cheers
Really like this package, but, I ran into a case I could not handle. Sometimes, there are many rows at the head of a data frame that contain information on column names. They can be ragged (e.g., some cells have entries, and some do not). I used this package as inspiration to write a function to deal with this in some data processing I'm doing, but realized, you might want it to be in the package! Rather than fork and pull, here's what I wrote - modify as you like, and I think it could find a happy home here. Or maybe I missed work you already did (hope not!)
The text was updated successfully, but these errors were encountered: