Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make many header rows into column names #4

Closed
jebyrnes opened this issue Jun 11, 2020 · 3 comments
Closed

Make many header rows into column names #4

jebyrnes opened this issue Jun 11, 2020 · 3 comments

Comments

@jebyrnes
Copy link

Really like this package, but, I ran into a case I could not handle. Sometimes, there are many rows at the head of a data frame that contain information on column names. They can be ragged (e.g., some cells have entries, and some do not). I used this package as inspiration to write a function to deal with this in some data processing I'm doing, but realized, you might want it to be in the package! Rather than fork and pull, here's what I wrote - modify as you like, and I think it could find a happy home here. Or maybe I missed work you already did (hope not!)

#' Make many initial row headings into names
#'
#' @param adf A `data.frame` or `tibble` object whose top few rows are the 
#' new column names.
#' @param nrows How many rows at the top are to be used to create new column 
#' names.
#' @param sliding_headers Defaults to TRUE. Many data sets only enter 
#' part of a column name once in a single row, and then have several subrows 
#' beneath them. E.g., Row 1 has "Fertilized" in column 1 and "Unfertilized" 
#' in Row 1 Column 4. Row 2 then has the same two entries repeated twice, 
#' such as "Caged" and "Uncaged". We want "Fertilized_Caged", 
#' "Fertilized_Uncaged", "Unfertilized_Caged", "Unfertilized_Uncaged". 
#' This is not always the case, however, so, a TRUE/FALSE switch.
#' @param sep How should different cells be combined? Defaults to "_".
#'
#' @return The original data frame, but with new names and the 
#' top `nrows` rows removed
#' @export
#'
#' @examples
#' 
#' test_df <- tibble::tribble(
#' ~x1,    ~x2,    ~x3,    ~x4,       ~x5,
#' "",    "Yes",  "",     "No",       "",
#' "",    "We",   "Have", "Bananas",  "",
#' "Why",  "",    "",     "",         "Today",
#' "1",    "2",   "3",    "4",        "5"
#' )
#' 
#' print(test_df)
#' print(many_name_rows(test_df, 3))

many_name_rows <- function(adf, 
                           nrows, 
                           sliding_headers=TRUE,
                           sep = "_"){
  
  #get the top rows of the data frame that contain the column names
  new_names <- adf[seq_len(nrows),] %>%
    dplyr::mutate(ROW_ID = 1:nrows)
  
  #pivot the data longer for easy manipulation
  #and give a column number, as column names are often
  #not properly sortable later. Also, make "" into NA
  #for easier string substitution later
  new_names <- new_names %>%
    tidyr::pivot_longer(cols = -ROW_ID,
                        names_to = "cols",
                        values_to = "vals") %>%
    dplyr::mutate(vals = gsub("^$", NA, vals),
                  col_num = rep(seq_len(dplyr::n_distinct(cols)),
                                dplyr::n()/dplyr::n_distinct(cols)))
  
  #if some header entries are only put in once, but should be copied
  #across many columns, slide them down using tidyr::fill
  if(sliding_headers){
    new_names <- new_names %>%
      dplyr::group_by(cols) %>%
      tidyr::fill(vals, .direction = "down") %>%
      dplyr::ungroup()
  }
  
  #collapse the column names together into single elements
  new_names <- new_names %>%
    dplyr::group_by(col_num) %>%
    dplyr::summarize(vals = paste0(vals, collapse = sep),
                     .groups = "keep") %>%
    dplyr::mutate(vals = gsub("NA_", "", vals),
                  vals = gsub("_NA", "", vals) #end of row
    )
  
  #pivot wide using the correct column order
  new_names_wide <- new_names %>%
    tidyr::pivot_wider(names_from = col_num,
                       values_from = vals)
  
  #fix the names of the old data frame
  names(adf) <- new_names_wide[1,]
  
  #strip the first few rows of the data frame
  #and return
  adf[-seq_len(nrows),]
  
}

@jebyrnes
Copy link
Author

I realize I might have re-invented unbreak_rows a wee bit... but not entirely, as particularly the fill issue is different, and this method doesn't rely on character matching. Hope it helps! I couldn't get unbreak_rows to do what I needed, and this seemed similar.

@luisDVA
Copy link
Owner

luisDVA commented Jun 12, 2020

Thanks for the input and for letting me know that you use the package. I do realize that I tend to over-rely on regex matching so this looks very promising. Let me play around with your function and hopefully once I understand the workflow I can add it (with proper attribution). I've actually had to deal with these ragged headers as well.
cheers

@jebyrnes
Copy link
Author

I see it's in now! Fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants