Sample check failure with names including '-' symbol #69

clstacy · 2023-02-14T23:09:53Z

Description of the bug

Pipeline failing with input and matrix sample names including "-" character on CUSTOM_MATRIXFILTER step, error gives all names of samples with "-" ID. When I check the work folder for the process, the names in each of the files have had "-" changed to ".", but I think this might be being compared to the original names and failing?

If I change names with sed in input and matrix file to remove "-", the pipeline continues without error.

Command used and terminal output

nextflow run nf-core/differentialabundance \
	-profile singularity \
	-r dev \
	--input diffabundance/inputs/samplesheet.csv \
	--contrasts diffabundance/inputs/serCan_contrasts.csv \
	--matrix diffabundance/inputs/data/rsem.merged.transcript_counts.tsv \
	--gtf reference/serCan2020/serCan2020_no_genes.gtf \
        -w diffabundance/work \
	--features_id_col 'transcript_id' \
	--outdir diffabundance/results \
	--max_memory '160.GB' \
	--deseq2_cores 8 \
	-resume

# output:
executor >  local (2)
[32/9bd716] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, cached: 1 ✔
[52/70c121] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1 ✔
[ed/3225f0] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, failed: 1 ✘
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/differentialabundance] Pipeline completed with errors-
Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])'

Caused by:
  Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])` terminated with an error exit status (1)

Command executed [/.nextflow/assets/nf-core/differentialabundance/./workflows/../modules/nf-core/custom/matrixfilter/templates/matrixfilter.R]:

  #!/usr/bin/env Rscript
  
  # Filter rows based on the number of columns passing the abundance threshold. By
  # default this will be any row with a value of 1 or more, which would be a
  # permissive threshold for RNA-seq data.
  #
  # In RNA-seq studies it's often not enough to just remove genes not expressed in
  # any sample. We also want to remove anything likely to be part of noise, or
  # which has sufficiently low expression that differential analysis would not be
  # useful. For that reason we might require a higher threshold than 1, and
  # require that more than one sample passes.
  #
  # Often we want to know that a gene is expressed in a substantial enough number
  # of sample that differential analysis worthwhile, so we may pick a threshold
  # sample number related to group size. Note that we do not filter with an
  # awareness of the groups themselves, since this adds bias towards discovery
  # between those groups.
  
  ################################################
  ################################################
  ## Functions                                  ##
  ################################################
  ################################################
  
  #' Parse out options from a string without recourse to optparse
  #'
  #' @param x Long-form argument list like --opt1 val1 --opt2 val2
  #' @param opt_defaults A lis with default argument values
  #'
  #' @return named list of options and values similar to optparse
  
  parse_args <- function(x, opt_defaults){
      args_list <- unlist(strsplit(x, ' ?--')[[1]])[-1]
      args_vals <- unlist(lapply(args_list, function(y) strsplit(y, ' +')), recursive = FALSE)
  
      # Ensure the option vectors are length 2 (key/ value) to catch empty ones
      args_vals <- lapply(args_vals, function(z){ length(z) <- 2; z})
  
      parsed_args <- structure(lapply(args_vals, function(x) x[2]), names = lapply(args_vals, function(x) x[1]))
      parsed_args[! is.na(parsed_args)]
  
      # Now apply CLI options to override defaults
  
      opt_types <- lapply(opt_defaults, class)
  
      for ( ao in names(parsed_args)){
          if (! ao %in% names(opt_defaults)){
              stop(paste("Invalid option:", ao))
          }else{
  
              # Preserve classes from defaults where possible
              if (! is.null(opt_defaults[[ao]])){
                  parsed_args[[ao]] <- as(parsed_args[[ao]], opt_types[[ao]])
              }
              opt_defaults[[ao]] <- parsed_args[[ao]]
          }
      }
      opt_defaults
  }
  
  #' Flexibly read CSV or TSV files
  #'
  #' @param file Input file
  #' @param header Passed to read.delim()
  #' @param row.names Passed to read.delim()
  #' @param nrows Passed to read.delim()
  #'
  #' @return output Data frame
  
  read_delim_flexible <- function(file, header = TRUE, row.names = NULL, nrows = -1 ){
  
      ext <- tolower(tail(strsplit(basename(file), split = "\\.")[[1]], 1))
  
      if (ext == "tsv" || ext == "txt") {
          separator <- "\t"
      } else if (ext == "csv") {
          separator <- ","
      } else {
          stop(paste("Unknown separator for", ext))
      }
  
      read.delim(
          file,
          sep = separator,
          header = header,
          row.names = row.names
      )
  }
  
  # Set up default options
  
  opt <- list(
      abundance_matrix_file = 'rsem.merged.transcript_counts.assay.tsv',
      sample_file = 'samplesheet.sample_metadata.tsv',
      minimum_abundance = 1,
      minimum_samples = 1,
      minimum_proportion = 0,
      grouping_variable = NULL
  )
  
  opt <- parse_args('--minimum_samples 1 --minimum_abundance 1', opt)
  
  abundance_matrix <- read_delim_flexible(opt$abundance_matrix_file, row.names = 1)
  feature_id_name <- colnames( read_delim_flexible(opt$abundance_matrix_file, nrows = 1)[1])
  
  # If a sample sheet was specified, validate the matrix against it
  
  if (opt$sample_file != ''){
  
      # Read the sample sheet and check against matrix
  
      samplesheet <- read_delim_flexible(opt$sample_file, row.names = 1)
      missing_samples <- setdiff(rownames(samplesheet), colnames(abundance_matrix))
  
      if (length(missing_samples) > 0){
          stop(
              paste(
                  paste(missing_samples, collapse = ', '),
                  'not represented in supplied abundance matrix'
              )
          )
      }else{
          abundance_matrix <- abundance_matrix[,rownames(samplesheet)]
      }
  }else{
  
      # If we're not using a sample sheet to select columns, then at least make
      # sure the ones we have are numeric (some upstream things like the RNA-seq
      # workflow have annotation colummns as well)
  
      numeric_columns <- unlist(lapply(1:ncol(abundance_matrix), function(x) is.numeric(abundance_matrix[,x])))
      abundance_matrix <- abundance_matrix[,numeric_columns]
  }
  
  # If we want to define n based on the levels of a grouping variable...
  
  if ((opt$sample_file != '') && ( ! is.null(opt$grouping_variable))){
  
      # Pick a minimum number of samples to pass threshold based on group size
  
      if (! opt$grouping_variable %in% colnames(samplesheet)){
          stop(paste(opt$grouping_variable, "not in supplied sample sheet"))
      }else{
          opt$minimum_samples <- min(table(samplesheet[[opt$grouping_variable]]))
          if ( opt$minimum_proportion > 0){
              opt$minimum_samples <- opt$minimum_samples * opt$minimum_proportion
          }
      }
  }else if (opt$minimum_proportion > 0){
  
      # Or if we want to define it based on a static proportion of the sample number
  
      opt$minimum_samples <- ncol(abundance_matrix) * opt$minimum_proportion
  }
  
  # Generate a boolean vector specifying the features to retain
  
  keep <- apply(abundance_matrix, 1, function(x){
      sum(x > opt$minimum_abundance) >= opt$minimum_samples
  })
  
  # Write out the matrix retaining the specified rows and re-prepending the
  # column with the feature identifiers
  
  prefix = ifelse('study' == 'null', '', 'study')
  
  write.table(
      data.frame(rownames(abundance_matrix)[keep], abundance_matrix[keep,,drop = FALSE]),
      file = paste0(
          prefix,
          '.filtered.tsv'
      ),
      col.names = c(feature_id_name, colnames(abundance_matrix)),
      row.names = FALSE,
      sep = '	',
      quote = FALSE
  )
  
  ################################################
  ################################################
  ## R SESSION INFO                             ##
  ################################################
  ################################################
  
  sink("R_sessionInfo.log")
  print(sessionInfo())
  sink()
  
  ################################################
  ################################################
  ## VERSIONS FILE                              ##
  ################################################
  ################################################
  
  r.version <- strsplit(version[['version.string']], ' ')[[1]][3]
  
  writeLines(
      c(
          '"NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER":',
          paste('    r-base:', r.version)
      ),
  'versions.yml')
  
  ################################################
  ################################################
  ################################################
  ################################################

Command exit status:
  1

Command output:
  (empty)

Command error:
  Error: ES1_48_Control-lipid_0, ES2_48_Control-lipid_14, ES3_48_Control-lipid_21, ES4_164_Control-protein_0, ES5_164_Control-protein_14, ES6_164_Control-protein_21, ES7_SEM19-1354_Control-protein_14, ES8_SEM19-1354_Control-protein_21, ES9_172_Control-protein_0, ES10_172_Control-protein_14, ES11_172_Control-protein_21, ES12_209_MG-lipid_0, ES13_209_MG-lipid_14, ES14_209_MG-lipid_21, ES15_601_Control-protein_14, ES16_601_Control-protein_21, ES17_615_Control-protein_0, ES18_615_Control-protein_14, ES19_615_Control-protein_21, ES20_618_Control-lipid_0, ES21_618_Control-lipid_14, ES22_618_Control-lipid_21, ES23_Blue-010_Control-lipid_0, ES24_Blue-010_Control-lipid_21, ES25_619_MG-lipid_0, ES26_619_MG-lipid_14, ES27_619_MG-lipid_21, ES103_625_Control-lipid_0, ES104_625_Control-lipid_14, ES105_625_Control-lipid_21, ES31_Blue-055_MG-protein_0, ES32_Blue-055_MG-protein_14, ES33_101-LS_MG-lipid_0, ES34_101-LS_MG-lipid_14, ES35_101-LS_MG-lipid_21, ES36_AR17-21_MG-lipid_0, ES37_AR17-21_MG-lipid_14,
  Execution halted

Work dir:
  diffabundance/work/ed/3225f0361d750970c0c0439ac3407c

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Relevant files

command_files.tgz

System information

Nextflow version 22.10.1
Hardware HPC
Executor slurm
Container engine: Singularity
OS Linux
Version of nf-core/differentialabundance: dev

The text was updated successfully, but these errors were encountered:

pinin4fjords · 2023-02-15T08:51:53Z

Thanks for reporting, will probably just need check.names = FALSE somewhere, I'll fix soon.

pinin4fjords · 2023-02-17T16:56:56Z

I believe #60 will fix this for the specific module. It may also come up in later modules though, so more testing will be required.

pinin4fjords · 2023-02-21T09:30:36Z

I believe this should be addressed by recent PRs. Please reopen if you encounter similar issues with the current dev version.

clstacy added the bug Something isn't working label Feb 14, 2023

pinin4fjords mentioned this issue Feb 20, 2023

Add palette options, bring full check.names fixes from upstream #71

Merged

9 tasks

pinin4fjords linked a pull request Feb 20, 2023 that will close this issue

Add palette options, bring full check.names fixes from upstream #71

Merged

9 tasks

pinin4fjords closed this as completed Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample check failure with names including '-' symbol #69

Sample check failure with names including '-' symbol #69

clstacy commented Feb 14, 2023 •

edited

Loading

pinin4fjords commented Feb 15, 2023

pinin4fjords commented Feb 17, 2023

pinin4fjords commented Feb 21, 2023

Sample check failure with names including '-' symbol #69

Sample check failure with names including '-' symbol #69

Comments

clstacy commented Feb 14, 2023 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

pinin4fjords commented Feb 15, 2023

pinin4fjords commented Feb 17, 2023

pinin4fjords commented Feb 21, 2023

clstacy commented Feb 14, 2023 •

edited

Loading