Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample check failure with names including '-' symbol #69

Closed
clstacy opened this issue Feb 14, 2023 · 3 comments · Fixed by #71
Closed

Sample check failure with names including '-' symbol #69

clstacy opened this issue Feb 14, 2023 · 3 comments · Fixed by #71
Labels
bug Something isn't working

Comments

@clstacy
Copy link

clstacy commented Feb 14, 2023

Description of the bug

Pipeline failing with input and matrix sample names including "-" character on CUSTOM_MATRIXFILTER step, error gives all names of samples with "-" ID. When I check the work folder for the process, the names in each of the files have had "-" changed to ".", but I think this might be being compared to the original names and failing?

If I change names with sed in input and matrix file to remove "-", the pipeline continues without error.

Command used and terminal output

nextflow run nf-core/differentialabundance \
	-profile singularity \
	-r dev \
	--input diffabundance/inputs/samplesheet.csv \
	--contrasts diffabundance/inputs/serCan_contrasts.csv \
	--matrix diffabundance/inputs/data/rsem.merged.transcript_counts.tsv \
	--gtf reference/serCan2020/serCan2020_no_genes.gtf \
        -w diffabundance/work \
	--features_id_col 'transcript_id' \
	--outdir diffabundance/results \
	--max_memory '160.GB' \
	--deseq2_cores 8 \
	-resume

# output:
executor >  local (2)
[32/9bd716] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, cached: 1 ✔
[52/70c121] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1 ✔
[ed/3225f0] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, failed: 1 ✘
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/differentialabundance] Pipeline completed with errors-
Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])'

Caused by:
  Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])` terminated with an error exit status (1)

Command executed [/.nextflow/assets/nf-core/differentialabundance/./workflows/../modules/nf-core/custom/matrixfilter/templates/matrixfilter.R]:

  #!/usr/bin/env Rscript
  
  # Filter rows based on the number of columns passing the abundance threshold. By
  # default this will be any row with a value of 1 or more, which would be a
  # permissive threshold for RNA-seq data.
  #
  # In RNA-seq studies it's often not enough to just remove genes not expressed in
  # any sample. We also want to remove anything likely to be part of noise, or
  # which has sufficiently low expression that differential analysis would not be
  # useful. For that reason we might require a higher threshold than 1, and
  # require that more than one sample passes.
  #
  # Often we want to know that a gene is expressed in a substantial enough number
  # of sample that differential analysis worthwhile, so we may pick a threshold
  # sample number related to group size. Note that we do not filter with an
  # awareness of the groups themselves, since this adds bias towards discovery
  # between those groups.
  
  ################################################
  ################################################
  ## Functions                                  ##
  ################################################
  ################################################
  
  #' Parse out options from a string without recourse to optparse
  #'
  #' @param x Long-form argument list like --opt1 val1 --opt2 val2
  #' @param opt_defaults A lis with default argument values
  #'
  #' @return named list of options and values similar to optparse
  
  parse_args <- function(x, opt_defaults){
      args_list <- unlist(strsplit(x, ' ?--')[[1]])[-1]
      args_vals <- unlist(lapply(args_list, function(y) strsplit(y, ' +')), recursive = FALSE)
  
      # Ensure the option vectors are length 2 (key/ value) to catch empty ones
      args_vals <- lapply(args_vals, function(z){ length(z) <- 2; z})
  
      parsed_args <- structure(lapply(args_vals, function(x) x[2]), names = lapply(args_vals, function(x) x[1]))
      parsed_args[! is.na(parsed_args)]
  
      # Now apply CLI options to override defaults
  
      opt_types <- lapply(opt_defaults, class)
  
      for ( ao in names(parsed_args)){
          if (! ao %in% names(opt_defaults)){
              stop(paste("Invalid option:", ao))
          }else{
  
              # Preserve classes from defaults where possible
              if (! is.null(opt_defaults[[ao]])){
                  parsed_args[[ao]] <- as(parsed_args[[ao]], opt_types[[ao]])
              }
              opt_defaults[[ao]] <- parsed_args[[ao]]
          }
      }
      opt_defaults
  }
  
  #' Flexibly read CSV or TSV files
  #'
  #' @param file Input file
  #' @param header Passed to read.delim()
  #' @param row.names Passed to read.delim()
  #' @param nrows Passed to read.delim()
  #'
  #' @return output Data frame
  
  read_delim_flexible <- function(file, header = TRUE, row.names = NULL, nrows = -1 ){
  
      ext <- tolower(tail(strsplit(basename(file), split = "\\.")[[1]], 1))
  
      if (ext == "tsv" || ext == "txt") {
          separator <- "\t"
      } else if (ext == "csv") {
          separator <- ","
      } else {
          stop(paste("Unknown separator for", ext))
      }
  
      read.delim(
          file,
          sep = separator,
          header = header,
          row.names = row.names
      )
  }
  
  # Set up default options
  
  opt <- list(
      abundance_matrix_file = 'rsem.merged.transcript_counts.assay.tsv',
      sample_file = 'samplesheet.sample_metadata.tsv',
      minimum_abundance = 1,
      minimum_samples = 1,
      minimum_proportion = 0,
      grouping_variable = NULL
  )
  
  opt <- parse_args('--minimum_samples 1 --minimum_abundance 1', opt)
  
  abundance_matrix <- read_delim_flexible(opt$abundance_matrix_file, row.names = 1)
  feature_id_name <- colnames( read_delim_flexible(opt$abundance_matrix_file, nrows = 1)[1])
  
  # If a sample sheet was specified, validate the matrix against it
  
  if (opt$sample_file != ''){
  
      # Read the sample sheet and check against matrix
  
      samplesheet <- read_delim_flexible(opt$sample_file, row.names = 1)
      missing_samples <- setdiff(rownames(samplesheet), colnames(abundance_matrix))
  
      if (length(missing_samples) > 0){
          stop(
              paste(
                  paste(missing_samples, collapse = ', '),
                  'not represented in supplied abundance matrix'
              )
          )
      }else{
          abundance_matrix <- abundance_matrix[,rownames(samplesheet)]
      }
  }else{
  
      # If we're not using a sample sheet to select columns, then at least make
      # sure the ones we have are numeric (some upstream things like the RNA-seq
      # workflow have annotation colummns as well)
  
      numeric_columns <- unlist(lapply(1:ncol(abundance_matrix), function(x) is.numeric(abundance_matrix[,x])))
      abundance_matrix <- abundance_matrix[,numeric_columns]
  }
  
  # If we want to define n based on the levels of a grouping variable...
  
  if ((opt$sample_file != '') && ( ! is.null(opt$grouping_variable))){
  
      # Pick a minimum number of samples to pass threshold based on group size
  
      if (! opt$grouping_variable %in% colnames(samplesheet)){
          stop(paste(opt$grouping_variable, "not in supplied sample sheet"))
      }else{
          opt$minimum_samples <- min(table(samplesheet[[opt$grouping_variable]]))
          if ( opt$minimum_proportion > 0){
              opt$minimum_samples <- opt$minimum_samples * opt$minimum_proportion
          }
      }
  }else if (opt$minimum_proportion > 0){
  
      # Or if we want to define it based on a static proportion of the sample number
  
      opt$minimum_samples <- ncol(abundance_matrix) * opt$minimum_proportion
  }
  
  # Generate a boolean vector specifying the features to retain
  
  keep <- apply(abundance_matrix, 1, function(x){
      sum(x > opt$minimum_abundance) >= opt$minimum_samples
  })
  
  # Write out the matrix retaining the specified rows and re-prepending the
  # column with the feature identifiers
  
  prefix = ifelse('study' == 'null', '', 'study')
  
  write.table(
      data.frame(rownames(abundance_matrix)[keep], abundance_matrix[keep,,drop = FALSE]),
      file = paste0(
          prefix,
          '.filtered.tsv'
      ),
      col.names = c(feature_id_name, colnames(abundance_matrix)),
      row.names = FALSE,
      sep = '	',
      quote = FALSE
  )
  
  ################################################
  ################################################
  ## R SESSION INFO                             ##
  ################################################
  ################################################
  
  sink("R_sessionInfo.log")
  print(sessionInfo())
  sink()
  
  ################################################
  ################################################
  ## VERSIONS FILE                              ##
  ################################################
  ################################################
  
  r.version <- strsplit(version[['version.string']], ' ')[[1]][3]
  
  writeLines(
      c(
          '"NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER":',
          paste('    r-base:', r.version)
      ),
  'versions.yml')
  
  ################################################
  ################################################
  ################################################
  ################################################

Command exit status:
  1

Command output:
  (empty)

Command error:
  Error: ES1_48_Control-lipid_0, ES2_48_Control-lipid_14, ES3_48_Control-lipid_21, ES4_164_Control-protein_0, ES5_164_Control-protein_14, ES6_164_Control-protein_21, ES7_SEM19-1354_Control-protein_14, ES8_SEM19-1354_Control-protein_21, ES9_172_Control-protein_0, ES10_172_Control-protein_14, ES11_172_Control-protein_21, ES12_209_MG-lipid_0, ES13_209_MG-lipid_14, ES14_209_MG-lipid_21, ES15_601_Control-protein_14, ES16_601_Control-protein_21, ES17_615_Control-protein_0, ES18_615_Control-protein_14, ES19_615_Control-protein_21, ES20_618_Control-lipid_0, ES21_618_Control-lipid_14, ES22_618_Control-lipid_21, ES23_Blue-010_Control-lipid_0, ES24_Blue-010_Control-lipid_21, ES25_619_MG-lipid_0, ES26_619_MG-lipid_14, ES27_619_MG-lipid_21, ES103_625_Control-lipid_0, ES104_625_Control-lipid_14, ES105_625_Control-lipid_21, ES31_Blue-055_MG-protein_0, ES32_Blue-055_MG-protein_14, ES33_101-LS_MG-lipid_0, ES34_101-LS_MG-lipid_14, ES35_101-LS_MG-lipid_21, ES36_AR17-21_MG-lipid_0, ES37_AR17-21_MG-lipid_14,
  Execution halted

Work dir:
  diffabundance/work/ed/3225f0361d750970c0c0439ac3407c

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Relevant files

command_files.tgz

System information

Nextflow version 22.10.1
Hardware HPC
Executor slurm
Container engine: Singularity
OS Linux
Version of nf-core/differentialabundance: dev
@clstacy clstacy added the bug Something isn't working label Feb 14, 2023
@pinin4fjords
Copy link
Member

Thanks for reporting, will probably just need check.names = FALSE somewhere, I'll fix soon.

@pinin4fjords
Copy link
Member

I believe #60 will fix this for the specific module. It may also come up in later modules though, so more testing will be required.

@pinin4fjords
Copy link
Member

I believe this should be addressed by recent PRs. Please reopen if you encounter similar issues with the current dev version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants