Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR for Release 2.0.0 #105

Merged
merged 204 commits into from Jun 17, 2022
Merged

PR for Release 2.0.0 #105

merged 204 commits into from Jun 17, 2022

Conversation

fmalmeida
Copy link
Contributor

@fmalmeida fmalmeida commented Jun 9, 2022

Here as discussed with Alex, I open a PR from dev to master so we can have our last checkings and run full_test to make sure it works before first release.

We will probably still have to fix some tests that run on PRs to master. It seems to be running some tests that are not available or relevant anymore (e.g. -profile test_kallisto).

Use filtered gtf
@fmalmeida
Copy link
Contributor Author

fmalmeida commented Jun 13, 2022

The problem saying that genomic coordinates are not available in the DNA fasta is yet persisting using new files.

I was able to overcome this problem when I used the complete ("Genome sequence (GRCh38.p13)") fasta instead of the primary assembly. Both available at: https://www.gencodegenes.org/human/release_32.html

I've been able to test already:

  • Alevin
  • Star
  • Kallisto
  • Cellranger

@fmalmeida
Copy link
Contributor Author

Also, when I try it with singularity I get the following:

-[nf-core/scrnaseq] Pipeline completed with errors-
Error executing process > 'NFCORE_SCRNASEQ:SCRNASEQ:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet_2.0_full.csv)'

Caused by:
  Process `NFCORE_SCRNASEQ:SCRNASEQ:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet_2.0_full.csv)` terminated with an error exit status (255)

Command executed:

  check_samplesheet.py \
      samplesheet_2.0_full.csv \
      samplesheet.valid.csv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SCRNASEQ:SCRNASEQ:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  255

Command output:
  (empty)

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  INFO:    Cleaning up image...
  FATAL:   container creation failed: failed to add  as session directory: path . is not an absolute path

Work dir:
  /opt/projects/1357_BIP/2022-09_gODMSinglecell/scratch.NOBACKUP/work/56/cb96a32ae75de25046f368e1fe5f10

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Any idea on what would be raising it?

@grst
Copy link
Member

grst commented Jun 14, 2022

I tested the workflow on one of my datasets (8 samples of unpublished 10x v3 data from mouse).

  • cellranger with precomputed reference ✔️
  • cellranger with gtf/fasta ✔️
  • kallisto with gtf/fasta ✔️
  • alevin with gtf/fasta ✔️
  • starsolo ❌

Starsolo fails with

Command output:
  Jun 14 11:04:54 ..... started STAR run
  Jun 14 11:04:56 ..... loading genome
  Jun 14 11:05:09 ..... processing annotations GTF
  Jun 14 11:05:26 ..... inserting junctions into the genome indices
  Jun 14 11:06:55 ..... started mapping

Command error:
  
  ReadAlignChunk_processChunks.cpp:202:processChunks EXITING because of FATAL ERROR in input reads: unknown file format: the read ID should start with @ or > 
  
  Jun 14 11:06:57 ...... FATAL ERROR, exiting

I suspect that it might not handle the gzipped fastq files (--readFilesCommand gzip -cdf is missing)? However how is it possible that the CI tests with compressed fastq files passes?

@grst
Copy link
Member

grst commented Jun 14, 2022

no idea about the singularity issue, but it seems like a problem with your system rather than the pipeline.
Are you able to run singularity shell docker://<path to image>?

@grst
Copy link
Member

grst commented Jun 14, 2022

Also (re-)discovered this issue:
#60

@grst
Copy link
Member

grst commented Jun 14, 2022

I suspect that it might not handle the gzipped fastq files (--readFilesCommand gzip -cdf is missing)? However how is it possible that the CI tests with compressed fastq files passes?

It seems that my custom configuration (though not for STAR_ALIGN) was overriding the default settings from modules.config. Possibly related to nextflow-io/nextflow#2422, but I don't think it's an issue with this pipeline.

@fmalmeida
Copy link
Contributor Author

no idea about the singularity issue, but it seems like a problem with your system rather than the pipeline. Are you able to run singularity shell docker://<path to image>?

Yeah, I am thinking that is the issue as well. I am not being able to start the image manually as well.

@fmalmeida
Copy link
Contributor Author

Hi @grst and @apeltzer,

Regarding the error you faced using your data with STARSOLO, I didn't face it using the test_full profile (see screenshot). Maybe something with your data?

Screenshot 2022-06-15 at 08 26 11

And now that this has passed as well, I was able to execute test_full with all aligners as shown in this comment but, to properly work we have to change the available genome in S3 Bucket as I was only able to do it when I change the genome fasta as discussed in the same comment.

Cheers.

@ggabernet
Copy link
Member

Hi, I can add this reference on the s3 bucket. Just following the docs by cellranger on the references though (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr) and they recommend using Ensembl references. Did you check if the Ensembl reference as they recommend works? If not, does using this gencode reference also produce comparable results to the provided cellranger indices?
Just wondering if this is the way to go...

@fmalmeida
Copy link
Contributor Author

fmalmeida commented Jun 15, 2022

Hi @ggabernet,
I haven't tried with Ensembl files but I can. I will try it will aligners again so we actually decide which reference files to keep.

But at least with these gencode I was able to test the pipeline in a full_size dataset and it actually succeed to run all 4 aligners, which is great news 🥳

As soon as I check with Ensembl I get back to you:

  • Alevin
  • Star
  • Kallisto
  • Cellranger

@grst
Copy link
Member

grst commented Jun 15, 2022

The problem with STARsolo and the compressed files was a problem with my configuration (and, ultimately, a bug in nextflow, that should be resolved soon).

However #60 is still an issue, I think.

removing repeated definitions of parameters
@apeltzer
Copy link
Member

Hi! Back from quick leave - can we align on which reference files to use for running the full tests? I was unable to use the ones that 10X said we should use in general, always getting weird failures no matter if I run them from what they provide nor if I create these with the script they provide (same type of error). (2020-A from here https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest) - @grst which ones did you use?

@fmalmeida
Copy link
Contributor Author

Hi Alex, I am using for now the Ensembl ones as Gisela commented to, and so far it is working.

@ggabernet
Copy link
Member

I'm also doing some tests at the moment, I am also uploading already those 2 to the s3 bucket (Ensembl fasta and GTF) and we can directly do the tests on s3. I think the reason why it was not working for you @apeltzer, might be that when using the mkgtf command as we are doing in the pipeline, then the modifications in the gtf file provided here might not be needed any more. Let's see.

@ggabernet
Copy link
Member

ggabernet commented Jun 16, 2022

Updates in this PR: #112

@apeltzer
Copy link
Member

Guess just #60 needs a bugfix, will try to find out what is wrong there and fix it

@grst
Copy link
Member

grst commented Jun 16, 2022

@grst which ones did you use?

FWIW, I tried with both

  • refdata-gex-mm10-2020-A (as downloaded from 10x website) and
  • GrCH38 primary assembly (fasta) with GENCODE v38 (gtf)

But it seems we have a working solution now anyway?

@fmalmeida
Copy link
Contributor Author

Hi @ggabernet,

As promised previously I am just updating that using the Ensembl files the pipeline ran for all tools.

:)

@ggabernet
Copy link
Member

yes, the tests passed on all tools as well on AWS with the Ensembl references 🎉

Copy link
Member

@apeltzer apeltzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me 🥳

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that I haven't checked all 110 files again, but I'd say we are good to go!
This version is definitely already better than the 1.1.0 release.

@apeltzer apeltzer merged commit 0bf83a8 into master Jun 17, 2022
@apeltzer
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet