Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aspera CLI download support to pipeline #259

Merged
merged 10 commits into from
Jan 30, 2024
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
- [PR #253](https://github.com/nf-core/fetchngs/pull/253) - Add implicit tags in nf-test files for simpler testing strategy
- [PR #257](https://github.com/nf-core/fetchngs/pull/257) - Template update for nf-core/tools v2.12
- [PR #258](https://github.com/nf-core/fetchngs/pull/258) - Fixes for [PR #253](https://github.com/nf-core/fetchngs/pull/253)
- [PR #259](https://github.com/nf-core/fetchngs/pull/259) - Add Aspera CLI download support to pipeline ([#68](https://github.com/nf-core/fetchngs/issues/68))

### Software dependencies

Expand All @@ -43,6 +44,16 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
>
> **NB:** Dependency has been **removed** if new version information isn't present.

### Parameters

| Old parameter | New parameter |
| ------------- | ---------------------- |
| | `--force_ftp_download` |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
> **NB:** Parameter has been **removed** if new parameter information isn't present.

## [[1.11.0](https://github.com/nf-core/fetchngs/releases/tag/1.11.0)] - 2023-10-18

### Credits
Expand Down
2 changes: 2 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

## Pipeline tools

- [Aspera CLI](https://github.com/IBM/aspera-cli)

- [Python](http://www.python.org)

- [Requests](https://docs.python-requests.org/)
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,10 @@ Via a single file of ids, provided one-per-line (see [example input file](https:
1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
2. Fetch extensive id metadata via ENA API
3. Download FastQ files:
- If direct download links are available from the ENA API, fetch in parallel via `curl` and perform `md5sum` check
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
- If direct download links are available from the ENA API:
- Fetch in parallel via `aspera-cli` and perform `md5sum` check (default)
- Fetch in parallel via `wget` and perform `md5sum` check. Use `--force_ftp_download` to force this behaviour.
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ. Use `--force_sratools_download` to force this behaviour.
4. Collate id metadata and paths to FastQ files in a single samplesheet

### Synapse ids
Expand Down
4 changes: 2 additions & 2 deletions conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ params {
config_profile_name = 'Full test profile'
config_profile_description = 'Full test dataset to check pipeline function'

// Input data for full size test
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/bb634bcfef520552e8314dfa3f8a764e1d62f7dc/testdata/v1.12.0/sra_ids_test_full.csv'
// File containing SRA ids from nf-core/rnaseq -profile test_full for full-sized test
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/100736c99d87667fb7c247c267bc8acfac647bed/testdata/v1.12.0/sra_ids_rnaseq_test_full.csv'
}
27 changes: 27 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,33 @@ You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g.

From v1.9 of this pipeline the default `strandedness` in the output samplesheet will be set to `auto` when using `--nf_core_pipeline rnaseq`. This will only work with v3.10 onwards of nf-core/rnaseq which permits the auto-detection of strandedness during the pipeline execution. You can change this behaviour with the `--nf_core_rnaseq_strandedness` parameter which is set to `auto` by default.

### Accessions with more than 2 FastQ files

Using `SRR9320616` as an example, if we run the pipeline with default options to download via Aspera/FTP the ENA API indicates that this sample is associated with a single FastQ file:

```
run_accession experiment_accession sample_accession secondary_sample_accession study_accession secondary_study_accession submission_accession run_alias experiment_alias sample_alias study_alias library_layout library_selection library_source library_strategy library_name instrument_model instrument_platform base_count read_count tax_id scientific_name sample_title experiment_title study_title sample_description fastq_md5 fastq_bytes fastq_ftp fastq_galaxy fastq_aspera
SRR9320616 SRX6088086 SAMN12086751 SRS4989433 PRJNA549480 SRP201778 SRA900583 GSM3895942_r1 GSM3895942 GSM3895942 GSE132901 PAIRED cDNA TRANSCRIPTOMIC RNA-Seq Illumina HiSeq 2500 ILLUMINA 11857688850 120996825 10090 Mus musculus Old 3 Kidney Illumina HiSeq 2500 sequencing: GSM3895942: Old 3 Kidney Mus musculus RNA-Seq A murine aging cell atlas reveals cell identity and tissue-specific trajectories of aging Old 3 Kidney 98c939bbae1a1fcf9624905516485b67 7763114613 ftp.sra.ebi.ac.uk/vol1/fastq/SRR932/006/SRR9320616/SRR9320616.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/SRR932/006/SRR9320616/SRR9320616.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR932/006/SRR9320616/SRR9320616.fastq.gz
```

However, this sample actually has 2 additional FastQ files that are flagged as technical and can only be obtained by running sra-tools. This is particularly important for certain preps like 10x and others using UMI barcodes.

```
$ fasterq-dump --threads 6 --split-files --include-technical SRR9320616 --outfile SRR9320616.fastq --progress

SRR9320616_1.fastq
SRR9320616_2.fastq
SRR9320616_3.fastq
```

This highlights that there is a discrepancy between the read data hosted on the ENA API and what can actually be fetched from sra-tools, where the latter seems to be the source of truth. If you anticipate that you may have more than 2 FastQ files per sample, it is recommended to use this pipeline with the `--force_sratools_download` parameter.

See [issue #260](https://github.com/nf-core/fetchngs/issues/260) for more details.

### Bypass Aspera data download

If the appropriate download links are available, the pipeline uses the Aspera CLI by default to download FastQ files. If you are having issues and prefer to use FTP or sra-tools instead, you can use the [`--force_ftp_download`](https://nf-co.re/fetchngs/parameters#force_ftp_download) and [`--force_sratools_download`](https://nf-co.re/fetchngs/parameters#force_sratools_download) parameters, respectively.

### Bypass `FTP` data download

If FTP connections are blocked on your network use the [`--force_sratools_download`](https://nf-co.re/fetchngs/parameters#force_sratools_download) parameter to force the pipeline to download data using sra-tools instead of the ENA FTP.
Expand Down
7 changes: 7 additions & 0 deletions modules/local/aspera_cli/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: aspera_cli
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- bioconda::aspera-cli=4.14.0
63 changes: 63 additions & 0 deletions modules/local/aspera_cli/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
process ASPERA_CLI {
tag "$meta.id"
label 'process_medium'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/aspera-cli:4.14.0--hdfd78af_1' :
'biocontainers/aspera-cli:4.14.0--hdfd78af_1' }"

input:
tuple val(meta), val(fastq)
val user

output:
tuple val(meta), path("*fastq.gz"), emit: fastq
tuple val(meta), path("*md5") , emit: md5
path "versions.yml" , emit: versions

script:
def args = task.ext.args ?: ''
if (meta.single_end) {
drpatelh marked this conversation as resolved.
Show resolved Hide resolved
"""
ascp \\
$args \\
-i \$CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \\
${user}@${fastq[0]} \\
${meta.id}.fastq.gz

echo "${meta.md5_1} ${meta.id}.fastq.gz" > ${meta.id}.fastq.gz.md5
md5sum -c ${meta.id}.fastq.gz.md5

cat <<-END_VERSIONS > versions.yml
"${task.process}":
aspera_cli: \$(ascli --version)
END_VERSIONS
"""
} else {
"""
ascp \\
$args \\
-i \$CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \\
${user}@${fastq[0]} \\
${meta.id}_1.fastq.gz

echo "${meta.md5_1} ${meta.id}_1.fastq.gz" > ${meta.id}_1.fastq.gz.md5
md5sum -c ${meta.id}_1.fastq.gz.md5

ascp \\
$args \\
-i \$CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \\
${user}@${fastq[1]} \\
${meta.id}_2.fastq.gz

echo "${meta.md5_2} ${meta.id}_2.fastq.gz" > ${meta.id}_2.fastq.gz.md5
md5sum -c ${meta.id}_2.fastq.gz.md5

cat <<-END_VERSIONS > versions.yml
"${task.process}":
aspera_cli: \$(ascli --version)
END_VERSIONS
"""
}
}
17 changes: 17 additions & 0 deletions modules/local/aspera_cli/nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
process {
withName: 'ASPERA_CLI' {
ext.args = '-QT -l 300m -P33001'
publishDir = [
[
path: { "${params.outdir}/fastq" },
mode: params.publish_dir_mode,
pattern: "*.fastq.gz"
],
[
path: { "${params.outdir}/fastq/md5" },
mode: params.publish_dir_mode,
pattern: "*.md5"
]
]
}
}
37 changes: 37 additions & 0 deletions modules/local/aspera_cli/tests/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
nextflow_process {

name "Test process: ASPERA_CLI"
script "../main.nf"
process "ASPERA_CLI"

tag "ASPERA_CLI"

test("Should run without failures") {

when {
params {
outdir = "$outputDir"
}

process {
"""
input[0] = [
[ id:'SRX9626017_SRR13191702', single_end:false, md5_1: '89c5be920021a035084d8aeb74f32df7', md5_2: '56271be38a80db78ef3bdfc5d9909b98' ], // meta map
[
'fasp.sra.ebi.ac.uk:/vol1/fastq/SRR131/002/SRR13191702/SRR13191702_1.fastq.gz',
'fasp.sra.ebi.ac.uk:/vol1/fastq/SRR131/002/SRR13191702/SRR13191702_2.fastq.gz'
]
]
input[1] = 'era-fasp'
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() }
)
}
}
}
71 changes: 71 additions & 0 deletions modules/local/aspera_cli/tests/main.nf.test.snap
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
{
"Should run without failures": {
"content": [
{
"0": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz:md5,baaaea61cba4294ec696fdfea1610848",
"SRX9626017_SRR13191702_2.fastq.gz:md5,8e43ad99049fabb6526a4b846da01c32"
]
]
],
"1": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz.md5:md5,055a6916ec9ee478e453d50651f87997",
"SRX9626017_SRR13191702_2.fastq.gz.md5:md5,c30ac785f8d80ec563fabf604d8bf945"
]
]
],
"2": [
"versions.yml:md5,a51a1dfc6308d71058ddc12c46101dd3"
],
"fastq": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz:md5,baaaea61cba4294ec696fdfea1610848",
"SRX9626017_SRR13191702_2.fastq.gz:md5,8e43ad99049fabb6526a4b846da01c32"
]
]
],
"md5": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz.md5:md5,055a6916ec9ee478e453d50651f87997",
"SRX9626017_SRR13191702_2.fastq.gz.md5:md5,c30ac785f8d80ec563fabf604d8bf945"
]
]
],
"versions": [
"versions.yml:md5,a51a1dfc6308d71058ddc12c46101dd3"
]
}
],
"timestamp": "2024-01-29T13:00:29.847293"
}
}
1 change: 1 addition & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ params {
ena_metadata_fields = null
sample_mapping_fields = 'experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description'
synapse_config = null
force_ftp_download = false
force_sratools_download = false
skip_fastq_download = false
dbgap_key = null
Expand Down
6 changes: 6 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,12 @@
"help_text": "The default is 'auto' which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.",
"default": "auto"
},
"force_ftp_download": {
"type": "boolean",
"fa_icon": "fas fa-tools",
"description": "Force download FASTQ files via FTP instead of via the Aspera CLI.",
"help_text": "If the Aspera CLI is not working on your infrastructure use this flag to force the pipeline to download data via FTP."
},
"force_sratools_download": {
"type": "boolean",
"fa_icon": "fas fa-tools",
Expand Down