Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aspera CLI download support to pipeline #259

Merged
merged 10 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
- [PR #253](https://github.com/nf-core/fetchngs/pull/253) - Add implicit tags in nf-test files for simpler testing strategy
- [PR #257](https://github.com/nf-core/fetchngs/pull/257) - Template update for nf-core/tools v2.12
- [PR #258](https://github.com/nf-core/fetchngs/pull/258) - Fixes for [PR #253](https://github.com/nf-core/fetchngs/pull/253)
- [PR #259](https://github.com/nf-core/fetchngs/pull/259) - Add Aspera CLI download support to pipeline ([#68](https://github.com/nf-core/fetchngs/issues/68))

### Software dependencies

Expand All @@ -43,6 +44,16 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
>
> **NB:** Dependency has been **removed** if new version information isn't present.

### Parameters

| Old parameter | New parameter |
| ------------- | ---------------------- |
| | `--force_ftp_download` |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
> **NB:** Parameter has been **removed** if new parameter information isn't present.

## [[1.11.0](https://github.com/nf-core/fetchngs/releases/tag/1.11.0)] - 2023-10-18

### Credits
Expand Down
2 changes: 2 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

## Pipeline tools

- [Aspera CLI](https://github.com/IBM/aspera-cli)

- [Python](http://www.python.org)

- [Requests](https://docs.python-requests.org/)
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,10 @@ Via a single file of ids, provided one-per-line (see [example input file](https:
1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
2. Fetch extensive id metadata via ENA API
3. Download FastQ files:
- If direct download links are available from the ENA API, fetch in parallel via `curl` and perform `md5sum` check
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
- If direct download links are available from the ENA API:
- Fetch in parallel via `aspera-cli` and perform `md5sum` check (default)
- Fetch in parallel via `wget` and perform `md5sum` check. Use `--force_ftp_download` to force this behaviour.
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ. Use `--force_sratools_download` to force this behaviour.
4. Collate id metadata and paths to FastQ files in a single samplesheet

### Synapse ids
Expand Down
4 changes: 2 additions & 2 deletions conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ params {
config_profile_name = 'Full test profile'
config_profile_description = 'Full test dataset to check pipeline function'

// Input data for full size test
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/bb634bcfef520552e8314dfa3f8a764e1d62f7dc/testdata/v1.12.0/sra_ids_test_full.csv'
// File containing SRA ids from nf-core/rnaseq -profile test_full for full-sized test
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/100736c99d87667fb7c247c267bc8acfac647bed/testdata/v1.12.0/sra_ids_rnaseq_test_full.csv'
}
4 changes: 4 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@ You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g.

From v1.9 of this pipeline the default `strandedness` in the output samplesheet will be set to `auto` when using `--nf_core_pipeline rnaseq`. This will only work with v3.10 onwards of nf-core/rnaseq which permits the auto-detection of strandedness during the pipeline execution. You can change this behaviour with the `--nf_core_rnaseq_strandedness` parameter which is set to `auto` by default.

### Bypass Aspera data download

If the appropriate download links are available, the pipeline uses the Aspera CLI by default to download FastQ files. If you are having issues and prefer to use FTP or sra-tools instead, you can use the [`--force_ftp_download`](https://nf-co.re/fetchngs/parameters#force_ftp_download) and [`--force_sratools_download`](https://nf-co.re/fetchngs/parameters#force_sratools_download) parameters, respectively.

### Bypass `FTP` data download

If FTP connections are blocked on your network use the [`--force_sratools_download`](https://nf-co.re/fetchngs/parameters#force_sratools_download) parameter to force the pipeline to download data using sra-tools instead of the ENA FTP.
Expand Down
7 changes: 7 additions & 0 deletions modules/local/aspera_cli/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: aspera_cli
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- bioconda::aspera-cli=4.14.0
63 changes: 63 additions & 0 deletions modules/local/aspera_cli/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
process ASPERA_CLI {
tag "$meta.id"
label 'process_medium'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/aspera-cli:4.14.0--hdfd78af_1' :
'biocontainers/aspera-cli:4.14.0--hdfd78af_1' }"

input:
tuple val(meta), val(fastq)
val user

output:
tuple val(meta), path("*fastq.gz"), emit: fastq
tuple val(meta), path("*md5") , emit: md5
path "versions.yml" , emit: versions

script:
def args = task.ext.args ?: ''
if (meta.single_end) {
drpatelh marked this conversation as resolved.
Show resolved Hide resolved
"""
ascp \\
$args \\
-i \$CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \\
${user}@${fastq[0]} \\
${meta.id}.fastq.gz

echo "${meta.md5_1} ${meta.id}.fastq.gz" > ${meta.id}.fastq.gz.md5
md5sum -c ${meta.id}.fastq.gz.md5

cat <<-END_VERSIONS > versions.yml
"${task.process}":
aspera_cli: \$(ascli --version)
END_VERSIONS
"""
} else {
"""
ascp \\
$args \\
-i \$CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \\
${user}@${fastq[0]} \\
${meta.id}_1.fastq.gz

echo "${meta.md5_1} ${meta.id}_1.fastq.gz" > ${meta.id}_1.fastq.gz.md5
md5sum -c ${meta.id}_1.fastq.gz.md5

ascp \\
$args \\
-i \$CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \\
${user}@${fastq[1]} \\
${meta.id}_2.fastq.gz

echo "${meta.md5_2} ${meta.id}_2.fastq.gz" > ${meta.id}_2.fastq.gz.md5
md5sum -c ${meta.id}_2.fastq.gz.md5

cat <<-END_VERSIONS > versions.yml
"${task.process}":
aspera_cli: \$(ascli --version)
END_VERSIONS
"""
}
}
17 changes: 17 additions & 0 deletions modules/local/aspera_cli/nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
process {
withName: 'ASPERA_CLI' {
ext.args = '-QT -l 300m -P33001'
publishDir = [
[
path: { "${params.outdir}/fastq" },
mode: params.publish_dir_mode,
pattern: "*.fastq.gz"
],
[
path: { "${params.outdir}/fastq/md5" },
mode: params.publish_dir_mode,
pattern: "*.md5"
]
]
}
}
37 changes: 37 additions & 0 deletions modules/local/aspera_cli/tests/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
nextflow_process {

name "Test process: ASPERA_CLI"
script "../main.nf"
process "ASPERA_CLI"

tag "ASPERA_CLI"

test("Should run without failures") {

when {
params {
outdir = "$outputDir"
}

process {
"""
input[0] = [
[ id:'SRX9626017_SRR13191702', single_end:false, md5_1: '89c5be920021a035084d8aeb74f32df7', md5_2: '56271be38a80db78ef3bdfc5d9909b98' ], // meta map
[
'fasp.sra.ebi.ac.uk:/vol1/fastq/SRR131/002/SRR13191702/SRR13191702_1.fastq.gz',
'fasp.sra.ebi.ac.uk:/vol1/fastq/SRR131/002/SRR13191702/SRR13191702_2.fastq.gz'
]
]
input[1] = 'era-fasp'
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() }
)
}
}
}
71 changes: 71 additions & 0 deletions modules/local/aspera_cli/tests/main.nf.test.snap
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
{
"Should run without failures": {
"content": [
{
"0": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz:md5,baaaea61cba4294ec696fdfea1610848",
"SRX9626017_SRR13191702_2.fastq.gz:md5,8e43ad99049fabb6526a4b846da01c32"
]
]
],
"1": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz.md5:md5,055a6916ec9ee478e453d50651f87997",
"SRX9626017_SRR13191702_2.fastq.gz.md5:md5,c30ac785f8d80ec563fabf604d8bf945"
]
]
],
"2": [
"versions.yml:md5,a51a1dfc6308d71058ddc12c46101dd3"
],
"fastq": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz:md5,baaaea61cba4294ec696fdfea1610848",
"SRX9626017_SRR13191702_2.fastq.gz:md5,8e43ad99049fabb6526a4b846da01c32"
]
]
],
"md5": [
[
{
"id": "SRX9626017_SRR13191702",
"single_end": false,
"md5_1": "89c5be920021a035084d8aeb74f32df7",
"md5_2": "56271be38a80db78ef3bdfc5d9909b98"
},
[
"SRX9626017_SRR13191702_1.fastq.gz.md5:md5,055a6916ec9ee478e453d50651f87997",
"SRX9626017_SRR13191702_2.fastq.gz.md5:md5,c30ac785f8d80ec563fabf604d8bf945"
]
]
],
"versions": [
"versions.yml:md5,a51a1dfc6308d71058ddc12c46101dd3"
]
}
],
"timestamp": "2024-01-29T13:00:29.847293"
}
}
1 change: 1 addition & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ params {
ena_metadata_fields = null
sample_mapping_fields = 'experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description'
synapse_config = null
force_ftp_download = false
force_sratools_download = false
skip_fastq_download = false
dbgap_key = null
Expand Down
6 changes: 6 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,12 @@
"help_text": "The default is 'auto' which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.",
"default": "auto"
},
"force_ftp_download": {
"type": "boolean",
"fa_icon": "fas fa-tools",
"description": "Force download FASTQ files via FTP instead of via the Aspera CLI.",
"help_text": "If the Aspera CLI is not working on your infrastructure use this flag to force the pipeline to download data via FTP."
},
"force_sratools_download": {
"type": "boolean",
"fa_icon": "fas fa-tools",
Expand Down
51 changes: 38 additions & 13 deletions workflows/sra/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ include { MULTIQC_MAPPINGS_CONFIG } from '../../modules/local/multiqc_mappings_c
include { SRA_FASTQ_FTP } from '../../modules/local/sra_fastq_ftp'
include { SRA_IDS_TO_RUNINFO } from '../../modules/local/sra_ids_to_runinfo'
include { SRA_RUNINFO_TO_FTP } from '../../modules/local/sra_runinfo_to_ftp'
include { ASPERA_CLI } from '../../modules/local/aspera_cli'
include { SRA_TO_SAMPLESHEET } from '../../modules/local/sra_to_samplesheet'
include { softwareVersionsToYAML } from '../../subworkflows/nf-core/utils_nfcore_pipeline'

Expand Down Expand Up @@ -54,27 +55,51 @@ workflow SRA {
.out
.tsv
.splitCsv(header:true, sep:'\t')
.map{ meta ->
def meta_clone = meta.clone()
meta_clone.single_end = meta_clone.single_end.toBoolean()
return meta_clone
.map {
meta ->
def meta_clone = meta.clone()
meta_clone.single_end = meta_clone.single_end.toBoolean()
return meta_clone
}
.unique()
.set { ch_sra_metadata }

if (!params.skip_fastq_download) {

ch_sra_metadata
.map {
meta ->
[ meta, [ meta.fastq_1, meta.fastq_2 ] ]
}
.branch {
ftp: it[0].fastq_1 && !params.force_sratools_download
sra: !it[0].fastq_1 || params.force_sratools_download
meta ->
def download_method = 'aspera'
// meta.fastq_aspera is a metadata string with ENA fasp links supported by Aspera
// For single-end: 'fasp.sra.ebi.ac.uk:/vol1/fastq/ERR116/006/ERR1160846/ERR1160846.fastq.gz'
// For paired-end: 'fasp.sra.ebi.ac.uk:/vol1/fastq/SRR130/020/SRR13055520/SRR13055520_1.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR130/020/SRR13055520/SRR13055520_2.fastq.gz'
if (!meta.fastq_aspera || params.force_ftp_download) {
if (meta.fastq_1) {
download_method = 'ftp'
}
}
if ((!meta.fastq_aspera && !meta.fastq_1) || params.force_sratools_download) {
download_method = 'sratools'
}

aspera: download_method == 'aspera'
return [ meta, meta.fastq_aspera.tokenize(';').take(2) ]
drpatelh marked this conversation as resolved.
Show resolved Hide resolved
ftp: download_method == 'ftp'
return [ meta, [ meta.fastq_1, meta.fastq_2 ] ]
sratools: download_method == 'sratools'
return [ meta, meta.run_accession ]
}
.set { ch_sra_reads }

//
// MODULE: If Aspera link is provided in run information then download FastQ directly via Aspera CLI and validate with md5sums
//
ASPERA_CLI (
ch_sra_reads.aspera,
'era-fasp'
)
ch_versions = ch_versions.mix(ASPERA_CLI.out.versions.first())

//
// MODULE: If FTP link is provided in run information then download FastQ directly via FTP and validate with md5sums
//
Expand All @@ -87,15 +112,16 @@ workflow SRA {
// SUBWORKFLOW: Download sequencing reads without FTP links using sra-tools.
//
FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS (
ch_sra_reads.sra.map { meta, reads -> [ meta, meta.run_accession ] },
ch_sra_reads.sratools,
params.dbgap_key ? file(params.dbgap_key, checkIfExists: true) : []
)
ch_versions = ch_versions.mix(FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS.out.versions.first())

// Isolate FASTQ channel which will be added to emit block
SRA_FASTQ_FTP
ASPERA_CLI
.out
.fastq
.mix(SRA_FASTQ_FTP.out.fastq)
.mix(FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS.out.reads)
.map {
meta, fastq ->
Expand Down Expand Up @@ -157,7 +183,6 @@ workflow SRA {
softwareVersionsToYAML(ch_versions)
.collectFile(storeDir: "${params.outdir}/pipeline_info", name: 'nf_core_fetchngs_software_mqc_versions.yml', sort: true, newLine: true)


emit:
samplesheet = ch_samplesheet
mappings = ch_mappings
Expand Down
1 change: 1 addition & 0 deletions workflows/sra/nextflow.config
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
includeConfig "../../modules/local/multiqc_mappings_config/nextflow.config"
includeConfig "../../modules/local/aspera_cli/nextflow.config"
includeConfig "../../modules/local/sra_fastq_ftp/nextflow.config"
includeConfig "../../modules/local/sra_ids_to_runinfo/nextflow.config"
includeConfig "../../modules/local/sra_runinfo_to_ftp/nextflow.config"
Expand Down
Loading
Loading