Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cellranger vdj #289

Merged
merged 38 commits into from
Mar 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
385364c
split bulk/sc raw
mapo9 Dec 5, 2023
5c3f677
added cellranger reference, cellranger vdj
mapo9 Dec 5, 2023
e53237d
added cellranger vdj
mapo9 Dec 7, 2023
f35e3d7
added cellranger vdj
mapo9 Dec 7, 2023
cf9090c
Merge branch 'dev' into cellranger_vdj
ggabernet Feb 8, 2024
f8b6a0f
Update nextflow_schema.json
mapo9 Feb 8, 2024
747d4ba
Update nextflow_schema.json
mapo9 Feb 8, 2024
d052124
Update subworkflows/local/repertoire_analysis_reporting.nf
mapo9 Feb 8, 2024
1adb1a7
Update subworkflows/local/sc_raw_input.nf
mapo9 Feb 8, 2024
d8cee70
Update workflows/airrflow.nf
mapo9 Feb 8, 2024
40a53d6
single cell based on parameter
mapo9 Feb 9, 2024
75d05f7
fixed samplesheet_val
mapo9 Feb 9, 2024
90a2ef4
linting
mapo9 Feb 9, 2024
f048861
lint
mapo9 Feb 9, 2024
9d8598d
lint
mapo9 Feb 9, 2024
42ccb1e
pre-commit
mapo9 Feb 9, 2024
12bdef9
pre-commit
mapo9 Feb 9, 2024
62f4643
test ready
mapo9 Feb 12, 2024
2917dc2
more stuff for test
mapo9 Feb 12, 2024
1e44e33
paths to test data
mapo9 Feb 12, 2024
293ba60
prettier
mapo9 Feb 12, 2024
3e52ef3
broken link
mapo9 Feb 12, 2024
ebbd620
Merge branch 'dev' into cellranger_vdj
ggabernet Feb 16, 2024
bc1e316
avoid creating extra param
ggabernet Feb 16, 2024
2bb6952
merge fastqs with multiple lanes
ggabernet Feb 19, 2024
f1caa7b
fix text
ggabernet Feb 19, 2024
7b8b8c9
fix lint
ggabernet Feb 19, 2024
68a7329
fix metadata merge
ggabernet Feb 19, 2024
afa4c1d
add collect
ggabernet Feb 20, 2024
a0ad699
Merge pull request #305 from ggabernet/cellranger_vdj
ggabernet Feb 27, 2024
ed44d9a
delete local mkvdjref
mapo9 Feb 28, 2024
637dd94
readme update 01
mapo9 Feb 28, 2024
541904d
docs update
mapo9 Feb 29, 2024
cdef84f
Update README.md
mapo9 Mar 6, 2024
ec048d9
Update CHANGELOG.md
mapo9 Mar 6, 2024
6f0844b
Update README.md
mapo9 Mar 6, 2024
575b41d
Update README.md
mapo9 Mar 6, 2024
184d559
prettier
mapo9 Mar 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ jobs:
"test_fetchimgt",
"test_assembled_hs",
"test_assembled_mm",
"test_10x_sc",
"test_clontech_umi",
"test_nebnext_umi",
]
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@ package-lock.json
.idea/
nf-params.json
.vscode/
tests/
test_flow/
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

- [#294](https://github.com/nf-core/airrflow/pull/294) Merge template updates nf-core/tools v2.11.1
- [#299](https://github.com/nf-core/airrflow/pull/299) Add profile for common NEB and TAKARA protocols
- [#289](https://github.com/nf-core/airrflow/pull/289) Add possibility to merge multi-lane samples when starting from fastq files
- [#289](https://github.com/nf-core/airrflow/pull/289) Add possibility to run cellranger for scVDJseq data

### `Fixed`

Expand Down
45 changes: 32 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

## Introduction

**nf-core/airrflow** is a bioinformatics best-practice pipeline to analyze B-cell or T-cell repertoire sequencing data. It makes use of the [Immcantation](https://immcantation.readthedocs.io) toolset. The input data can be targeted amplicon bulk sequencing data of the V, D, J and C regions of the B/T-cell receptor with multiplex PCR or 5' RACE protocol, or assembled reads (bulk or single cell).
**nf-core/airrflow** is a bioinformatics best-practice pipeline to analyze B-cell or T-cell repertoire sequencing data. It makes use of the [Immcantation](https://immcantation.readthedocs.io) toolset. The input data can be targeted amplicon bulk sequencing data of the V, D, J and C regions of the B/T-cell receptor with multiplex PCR or 5' RACE protocol, single-cell VDJ sequencing using the 10xGenomics libraries, or assembled reads (bulk or single-cell).

![nf-core/airrflow overview](docs/images/airrflow_workflow_overview.png)

Expand All @@ -34,18 +34,25 @@ nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single

![nf-core/airrflow overview](docs/images/metro-map-airrflow.png)

1. QC and sequence assembly (bulk only)

- Raw read quality control, adapter trimming and clipping (`Fastp`).
- Filter sequences by base quality (`pRESTO FilterSeq`).
- Mask amplicon primers (`pRESTO MaskPrimers`).
- Pair read mates (`pRESTO PairSeq`).
- For UMI-based sequencing:
- Cluster sequences according to similarity (optional for insufficient UMI diversity) (`pRESTO ClusterSets`).
- Build consensus of sequences with the same UMI barcode (`pRESTO BuildConsensus`).
- Assemble R1 and R2 read mates (`pRESTO AssemblePairs`).
- Remove and annotate read duplicates (`pRESTO CollapseSeq`).
- Filter out sequences that do not have at least 2 duplicates (`pRESTO SplitSeq`).
1. QC and sequence assembly

- Bulk
- Raw read quality control, adapter trimming and clipping (`Fastp`).
- Filter sequences by base quality (`pRESTO FilterSeq`).
- Mask amplicon primers (`pRESTO MaskPrimers`).
- Pair read mates (`pRESTO PairSeq`).
- For UMI-based sequencing:
- Cluster sequences according to similarity (optional for insufficient UMI diversity) (`pRESTO ClusterSets`).
- Build consensus of sequences with the same UMI barcode (`pRESTO BuildConsensus`).
- Assemble R1 and R2 read mates (`pRESTO AssemblePairs`).
- Remove and annotate read duplicates (`pRESTO CollapseSeq`).
- Filter out sequences that do not have at least 2 duplicates (`pRESTO SplitSeq`).
- single cell
- cellranger vdj
- Assemble contigs
- Annotate contigs
- Call cells
- Generate clonotypes

2. V(D)J annotation and filtering (bulk and single-cell)

Expand Down Expand Up @@ -115,6 +122,18 @@ nextflow run nf-core/airrflow \
--outdir ./results
```

A typical command to run the pipeline from **single cell raw fastq files** (10X genomics) is:

```bash
nextflow run nf-core/airrflow -r dev \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
--mode fastq \
--input input_samplesheet.tsv \
--library_generation_method sc_10x_genomics \
--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
--outdir ./results
```

A typical command to run the pipeline from **single-cell AIRR rearrangement tables or assembled bulk sequencing fasta** data is:

```bash
Expand Down
14 changes: 6 additions & 8 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,11 +124,6 @@ def check_samplesheet(file_in, assembled):
)
)
else:
if any(tab["single_cell"].tolist()):
print_error(
"Some single cell column values are TRUE. The raw mode only accepts bulk samples. If processing single cell samples, please set the `--mode assembled` flag, and provide an AIRR rearrangement as input."
)

for col in required_columns_raw:
if col not in header:
print("ERROR: Please check samplesheet header: {} ".format(",".join(header)))
Expand Down Expand Up @@ -165,9 +160,12 @@ def check_samplesheet(file_in, assembled):

## Check that sample ids are unique
if len(tab["sample_id"]) != len(set(tab["sample_id"])):
print_error(
"Sample IDs are not unique! The sample IDs in the input samplesheet should be unique for each sample."
)
if assembled:
print_error(
"Sample IDs are not unique! The sample IDs in the input samplesheet should be unique for each sample."
)
else:
print("WARNING: Sample IDs are not unique! FastQs with the same sample ID will be merged.")

## Check that pcr_target_locus is IG or TR
for val in tab["pcr_target_locus"]:
Expand Down
11 changes: 6 additions & 5 deletions bin/reveal_add_metadata.R
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,12 @@ if (!("INPUTID" %in% names(opt))) {
# Read metadata file
metadata <- read.csv(opt$METADATA, sep = "\t", header = TRUE, stringsAsFactors = F)

# Merging samples over multiple lanes introduces multi-rows per sample
# We expect only one row per sample
metadata <- metadata %>%
filter(sample_id == opt$INPUTID)
dplyr::filter(sample_id == opt$INPUTID) %>%
dplyr::select(!starts_with("filename_")) %>%
dplyr::distinct()

if (nrow(metadata) != 1) {
stop("Expecting nrow(metadata) == 1; nrow(metadata) == ", nrow(metadata), " found")
Expand All @@ -81,10 +85,7 @@ internal_fields <-
"id",
"filetype",
"valid_single_cell",
"valid_pcr_target_locus",
"filename_R1",
"filename_R2",
"filename_I1"
"valid_pcr_target_locus"
)
metadata <- metadata[, !colnames(metadata) %in% internal_fields]

Expand Down
28 changes: 28 additions & 0 deletions conf/test_10x_sc.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* -------------------------------------------------
* Nextflow config file for running tests
* -------------------------------------------------
* Defines bundled input files and everything required
* to run a fast and simple test. Use as follows:
* nextflow run nf-core/airrflow -profile test_10x_sc,<docker/singularity>
*/

params {
config_profile_name = 'Test 10xGenomics single cell data'
config_profile_description = 'Minimal test dataset to check pipeline function with raw single cell data from 10xGenomics'

// Limit resources so that this can run on GitHub Actions
max_cpus = 2
max_memory = 6.GB
max_time = 48.h

// params
mode = 'fastq'
library_generation_method = 'sc_10x_genomics'
clonal_threshold = 0


// Input data
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/airrflow/testdata-sc/10x_sc_raw.tsv'
reference_10x = 'https://raw.githubusercontent.com/nf-core/test-datasets/airrflow/testdata-sc/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz'
}
70 changes: 66 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,18 @@ nextflow run nf-core/airrflow \
--outdir results
```

A typical command to run the pipeline from **single cell raw fastq files** is:

```bash
nextflow run nf-core/airrflow -r dev \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
--mode fastq \
--input input_samplesheet.tsv \
--library_generation_method sc_10x_genomics \
--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
--outdir ./results
```

A typical command for running the pipeline departing from **single-cell AIRR rearrangement tables or assembled bulk sequencing fasta** data is:

```bash
Expand All @@ -49,7 +61,7 @@ nextflow run nf-core/airrflow \
--outdir results
```

Check the section [Input samplesheet](#input-samplesheet) below for instructions on how to create the samplesheet, and the [Supported library generation protocols](#supported-bulk-library-generation-methods-protocols) section below for examples on how to run the pipeline for different bulk sequencing protocols.
Check the section [Input samplesheet](#input-samplesheet) below for instructions on how to create the samplesheet, and the [Supported library generation protocols](#supported-bulk-library-generation-methods-protocols) section below for examples on how to run the pipeline for different bulk and the 10xGenomics single cell sequencing protocol.
For more information about the parameters, please refer to the [parameters documentation](https://nf-co.re/airrflow/parameters).
The command above will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.

Expand Down Expand Up @@ -111,7 +123,7 @@ If you wish to share such profile (such as upload as supplementary material for

## Input samplesheet

### Fastq input samplesheet (bulk sequencing only)
### Fastq input samplesheet (bulk sequencing)

The required input file for processing raw BCR or TCR bulk targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename_R1`, `filename_R2`, `subject_id`, `species`, `tissue`, `pcr_target_locus`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. An example samplesheet is:

Expand All @@ -131,7 +143,7 @@ The required input file for processing raw BCR or TCR bulk targeted sequencing d
- `biomaterial_provider`: Institution / research group that provided the samples.
- `sex`: Subject biological sex (`female`, `male`, etc.).
- `age`: Subject biological age.
- `single_cell`: TRUE or FALSE. Fastq input samplesheet only supports a FALSE value.
- `single_cell`: TRUE or FALSE.

Other optional columns can be added. These columns will be available when building the contrasts for the repertoire comparison report. It is recommended that these columns also follow the AIRR nomenclature. Examples are:

Expand All @@ -143,6 +155,25 @@ Other optional columns can be added. These columns will be available when buildi

The metadata specified in the input file will then be automatically annotated in a column with the same header in the tables generated by the pipeline.

### Fastq input samplesheet (single cell sequencing)

The required input file for processing raw BCR or TCR single cell targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename_R1`, `filename_R2`, `subject_id`, `species`, `tissue`, `pcr_target_locus`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. You can refer to the bulk fastq input section for documentation on the individual columns.
An example samplesheet is:

| sample_id | filename_R1 | filename_R2 | subject_id | species | pcr_target_locus | tissue | sex | age | biomaterial_provider | single_cell | intervention | collection_time_point_relative | cell_subset |
| --------- | ------------------------------- | ------------------------------- | ---------- | ------- | ---------------- | ------ | ------ | --- | -------------------- | ----------- | -------------- | ------------------------------ | ------------ |
| sample01 | sample1_S1_L001_R1_001.fastq.gz | sample1_S1_L001_R2_001.fastq.gz | Subject02 | human | IG | blood | NA | 53 | sequencing_facility | FALSE | Drug_treatment | Baseline | plasmablasts |
| sample02 | sample2_S1_L001_R1_001.fastq.gz | sample2_S1_L001_R2_001.fastq.gz | Subject02 | human | TR | blood | female | 78 | sequencing_facility | FALSE | Drug_treatment | Baseline | plasmablasts |

> FASTQ files must confirm the 10xGenomics cellranger naming conventions<br> >**`[SAMPLE-NAME]`_S1_L00`[LANE-NUMBER]` _`[READ-TYPE]`\_001.fastq.gz**
>
> Read type is one of
>
> - `I1`: Sample index read (optional)
> - `I2`: Sample index read (optional)
> - `R1`: Read 1
> - `R2`: Read 2

### Assembled input samplesheet (bulk or single-cell sequencing)

The required input file for processing raw BCR or TCR bulk targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename`, `subject_id`, `species`, `tissue`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. All fields are explained in the previous section, with the only difference being that there is only one `filename` column for the assembled input samplesheet. The provided file will be different from assembled single-cell or bulk data:
Expand Down Expand Up @@ -380,7 +411,7 @@ This sequencing type requires setting `--library_generation_method race_5p_umi`

#### Takara Bio SMARTer Human BCR

The read configuration when sequenicng with the TAKARA Bio SMARTer Human BCR protocol is the following:
The read configuration when sequencing with the TAKARA Bio SMARTer Human BCR protocol is the following:

![nf-core/airrflow](images/TAKARA_RACE_BCR.png)

Expand Down Expand Up @@ -449,6 +480,37 @@ The UMI barcodes are typically read from an index file but sometimes can be prov

- No UMIs in R1 or R2 reads: if no UMIs are present in the samples, specify `--umi_length 0` to use the sans-UMI subworkflow.

## Supported single cell library generation methods (protocols)

When processing single cell sequencing data departing from raw `fastq` reads, currently only a `--library_generation_method` to support 10xGenomics data is available.

| Library generation methods | Description | Name in pipeline | Commercial protocols |
| -------------------------- | ----------------------------------------------------------------------------------------------------------- | ---------------- | -------------------- |
| RT(RHP)+PCR | sequencing data produced from Chromium single cell 5'V(D)J libraries containing cellular barcodes and UMIs. | sc_10x_genomics | 10xGenomics |

### 10xGenomics

This sequencing type requires setting `--library_generation_method sc_10x_genomics`.
The `cellranger vdj` automatically uses the Chromium cellular barcodes and UMIs to perform sequence assembly, paired clonotype calling and to assemble V(D)J transcripts per cell.
Examples are provided below to run airrflow to process 10xGenomics raw FASTQ data.

```bash
nextflow run nf-core/airrflow -r dev \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
--mode fastq \
--input input_samplesheet.tsv \
--library_generation_method sc_10x_genomics \
--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
--outdir ./results
```

#### 10xGenomics reference

10xGenomics requires a reference. This can be provided using the `--reference_10x` parameter.

- The 10xGenomics reference can be downloaded from the [download page](https://www.10xgenomics.com/support/software/cell-ranger/downloads)
- To generate a V(D)J segment fasta file as reference from IMGT one can follow the [cellranger docs](https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/advanced/references#imgt).

## Core Nextflow arguments

:::note
Expand Down
15 changes: 15 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,21 @@
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"cat/fastq": {
"branch": "master",
"git_sha": "02fd5bd7275abad27aad32d5c852e0a9b1b98882",
"installed_by": ["modules"]
},
"cellranger/mkvdjref": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"cellranger/vdj": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"custom/dumpsoftwareversions": {
"branch": "master",
"git_sha": "8ec825f465b9c17f9d83000022995b4f7de6fe93",
Expand Down
29 changes: 29 additions & 0 deletions modules/local/unzip_cellrangerdb.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
process UNZIP_CELLRANGERDB {
tag "unzip_cellrangerdb"
label 'process_single'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/ubuntu:20.04' :
'nf-core/ubuntu:20.04' }"

input:
path(archive)

output:
path("$unzipped") , emit: unzipped
path "versions.yml", emit: versions

script:
unzipped = archive.toString() - '.tar.gz'
"""
echo "${unzipped}"

tar -xzvf ${archive}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
unzip_cellrangerdb: \$(echo \$(tar --version 2>&1 | sed 's/^.*(GNU tar) //; s/ Copyright.*\$//')
END_VERSIONS
"""
}
7 changes: 7 additions & 0 deletions modules/nf-core/cat/fastq/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.