Salcher, S., Sturm, G., Horvath, L., Untergasser, G., Kuempers, C., Fotakis, G., ... & Trajanoski, Z. (2022). High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell. doi:10.1016/j.ccell.2022.10.008
The single cell lung cancer atlas is a resource integrating more than 1.2 million cells from 309 patients across 29 datasets.
The atlas is publicly available for interactive exploration through a cell-x-gene instance. We also provide
h5ad
objects and a scArches model which allows to project custom datasets
into the atlas. For more information, check out the
- project website and
- our publication.
This repository contains the source-code to reproduce the single-cell data analysis for the paper. The analyses are wrapped into nextflow pipelines, all dependencies are provided as singularity containers, and input data are available from zenodo.
For clarity, the project is split up into two separate workflows:
build_atlas
: Takes oneAnnData
object with UMI counts per dataset and integrates them into an atlas.downstream_analyses
: Runs analysis tools on the annotated, integrated atlas and produces plots for the publication.
The build_atlas
step requires specific hardware (CPU + GPU) for exact reproducibility
(see notes on reproducibility) and is relatively computationally
expensive. Therefore, the downstream_analysis
step can also operate on pre-computed results of the build_atlas
step,
which are available from zenodo.
- Nextflow, version 21.10.6 or higher
- Singularity/Apptainer, version 3.7 or higher (tested with 3.7.0-1.el7)
- A high performance cluster (HPC) or cloud setup. The whole analysis will consume several thousand CPU hours.
Before launching the workflow, you need to obtain input data and singularity containers from zenodo. First of all, clone this repository:
git clone https://github.com/icbi-lab/luca.git
cd luca
Then, within the repository, download the data archives and extract then to the corresponding directories:
# singularity containers
curl "https://zenodo.org/record/7227571/files/containers.tar.xz?download=1" | tar xvJ
# input data
curl "https://zenodo.org/record/7227571/files/input_data.tar.xz?download=1" | tar xvJ
# OPTIONAL: obtain intermediate results if you just want to run the `downstream_analysis` workflow
curl "https://zenodo.org/record/7227571/files/build_atlas_results.tar.xz?download=1" | tar xvJ
Note that some steps of the downstream analysis depend on an additional cohort of checkpoint-inhibitor-treated patients, which is only available under protected access agreement. For obvious reasons, these data
are not included in our data archive. You'll need to obtain the dataset yourself and place it in the data/14_ici_treatment/Genentech
folder.
The corresponding analysis steps are skipped by default. You can enable them by adding the --with_genentech
flag to the nextflow run
command.
Depending on your HPC/cloud setup you will need to adjust the nextflow profile in nextflow.config
, to tell
nextflow how to submit the jobs. Using a withName:...
directive, special
resources may be assigned to GPU-jobs. You can get an idea by checking out the icbi_lung
profile - which we used to run the
workflow on our on-premise cluster. Only the build_atlas
workflow makes use of GPU processes.
# newer versions of nextflow are incompatible with the workflow. By setting this variable
# the correct version will be used automatically.
export NXF_VER=22.04.5
# Run `build_atlas` workflow
nextflow run main.nf --workflow build_atlas -resume -profile <YOUR_PROFILE> \
--outdir "./data/20_build_atlas"
# Run `downstream_analysis` workflow
nextflow run main.nf --workflow downstream_analyses -resume -profile <YOUR_PROFILE> \
--build_atlas_dir "./data/20_build_atlas" \
--outdir "./data/30_downstream_analyses"
As you can see, the downstream_analysis
workflow requires the output of the build_atlas
workflow as input.
The intermediate results from zenodo contain the output of the build_atlas
workflow.
analyses
: Place for e.g. jupyter/rmarkdown notebooks, gropued by their respective (sub-)workflows.bin
: executable scripts called by the workflowconf
: nextflow configuration files for all processescontainers
: place for singularity image files. Not part of the git repo and gets created by the download command.data
: place for input data and results in different subfolders. Gets populated by the download commands and by running the workflows.lib
: custom libraries and helper functionsmodules
: nextflow DSL2.0 modulespreprocessing
: scripts used to preprocess data upstream of the nextflow workflows. The processed data are part of the archives on zenodo.subworkflows
: nextflow subworkflowstables
: contains static content that should be under version control (e.g. manually created tables)workflows
: the main nextflow workflows
The build_atlas
workflow comprises the following steps:
- QC of the individual datasets based on detected genes, read counts and mitochondrial fractions
- Merging of all datasets into a single
AnnData
object. Harmonization of gene symbols. - Annotation of two "seed" datasets as input for scANVI.
- Integration of datasets with scANVI
- Doublet removal with SOLO
- Annotation of cell-types based on marker genes and unsupervised leiden clustering.
- Integration of additional datasets with transfer learning using scArches.
- Patient stratification into immune phenotypes
- Subclustering and analysis of the neutrophil cluster
- Differential gene expression analysis using pseudobulk + DESeq2
- Differential analysis of transcription factors, cancer pathways and cytokine signalling using Dorothea, progeny, and CytoSig.
- Copy number variation analysis using SCEVAN
- Cell-type composition analysis using scCODA
- Association of single cells with phenotypes from bulk RNA-seq datasets with Scissor
- Cell2cell communication based on differential gene expression and the CellphoneDB database.
For reproducibility issues or any other requests regarding single-cell data analysis, please use the issue tracker. For anything else, you can reach out to the corresponding author(s) as indicated in the manuscript.
We aimed at making this workflow reproducible by providing all input data, containerizing all software dependencies and integrating all analysis steps into a nextflow workflow. In theory, this allows to execute the workflow on any system that can run nextflow and singularity. Unfortunately, some single cell analysis algorithms (in particular scVI/scANVI and UMAP) will yield slightly different results on different hardware, trading off computational reproducibility for a significantly faster runtime. In particular, results will differ when changing the number of cores, or when running on a CPU/GPU of a different architecture. See also scverse/scanpy#2014 for a discussion.
Since the cell-type annotation depends on clustering, and the clustering depends on the neighborhood graph,
which again depends on the scANVI embedding, running the build_atlas
workflow on a different machine
will likely break the cell-type labels.
Below is the hardware we used to execute the build_atlas
workflow. Theoretically,
any CPU/CPU of the same generation shoud produce identical results, but we did not have the chance to test this yet.
- Compute node CPU:
Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
(2x) - GPU node CPU:
EPYC 7352 24-Core
(2x) - GPU node GPU:
Nvidia Quadro RTX 8000 GPU