Universal data analysis pipeline
Nextflow-based pipeline to run and deploy reproducible analyses. Alongside the pipeline I developed the toolbox reportsrender to execute notebooks, but it can as well be used without it.
- render jupyter notebooks or Rmarkdown notebooks (papermill/knitr)
- ensure reproducible analyses
- deploy reports to GitHub pages.
analyses: The actual analysis steps (i.e. jupyter notebooks, Rmarkdown documents, bash scripts) go here.
bin: scripts that can be called from nextflow directly (nextflow will add them to the
PATHfor commands ran from a
data: input data for the notebooks. I often replace this with a symlink to some data storage.
deploy: final reports. Will be filled by the
deployprocess which copies all html reports to that directory and creates an index file. A great way to share the final reports is to push this directory to Github pages.
envs: conda environment files go here. Create one file per notebook, or re-use environments for multiple notebooks -- it's up to you.
lib: put custom libraries (e.g. python modules) here.
results: final results generated by the pipeline go here. Concept: one can always delete the results directory and re-generate it from
datausing the pipeline.
tables: manually created input data that I want to be under version control. E.g. the list of samples and the associated patient data that you had to compile manually from three excel sheets because the biologists encoded data as background-color.
main.nf: The nextflow workflow that ties everything together.
nextflow.config: Contains configuration options for the pipeline (e.g. output directory). You can also set options here to run the pipeline on a HPC grid engine (e.g. SGE or SLURM).
How to run.
- Install nextflow In this case, we use conda. Check the nextflow webiste for other options.
conda create -n nextflow -c conda-forge -c bioconda nextflow conda activate nextflow
- Clone this repository
gitclone firstname.lastname@example.org:grst/universal_analysis_pipeline.git cd universal_analysis_pipeline
- Run the pipeline
- Share the results. You can zip and email the
deployfolder. Even better is to share the results using github pages:
- To setup GitHub pages, init a repository in the deploy folder and push to the gh-pages branch:
cd deploy git init git remote add origin <YOUR_REMOTE> git checkout --orphan gh-pages git add -A . git commit -m "Initial deploy on gh-pages" git push -u origin gh-pages
It can take a few minutes, but eventually your reports will be available at
You might want to "password protect" your pages. This is not natively supported by GitHub pages, but a workaround is to put all files in a cryptic subfolder, e.g.
rBymGubVBBrdHtGo6Of35E3uI. As GitHub pages doesn't list directories, you need to know the precise URL to access the folder. You can adjust the deploy dir in
How to use
This repository is meant as a template. You can fork/clone this repository and expand from there. At least, you have to change two things:
- Add your notebooks to the
main.nfto wire your notebooks together the right way. You can use reportsrender to execute the notebooks.
Ideas for the future:
- convert conda envs to singularity containers to ensure reproducibility.