marp | theme |
---|---|
true |
default |
Nipype and friends
Questions: https://neurostars.org/
<style> section { padding-top: 150px; } h1 { position: absolute; left: 80px; top: 30px; right: 80px; height: 70px; line-height: 70px; padding-left: 10px; background: rgb(205,105,0,0.2); } </style>
People: 190+ contributors to code, countless bug reports and questions. Open source projects that we depend on The Originals: Gorgolewski, Krzysztof J.; Burns, Christopher; Ziegler, Erik; Madison, Cindee; Waskom, Michael Philipp; Clark, Dav; Halchenko, Yaroslav O.; Johnson, Hans; The Nipy workshop at UC Berkeley, 2009. Engineering core: Esteban, Oscar; Markiewicz, Christopher J.; Goncalves, Mathias; Jarecka, Dorota; Notter, Michael Many derived packages: CPAC, Mindboggle, Clinica, ... Supporting Labs: Damian Fair, Cameron Craddock, Michael Milham, Arno Klein, Russ Poldrack, Daniel Margulies, Hans Johnson, John Gabrieli, Michael Hanke, ...
Funding: NIH (R24MH117295 - DANDI, R01EB020740 - Nipype, P41EB019936 - ReproNim, R03EB008673 - Nipype, R01MH081909 - Nipy, EB005149, S10RR023392, R01NS040068,UL1TR000442, ...), INCF
- Why are workflow systems essential for many scientific analyses?
- Discuss some common concepts across workflow technologies.
- The Nipype ecosystem.
- Some things you can do with the Nipype ecosystem.
- How you can help improve scientific workflows.
- Designing robust, reproducible, and usable workflows.
- Random pointers to several things
- Fundamentals of MRI analysis or how to analyze data in your specific neuroscientific domain. For example,
- What software tools (e.g., terminal-based, script-based, Web-based, packages, libraries, workflows) exist to solve your problem
- The full potential of any of the tools we highlight here
- How to use your own laptop, HPC cluster, or the cloud to do analyses
- How to manage your data (e.g., Datalad) and computational environment (e.g., Neurodocker)
<style> .myc { padding-top: 430px; }; </style>
Workflow, n. A set of tasks needed to achieve one or more goals.
Examples of generic workflows:
- Purchase a car
- A wedding
- Cook a meal
- Construct a house
- Fly to New Zealand
In many of these Workflows
, a sequence of tasks has to be executed. Hence, the word Pipeline
is often used synonymously with Workflows
. The word Pipeline
originates from industrial automation.
Dataflow, n. A set of tasks that consume, transform, and/or generate data towards achieving one or more goals. Specifically, tasks can get started whenever all the necessary input data is available for the task.
Examples of dataflows:
- Analyze tweets
- Build a machine learning model
- Do data wrangling and quality control
- Run a neuroimaging analysis
In general, dataflows can be represented as computational graphs, where data flows from nodes to other nodes.
- Separation: of data, scripts, and execution.
- Dataflows are not intricately tied to a particular data set.
- Reuse: Algorithms or dataflows written using such abstractions can be reused on different datasets.
- Automation: Dataflows do not require human intervention, allowing automated execution.
- Standarization: The same dataflow can be applied to similar data, which itself encourages standardization.
- Data management: Most dataflow frameworks rely on language abstractions to support the flow of data, often without user consideration of naming files at different stages of a dataflow.
- Most neurscience analyses comprise multiple steps that are dependent on prior steps, i.e. a graph.
- e.g., Neuroimaging analyses may involve preprocessing, quality control, normalization, statistical inference.
- Many software implement many of these of algorithms.
- Performance: These algorithms vary in their execution time, output quality, as a function of:
- sample characteristics (e.g., age, species, in vs ex vivo)
- data quality
- computational environment.
- Matchmaking: For any given application, each software brings with a set of strengths and weaknesses.
- Performance: These algorithms vary in their execution time, output quality, as a function of:
- Abstraction
- Encapsulate different functional tasks
- Simplify the assumptions any individual task needs to consider
- Efficiency
- Parallelization of processes
- Reduced overhead of data management
- Replicability
- Embed knowledge
- Best practices
- Heuristics
- A structured plan for analysis (good for preregistrations)
- Why are your goals/use cases?
- What are your computational dependencies?
- How are you managing the data?
- How do you parameterize the script?
- How specific is the code to one situation?
- Will you share and support your code?
- What computational resources do you have access to?
Workflow systems provide computational flexibility, but have a steep cost.
- It can increase the complexity and brittleness of your environments.
- Additional learning is necessary to combine software packages.
- Cannot just point and click, need to script and program one’s analyses.
- Debugging is not always easy.
But there are benefits as well.
- You can reuse existing Workflows.
- You can combine the most appropriate algorithms for the goals (e.g., fast, accurate, precise, robust) of task rather than being restricted to what is available in a single package.
- Once you know how to construct a Dataflow, you can create others.
- Workflow systems: Nipype, Pydra, Snakemake, Nextflow, ...
- Features to consider
- Workflow specification language
- Nested workflow support
- Workflow/Task library, reusability
- Caching
- Execution support: Parallelization, Managers, Containers
- Provenance tracking
- Workflow languages: Common Workflow Language, Workflow Description Language, Nextflow DSL, ...
- Bring the world of neuroimaging tools together
- What is out there?
- How to use?
- Which ones to use?*
- Run analyses
- Combine computational resources
- Compare tools
- Combine the "best" tools
- Does the combination help?*
* Nipype can help answer this, but doesn't do so directly.
- Pythonic Interfaces to over 700 neuroimaging tools
- Including support for MATLAB-based tools like SPM
- A generic workflow engine with special semantics.
- Extensive support for local and HPC workflows.
- Local resource management across parallel tasks.
- Remote parallel HPC distribution with monitoring.
ASLPrep | Clinica | C-PAC | FitLins | fMRIPrep | Giraffe.tools | Halfpipe | Lyman | PyNets Macapype | Mindboggle | MRIQC | Neuropycon/Ephypype/Graphpype | Nipreps | QSIPrep
- Nipype does not create workflows for you.
- Nipype does not optimize workflows for you.
- It can optimize some of the execution.
- Nipype allows you to create scalable, complex workflows.
- Nipype allows you to mix and match software with the same Pythonic interface.
- To use Nipype workflows you need to know minimal Python and shell.
- To create Nipype workflows you need to know: Python, Nipype semantics, and at least 1 neuroimaging package.
- Nipype 1.x is the current stable platform.
- Nipype 2.0 is a new ecosystem of tools.
-
pydra: A general purpose workflow engine
- pydra-ml: A demo application
- pydra-tasks: Packages that provide Pydra tasks
-
neurodocker: A neuroscience container builder
-
testkraken: A parametric/vibration testing framework
-
nipreps: Preprocessing workflows
-
niflows: A general purpose Dataflow repository
-
nobrainer: Deep learning models
- Composable dataflows.
- Flexible semantics for looping over input sets.
- A content-addressable global cache.
- Support for Python functions and external (shell) commands.
- Native container execution support.
- Auditing and provenance tracking.
- Pydra paper
For attendees during neurohackademy
The Nipype tutorial: https://miykael.github.io/nipype_tutorial The Pydra tutorial: https://github.com/nipype/pydra-tutorial
- How to parallelize?
- Atomic
- Per participant, per subworkflow
- Database + resource driven
- Cost driven
- Which packages to use?
- Availability (re-executability by others)
- Licesning
- Complexity of maintenance
- Optimization goals
- How replicable do you want it to be?
- Designing good Dataflows
- Validating Dataflows