diff --git a/docs/assets/img/seqera-container-python-00.png b/docs/assets/img/seqera-container-python-00.png new file mode 100644 index 000000000..edc329f43 Binary files /dev/null and b/docs/assets/img/seqera-container-python-00.png differ diff --git a/docs/index.md b/docs/index.md index 21cc4ba42..d10668065 100644 --- a/docs/index.md +++ b/docs/index.md @@ -64,6 +64,18 @@ These are foundational, domain-agnostic courses intended for those who are compl [Start the Nextflow Run training :material-arrow-right:](nextflow_run/index.md){ .md-button .md-button--primary } +!!! exercise "Small Nextflow" + + !!! tip inline end "" + + :material-cat:{.nextflow-primary} Build a complete workflow from scratch. + + This is a hands-on workshop where you build a real-world image classification workflow from the ground up. You'll learn Nextflow fundamentals by creating channels, defining processes, working with operators, and making your workflow reproducible and portable. Perfect for learners who prefer building something concrete while learning core concepts. + + The course is calibrated to take a half day to cover in group trainings. + + [Start the Small Nextflow workshop :material-arrow-right:](small_nextflow/index.md){ .md-button .md-button--primary } + !!! exercise "Hello nf-core" !!! tip inline end "" diff --git a/docs/small_nextflow/00_orientation.md b/docs/small_nextflow/00_orientation.md new file mode 100644 index 000000000..a7dbc3c39 --- /dev/null +++ b/docs/small_nextflow/00_orientation.md @@ -0,0 +1,75 @@ +# Orientation + +This orientation assumes you have already opened the training environment by clicking on the "Open in GitHub Codespaces" button. +If not, please do so now, ideally in a second browser window or tab so you can refer back to these instructions. + +[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/nextflow-io/training/tree/smol-nextflow?quickstart=1&ref=master) + +## GitHub Codespaces + +The GitHub Codespaces environment contains all the software, code and data necessary to work through this training course, so you don't need to install anything yourself. +However, you do need a (free) GitHub account to log in, and you should take a few minutes to familiarize yourself with the interface. + +If you have not yet done so, please go through the [Environment Setup](../../envsetup/) mini-course before going any further. + +## Working directory + +Throughout this training course, we'll be working in the `small_nextflow/` directory. + +Change directory now by running this command in the terminal: + +```bash +cd small_nextflow/ +``` + +!!!tip + + If for whatever reason you move out of this directory, you can always use the full path to return to it, assuming you're running this within the GitHub Codespaces training environment: + + ```bash + cd /workspaces/training/small_nextflow + ``` + +Now let's have a look at the contents of this directory. + +## Materials provided + +You can explore the contents of this directory by using the file explorer on the left-hand side of the training workspace. +Alternatively, you can use the `ls` command. + +Here we list the contents of the directory: + +```bash +ls -la +``` + +If you run this inside `small_nextflow`, you should see a minimal directory structure: + +```console title="Directory contents" +. +├── .stuff/ + ├── cat_me.sh + ├── classify.py + └── pyproject.toml +``` + +**Here's a summary of what you should know to get started:** + +- **The `.stuff/` directory** contains helper scripts and configuration files we'll use throughout the workshop. + You can think of this as a toolbox we'll pull from as we build our workflow. + +- **The `cat_me.sh` script** fetches random cat images from an API for our workflow to process. + +- **The `classify.py` script** is a Python program that uses machine learning to classify images. + +- **The `pyproject.toml` file** describes the Python dependencies needed for the classification script. + +Throughout this workshop, we'll start with this minimal setup and progressively build a complete image classification workflow. + +Let's get started by creating a fresh, empty `main.nf`: + +```bash +code main.nf +``` + +**Now, to begin the course, click on the arrow in the bottom right corner of this page.** diff --git a/docs/small_nextflow/01_fundamentals.md b/docs/small_nextflow/01_fundamentals.md new file mode 100644 index 000000000..82a0af584 --- /dev/null +++ b/docs/small_nextflow/01_fundamentals.md @@ -0,0 +1,527 @@ +# Part 1: Fundamentals + +In this first part, we'll learn the building blocks of Nextflow by creating channels, defining our first process, and working with parameters and metadata. +We're going to build a workflow that produces a gallery of classified cat images, starting from the very beginning. + +--- + +## Introduction + +Nextflow offers you a way to iterate over a collection of files, so let's grab some files to iterate over. +We're going to write a workflow which produces a gallery of good and bad cats. +First things we're going to need are some cats. + +We're starting with a (nearly) empty directory. +There is a `.stuff` directory that contains some bits and pieces to help us along during the workshop, but you can imagine that we're essentially starting from scratch. + +### Fetch some cat images + +The first thing we're going to need is some data. +I've created a small script that pulls from the Cat-As-A-Service API to give us some random cats. + +Let's see what the script can do: + +```bash +.stuff/cat_me.sh --help +``` + +To start, let's grab 4 cats. +By default, the script will save the images to `./data/pics`: + +```bash +.stuff/cat_me.sh --count 4 --prefix data/pics +``` + +This will generate some example data. +It will look something like this: + +```console title="Directory structure" +data +└── pics + ├── 5n4MTAC6ld0bVeCe.jpg + ├── 5n4MTAC6ld0bVeCe.txt + ├── IOSNx33kkgkiPfaP.jpg + ├── IOSNx33kkgkiPfaP.txt + ├── IRuDgdPJZFA39dyf.jpg + ├── IRuDgdPJZFA39dyf.txt + ├── uq5KqqiF0qpgTQVA.jpg + └── uq5KqqiF0qpgTQVA.txt +``` + +### Create a channel of images + +Now let's iterate over those images in Nextflow. +To start, we'll just create a channel of those images. +We're not going to do anything with them, but to make sure that everything is working, we connect the channel to the `view` operator which takes the things in the channel (files in our case) and prints a String representation of those things to the command line. + +Open `main.nf` and add the following: + +```groovy title="main.nf" linenums="1" +#!/usr/bin/env nextflow + +workflow { + channel.fromPath("data/pics/*.{png,gif,jpg}") + | view +} +``` + +Now run the workflow: + +```bash +nextflow run main.nf +``` + +You should see output showing the paths to your cat images. + +!!! tip "Glob patterns" + + The `{png,gif,jpg}` syntax is called brace expansion. + It's a glob pattern that matches any file ending in `.png`, `.gif`, or `.jpg`. + This is equivalent to writing three separate patterns: `*.png`, `*.gif`, and `*.jpg`. + Using brace expansion keeps our code concise when we need to match multiple file extensions. + +### Takeaway + +You now know how to create a channel from files using glob patterns and view its contents. + +### What's next? + +Let's actually do something with those images by creating our first process. + +--- + +## Channels and processes + +Let's try actually doing something with those images. +We'll start with something simple - resizing images. +We'll need to download some software to do so. +Eventually, we'll talk about containers and reproducibility, but just to start, let's download the software to our machine: + +```bash +sudo apt update +sudo apt-get install -y imagemagick +``` + +### Understanding process structure + +We'll create a new "process" for our resize operation. +You can think of these processes as templates, or classes. +We'll connect the process to a channel and fire off a new "task" for each thing in the channel. + +In our Resize process definition (indeed in most process definitions), there are three blocks - `input`, `output`, and `script`. +We connect these processes by channels and these blocks describe what we expect to get, what we expect to emit, and the work we want to do in-between. + +### Create the Resize process + +Update your `main.nf` to add a Resize process: + +```groovy title="main.nf" linenums="1" +#!/usr/bin/env nextflow + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + + images + | Resize + | view +} + +process Resize { + input: path(img) + output: path("resized-*") + script: "convert ${img} -resize 400x resized-${img.baseName}.png" +} +``` + +### Understanding the input block + +The input block describes what we expect to take from the input channel. +The "things" in the channel can have a type. +The most common "types" are: + +- `val` (values like Strings, integers, Maps, complex objects), +- `path` (paths like directories or files), or +- `tuple` (a collection of values and paths). + +We'll see more of those later, but for the moment, the channel we created in the `workflow` block is a channel of files, so in our process definition, we'll say "I'm going to supply a channel of paths to this process, and as our process takes things from that channel to spawn a new task, we'll call the thing in the channel `img`." + +### Understanding the output block + +The output block describes what we want to emit into the process' output channel. +Again, we can describe the "type" of thing emitted - `val`, `path`, `tuple` and others. +For now, we'll promise to produce a file (or directory) that matches the glob pattern `resized-*`. + +### Understanding the script block + +The script block describes what work we want to do on each of the things in the channel - how we're going to transform each of the things we pull from the input channel into the files or values we promised to emit into the output channel. + +By default, the script block will be rendered into a bash script, but you can use any interpreted language that makes sense to you - python, ruby, R, zsh, closure, whatever. +In this introductory workshop, we'll stick with the default bash. + +We run the `magick` command from imagemagick which performs many types of manipulation. +In our case, we'll use the `-resize` argument to resize the image to a width of 400 pixels. +We also supply an output filename. +You'll notice that we use the `${img}` variable twice in our script block. +This `${img}` is the variable we defined in the input block. +For each iteration of our process (each task), the variable will be the path to our individual image. + +For example, if the "thing" in the channel is the image `kitten.jpg`, then when Nextflow creates a new Resize task for this file, it will "render" our script block into bash, replacing the `${img}` variables with the path to produce this valid bash: + +```bash +convert kitten.jpg -resize 400x resized-kitten.jpg +``` + +### Run the workflow + +Now let's run our workflow! +We'll iterate over all of the images in `data/pics` (relative to our current location) and produce a channel of resized pictures that we then pipe into the `view` operator to print the channel contents to stdout. + +```bash +nextflow run main.nf +``` + +### Takeaway + +You now understand the three-part structure of a Nextflow process: input, output, and script blocks work together to transform data flowing through channels. + +### What's next? + +Let's explore how Nextflow executes each task in isolation. + +--- + +## Investigate task execution + +Every task in Nextflow is executed in its own unique work directory. +This directory isolation is a fundamental feature that ensures tasks cannot interfere with each other, even when running in parallel. + +### Understanding work directories + +The work directory path is calculated by constructing a hash of all task inputs. +This means that if you run the same task with the same inputs, Nextflow will recognize it and can reuse the cached results (we'll explore this with `-resume` later). + +Let's explore where Nextflow actually ran our tasks. +Look at the output from your last workflow run - you'll see something like: + +```console +executor > local (4) +[a0/e7b2d4] Resize (1) | 4 of 4 ✔ +``` + +That `a0/e7b2d4` is the hash prefix for the task directory. +Let's explore what's inside: + +```bash +tree work +``` + +You'll see a directory structure like: + +```console +work +└── a0 + └── e7b2d4a1f3c8e9b0a7f6d5c4b3a2e1f0 + ├── resized-5n4MTAC6ld0bVeCe.png + └── ... +``` + +### Exploring task files + +Each work directory contains several hidden files that Nextflow uses to track task execution. +Let's see them all: + +```bash +tree -a work +``` + +Now you'll see additional files: + +```console +work +└── a0 + └── e7b2d4a1f3c8e9b0a7f6d5c4b3a2e1f0 + ├── .command.begin + ├── .command.err + ├── .command.log + ├── .command.out + ├── .command.run + ├── .command.sh + ├── .exitcode + └── resized-5n4MTAC6ld0bVeCe.png +``` + +These files serve different purposes: + +- `.command.sh` - The actual script that was executed (with all variables resolved) +- `.command.run` - The wrapper script that Nextflow uses to execute the task +- `.command.out` - Standard output from the task +- `.command.err` - Standard error from the task +- `.command.log` - Combined stdout and stderr +- `.exitcode` - The exit code from the task (0 = success) +- `.command.begin` - Setup instructions run before the task + +The most useful file for debugging is `.command.sh`. +Let's look at one: + +```bash +cat work/a0/*/command.sh +``` + +You'll see the actual bash script that was executed with all Nextflow variables resolved: + +```bash +convert 5n4MTAC6ld0bVeCe.jpg -resize 400x resized-5n4MTAC6ld0bVeCe.png +``` + +### Task isolation and idempotence + +This isolation serves two critical purposes: + +1. **Independence**: Tasks running in parallel cannot accidentally overwrite each other's files or interfere with each other's execution +2. **Idempotence**: Running the same task with the same inputs will produce the same outputs in the same location + +**Idempotence** means that executing a task multiple times with identical inputs produces identical results. +This is crucial for reproducibility and for Nextflow's caching system. +Because the work directory is determined by hashing the inputs, identical inputs always map to the same directory, allowing Nextflow to detect when work can be reused. + +### Takeaway + +Each task runs in its own isolated directory to prevent interference between parallel executions. + +### What's next? + +Let's learn some convenient methods for working with file paths. + +--- + +## Harmonization + +One of the nice features of the `magick` utility is that it will also do file format conversion for us. +It will infer the format from the extension of the final argument. +For example, if we execute: + +```bash +convert kitten.jpg -resize 400x resized-kitten.png +``` + +The `magick` utility will both resize the image and convert the jpg to png format. +Let's say we want to ensure that downstream in our workflow, we'd like to ensure all images are in the png format. +How might we modify our `script` block to replace the extension or pull out the file basename so that we can append the `.png` extension? + +### The bash way (harder) + +If you're a bash wizard, you might know that if you have a variable `$myFile` with the path to our file, you can replace the extension with this arcane incantation: + +```bash +file=kitten.jpg +convert "$file" -resize 400x "${file%.*}.png" +``` + +Or perhaps you use the `basename` utility: + +```bash +convert "$file" -resize 400x "$(basename "$file" .${file##*.}).png" +``` + +I love bash, but it's easy to forget this syntax or mistype it. + +### The Nextflow way (easier) + +Fortunately for us, when inside the script block the `img` variable is not a bash variable - it's a Nextflow variable, and Nextflow provides some convenience methods for operating on those path objects. +The full list is available in the [Nextflow stdlib documentation](https://www.nextflow.io/docs/latest/reference/stdlib-types.html#stdlib-types-path), but one handy method is `baseName`. + +We can simply call `${img.baseName}` to return the file base name. +For example: + +```groovy title="main.nf" hl_lines="14" linenums="1" +#!/usr/bin/env nextflow + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + + images + | Resize + | view +} + +process Resize { + input: path(img) + output: path("resized-*") + script: "convert ${img} -resize 400x resized-${img.baseName}.png" +} +``` + +Update your workflow with this change and run it again: + +```bash +nextflow run main.nf +``` + +### Takeaway + +Nextflow path objects provide convenient methods like `baseName` that make file manipulation easier than bash string operations. + +### What's next? + +Let's make our workflow more flexible by adding parameters. + +--- + +## Parameters + +What if we want to make our workflow a little more flexible? +Let's pull out the width and expose it as a parameter to the user. + +### Using parameters directly in the script + +We could reference a parameter directly in the script block: + +```groovy title="main.nf" hl_lines="14" linenums="1" +#!/usr/bin/env nextflow + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + + images + | Resize + | view +} + +process Resize { + input: path(img) + output: path("resized-*") + script: "convert $img -resize ${params.width}x resized-${img.baseName}.png" +} +``` + +Now we can run with: + +```bash +nextflow run main.nf --width 300 +``` + +### Making inputs explicit (best practice) + +This works, but it's considered best practice (and we'll see why in a bit) to make the inputs to a process explicit. +We can do this by adding a second input channel: + +```groovy title="main.nf" hl_lines="8 16-17" linenums="1" +#!/usr/bin/env nextflow + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + + Resize(images, params.width) + | view +} + +process Resize { + input: + path(img) + val(width) + output: path("resized-*") + script: "convert $img -resize ${width}x resized-${img.baseName}.png" +} +``` + +The params object still works in the same way: + +```bash +nextflow run main.nf --width 500 +``` + +### Takeaway + +Exposing parameters as explicit process inputs makes workflows more flexible and clearer about their dependencies. + +### What's next? + +Let's learn how to attach metadata to our files using tuples and maps. + +--- + +## Extracting an ID + +Great, but I'd like a way of retaining the original IDs. + +### Understanding the map operator + +The `map` operator is one of the most powerful tools in Nextflow. +It takes a collection of items in a channel and transforms them into a new collection of items. +The transformation is defined by a closure - a small piece of code that is evaluated "later" - during workflow execution. +Each item in the new channel is the result of applying the closure to the corresponding item in the original channel. + +A closure is written as `{ input -> }` where to the left of the "stabby operator" `->`, you define the variable used to refer to the closure input, and then an expression or series of expressions. The last expression will be the return value of the closure. For map, the items in the resulting output channel are the collection of values returned by each invocation of the closure. + +Let's use `map` to extract the ID from each filename: + +```groovy title="main.nf" hl_lines="5-6" linenums="1" +#!/usr/bin/env nextflow + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [img.baseName, img] } + | view +} +``` + +Run this to see the structure: + +```bash +nextflow run main.nf +``` + +### Using a metadata map + +Better - we have the id extracted as a String. +What if we want to add other metadata later? +Let's turn it into a Map: + +```groovy title="main.nf" hl_lines="5" linenums="1" +#!/usr/bin/env nextflow + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + | view +} +``` + +### Update the process to handle tuples + +Now we've changed the "shape" of the items in the channel, so we'll update the downstream process: + +```groovy title="main.nf" hl_lines="8 13" linenums="1" +#!/usr/bin/env nextflow + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + | view +} + +process Resize { + input: + tuple val(meta), path(img) + val(width) + output: + tuple val(meta), path("resized-*") + script: "convert $img -resize ${width}x resized-${img.baseName}.png" +} +``` + +Run the workflow and view the output: + +```bash +nextflow run main.nf +``` + +### Takeaway + +Using tuples with metadata maps allows you to carry important information alongside your files as they flow through the workflow. + +### What's next? + +Now that we understand the fundamentals, let's build a more complex workflow with classification and grouping operations. diff --git a/docs/small_nextflow/02_data_transformation.md b/docs/small_nextflow/02_data_transformation.md new file mode 100644 index 000000000..201946a4f --- /dev/null +++ b/docs/small_nextflow/02_data_transformation.md @@ -0,0 +1,515 @@ +# Part 2: Data Transformation & Analysis + +In this part, we'll build a multi-step workflow that classifies images using machine learning, manages computational resources, and groups results intelligently. + +--- + +## Classification + +Let's get to the fun part - the cat sorting! +We have a little classification script - `classify.py` that I've provided in the `.stuff` directory. +In your research sometimes you have small accessory scripts that are useful for your pipelines. +We're using a python script here in this workshop example, but this pattern will hold for scripts written in perl, ruby, R, python, closurescript, or any of the other interpreted languages. + +### Set up the classification script + +Let's pull the file out into a new `bin` directory: + +```bash +mkdir -p bin +cp .stuff/classify.py bin/ +``` + +The script requires some dependencies. +Again, we'll do this the slow/painful way one time before we demonstrate how to use containers to encapsulate the software dependencies. + +We'll grab one more file from our `.stuff` directory - a `pyproject.toml` file which is a way of describing software dependencies for Python projects. +This is unrelated to Nextflow, but an example of one of the (many) ways in which different languages and frameworks might install software. + +You can install the dependencies and activate the environment with: + +```bash +cp .stuff/pyproject.toml . +uv sync +source .venv/bin/activate +``` + +which you can run with: + +```bash +bin/classify.py --help +``` + +```console title="Output" +usage: classify.py [-h] [--model-path MODEL_PATH] [--labels LABELS [LABELS ...]] [--json] image + +Classify a single image using MetaCLIP + +positional arguments: + image Path to the image file to classify + +options: + -h, --help show this help message and exit + --model-path MODEL_PATH + Path to MetaCLIP model weights (default: data/models/b32_400m.pt) + --labels LABELS [LABELS ...] + Labels for classification (default: ["good cat", "bad cat"]) + --json Output result as JSON to stdout + --architecture {ViT-B-32-quickgelu,ViT-B-16-quickgelu,ViT-L-14-quickgelu,ViT-H-14-quickgelu} + Model architecture (auto-detected from filename if not specified) +``` + +### Download the classification model + +The script takes images, a model, and a set of labels and classifies each of the images according to the labels. +To run the script outside of Nextflow, we'll need to download one of the models. +Do so with: + +```bash +mkdir -p data/models +(cd data/models && wget https://dl.fbaipublicfiles.com/MMPT/metaclip/b32_400m.pt) +``` + +### Create the Classify process + +Now let's create a `Classify` process that will take two channels - one channel of images and one channel that supplies the model: + +```groovy title="Process definition" linenums="1" +process Classify { + input: + tuple val(meta), path(img) + path(model) + output: tuple val(meta), stdout + script: "classify.py --model-path $model ${img}" +} +``` + +Note here that we're calling the `classify.py` script directly, even though we can't do that from the command line (we had to provide the relative or absolute path). +This is because Nextflow automatically adds the `bin` directory (relative to the main.nf) to the `$PATH` for all Nextflow tasks. +This is a very convenient way to bundle accessory scripts and snippets with your workflow. + +### Understanding queue vs. value channels + +Processes can have multiple channels as input or as output. +A process will continue to emit tasks as long as it can pull an item from each of the input channels. +We could create a new channel for the model, and define a sensible default: + +```groovy title="Workflow with model channel" linenums="1" +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + model_channel = channel.fromPath(params.model) + Classify(images, model_channel) +} +``` + +What happens when you run the workflow? +Given what we know about channels, what might be happening? + +**Answer:** The Classify process only spawns a single task. +This is because after pulling the model path from the second input channel on the first iteration, the channel is empty, so no more Classify tasks can be submitted for execution. + +There are two types of channel in Nextflow - **queue channels** and **value channels**. +Queue channels are exhaustible - they have a set number of items in the channel and each process can only take each item in the channel once. +The second type of channel is a value channel, which is a channel of only a single item. +This item is emitted without exhaustion. + +### Using value channels + +There are some operators which will always return a value channel. +Examples are `first`, `collect`, `count`, etc. + +We could also create a value channel using the `channel.value` factory: + +```groovy title="Using channel.value" linenums="1" +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + model_channel = channel.value(file(params.model)) + Classify(images, model_channel) +} +``` + +Note here that we're wrapping the params.model value (a String) in the `file()` function, which turns an ordinary String into an object that Nextflow can use as a path. +We've not needed to use this until now because the `channel.fromPath` factory necessarily returns paths, so it automatically does this conversion for us. + +### Implicit value channels + +An even simpler solution is to provide the path object directly when calling the process. +Any non-channel object will automatically be converted into a value channel for you: + +```groovy title="main.nf" hl_lines="8" linenums="1" +#!/usr/bin/env nextflow +params.width = 400 +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + | view +} +``` + +Add the Classify process definition to your workflow and run it: + +```bash +nextflow run main.nf +``` + +You might find that the process errors out with a 137 exit code. +This generally means that we've run out of RAM because we're running too many of these classification jobs at the same time. +Let's talk about how we tell Nextflow that a particular process requires more resources. + +### Takeaway + +Understanding queue channels vs. value channels is crucial for controlling how data flows through multi-input processes. + +### What's next? + +Let's learn how to manage computational resources for our processes. + +--- + +## Resources + +Our processes are currently composed of the `input:`, `output:`, and `script:` blocks. +In addition to these blocks, processes can use "process directives" which are optional annotations which modify the behaviour of the processes. +There are many directives ([documentation](https://www.nextflow.io/docs/latest/reference/process.html#directives)), but we can introduce the concept with two important process directives - `memory` and `cpus`. + +### Understanding executors + +So far, we've been using the local executor to run Nextflow - running on the local machine. +There are many other executors targeting different backends, from HPC executors like SLURM and PBS to cloud executors like AWS Batch, Google Batch, and Azure Batch. +There are more than a dozen supported executors ([documentation](https://www.nextflow.io/docs/latest/executor.html)). + +Each of these have a concept of the resources a particular task will require - resources such as cpus, memory, gpus, disk, etc. + +### Resource defaults and management + +If not otherwise specified, the defaults are to request 1 cpu, 1 GB of RAM and 0 GPUs for each task. + +When using the local executor, Nextflow scans the machine it is running on and determines how many cpus and how much RAM the system has. +It will ensure that (given the resources specified or defaults applied) the running tasks never exceed the available limits. +If the system has 16 GB of RAM, for example, and a particular process requires 6 GB of ram, Nextflow will ensure that _at most_ 2 of those tasks are running at any one time. +As a task finishes, Nextflow begins the next task in line. + +### Add resource directives + +Update your Classify process to request more memory: + +```groovy title="Process with memory directive" hl_lines="2" linenums="1" +process Classify { + memory '13 GB' + + input: + tuple val(meta), path(img) + path(model) + output: tuple val(meta), stdout + script: "classify.py --model-path $model ${img}" +} +``` + +Now run the workflow again: + +```bash +nextflow run main.nf +``` + +### Takeaway + +Process directives like `memory` and `cpus` communicate resource requirements to Nextflow executors, enabling proper scheduling and preventing resource exhaustion. + +### What's next? + +Let's learn how to combine related data using the join and groupTuple operators. + +--- + +## Grouping + +Now we want to combine our classification results with our resized images. +We can use the `join` operator, which finds pairs of items (one from each channel) that share a key. +By default, the `join` operator will use the first element of each item in the channel as the key. +In our case, that first item was the image metadata, which occupies the first position in both the Classify process output and the Resize process output. + +### Join classification results with images + +Update your workflow to join the channels: + +```groovy title="Workflow with join" hl_lines="12-14" linenums="1" +#!/usr/bin/env nextflow +params.width = 400 +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + | join(Resize.out) + | view +} +``` + +This produces a channel like: + +``` +[metadata, label, img] +[metadata, label, img] +[metadata, label, img] +[metadata, label, img] +``` + +### Group items by label + +In order to make a picture of just the good cats and a second picture of just the bad cats, we'll need to group the items in the channel based on the label. +We can do this with the `groupTuple` operator. +Normally the groupTuple expects that the grouping key will be the first element in each item in the channel. +In our case, it is the second item, i.e. index "1" if the first item is index "0". +To ask Nextflow to group on the item with index 1, we add a `by: 1` argument to the operator: + +```groovy title="Workflow with grouping" hl_lines="13-15" linenums="1" +#!/usr/bin/env nextflow +params.width = 400 +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + | join(Resize.out) + | groupTuple(by: 1) + | view +} +``` + +This produces a channel of the form: + +``` +[metadatas, label, images] +[metadatas, label, images] +``` + +### Takeaway + +The `join` and `groupTuple` operators allow you to match related items and collect them by common attributes. + +### What's next? + +Let's create visual collages for each group of classified images. + +--- + +## Collage + +Let's create a `Collage` process that takes this channel and produces a collage of all of the images for each label. +The script block here is a little involved, but it uses ImageMagick's montage command to arrange images into a grid. + +### Create the Collage process + +```groovy title="Collage process" linenums="1" +process Collage { + input: tuple val(metadatas), val(label), path("inputs/*.png") + output: tuple val(label), path("collage.png") + script: + """ + montage inputs/* \\ + -geometry +10+10 \\ + -background black \\ + +polaroid \\ + -background '#ffbe76' \\ + collage_nolabel.png + montage \\ + -pointsize 48 \\ + -label '$label' \\ + -geometry +0+0 \\ + -background "#f0932b" \\ + collage_nolabel.png collage.png + """ +} +``` + +### Connect to the workflow + +We can then hook this into our channel chain: + +```groovy title="Workflow with Collage" hl_lines="13-15" linenums="1" +#!/usr/bin/env nextflow +params.width = 400 +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + | join(Resize.out) + | groupTuple(by: 1) + | Collage + | view +} +``` + +### Optimize with resized images + +Those collage tasks are taking a little too long, but that might be because we're collaging the original full-sized images and not our resized images. +Because the `images` channel and the output channel from the `Resize` process both have the same shape, we can simply replace them in the workflow: + +```groovy title="Optimized workflow" hl_lines="12" linenums="1" +#!/usr/bin/env nextflow +params.width = 400 +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + | join(Resize.out) + | groupTuple(by: 1) + | Collage + | view +} +``` + +### Combine all collages + +For our final process, let's combine these two collages together into a single final image. +We'll create a process that takes a collection of images (we don't care what they are called) and produces a final `collage_all.png` image: + +```groovy title="CombineImages process" linenums="1" +process CombineImages { + input: path "in.*.png" + output: path "collage_all.png" + script: + """ + montage \\ + -geometry +10+10 \\ + -quality 05 \\ + -background '#ffbe76' \\ + -border 5 \\ + -bordercolor '#f0932b' \\ + in.*.png \\ + collage_all.png + """ +} +``` + +### Transform the channel + +The channel coming from the Collage process looks like: + +``` +[label, collageImage] +[label, collageImage] +``` + +but we need it to look like: + +``` +[collageImage, collageImage] +``` + +So we'll drop the labels and collect all images: + +```groovy title="Final workflow" hl_lines="14-18" linenums="1" +#!/usr/bin/env nextflow +params.width = 400 +params.model = "${projectDir}/data/models/b32_400m.pt" + +workflow { + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + | join(Resize.out) + | groupTuple(by: 1) + | Collage + | map { _label, img -> img } + | collect + | CombineImages + | view +} +``` + +The `collect` operator takes all the items in a channel and then emits them as a single "wide" collection. + +Run the complete workflow: + +```bash +nextflow run main.nf +``` + +### Scaling up without code changes + +One of Nextflow's key strengths is automatic scalability. +Let's see this in action by adding more data to our analysis! + +While your workflow is still running (or right after it completes), open a new terminal and add more cat images: + +```bash +# Add 20 more cats to our dataset +.stuff/cat_me.sh --count 20 --prefix data/pics +``` + +This brings our total from 4 cats to 24 cats. +Now run the workflow again with `-resume`: + +```bash +nextflow run main.nf -resume +``` + +Notice what happens in the output: + +- Tasks for the original 4 images show as **[cached]** in gray +- Only the 20 new images are processed through Resize and Classify +- The groupTuple, Collage, and CombineImages steps run again (because their inputs changed) +- The final collage now includes all 24 cats + +**You didn't change a single line of code** - the workflow automatically: + +- Detected the new input files via the glob pattern `data/pics/*.{png,gif,jpg}` +- Processed only the new images that hadn't been seen before +- Reused cached results for the original 4 images +- Scaled the grouping and collage operations to handle more data + +This is the power of Nextflow's declarative approach: you describe **what** you want to do, and Nextflow figures out **how** to do it efficiently, whether you have 4 files or 4,000 files. + +!!! tip "Scalability in practice" + + This same pattern works at any scale: + + - **Local development**: Test with 4 samples + - **Pilot study**: Scale to 24 samples with no code changes + - **Production**: Process thousands of samples with the same workflow + - **HPC/Cloud**: Nextflow automatically distributes tasks across available resources + +### Takeaway + +You can chain together multiple processes and operators to build sophisticated multi-step workflows that transform and aggregate data. +Nextflow automatically scales your workflow as your data grows, without requiring any code changes. + +### What's next? + +Now that we have a working workflow, let's learn how to publish the results in an organized way. diff --git a/docs/small_nextflow/03_publishing_portability.md b/docs/small_nextflow/03_publishing_portability.md new file mode 100644 index 000000000..f51318f56 --- /dev/null +++ b/docs/small_nextflow/03_publishing_portability.md @@ -0,0 +1,381 @@ +# Part 3: Publishing & Portability + +In this part, we'll make our workflow production-ready by publishing organized outputs, supporting multiple filesystems, and containerizing dependencies for reproducibility. + +--- + +## Workflow outputs + +Great! We have a workflow that (arguably cruelly) collects our cats into "good" and "bad" groupings! +Unfortunately, the final output file is still deep in the work directory in a hostile-looking hash-addressed directory. +We'd like to define some final workflow outputs that should be published somewhere safe, outside of the work directory. + +### Understanding workflow output blocks + +To define the workflow outputs, we'll need to define a `publish:` block in the workflow. +We'll also need to put the existing workflow in a `main:` block as shown below: + +```groovy title="Workflow with output blocks" linenums="1" +workflow { + main: + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + + Classify.out + | join(Resize.out) + | groupTuple(by: 1) + | Collage + | map { _label, img -> img } + | collect + | CombineImages + + publish: + // Something to go here +} +``` + +### Publishing the collage + +In the `publish:` block, we define channels that we'd like to publish: + +```groovy title="Publishing output" hl_lines="19-20 23-27" linenums="1" +workflow { + main: + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + + Classify.out + | join(Resize.out) + | groupTuple(by: 1) + | Collage + | map { _label, img -> img } + | collect + | CombineImages + + publish: + collage = CombineImages.out +} + +output { + collage { + mode 'copy' + } +} +``` + +Now when we run, the final collage will be copied into `results/collage_all.png`. + +We can control the publication mechanism by adding arguments. +The `mode 'copy'` directive tells Nextflow to copy the output file rather than create a symlink (the default). + +### Publishing the classifications + +The more interesting outputs might be those with more metadata associated with them. +For example, we might want to record the classification for each image ID. +To publish metadata-rich outputs, we'll first create a channel that is composed of Maps: + +```groovy title="Publishing with metadata" hl_lines="11-14 20 25-28" linenums="1" +workflow { + main: + images = channel.fromPath("data/pics/*.{png,gif,jpg}") + | map { img -> [[id: img.baseName], img] } + + Resize(images, params.width) + + Classify(images, file(params.model)) + + Classify.out + | join(Resize.out) + | map { meta, label, image -> meta + [label:label, image:image] } + | set { classifiedMaps } + + Classify.out + | join(Resize.out) + | groupTuple(by: 1) + | Collage + | map { _label, img -> img } + | collect + | CombineImages + + publish: + collage = CombineImages.out + classified = classifiedMaps +} + +output { + collage { + mode 'copy' + } + classified { + mode 'copy' + } +} +``` + +This will cause the resized images to also be published in the `results` directory, but it's looking a bit cluttered now: + +```console title="Results directory" +results +├── collage_all.png +├── resized-4skdDxHm4yDsSJIr.png +├── resized-4y6Hyu0uzVZcEx89.png +├── resized-6Nb0ipGrHDHqCEmZ.png +└── resized-wfMCf1lHc9YPw455.png +``` + +### Organizing outputs with path directives + +Let's bring a little bit of order by organizing images into subdirectories by their label: + +```groovy title="Organized output" hl_lines="7-9" linenums="1" +output { + collage { + mode 'copy' + } + classified { + mode 'copy' + path { sample -> "images/${sample.label.replaceAll(/\s+/, '_')}" } + } +} +``` + +Now our results are organized: + +```console title="Organized results" +results +├── collage_all.png +└── images + ├── good_cat + │ ├── resized-4skdDxHm4yDsSJIr.png + │ └── resized-wfDuHvt6VIn2tM8T.png + └── bad_cat + ├── resized-6Nb0ipGrHDHqCEmZ.png + └── resized-wfMCf1lHc9YPw455.png +``` + +### Creating index files + +Now we sanitized the label names so that they'd be in more sensibly named directories (no spaces, etc), but this risks corrupting that metadata. +Let's ask Nextflow to publish a more digestible samplesheet or "index" of the published outputs, that includes the real, unsanitized labels: + +```groovy title="Output with index" hl_lines="8-11" linenums="1" +output { + collage { + mode 'copy' + } + classified { + mode 'copy' + path { sample -> "images/${sample.label.replaceAll(/\s+/, '_')}" } + index { + header true + path 'images/cats.csv' + } + } +} +``` + +This produces a CSV at `results/images/cats.csv`. + +For more structured data, you can also choose YAML or JSON: + +```groovy title="JSON index" hl_lines="10" linenums="1" +output { + collage { + mode 'copy' + } + classified { + mode 'copy' + path { sample -> "images/${sample.label.replaceAll(/\s+/, '_')}" } + index { + header true + path 'images/cats.json' + } + } +} +``` + +Run the workflow to see the organized outputs: + +```bash +nextflow run main.nf +``` + +### Takeaway + +Nextflow's output publishing system allows you to organize results with custom paths and generate index files that preserve important metadata. + +### What's next? + +Let's make our workflow portable across different storage systems. + +--- + +## Filesystem independence + +Nextflow speaks many different communication protocols, allowing you to seamlessly move from using data on a local or shared filesystem, to `http://`/`https://`, to object storage protocols like `s3://`, `az://`, `gs://` or even older `ftp://` protocols. +You can provide support for new protocols yourself via Nextflow's plugin system. + +### Using remote files + +For example, our current workflow uses a local model file. +But we can easily switch to using a remote model from the web: + +```bash +nextflow run main.nf --model https://dl.fbaipublicfiles.com/MMPT/metaclip/b32_400m.pt +``` + +Nextflow will automatically download the file and make it available to the process. +This works for input files too - you could provide image URLs instead of local paths! + +### Cloud storage support + +Similarly, if you're working in the cloud, you can use cloud storage URLs: + +```bash +# AWS S3 +nextflow run main.nf --model s3://my-bucket/models/b32_400m.pt + +# Google Cloud Storage +nextflow run main.nf --model gs://my-bucket/models/b32_400m.pt + +# Azure Blob Storage +nextflow run main.nf --model az://my-container/models/b32_400m.pt +``` + +!!! tip + + To use cloud storage protocols, you'll need to configure appropriate credentials for your cloud provider. + See the [Nextflow documentation](https://www.nextflow.io/docs/latest/amazons3.html) for details. + +### Takeaway + +Nextflow's protocol flexibility makes workflows portable across local, web, and cloud storage systems without code changes. + +### What's next? + +Let's containerize our workflow to ensure it runs reliably anywhere. + +--- + +## Containerization + +All of our Nextflow tasks are currently using the software installed on the host operating system. +This practice can quickly become a problem for you for a number of reasons: + +- As the workflow grows, and the number of software dependency stacks will also likely grow, and it becomes increasingly likely that the installation of one piece of software accidentally updates a dependency of another piece of software. Similarly, you may end up with incompatible software dependency stacks. +- The analysis becomes tied to a very specific machine or infrastructure, difficult to reproduce in exactly the same way elsewhere (by yourself or by a colleague). +- Managing software is a thankless and boring task. + +### Understanding containerization + +Containers are lightweight, standalone packages that include everything needed to run a piece of software: code, runtime, system tools, and libraries. +Docker is the most popular container technology, and it works by packaging your software and dependencies into an "image" that can run consistently anywhere Docker is installed. + +When you run a containerized task, Docker creates an isolated environment with exactly the software versions specified in the container image, completely independent of what's installed on the host system. +This ensures that your workflow produces identical results whether you run it on your laptop, an HPC cluster, or in the cloud. + +### Container technologies in Nextflow + +Nextflow provides the opportunity to run each task in an isolated software environment, and can do so via a variety of technologies, including: + +- conda +- containers (docker, apptainer/singularity, charliecloud, sarus, shifter, and podman) +- spack + +Let's improve the reproducibility and portability of our workflow. + +### Containerizing processes + +You'll remember that we manually installed software two different ways: + +- imagemagick (via `apt-get install`), and +- python packages (via `uv sync`) + +We could use a single container for all of the steps in the workflow, but this might limit the reusability of the containers, and upgrading one piece of software for one task would mean changing the container for all of the tasks. +Most researchers prefer (and Nextflow supports) defining a container per-process. + +To replace the imagemagick we installed via apt-get, we'll use the public container `minidocks/imagemagick:7`. + +We've already talked about the `memory` and `cpus` process directives, but another useful directive is the `container` directive. +We'll use this to add the container to our `Resize`, `Collage`, and `CombineImages` processes: + +```groovy title="Processes with containers" hl_lines="2 9 16" linenums="1" +process Resize { + container 'minidocks/imagemagick:7' + + input: + tuple val(meta), path(img) + val(width) + output: tuple val(meta), path("resized-*") + script: "convert ${img} -resize ${width}x resized-${img.baseName}.png" +} + +process Collage { + container 'minidocks/imagemagick:7' + // ... rest of process +} + +process CombineImages { + container 'minidocks/imagemagick:7' + // ... rest of process +} +``` + +### Building custom containers + +Our `classify.py` process includes three specific python packages (torch, pillow, and openclip-torch) at specific versions. +It's unlikely that there is an existing container that provides these specific packages. +We could opt to build our own. + +There are a number of ways of building containers, but we'll use the [Seqera Containers](https://seqera.io/containers/) web interface. +You can add multiple packages and it will build a container for you. + +![Creating a new container using Seqera Containers](../assets/img/seqera-container-python-00.png) + +Once you have your container image, add it to the Classify process: + +```groovy title="Classify with custom container" hl_lines="2" linenums="1" +process Classify { + container 'community.wave.seqera.io/library/pip_open-clip-torch_pillow_torch:83edd876e95b9a8e' + memory '13 GB' + + input: + tuple val(meta), path(img) + path(model) + output: tuple val(meta), stdout + script: "classify.py --model-path $model ${img}" +} +``` + +### Enable container execution + +To actually use containers, you need to enable Docker (or another container engine) in your Nextflow configuration. +Create or update `nextflow.config`: + +```groovy title="nextflow.config" linenums="1" +docker { + enabled = true +} +``` + +Now run your fully containerized workflow: + +```bash +nextflow run main.nf +``` + +### Takeaway + +Containerization ensures your workflow runs identically across different computing environments by packaging all software dependencies. + +### What's next? + +Let's explore advanced topics like version control integration and cloud execution. diff --git a/docs/small_nextflow/04_advanced.md b/docs/small_nextflow/04_advanced.md new file mode 100644 index 000000000..d0113ef2c --- /dev/null +++ b/docs/small_nextflow/04_advanced.md @@ -0,0 +1,343 @@ +# Part 4: Advanced Topics + +In this final part, we'll explore version control integration, cloud execution, and extension exercises to deepen your Nextflow skills. + +--- + +## Version control + +One of Nextflow's most powerful features is its deep integration with version control systems. +This allows you to share workflows, track changes, and ensure reproducibility by pinning to specific versions. + +### Create a GitHub repository + +First, let's create a new repository on GitHub to store your workflow. + +1. Go to [github.com](https://github.com) and log in +2. Click the "+" icon in the top right and select "New repository" +3. Name it something like `cat-classifier` (or any name you prefer) +4. Make it **public** (so Nextflow can access it easily) +5. **Don't** initialize with a README, .gitignore, or license +6. Click "Create repository" + +GitHub will show you some commands to push an existing repository. +Keep this page open - we'll use those commands in a moment. + +### Initialize and push your workflow + +Now let's version control your workflow. +From your workshop directory: + +```bash +# Initialize a git repository +git init + +# Add your workflow files +git add main.nf +git add bin/ + +# If you have a nextflow.config, add that too +git add nextflow.config + +# Create your first commit +git commit -m "Initial commit of cat classifier workflow" + +# Connect to your GitHub repository (replace with your username and repo name) +git remote add origin https://github.com/YOUR-USERNAME/cat-classifier.git + +# Push to GitHub +git branch -M main +git push -u origin main +``` + +Your workflow is now on GitHub! +Visit your repository URL to see your code online. + +### Running remote workflows + +Here's where it gets interesting: **you don't need a local copy of a workflow to run it**. + +Nextflow can pull workflows directly from GitHub and run them. +For example, to run the nf-core RNA-seq pipeline: + +```bash +nextflow run nf-core/rnaseq --help +``` + +This command pulls the workflow from `github.com/nf-core/rnaseq` (the `nf-core` organization, `rnaseq` repository), downloads it to `$HOME/.nextflow/assets/`, and runs it. + +Let's try a simpler example: + +```bash +nextflow run hello +``` + +This pulls and runs `github.com/nextflow-io/hello`. +Notice we didn't specify the full path - Nextflow uses sensible defaults: + +- If no provider is specified, it defaults to `github.com` +- If no organization is specified, it defaults to `nextflow-io` + +!!! tip "Other Git providers" + + Nextflow also supports: + + - **GitLab**: `nextflow run gitlab.com/user/repo` + - **Bitbucket**: `nextflow run bitbucket.org/user/repo` + - **Gitea**: With custom configuration + - **Azure Repos**: With custom configuration + - **AWS CodeCommit**: With custom configuration + +### Running specific versions with revisions + +Now you can run your own workflow from anywhere: + +```bash +# Run from GitHub (replace with your username and repo name) +nextflow run YOUR-USERNAME/cat-classifier +``` + +But what about version control? +What if you want to continue developing while also maintaining a stable version? + +Nextflow allows you to specify a **revision** - a specific branch, tag, or commit: + +```bash +# Run a specific branch +nextflow run YOUR-USERNAME/cat-classifier -revision dev-branch + +# Or use the short form +nextflow run YOUR-USERNAME/cat-classifier -r dev-branch +``` + +### Using Git tags for stable versions + +**Git tags** are named references to specific commits, typically used to mark release versions. +They're like bookmarks in your repository's history - they don't change, making them perfect for reproducible pipelines. + +Let's create a `1.0` tag for your workflow: + +```bash +# Create an annotated tag +git tag -a 1.0 -m "First stable release of cat classifier" + +# Push the tag to GitHub +git push origin 1.0 +``` + +Now you can run this exact version forever: + +```bash +nextflow run YOUR-USERNAME/cat-classifier -r 1.0 +``` + +This will always run the code as it existed when you created the tag, even if you continue developing on the `main` branch. + +### Testing with different revisions + +Let's see this in action. +Create a new branch and make a change: + +```bash +# Create and switch to a new branch +git checkout -b experimental + +# Make a small change (e.g., modify a parameter default in main.nf) +# Then commit it +git add main.nf +git commit -m "Experimental feature" +git push origin experimental +``` + +Now you can run different versions: + +```bash +# Run the stable 1.0 release +nextflow run YOUR-USERNAME/cat-classifier -r 1.0 + +# Run the main branch +nextflow run YOUR-USERNAME/cat-classifier -r main + +# Run the experimental branch +nextflow run YOUR-USERNAME/cat-classifier -r experimental + +# Run a specific commit (use any commit hash from git log) +nextflow run YOUR-USERNAME/cat-classifier -r abc123def +``` + +This is incredibly powerful: you can have a stable, reproducible pipeline (using a tag) while actively developing new features (on branches), all from the same repository. + +### Takeaway + +Nextflow's git integration allows you to version your workflows, share them easily, and run specific commits, branches, or tags from anywhere - no local copy required. + +### What's next? + +Let's explore running workflows on cloud infrastructure. + +--- + +## Cloud executors + +Nextflow supports running workflows on cloud infrastructure through various executors including AWS Batch, Google Cloud Batch, and Azure Batch. + +### Benefits of cloud execution + +Running workflows in the cloud offers several advantages: + +- **Scalability**: Automatically scale to hundreds or thousands of parallel tasks +- **Cost efficiency**: Pay only for the compute resources you use +- **No local infrastructure**: No need to maintain local HPC clusters +- **Global accessibility**: Run workflows from anywhere with internet access + +### Cloud executor configuration + +Each cloud provider has its own executor configuration. +Here's a basic example for AWS Batch: + +```groovy title="nextflow.config for AWS Batch" linenums="1" +process { + executor = 'awsbatch' + queue = 'my-batch-queue' + container = 'my-default-container' +} + +aws { + region = 'us-east-1' + batch { + cliPath = '/home/ec2-user/miniconda/bin/aws' + } +} +``` + +### Data staging with cloud storage + +When running in the cloud, you'll typically stage your data in cloud storage: + +```bash +# Upload model to S3 +aws s3 cp data/models/b32_400m.pt s3://my-bucket/models/ + +# Run workflow with cloud-based data +nextflow run main.nf --model s3://my-bucket/models/b32_400m.pt +``` + +!!! tip + + For detailed cloud setup instructions, see the Nextflow documentation for [AWS](https://www.nextflow.io/docs/latest/aws.html), [Google Cloud](https://www.nextflow.io/docs/latest/google.html), or [Azure](https://www.nextflow.io/docs/latest/azure.html). + +### Takeaway + +Nextflow's cloud executors enable scalable, cost-effective workflow execution without managing local infrastructure. + +### What's next? + +Try the extension exercises to practice your new skills! + +--- + +## Extension Exercise 1: Find extreme scores + +Our team is interested in which cat is the cutest cat and which cat is the ugliest cat. +Can you extend the workflow to identify (for each label) which picture scores the highest? + +### Hints + +**Hint 1:** You can use the `--json` flag on the `classify.py` script to output structured data instead of plain text. + +**Hint 2:** You can parse a JSON file in a closure by using the JsonSlurper class, part of the standard library: + +```groovy +| map { meta, jsonFile -> new groovy.json.JsonSlurper().parseText(jsonFile.text) } +``` + +**Hint 3:** You can use the `min` and `max` operators to return a channel containing the minimum or maximum item, and you can pass a closure to those operators to describe how the elements in the channel should be compared ([docs](https://www.nextflow.io/docs/latest/reference/operator.html#min)). + +### Exercise goals + +- Modify the Classify process to output JSON +- Parse the JSON to extract scores +- Use operators to find the highest and lowest scoring images +- Publish these special images separately + +### Takeaway + +This exercise demonstrates how to work with structured data and use comparison operators to filter channel contents. + +### What's next? + +Try making the classification labels configurable! + +--- + +## Extension Exercise 2: Configurable labels + +We've decided that "bad" and "good" are too cruel a classification system for the cats. +Can you modify the workflow to add a `--labels` parameter? +The parameter should take a comma-separated list of labels and use those labels in preference to the default "good cat" and "bad cat". + +### Example usage + +```bash +nextflow run main.nf --labels 'red cat,orange cat,black cat' +``` + +### Hints + +**Hint 1:** You'll need to modify how the labels are passed to the `classify.py` script. + +**Hint 2:** The classify.py script accepts multiple labels via the `--labels` argument: + +```bash +classify.py --labels "label1" "label2" "label3" image.jpg +``` + +**Hint 3:** You can split a comma-separated string in Groovy: + +```groovy +params.labels = 'good cat,bad cat' +labels = params.labels.split(',') +``` + +### Exercise goals + +- Add a `--labels` parameter to the workflow +- Parse the comma-separated labels +- Pass them to the classification script +- Ensure the grouping and collage steps still work with custom labels + +### Takeaway + +This exercise shows how to make workflows more flexible by parameterizing key values. + +### What's next? + +Congratulations! You've completed the Small Nextflow workshop. + +--- + +## Final notes + +### What you've learned + +Congratulations on completing the Small Nextflow workshop! +You've built a complete image classification workflow from scratch and learned: + +- **Fundamentals**: Channels, processes, parameters, and metadata +- **Data transformation**: Operators, resource management, and grouping +- **Publishing**: Organized outputs with indexes and custom paths +- **Portability**: Filesystem independence and containerization +- **Advanced patterns**: Version control and cloud execution + +### Where to go from here + +Now that you understand the basics, you can: + +- Explore the [Hello Nextflow](../../hello_nextflow/) course for more detailed explanations +- Learn about [nf-core](../../hello_nf-core/) for production-ready pipeline templates +- Try the [Side Quests](../../side_quests/) for advanced topics +- Build your own scientific workflows! + +### Takeaway + +You now have the foundational knowledge to build, run, and share reproducible scientific workflows with Nextflow. diff --git a/docs/small_nextflow/index.md b/docs/small_nextflow/index.md new file mode 100644 index 000000000..287107a81 --- /dev/null +++ b/docs/small_nextflow/index.md @@ -0,0 +1,57 @@ +--- +title: Small Nextflow +hide: + - toc +--- + +# Small Nextflow + +Hello! Welcome to Small Nextflow, a hands-on workshop where you'll build a real-world image classification workflow from scratch. + +In this workshop, you'll create a workflow that fetches cat images, classifies them using machine learning, and produces visual collages. +Along the way, you'll learn the core concepts that make Nextflow powerful for scientific computing: channels, processes, operators, and reproducible execution. + +By starting with an empty directory and building up piece by piece, you'll gain an intuitive understanding of how Nextflow workflows are structured and how data flows through them. + +Let's get started! + +[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/nextflow-io/training/tree/smol-nextflow?quickstart=1&ref=master) + +## Learning objectives + +In this workshop, you will learn foundational Nextflow concepts by building a complete workflow from scratch. + +By the end of this workshop you will be able to: + +- Create channels from files and understand channel types (queue vs. value channels) +- Define processes with input, output, and script blocks +- Use operators to transform and combine channel data (`map`, `join`, `groupTuple`, `collect`) +- Work with metadata using tuples and maps +- Configure workflows with parameters +- Manage resources and direct process execution +- Publish workflow outputs in organized structures +- Containerize workflows for reproducibility +- Make workflows portable across filesystems + +## Audience & prerequisites + +This workshop is designed for those who want to learn Nextflow by building a complete workflow from the ground up. +Some basic familiarity with the command line and programming concepts is helpful but not required. + +**Prerequisites** + +- A GitHub account +- Basic familiarity with command line +- Curiosity about machine learning (no ML expertise needed!) + +## Workshop structure + +The workshop is organized into four chapters: + +**Chapter 1: Fundamentals** - Learn the building blocks of Nextflow by creating channels, defining processes, and working with parameters and metadata. + +**Chapter 2: Data Transformation & Analysis** - Build a multi-step workflow that classifies images, manages resources, and groups results intelligently. + +**Chapter 3: Publishing & Portability** - Make your workflow production-ready by publishing organized outputs, supporting multiple filesystems, and containerizing dependencies. + +**Chapter 4: Advanced Topics** - Explore version control integration, cloud execution, and extension exercises to deepen your skills. diff --git a/docs/small_nextflow/next_steps.md b/docs/small_nextflow/next_steps.md new file mode 100644 index 000000000..e4c8d896e --- /dev/null +++ b/docs/small_nextflow/next_steps.md @@ -0,0 +1,74 @@ +# Next Steps + +Congrats again on completing the Small Nextflow workshop and thank you for completing our survey! + +--- + +## 1. Top 3 ways to level up your Nextflow skills + +Here are our top three recommendations for what to do next based on the workshop you just completed. + +### 1.1. Master the fundamentals with Hello Nextflow + +**[Hello Nextflow](../../hello_nextflow/index.md)** is a comprehensive hands-on course that provides deeper explanations of Nextflow concepts. + +If the Small Nextflow workshop gave you a taste of what's possible and you want to understand the details more thoroughly, Hello Nextflow is the perfect next step. +It covers channels, processes, modules, containers, and configuration in greater depth with more exercises and examples. + +### 1.2. Get started with nf-core + +**[nf-core](https://nf-co.re/)** is a worldwide collaborative effort to develop standardized open-source pipelines for a wide range of scientific research applications. +The project includes [over 100 pipelines](https://nf-co.re/pipelines/) that are available for use out of the box and [well over 1400 process modules](https://nf-co.re/modules/) that can be integrated into your own projects, as well as a rich set of developer tools. + +The **[Hello nf-core](../../hello_nf-core/index.md)** training course is a hands-on introduction to the nf-core project and its rich set of community-curated pipelines. +Part 1 covers how to find and use existing nf-core pipelines, which can save you a lot of time and effort. + +The rest of the training covers how to adopt nf-core standards and conventions in your own work, work with nf-core modules, and contribute back to the project. + +### 1.3. Explore advanced topics with Side Quests + +If you're ready to dive deeper into specific Nextflow topics, check out our growing **collection of [Side Quests](../../side_quests/index.md)**. + +These are short standalone courses that go deep into specific topics like: + +- Metadata handling +- Working with files +- Splitting and grouping data +- Debugging workflows +- Testing with nf-test +- And more! + +Each Side Quest focuses on a particular skill or pattern you can apply to your own workflows. + +--- + +## 2. Apply Nextflow to your domain + +**Check out the [Nextflow for Science](../../nf4_science/index.md) page** for a list of short standalone courses that demonstrate how to apply Nextflow concepts to common scientific analysis use cases: + +- **Genomics**: Variant calling workflows +- **RNAseq**: Gene expression analysis +- **Imaging**: Image processing pipelines + +These courses show you how to apply what you learned in Small Nextflow to real scientific problems in your field. + +If you don't see your domain represented, let us know in the [Community forum](https://community.seqera.io/) or by [creating an issue in the training repository](https://github.com/nextflow-io/training/issues/new/choose) so we can add it to our to-do list. + +--- + +## 3. Check out Seqera Platform + +**[Seqera Platform](https://seqera.io/) is the best way to run Nextflow in practice.** + +It is a cloud-based platform developed by the creators of Nextflow that you can connect to your own compute infrastructure (whether local, HPC or cloud) to make it much easier to launch and manage your workflows, as well as manage your data and run analyses interactively in a cloud environment. + +The Free Tier is available for free use by everyone (with usage quotas). +Qualifying academics can get free Pro-level access (no usage limitations) through the [Academic Program](https://seqera.io/academic/program/). + +Have a look at the [Seqera Platform tutorials](https://docs.seqera.io/platform/latest/getting-started/quickstart-demo/comm-showcase) to see if this might be useful to you. + +--- + +### That's it for now! + +**Good luck in your Nextflow journey and don't hesitate to let us know in the [Community forum](https://community.seqera.io/) what else we could do to help.** diff --git a/docs/small_nextflow/survey.md b/docs/small_nextflow/survey.md new file mode 100644 index 000000000..5d09521d2 --- /dev/null +++ b/docs/small_nextflow/survey.md @@ -0,0 +1,7 @@ +# Feedback survey + +Before you move on, please complete this short 5-question survey to rate the training, share any feedback you may have about your experience, and let us know what else we could do to help you in your Nextflow journey. + +This should take you less than a minute to complete. Thank you for helping us improve our training materials for everyone! + +
diff --git a/mkdocs.yml b/mkdocs.yml index 3f27c7c0c..b44ccb018 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -27,6 +27,15 @@ nav: - hello_nextflow/06_hello_config.md - hello_nextflow/survey.md - hello_nextflow/next_steps.md + - Small Nextflow: + - small_nextflow/index.md + - small_nextflow/00_orientation.md + - small_nextflow/01_fundamentals.md + - small_nextflow/02_data_transformation.md + - small_nextflow/03_publishing_portability.md + - small_nextflow/04_advanced.md + - small_nextflow/survey.md + - small_nextflow/next_steps.md - Hello nf-core: - hello_nf-core/index.md - hello_nf-core/00_orientation.md diff --git a/small_nextflow/.stuff/cat_me.sh b/small_nextflow/.stuff/cat_me.sh new file mode 100755 index 000000000..fd2ef1247 --- /dev/null +++ b/small_nextflow/.stuff/cat_me.sh @@ -0,0 +1,114 @@ +#!/usr/bin/env bash + +set -euo pipefail + +# Show help message +show_help() { + cat << EOF +Usage: $0 [-n|--count NUM] [-p|--prefix PATH] [-h|--help] + +Download random cat images from cataas.com along with their tags. + +Options: + -c, --count NUM Number of cats to download (default: 1) + -p, --prefix PATH Directory to download files to (default: current directory) + -h, --help Show this help message and exit + +Examples: + $0 # Download 1 cat to current directory + $0 -n 5 # Download 5 cats to current directory + $0 -p cats -n 10 # Download 10 cats to ./cats directory + $0 --prefix data/cats # Download 1 cat to ./data/cats directory + +Output: + For each cat, creates two files: + - .jpg The cat image + - .txt The tags (one per line) + +EOF +} + +# Default values +num_downloads=1 +prefix="." + +# Parse arguments +while [[ $# -gt 0 ]]; do + case $1 in + -h|--help) + show_help + exit 0 + ;; + -c|--count) + num_downloads="$2" + shift 2 + ;; + -p|--prefix) + prefix="$2" + shift 2 + ;; + *) + echo "Unknown option: $1" + echo "Use --help for usage information" + exit 1 + ;; + esac +done + +# Validate number +if ! [[ "$num_downloads" =~ ^[0-9]+$ ]] || [ "$num_downloads" -lt 1 ]; then + echo "Error: Number must be a positive integer" + exit 1 +fi + +# Create output directory if it doesn't exist +mkdir -p "$prefix" +echo "Downloading $num_downloads cat(s) to $prefix/" +echo "" + +# Download loop +for i in $(seq 1 "$num_downloads"); do + echo "[$i/$num_downloads]" + + # Get the JSON metadata + json=$(curl -s 'https://cataas.com/cat?json=true') + + # Extract fields using jq + cat_id=$(echo "$json" | jq -r '.id') + mimetype=$(echo "$json" | jq -r '.mimetype') + url=$(echo "$json" | jq -r '.url') + tags=$(echo "$json" | jq -r '.tags[]') # Extract tags, one per line + + # Map mimetype to extension - only accept jpg, png, gif + case "$mimetype" in + image/jpeg|image/jpg) + ext="jpg" + ;; + image/png) + ext="png" + ;; + image/gif) + ext="gif" + ;; + *) + echo "✗ Skipping unsupported type: $mimetype" + echo "" + continue + ;; + esac + + # Build filenames with prefix + filename="${prefix}/${cat_id}.${ext}" + tagfile="${prefix}/${cat_id}.txt" + + # Download the image + curl -s "$url" -o "$filename" + echo "✓ Saved as $filename" + + # Save tags to text file + echo "$tags" > "$tagfile" + echo "✓ Saved tags to $tagfile" + echo "" +done + +echo "Download complete! Downloaded $num_downloads cat(s) to $prefix/" diff --git a/small_nextflow/.stuff/classify.py b/small_nextflow/.stuff/classify.py new file mode 100755 index 000000000..d24a6493c --- /dev/null +++ b/small_nextflow/.stuff/classify.py @@ -0,0 +1,137 @@ +#!/usr/bin/env python +"""Classify a single image using MetaCLIP with local weights.""" + +from pathlib import Path +from PIL import Image +import argparse +import json +import open_clip +import sys +import torch + +def classify_image(image_path: Path, labels: list[str], model_path: Path, + json_output: bool = False, architecture: str = None): + """Classify a single image using MetaCLIP.""" + # Auto-detect architecture from model filename if not provided + if architecture is None: + filename = model_path.name.lower() + if 'b32' in filename or '32' in filename: + architecture = 'ViT-B-32-quickgelu' + elif 'b16' in filename or '16' in filename: + architecture = 'ViT-B-16-quickgelu' + elif 'l14' in filename: + architecture = 'ViT-L-14-quickgelu' + elif 'h14' in filename: + architecture = 'ViT-H-14-quickgelu' + else: + raise ValueError(f"Cannot infer architecture from {model_path.name}. " + f"Please specify --architecture") + + print(f"Using architecture: {architecture}", file=sys.stderr) + + # Load model and preprocessing + model, _, preprocess = open_clip.create_model_and_transforms( + architecture, + pretrained=str(model_path), + weights_only=False # Trust MetaCLIP checkpoint from Facebook Research + ) + tokenizer = open_clip.get_tokenizer(architecture) + + # Prepare text labels + text = tokenizer(labels) + + # Process the image + try: + image = preprocess(Image.open(image_path)).unsqueeze(0) + + with torch.no_grad(): + image_features = model.encode_image(image) + text_features = model.encode_text(text) + + # Normalize and compute similarity + image_features /= image_features.norm(dim=-1, keepdim=True) + text_features /= text_features.norm(dim=-1, keepdim=True) + similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) + + # Get all confidences for all labels + confidences = {label: float(conf) for label, conf in zip(labels, similarity[0])} + + # Get top prediction + values, indices = similarity[0].topk(1) + prediction = labels[indices[0]] + confidence = values[0].item() + + result = { + 'file': image_path.name, + 'path': str(image_path), + 'prediction': prediction, + 'confidence': confidence, + 'all_confidences': confidences + } + + if json_output: + print(json.dumps(result)) + else: + # Print human-readable format to stderr + print(f"\nImage: {image_path.name}", file=sys.stderr) + print(f"Prediction: {prediction} ({confidence:.2%})", file=sys.stderr) + print("\nAll confidences:", file=sys.stderr) + for label, conf in confidences.items(): + print(f" {label}: {conf:.2%}", file=sys.stderr) + # Print only the top label to stdout (no newline) + print(prediction, end='') + + return result + + except Exception as e: + error_result = { + 'file': image_path.name, + 'path': str(image_path), + 'error': str(e) + } + if json_output: + print(json.dumps(error_result)) + else: + print(f"Error processing {image_path.name}: {e}", file=sys.stderr) + sys.exit(1) + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Classify a single image using MetaCLIP") + parser.add_argument( + 'image', + type=Path, + help='Path to the image file to classify' + ) + parser.add_argument( + '--model-path', + type=Path, + default=Path("data/models/b32_400m.pt"), + help='Path to MetaCLIP model weights (default: data/models/b32_400m.pt)' + ) + parser.add_argument( + '--labels', + nargs='+', + default=["good cat", "bad cat"], + help='Labels for classification (default: ["good cat", "bad cat"])' + ) + parser.add_argument( + '--json', + action='store_true', + help='Output result as JSON to stdout' + ) + parser.add_argument( + '--architecture', + type=str, + choices=['ViT-B-32-quickgelu', 'ViT-B-16-quickgelu', 'ViT-L-14-quickgelu', 'ViT-H-14-quickgelu'], + help='Model architecture (auto-detected from filename if not specified)' + ) + + args = parser.parse_args() + + result = classify_image( + args.image, + args.labels, + args.model_path, + args.json, + args.architecture + ) diff --git a/small_nextflow/.stuff/pyproject.toml b/small_nextflow/.stuff/pyproject.toml new file mode 100644 index 000000000..6bd10d516 --- /dev/null +++ b/small_nextflow/.stuff/pyproject.toml @@ -0,0 +1,9 @@ +[project] +name = "my-classifier" +version = "0.1.0" +requires-python = ">=3.10" +dependencies = [ + "torch>=2.0.0", + "pillow>=10.0.0", + "open-clip-torch>=2.20.0", +]