# Nextflow

Exercise for running Nextflow on RNA-Seq pipeline.

* **Contact:** mate.balajti@unibas.ch

## Exercise 4.1: Use a workflow to run your analysis (10 points)

Using a bash file or some script to run your analysis is not very
good. It will be difficult to parallelize your code, scatter/gather multiple
jobs of the same task, keep sufficient logging and provenance information, and
it will be difficult to share your analysis in a way that it is easily
reproducible/reusable. Workflow languages and corresponding management
systems/engines take care of all of these things and more.

If you are (planning on) doing bioinformatics analyses more regularly, we
strongly recommend you to pick up one of these languages, e.g.,
[Nextflow](https://www.nextflow.io/) (Groovy-based) or
[Snakemake](https://snakemake.readthedocs.io/en/stable/) (Python-based) are two
popular choices that we frequently use in our lab.

If you are interested, you can follow a tutorial on either of these domain-
specific languages and learn how you can package your code as a proper
shareable workflow.

> Note: Please use your code from Exercise 2 and 3 (alignment, featureCounts), organize them into Nextflow processes and copy the code below.
> 

In [None]:
#!/usr/bin/env nextflow

process STAR_INDEX {

	input:
	path fasta_file
	path gtf_file

	output:
	path 'Files/genome_index'// into genome_index_ch

	publishDir "results/genome_index", mode: "copy"

    script:
	"""
	STAR --runThreadN 4 --runMode genomeGenerate --genomeDir Files/genome_index --genomeFastaFiles ${fasta_file} --sjdbGTFfile ${gtf_file}
	"""

}

process STAR_ALIGN {

	input:
	path genome_index_ch
	path control_file_1
	path control_file_2

	output:
	path 'Files/alignment_*.bam'// into bam_ch

	publishDir "results/alignment", mode: "copy"

	script:
	"""
    STAR --runThreadN 4 --genomeDir ${genome_index_ch} --readFilesIn ${control_file_1} ${control_file_2} --outFileNamePrefix Files/alignment_ --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat
    """

}

//process FEATURE_COUNTS {

	//input: 
	//path file(Files/alignment_/*log.out)

//}



In [None]:
#!/usr/bin/env nextflow

include { STAR_INDEX; STAR_ALIGN } from './process.nf'

//files
fasta_file = file('Files/Mus_musculus.GRCm38.dna_rm.chromosome.19.fa')
gtf_file = file('Files/Mus_musculus.GRCm38.88_chr19.gtf')
control_file_1 = file('Files/control.mate_1.fq.gz')
control_file_2 = file('Files/control.mate_2.fq.gz')
treated_file_1 = file('Files/treated.mate_1.fq.gz')
treated_file_2 = file('Files/treated.mate_2.fq.gz')

// workflow
workflow {
    STAR_INDEX(fasta_file, gtf_file)
    STAR_ALIGN(STAR_INDEX.out, control_file_1, control_file_2)
}

## Workflow programming languages

What's the problem? Why do we need workflow programming languages?

As biology research became an increasingly large, multi-disciplinary, data-driven effort, a whole set of issues came up, having little to do with the research questions themselves, but taking increasingly much time. These issues stem from the need to run many different tools, developed by various researchers on different platforms, on infrastructures with specific software installations, memory and CPU constraints. Furthermore, the analyses need to be done many times, of course, in a reproducible manner. This led to the need for an infrastructure that allows a high-level description of analysis steps, supports the use of packaged code and takes care of all the "bookkeeping" necessary to set up these tools on new machines, catch and report errors, restart failed jobs etc. This is what workflow programming languages are designed to do. A number of them have emerged in the last few years, e.g., **Snakemake** (based on the old Unix `make` command), the **Common Workflow Language** (CWL), and **Nextflow**, the one we will be using here.

`Nextflow` is based on [`Groovy`](http://groovy-lang.org/documentation.html), which in turn is based on `Java`. Therefore, it is fairly intuitive for many programmers, and supports general programming as well as has specific constructs for specifying and running workflows. A good in-depth introductory course that I have found can be found at https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html. It is quite extensive and takes time to go through, but if you will need to work on workflows, it's probably worthwhile. There is a tutorial available from the developers of Nextflow (seqeralabs), but that's rather recipe-driven, doesn't have much explanation of what's going on. Nevertheless, here we will use some examples from the tutorial to illustrate main features, commenting them more extensively.

### Tutorial pulled from github

https://github.com/seqeralabs/nextflow-tutorial

### Prerequisites 

Follow the instructions in the tutorial above to install Nextflow. Moreover, Docker is required for running the more advanced examples in the tutorial. If you don't have Docker installed and running, please set that up before going ahead. 

### Nextflow scripting

Info on nextflow scripting can be found here: https://www.nextflow.io/docs/latest/script.html. 

### The simplest possible workflow

The work in `Nextflow` is organized in processes, which in turn are built around shell commands. So let's write a really minimal `Nextflow` script. The main logic of a `Nextflow` workflow is in a file typically called `main.nf`, though this is not absolutely necessary. If `Nextflow` is installed (which in our case should be once we built the tutorial), we can run the workflow from the commandline, like this:

```python
nextflow main.nf
```

where the name of the script could be something else as well. What does a minimal `Nextflow` script look like?

```python
#!/usr/bin/env nextflow

process simplest_ever {

        script:
        """
        echo "This is a test"
        """
}
```

The first line specifies what program to use to execute the script. Then we have the definition of the process called `simplest_ever`, in which all we do is specify the `script`, which is the shell script that we want to execute. All this does it to print a string to `STDOUT`, the default file descriptor in Unix-based systems, the user's terminal screen. 

If you run this script, you will notice all sorts of new files in the directory where you did the run. E.g., there is the log file (`.nextflow.log`) which gives you an idea of what has happened behind the scenes: many parameters describing the run have been saved, an environment has been created for the process to run etc. There is also a `work/` directory, where information on individual processes is saved, each within a directory with a unique identifier. Why all this? Imagine having to run the same program many times, with different input data sets, some of the runs perhaps failing. If you had to keep track of which files can or cannot be rewritten, are or are not corrupted, it will quickly become unmanageable. `Nextflow` solves this problem by creating unique identifiers for all files associated with a specific run.

### Variables

What if we want to save some outputs to a file rather than `STDOUT`? Here we can take advantage of the fact that `Nextflow` supports general programming, so it has constructs like _variables_ that allow us to specify the name of the output file. Here's how we can do it:

```python
#!/usr/bin/env nextflow

outfile = "out.txt"
                                                                                
process simplest_ever {                                                         
                                                                                
        script:                                                                 
        """                                                                     
        echo "This is a test" > $outfile                                        
        """                                                                     
}                                                                               
```

Now, if we execute 

```bash
nextflow variables.nf
```

with `variables.nf` being the name I gave the script, in the directory associated with the run we will have an `out.txt` file containing the text `This is a test`. Note how the variable is defined in the _global scope_ of the script, and it is _interpolated_ (substituted) in the string inside the process block.

### A more general definition of a `Nextflow` process

Of course, we would use `Nextflow` when we have more complex tasks to orchestrate, letting the data flow from one process to the other in the appropriate order. The main constructs of the language that allow us to do the wiring are **channels**, objects of different types that move between processes in a workflow. Channels can be _emitted_ by one process and _consumed_ by others. The types of channels are:

- **Value** channel - single value specified (scalar, list, numerical, alphabetical, etc.) either directly or as a result of an operation; can be read unlimited number of times, it is not _consumed_ by processes
- **Queue** channel - queue of values, _consumed_ by processes in an ordered fashion

Let's look again at an example:

```python
#!/usr/bin/env nextflow                                                            \
                                                                                    
input_ch = Channel.from(1, 2, 3)                                                    
                                                                                    
process simple_script {                                                             
        tag "simple_script"                                                         
                                                                                    
        input:                                                                      
        val x from input_ch                                                         
                                                                                    
        output:                                                                     
        path "*" into ch_output                                                     
                                                                                    
        script:                                                                     
        """                                                                         
        add.py $x                                                          
        """                                                                         
}                                                                                   
                                                                                    
ch_output.view{"Received $it"}                                                      
```

Here we start by creating a queue channel object that contains 3 values, the numbers 1, 2 and 3. 

Next we define our process, giving it a _tag "simple_script"_, which will be used in log messages to make it easier for us to identify the process.

Then we have the more general definition of a process, which contains a block _input_ where the inputs to the process are specified, a block of _output_ that the process will produce and the script itself, that we have seen before. In this case the bash command will try to run a python program, called `add.py`. Where would Nextflow look for this program? In the `bin/` subdirectory of the directory from which the Nextflow script is run. The Python program looks like this:

```python
#!/Users/zavolan/anaconda3/bin/python

import sys                                                                          
                                                                                    
y = sys.argv[1]                                                                     
                                                                                    
with open(y+".txt", 'w') as output:                                                 
    v = 3 + int(y)                                                                  
    output.write(str(v))                                                            
```

so it implements the addition of **3** to whatever number is given as argument, writing the result to an output file whose name is also constructed from the argument to the script. The Nextflow script ends with a command that will write the elements of the queue channel `ch_output` to stdout, these elements being handled in order and saved in the interpolated variable **it**.

**input** block tells us that the process will be run for each value in the `input_ch`, each time this value will be saved in the variable **x**.

**output** block tells us that the names of all the files generated by our process will be streamed into a queue channel called `ch_output`

**script** block specifies what will be done with each value from the input channel, which is the addition of **3**, the result being saved in a file named `<value>.txt` in the directory created by Nextflow for the respective process.

### A realistic example

Let's now come to an example from the Nextflow tutorial, specifically `script2.nf`, which implements _indexing_ of transcript sequences, which is done to speed up the mapping of short sequenced reads to transcripts from which these reads are thought to originate.

```python
/* 
 * pipeline input parameters 
 */
params.reads = "$baseDir/data/ggal/gut_{1,2}.fq"
params.transcript = "$baseDir/data/ggal/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "results"

log.info """\
         R N A S E Q - N F   P I P E L I N E    
         ===================================
         transcriptome: ${params.transcript}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

 
/* 
 * define the `index` process that create a binary index 
 * given the transcriptome file
 */
process index {
    
    input:
    path transcriptome from params.transcript
     
    output:
    path 'index' into index_ch

    script:       
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
    
}
 
index_ch.view { "Index found it $it" }
```

The script starts by setting some parameters. They are prefixed by _params._ to indicate the object that they are part of, which can be also populated from a _config_ file. Note the variable interpolation and the fact that regular-expression-like syntax can be used to denote a set of files. After this, some logging information is defined.

Finally, we get to the process definition. 

The **input** block shows that a file/path variable (`transcriptome`) is used as input and this variable comes from a value channel, the value being specified by a string-valued parameter. 

The **output** block shows that a file/path variable with value `index` will be fed into a channel called `index_ch`. We could also use a _directive_ `publishDir` to specify a path where outputs should be saved (a more meaningful location in the project directory rather than the arbitrarily-named process-specific directory).

The **script** block contains the command, which is `salmon index`. This will create the file called `index` corresponding to the input file whose name is contained in the `transcriptome` variable. It also uses an interpolated implicit variable, `task.cpus`, which specifies the number of CPUs for the task. This can be changed with a _cpu directive_, which needs to be specified at the beginning of the process definition.

As said, there are many more features of Nextflow, which we better learn on-demand.