# Useful nextflow channel patterns

## Pre-requisites

### 1. Install the required conda packages

Before starting let's install `nextflow` and the `awscli` and the `bash_kernel` that will allow us to execute bash code within Jupyter notebook cells.

```
conda create -n jupyflow -c bioconda nextflow=20.01.0 bash_kernel awscli -y
```


### 2. Choose the new Jupyter kernel for the current notebook

To use the dependencies installed above in an interactive session, select the kernel from the top right corner of your Notebook. We named our conda environment `jupyflow`, go ahead and select this one before starting to execute the following cells.

The following gif shows you where to find the new Jupyter kernel, `jupyflow` option, and select it:

![](http://g.recordit.co/LLDwDx6YtS.gif)

In [1]:
# prevents from warning messages related to the undefined TERM variable
export TERM=xterm

## Pattern Examples

### 1. Example Input: A csv file with a genotypic data set of 2 files (main file & its index)

Example file: `s3://lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs.csv`

We can inspect the contents of the folder using `awscli`

In [2]:
aws s3 cp s3://lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs.csv . --no-sign-request

head -4 vcfs.csv 

download: s3://lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs.csv to ./vcfs.csv
chr,vcf,index
1,s3://lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr1.vcf.gz,s3://lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr1.vcf.gz.csi
10,s3://lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr10.vcf.gz,s3://lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr10.vcf.gz.csi
11,s3://lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr11.vcf.gz,s3://lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr11.vcf.gz.csi


#### What we would like to create:

```
[sampleA_chr1, 1, sampleA_chr1.vcf.gz, sampleA_chr1.vcf.gz.csi]
[sampleA_chr2, 2, sampleA_chr2.vcf.gz, sampleA_chr2.vcf.gz.csi]
```

#### How to create it:


In [3]:
cat << 'EOF' > pattern_1.nf
#!/usr/bin/env nextflow


params.number_of_files_to_process = 2
params.genotype_files_list = "s3://lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs.csv"

  Channel
    .fromPath(params.genotype_files_list)
    .ifEmpty { exit 1, "Cannot find CSV VCFs file : ${params.genotype_files_list}" }
    .splitCsv(skip:1)
    .map { chr, vcf, index -> [file(vcf).simpleName, chr, file(vcf), file(index)] }
    .take( params.number_of_files_to_process )
    .set { ch_user_input_vcf }

    ch_user_input_vcf.view()
    
    
EOF
nextflow run pattern_1.nf

N E X T F L O W  ~  version 20.01.0
Launching `pattern_1.nf` [elated_northcutt] - revision: efb9b93a3c
[sampleA_chr1, 1, /lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr1.vcf.gz, /lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr1.vcf.gz.csi]
[sampleA_chr10, 10, /lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr10.vcf.gz, /lifebit-featured-datasets/projects/gel/gwas/testdata/vcfs/sampleA_chr10.vcf.gz.csi]


### 2. Example Input: A folder with a genotypic data set of 3 files (plink set)

Example file: `s3://omics-example-datasets/pipelines/gwas/tools/king/ancestry-reference-files/`

We can inspect the contents of the folder using `awscli`

In [4]:
aws s3 ls s3://omics-example-datasets/pipelines/gwas/tools/king/ancestry-reference-files/ --no-sign-request

2022-02-20 20:44:07  488917156 KGref.bed.xz
2022-02-20 20:57:40   36528948 KGref.bim.xz
2022-02-20 21:42:39       3272 KGref.fam.xz


#### What we would like to create:

```
[KGref,KGref.bed.xz, KGref.bim.xz, KGref.fam.xz]
```

#### How to create it:


In [5]:
cat << 'EOF' > pattern_2.nf
#!/usr/bin/env nextflow


params.king_reference_data = "s3://omics-example-datasets/pipelines/gwas/tools/king/ancestry-reference-files/KGref.{bed,bim,fam}.xz"

Channel
  .fromFilePairs("${params.king_reference_data}",size:3, flat : true)
  .ifEmpty { exit 1, "KING reference data PLINK files not found: ${params.king_reference_data}.\nPlease specify a valid --king_reference_data value. e.g. refdata/king_ref*.{bed,bim,fam}" }
  .set{ ch_king_reference_data }

  ch_king_reference_data.view()
  
EOF
nextflow run pattern_2.nf

N E X T F L O W  ~  version 20.01.0
Launching `pattern_2.nf` [high_raman] - revision: 5ab8293681
[KGref, /omics-example-datasets/pipelines/gwas/tools/king/ancestry-reference-files/KGref.bed.xz, /omics-example-datasets/pipelines/gwas/tools/king/ancestry-reference-files/KGref.bim.xz, /omics-example-datasets/pipelines/gwas/tools/king/ancestry-reference-files/KGref.fam.xz]


### 3. Example Input: A folder with a genotypic data pair

Example file: `s3://lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/`

We can inspect the contents of the folder using `awscli`

In [6]:
aws s3 ls s3://lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/  --no-sign-request | tail -5

2021-09-02 20:38:30       1887 sampleA_chr7.vcf.gz.csi
2021-09-02 20:38:30     121759 sampleA_chr8.vcf.gz
2021-09-02 20:38:30       2002 sampleA_chr8.vcf.gz.csi
2021-09-02 20:38:30     110844 sampleA_chr9.vcf.gz
2021-09-02 20:38:30       1829 sampleA_chr9.vcf.gz.csi


#### What we would like to create:

```
[sampleA_chr1, 1, sampleA_chr1.vcf.gz, sampleA_chr1.vcf.gz.csi]
```

#### How to create it:

In [7]:
cat << 'EOF' > pattern_3.nf
#!/usr/bin/env nextflow

def get_chromosome( file ) {
    // using RegEx to extract chromosome number from file name
    regexpPE = /(?:chr)[a-zA-Z0-9]+/
    (file =~ regexpPE)[0].replaceAll('chr','')
}
params.number_of_files_to_process = 3
params.input_folder_location = "s3://lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/"
params.file_pattern = "sampleA"
params.file_suffix = "vcf.gz"
params.index_suffix = "vcf.gz.csi"

/*--------------------------------------------------
  Channel setup
---------------------------------------------------*/
if (params.input_folder_location) {
  Channel.fromPath("${params.input_folder_location}/**${params.file_pattern}*.{${params.file_suffix},${params.index_suffix}}")
       .map { it -> [ get_chromosome(file(it).simpleName.minus(".${params.index_suffix}").minus(".${params.file_suffix}")), "s3:/"+it] }
       .groupTuple(by:0)
       .map { chr, files_pair -> [ chr, files_pair[0], files_pair[1] ] }
       .map { chr, vcf, index -> [ file(vcf).simpleName, chr, file(vcf), file(index) ] }
       .take( params.number_of_files_to_process )
       .set { ch_user_input_vcf }
       
       ch_user_input_vcf.view()
}

EOF
nextflow run pattern_3.nf

N E X T F L O W  ~  version 20.01.0
Launching `pattern_3.nf` [focused_knuth] - revision: e5c967cdda
[sampleA_chr1, 1, /lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/sampleA_chr1.vcf.gz, /lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/sampleA_chr1.vcf.gz.csi]
[sampleA_chr10, 10, /lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/sampleA_chr10.vcf.gz, /lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/sampleA_chr10.vcf.gz.csi]
[sampleA_chr11, 11, /lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/sampleA_chr11.vcf.gz, /lifebit-featured-datasets/projects/gel/gel-gwas/testdata/vcfs/sampleA_chr11.vcf.gz.csi]
