# Hello World! example for looper Using PEPhub project 

This tutorial demonstrates how to install `looper` and use it to run a pipeline on a PEP project. 

## 1. Install the latest version of looper:

```console
pip install --user --upgrade looper
```

## 2. Download and unzip the hello_looper repository

The [hello looper repository (pephub_branch)](https://github.com/pepkit/hello_looper/tree/pephub_config) contains a basic functional example config (in `/looper_config`) and a looper-compatible pipeline (in `/pipeline`) 
that can run on that project. Let's download and unzip it:


In [2]:
wget https://github.com/pepkit/hello_looper/archive/pephub_config.zip

--2023-05-01 13:25:29--  https://github.com/pepkit/hello_looper/archive/pephub_config.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/pepkit/hello_looper/zip/refs/heads/pephub_config [following]
--2023-05-01 13:25:29--  https://codeload.github.com/pepkit/hello_looper/zip/refs/heads/pephub_config
Resolving codeload.github.com (codeload.github.com)... 140.82.112.10
Connecting to codeload.github.com (codeload.github.com)|140.82.112.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘pephub_config.zip’

pephub_config.zip       [ <=>                ]   6.51K  --.-KB/s    in 0.02s   

2023-05-01 13:25:29 (285 KB/s) - ‘pephub_config.zip’ saved [6666]



In [3]:
unzip pephub_config.zip

Archive:  pephub_config.zip
d612e3d4245d04e7f23419fb77ded80773b40f0d
   creating: hello_looper-pephub_config/
  inflating: hello_looper-pephub_config/README.md  
   creating: hello_looper-pephub_config/data/
  inflating: hello_looper-pephub_config/data/frog1_data.txt  
  inflating: hello_looper-pephub_config/data/frog2_data.txt  
  inflating: hello_looper-pephub_config/data/frog3_data.txt  
  inflating: hello_looper-pephub_config/data/frog4_data.txt  
  inflating: hello_looper-pephub_config/data/frog5_data.txt  
   creating: hello_looper-pephub_config/looper_config/
  inflating: hello_looper-pephub_config/looper_config/.looper.yaml  
  inflating: hello_looper-pephub_config/looper_pipelines.md  
  inflating: hello_looper-pephub_config/output.txt  
   creating: hello_looper-pephub_config/pipeline/
  inflating: hello_looper-pephub_config/pipeline/count_lines.sh  
  inflating: hello_looper-pephub_config/pipeline/output_schema.yaml  
  inflating: hello_looper-pephub_config/pipeline/pipeline

In [4]:
cd hello_looper-pephub_config/

Let's check what is inside. We have data, pipeline interfaces, and looper config file

In [5]:
ls

[0m[01;34mdata[0m  [01;34mlooper_config[0m  looper_pipelines.md  output.txt  [01;34mpipeline[0m  README.md


Now create env variables that are used in project and looper config:

In [6]:
export LOOPERDATA=`pwd`/data

In [7]:
export LOOPERPIPE=`pwd`/pipeline

Check what's inside `.looper.yaml`. We have pep_config, output_dir, and pipeline interfaces.

In [8]:
cat ./looper_config/.looper.yaml

pep_config: "databio/looper:default" # pephub registry path or local path
output_dir: "$HOME/hello_looper_results"
pipeline_interfaces:
  sample:  $LOOPERPIPE/pipeline_interface.yaml


## 3. Run it

Run it by changing to the directory and then invoking `looper run` on the project configuration file.

In [9]:
cd ./looper_config; looper run

No project config defined, using: {'config_file': 'databio/looper:default', 'output_dir': '$HOME/hello_looper_results', 'sample_pipeline_interfaces': '$LOOPERPIPE/pipeline_interface.yaml', 'project_pipeline_interfaces': None}. Read from dotfile (/home/bnt4me/virginia/repos/looper/docs_jupyter/hello_looper-pephub_config/looper_config/.looper.yaml).
Looper version: 1.4.0
Command: run
Using default config. No config found in env var: ['DIVCFG']
No config key in Project, or reading project from dict
Processing project from dictionary...
Pipestat compatible: False
[36m## [1 of 5] sample: frog_1; pipeline: count_lines[0m
Writing script to /home/bnt4me/hello_looper_results/submission/count_lines_frog_1.sub
Job script (n=1; 0.00Gb): /home/bnt4me/hello_looper_results/submission/count_lines_frog_1.sub
Compute node: bnt4me-Precision-5560
Start time: 2023-05-01 13:25:48
Number of lines: 4
[36m## [2 of 5] sample: frog_2; pipeline: count_lines[0m
Writing script to /home/bnt4me/hello_looper_resul

Voila! You've run your very first pipeline across multiple samples using `looper` and project from `PEPhub`!

# Exploring the results

Now, let's inspect the `hello_looper` repository you downloaded. It has 3 components, each in a subfolder:

In [10]:
cd ../..

In [12]:
tree hello_looper-pephub_config/

[01;34mhello_looper-pephub_config/[0m
├── [01;34mdata[0m
│   ├── frog1_data.txt
│   ├── frog2_data.txt
│   ├── frog3_data.txt
│   ├── frog4_data.txt
│   └── frog5_data.txt
├── [01;34mlooper_config[0m
├── looper_pipelines.md
├── output.txt
├── [01;34mpipeline[0m
│   ├── [01;32mcount_lines.sh[0m
│   ├── output_schema.yaml
│   ├── pipeline_interface2.yaml
│   └── pipeline_interface.yaml
└── README.md

3 directories, 12 files


These are:

 * `/data` -- contains 5 data files for 5 samples. These input files were each passed to the pipeline.
 * `/pipeline` -- contains the script we want to run on each sample in our project. Our pipeline is a very simple shell script named `count_lines.sh`, which (duh!) counts the number of lines in an input file.
 * `/looper_config` -- contains 1 file - looper configuration, that points to PEPhub, pipeline interfaces and output directory. This particular cofig file points to: https://pephub.databio.org/databio/looper?tag=default project.




When we invoke `looper` from the command line we told it to `run project/project_config.yaml`. `looper` reads the [project/project_config.yaml](https://github.com/pepkit/hello_looper/blob/master/project/project_config.yaml) file, which points to a few things:

 * the [project/sample_annotation.csv](https://github.com/pepkit/hello_looper/blob/master/project/sample_annotation.csv) file, which specifies a few samples, their type, and path to data file
 * the `output_dir`, which is where looper results are saved. Results will be saved in `$HOME/hello_looper_results`.
 * the `pipeline_interface.yaml` file, ([pipeline/pipeline_interface.yaml](https://github.com/pepkit/hello_looper/blob/master/pipeline/pipeline_interface.yaml)), which tells looper how to connect to the pipeline ([pipeline/count_lines.sh](https://github.com/pepkit/hello_looper/blob/master/pipeline/)).

The 3 folders (`data`, `project`, and `pipeline`) are modular; there is no need for these to live in any predetermined folder structure. For this example, the data and pipeline are included locally, but in practice, they are usually in a separate folder; you can point to anything (so data, pipelines, and projects may reside in distinct spaces on disk). You may also include more than one pipeline interface in your `project_config.yaml`, so in a looper project, many-to-many relationships are possible.



## Pipeline outputs

Outputs of pipeline runs will be under the directory specified in the `output_dir` variable under the `paths` section in the project config file (see [defining a project](defining-a-project.md)). Let's inspect that `project_config.yaml` file to see what it says under `output_dir`:


In [6]:
!cat hello_looper-master/project/project_config.yaml

metadata:
  sample_annotation: sample_annotation.csv
  output_dir: $HOME/hello_looper_results
  pipeline_interfaces: ../pipeline/pipeline_interface.yaml


Alright, next let's explore what this pipeline stuck into our `output_dir`:


In [7]:
!tree $HOME/hello_looper_results

/home/nsheff/hello_looper_results
├── results_pipeline
└── submission
    ├── count_lines.sh_frog_1.log
    ├── count_lines.sh_frog_1.sub
    ├── count_lines.sh_frog_2.log
    ├── count_lines.sh_frog_2.sub
    ├── frog_1.yaml
    └── frog_2.yaml

2 directories, 6 files



Inside of an `output_dir` there will be two directories:

- `results_pipeline` - a directory with output of the pipeline(s), for each sample/pipeline combination (often one per sample)
- `submissions` - which holds a YAML representation of each sample and a log file for each submitted job

From here to running hundreds of samples of various sample types is virtually the same effort!



## A few more basic looper options

Looper also provides a few other simple arguments that let you adjust what it does. You can find a [complete reference of usage](usage.md) in the docs. Here are a few of the more common options:

For `looper run`:

- `-d`: Dry run mode (creates submission scripts, but does not execute them) 
- `--limit`: Only run a few samples 
- `--lumpn`: Run several commands together as a single job. This is useful when you have a quick pipeline to run on many samples and want to group them.

There are also other commands:

- `looper check`: checks on the status (running, failed, completed) of your jobs
- `looper summarize`: produces an output file that summarizes your project results
- `looper destroy`: completely erases all results so you can restart
- `looper rerun`: rerun only jobs that have failed.


## On your own

To use `looper` on your own, you will need to prepare 2 things: a **project** (metadata that define *what* you want to process), and **pipelines** (*how* to process data). To link your project to `looper`, you will need to [define a project](defining-a-project.md). You will want to either use pre-made `looper`-compatible pipelines or link your own custom-built pipelines. These docs will also show you how to connect your pipeline to your project.
