# Analyzing Many Data Files: Mapping Genomic Reads with Minimap2

This tutorial will walk you through how to analyze many data files by submitting a workload of many jobs that can run in parallel. 

We are using a realistic genomics use cases, using [Minimap2](https://github.com/lh3/minimap2) to complete a long-read sequencing read mapping process. In the tutorial, you'll work with real data and see how high-throughput computing (HTC) can accelerate your genomics workflows or any workflow that involves analyzing many individual files or pieces of data

### Learning Goals

* Break down a large computational problem into many independent smaller tasks
* Submit hundreds to thousands of jobs with a few simple commands
* Use the Open Science Data Federation (OSDF) to manage file transfer during job submission



## Getting Ready

Before we begin, Let us make sure we are in our `tutorial-minimap2` directory by printing our working directory:

In [None]:
cd ~/tutorial-minimap2

In [None]:
pwd

We should see `/home/<username>/tutorial-minimap2`.

## Workload Components

Before thinking about how to run a list of jobs, let's bring the components of our workload (data and software) onto this computer. 

### Data

For the data, we will be using simulated Oxford Nanopore reads from the _Megaptera novaeangliae_ (humpback whale) genome. The reference genome was generated by [Carminati et al. (2024)](https://www.nature.com/articles/s41597-024-03922-9) and simulated reads were generated for this tutorial using [pbsim3](https://github.com/yukiteruono/pbsim3).
<center><img src="notebook_images/whale_acs.webp" width="500px"/></center>

We have a script called `download_data.sh` that will download our bioinformatic data. Let's go ahead and run this script to download our data.

In [None]:
./download_data.sh

Our sequencing data files, all ending in `.fastq` can now be found in a folder called `inputs`.

Our data has been organized into folders for ease of use. The `inputs` folder contains our sequencing reads. Most of our individual files used to actually run the workflow will be found in the tutorial folder, organized like so: 

```
    ├── tutorial-minimap2
    │   ├── inputs
    │   │   │   ├── humpback_whale_reads.fastq
    │   ├── outputs
    │   ├── logs
    │   │   ├── log
    │   │   ├── error
    │   │   ├── output
    │   ├── software
    │   │   ├── minimap2.def
```

A few files that will be used many times are placed in a separate location: 
   
```
    ├── /ospool/guest-ap/data/jovyan/
    │   ├── tutorial-minimap2
    │   │   ├── inputs
    │   │   │   ├── humpback_whale_ref_genome.fasta.mmi
    │   │   ├── software
    │   │   │   ├── minimap2_08OCT2025_v1.sif

```

<div class="alert alert-block alert-info">
<b>Note:</b> While this is the directory structure we will be using for this tutorial, you can
    organize your files in whatever way makes sense for you. Just be sure to update the paths in the
    job submission file accordingly. It's important to have a clear organizational structure
    when working with many files and jobs.
</div>

<div class="alert alert-block alert-warning">
<b>OSPool Directory for Tutorial:</b> While you do have an OSPool <code>/ospool/guest-ap/data/jovyan/</code> directory, for this tutorial we will be using
    the shared directory <code>/ospool/uc-shared/public/osg-training/</code> to host our data and software.
    This is because the OSPool directory is not available to OSPool users in the guest-ap account. If you have your own OSPool account, you will have access to your own <code>ospool/ap##/<username>/data/</code> directory to use for your own data and software.
</div>


### Software

The first step of any analysis is to get the software you need. On the OSPool, we recommend the use of an [Apptainer](https://apptainer.org/) containers to package software. Apptainer is a popular containerization technology in the scientific computing communities. It allows users to create and run containers that encapsulate software and its dependencies, ensuring consistency across different computing environments.

In this tutorial, we will use a container that has uses Anaconda's miniconda3 to install Minimap2. The container was built using the definition file `minimap2.def` located in the `software` folder. This definition file contains all the instructions needed to build the container. This definition file includes [SAMTools](http://www.htslib.org/) and [BEDTools](https://bedtools.readthedocs.io/en/latest/) as well.

If you already ran the `download_data.sh` script, your software environment has already been setup!

In [None]:
ls software/
ls /ospool/guest-ap/data/jovyan/tutorial-minimap2/software/

In the `software` directory you will find the container definition file `minimap2.def`. For time-saving sakes, our `download_data.sh` script downloaded a pre-built version of this container and placed it in `/ospool/guest-ap/data/jovyan/tutorial-minimap2/software/minimap2_08OCT2025_v1.sif`. This `.sif` file is the actual container image that we will use to run our jobs.

<div class="alert alert-block alert-info">
<b>Note:</b> If you wanted to replicate the container build, you could do so by using
    the definition file and steps below:
</div>

In [None]:
cat software/minimap2.def

And then running this command: 

```
$ apptainer build minimap2_08OCT2025_v1.sif software/minimap2.def
```

## HTCondor and its List of Jobs/Tasks to Run

HTCondor is a **workload manager** that enables researchers to distribute computing tasks across many machines—an approach known as **High Throughput Computing (HTC)**. Unlike High Performance Computing (HPC), which focuses on tightly coupled parallel jobs, HTC excels at handling **many independent tasks**, each of which can run separately on different nodes. This makes it a perfect fit for bioinformatics workloads like read mapping, assembly, and sequence analysis, where datasets can be partitioned.

When mapping reads, our FASTQ files often contain **tens of millions of reads**. Mapping all of these reads in a single job can take hours or days and often risks failure if a single node crashes. Instead, we can **split the FASTQ file** into smaller chunks—say, 10,000 reads each—and map them *independently*. Each subset becomes its own **Condor job**.

<div class="alert alert-block alert-info">
<b>Note:</b> Sometimes you don't need to split data - it comes as many pieces already! Whatever the case, you want to think about the list of items that need to be processed, in order to analyze the items as many jobs. 
</div>

HTCondor then:
1. Queues these jobs in a **job list**.
2. Sends them to available resources across the OSPool or your local cluster.
3. Monitors for completion or failure.
4. Collects all output files for recombination later.

HTCondor uses this information to automatically execute hundreds or thousands of such tasks efficiently and fault-tolerantly, distributing your tasks across the entire OSPool.

### Generating and Submitting a List of Jobs


#### Example Conceptual Visualization

```text
HTCondor Queue:
 ├── job_001 → maps subset_001.fastq
 ├── job_002 → maps subset_002.fastq
 ├── job_003 → maps subset_003.fastq
 ├── ...
 └── job_100 → maps subset_100.fastq
```

Each of these jobs performs the same mapping operation, just on a different chunk of reads across dozens of machines at once. Once all are complete, you’ll have 100 smaller SAM/BAM files ready to merge into a single final alignment.




![HTCondor List of Jobs](notebook_images/listOfJobs.png)

## Building Our List of Jobs

To run the minimap2 mapping on one sample the command is:

```
minimap2 -ax map-ont <ref_genome> <reads_fastq> > <output_sam_file>
```

#### Splitting Our FASTQ Reads File

We want to break down our single humpback_whale_reads.fastq file into many smaller files, each with 1,000 reads. We can use the `split` command to do this:

In [None]:
split -l 4000 inputs/humpback_whale_reads.fastq --additional-suffix=_humpback_whale_reads.fastq inputs/subset_

This command splits the `humpback_whale_reads.fastq` file into smaller files, each containing 4000 lines (which corresponds to 1000 reads, since each read in a FASTQ file is represented by 4 lines). It prepends the prefix `subset_` to each split output file, this will help us in the next step when listing our input files for HTCondor. The output files will be named `subset_aa_humpback_whale_reads.fastq`, `subset_ab_humpback_whale_reads.fastq`, `subset_ac_humpback_whale_reads.fastq`, etc.

Run the command below to see the files that were created:

In [None]:
ls -l inputs | head -n 5

#### Creating a List of Jobs from Our Subset Files

Now that we have our smaller FASTQ files, we can create a list of jobs to map each subset file to the reference genome using HTCondor. Each job will run the `minimap2` command on one of these subset files.

The easiest ways to do this is by using the `ls` command to list all the subset files and saving the output into a text file version for this list of jobs. We can save this list to a file called `listOfReads.txt` on our project directory:

In [None]:
ls inputs/subset_* | xargs -n 1 basename > listOfReads.txt

In [None]:
head listOfReads.txt

## Submitting Our Jobs with HTCondor

So we want to run the `minimap2` command for each of these samples. Our list of jobs will be based on the list of reads subset files -- we want to submit one job per sample. To do this, we need to make two things:

- a list of our samples (we already have this in `sample_list.txt`) ✅, and
- a "template" for the jobs we want to run.

Each of our jobs will run the same command, but with a different sample file. The job template will be the HTCondor submit file. Our job template will be a file called `minimap2.submit`. It will reference our executable script `run_minimap2.sh`. Now is a good time to open up both of these files and take a look at them.

#### Examining the Job Executable File

Lets start with the executable script:


In [None]:
cat run_minimap2.sh

You may notice that the script `run_minimap2.sh` is a simple bash script that runs the `minimap2` command with three parameters: the reference genome, the input FASTQ file, and the output SAM file. The script uses positional parameters `$1`, `$2`, and `$3` to accept these inputs when the script is executed. This means that when you run the script, you need to provide these three arguments in the correct order. HTCondor will handle passing these parameters to the script when it submits each job using the `arguments` option in the submit file.

The script also include a samtools sort command to convert the SAM output from minimap2 into a sorted BAM file, which is a more efficient format for storing aligned reads. The sorted BAM file is saved with the same base name as the output SAM file but with a `.bam` extension. Sorting your BAM files is a common practice in bioinformatics workflows, as many downstream analysis tools require sorted BAM files for optimal performance. This is especially important when merging multiple BAM files together (for example, the many BAM files you will generate mapping each of your read subsets).

#### Examining the Job Submit File

The typical structure of an HTCondor submit file includes several key sections:

##### Example of a Submit File

```plaintext
container_image = <path>                        # Path to the Apptainer container image (.sif file)

executable = <path>                             # Path to the Apptainer container image (.sif file)
arguments = <string>                            # Path to the Apptainer container image (.sif file)

transfer_input_files = <list of paths>          # List of input files to be transferred to the execution site

transfer_output_files = <list of paths>         # List of output files to be transferred back to the submit machine
transfer_output_remaps = <key=value pair>       # Remap output file paths

output = <path>                                 # Path to the standard output file
error = <path>                                  # Path to the standard error file
log = <path>                                    # Path to the log file

request_cpus = <int>                            # Number of CPU cores to request
request_disk = <int>                            # Amount of disk space to request (in KB)
request_memory = <int>                          # Amount of memory to request (in MB)

queue
```

The job template will be the HTCondor submit file. To start out, we're going to write a submit file that submits a list of one (samples), for testing. Let's examine the pre-written submit file to submit one of our minimap2 jobs:

In [None]:
cat run_minimap2.sub

#### Understanding the HTCondor Submit File

In HTCondor, we describe *what* we want to run and *what files and resources* the job needs in a **submit file**.
This file acts as a set of instructions that HTCondor reads to queue and execute our job on the OSPool.

Let’s break down each part of the submit file below and explain what it does.

##### 🧩 **Container and Executable Setup**

```text
container_image        = osdf:///ospool/uc-shared/public/osg-training/tutorial-minimap2/software/minimap2_08OCT2025_v1.sif
executable             = run_minimap2.sh
arguments              = humpback_whale_ref_genome.fasta.mmi subset_ab_humpback_whale_reads.fastq
```

- **`container_image`**
  Specifies the Apptainer/Singularity image that contains the required software.
  Using containers ensures reproducibility and that all dependencies (like `minimap2`) are available, regardless of which machine runs the job.

- **`executable`**
  The script or command that will be executed inside the container — in this case, our shell script `run_minimap2.sh`.

- **`arguments`**
  These are the command-line arguments passed to the executable. Here, we’re giving it:
  - the **reference genome index** (`humpback_whale_ref_genome.fasta.mmi`), and
  - a **subset of reads** (`subset_ab_humpback_whale_reads.fastq`).

##### 📦 **Input and Output Files**

```text
transfer_input_files   = osdf:///ospool/uc-shared/public/osg-training/tutorial-minimap2/inputs/humpback_whale_ref_genome.fasta.mmi, inputs/subset_ab_humpback_whale_reads.fastq

transfer_output_files  = mapped_subset_ab_mhumpback_whale_reads.fastq_reads_to_genome_sam_sorted.bam
transfer_output_remaps = "mapped_subset_ab_humpback_whale_reads.fastq_reads_to_genome_sam_sorted.bam = outputs/mapped_subset_ab_humpback_whale_reads.fastq_reads_to_genome_sam_sorted.bam"
```

- **`transfer_input_files`**
  Lists the files HTCondor should send along with the job to the remote machine.
  These are copied automatically before the job starts.

- **`transfer_output_files`**
  Lists which files should be sent back when the job finishes.
  If not listed, HTCondor assumes the outputs are not needed.

- **`transfer_output_remaps`**
  Renames or moves the returned output into a specific location in your working directory.
  Here, the output BAM file is placed neatly inside an `outputs/` folder. If the folder doesn’t exist, HTCondor will create it.
  The syntax is `"original_filename = new_path/filename"`.
  _This is especially useful for keeping your workspace organized when running many jobs._

##### 🧾 **Logs and Diagnostics**

```text
output = ./logs/$(Cluster)_$(Process)_mapping_subset_ab_humpback_whale_reads.fastq_step2.out
error  = ./logs/$(Cluster)_$(Process)_mapping_subset_ab_humpback_whale_reads.fastq_step2.err
log    = ./logs/$(Cluster)_mapping_step2.log
```

- **`output`**
  Captures anything the job prints to standard output (`stdout`), such as progress messages.

- **`error`**
  Captures any error messages (`stderr`) if something goes wrong.

- **`log`**
  A job-level event log that records job lifecycle information (when it started, ended, or failed).
  The `$(Cluster)` and `$(Process)` macros are automatically filled in by HTCondor to make each log file unique.

##### ⚙️ **Resource Requests**

```text
request_cpus   = 2
request_disk   = 5 GB
request_memory = 10 GB
```

These tell HTCondor **how much compute and storage** your job needs.
HTCondor will only send your job to a machine that can meet these requirements.

- `request_cpus`: number of CPU cores to allocate
- `request_disk`: how much temporary disk space to provide
- `request_memory`: how much RAM (system memory) to allocate

If you're not sure where to start, a good rule of thumb is to start with 1-2 CPUs and 4-8 GB memory for typical bioinformatics tasks, then adjust based on your job's performance and needs.

##### 🚀 **Queue Command**

```text
queue 1
```

Finally, `queue` tells HTCondor to **submit the job**.
You can think of it like pressing the "start" button. Here we queue **1 job**, but in later examples you might queue hundreds at once!

##### **Submitting the Job**
When you submit it using:

```bash
condor_submit mapping_example.sub
```

HTCondor will handle **everything else** — transferring files, finding a machine to run the job, monitoring progress, and bringing your results back.


In [None]:
condor_submit run_minimap2.sub

Check on your job status using:

In [None]:
condor_q

### Job Template — Preparing a Scalable Submit File

So far, we’ve seen how to create a submit file that runs **one job** — mapping a single FASTQ subset to a reference genome.
But what if we have **dozens or hundreds of subsets** that we want to map *in parallel*?
Instead of writing and submitting a separate file for each subset, HTCondor allows us to turn our single-job submit file into a **job template** that can automatically scale to many jobs.

#### The Concept: From One Job → Many Jobs

HTCondor’s strength lies in managing many independent tasks.
To do this efficiently, we can make our submit file **scalable** — meaning it can accept **variables** that change for each queued job.
This allows us to define one flexible template and use different input values for each job (for example, different FASTQ subsets).

Let’s start with our original single-job setup:

```shell
arguments = humpback_whale_ref_genome.fasta.mmi subset_ab_humpback_whale_reads.fastq
```

This command runs `run_minimap2.sh` on one subset. For example, `run_minimap2.sh humpback_whale_ref_genome.fasta.mmi subset_ab_humpback_whale_reads.fastq`.
To scale up, we can replace the hard-coded subset name with a **variable**, such as `$(reads_subset)`.

But what about the reference genome (`humpback_whale_ref_genome.fasta.mmi`)?
In this case, we want to keep it constant across all jobs, so we leave it as is. Reference genomes are typically large files that don’t change between jobs, so it doesn't make sense to make them into a variable.

#### Using Variables

HTCondor allows you to define **custom variables** that can be reused throughout your submit file.
You can then assign different values to these variables for each job when queuing multiple tasks.

For example:

```shell
arguments = humpback_whale_ref_genome.fasta.mmi $(reads_subset)
transfer_input_files = osdf:///ospool//uc-shared/public/osg-training/tutorial-minimap2/inputs/humpback_whale_ref_genome.fasta.mmi, inputs/$(reads_subset)
transfer_output_files = mapped_$(reads_subset)_reads_to_genome_sam_sorted.bam
transfer_output_remaps = "mapped_$(reads_subset)_reads_to_genome_sam_sorted.bam = outputs/mapped_$(reads_subset)_reads_to_genome_sam_sorted.bam"
output = ./logs/$(Cluster)_$(Process)_$(reads_subset)_mapping.out
error  = ./logs/$(Cluster)_$(Process)_$(reads_subset)_mapping.err
```

Now, each time a job is queued, HTCondor will substitute the variable `$(reads_subset)` with the value you provide — just like a placeholder in a template.

You may notice that we did not include the `$(reads_subset)` variable in the log file. This is because the log file is shared across all jobs in the queue, and including a job-specific variable would create multiple log files with the same name, leading to confusion. Instead, we use `$(Cluster)` and `$(Process)` to uniquely identify each job's output and error files.

**Our recommendation is to keep a single log file per submit file submission**, which will contain entries for all jobs submitted from that file. Whereas the output and error files should be unique per job, which is why we include the `$(reads_subset)` variable in their filenames.

#### Queuing Multiple Jobs with `queue` and Variable Lists

You can now tell HTCondor to queue one job for each subset of data, assigning a different variable value each time.

```text
queue reads_subset from listOfReads.txt
```

The file `listOfReads.txt` might look like this:

```
subset_ab_humpback_whale_reads.fastq
subset_cd_humpback_whale_reads.fastq
subset_ef_humpback_whale_reads.fastq
```

When you run:

```bash
condor_submit run_multi_minimap2.sub
```

HTCondor will automatically create **many jobs**, substituting each `$(reads_subset)` in the template with one line from the file. This queuing strategy works similarly to a `while` or `for` loop in programming.

### ⛔ Testing Before Fully Scaling Up ⛔️

Before submitting hundreds of mapping jobs to the OSPool, it’s best practice to **test your workflow with just a few jobs first**. This helps ensure your inputs, paths, and scripts are correct and that your jobs complete successfully within expected time and resource limits.

The easiest way to test your submit file before scaling to the full dataset is by generating a smaller subset of your listOfReads.txt file using the `head` command.


In [None]:
head -n 5 listOfReads.txt > testset_of_listOfReads.txt

In [None]:
cat testset_of_listOfReads.txt

Use the `testset_of_listOfReads.txt` file in your submit file to test with just a few jobs. Edit your submit file to reference `testset_of_listOfReads.txt` instead of the full list of reads for the `queue` statement. Here is an example of what your test submit file might look like:

```shell
... Previous submit file content ...
queue reads_subset from testset_of_listOfReads.txt
```

Once you have your test submit file ready, you can submit it to HTCondor using the `condor_submit` command.

In [None]:
condor_submit run_multi_minimap2.sub

Monitor the job queue using `condor_watch_q` to see the status of your jobs. This command provides a real-time view of your job's progress.

In [None]:
condor_q

Notice that using a **single submit file**, we now have **multiple jobs in the queue**.

It's always good practice to look at our standard error, standard out, and HTCondor log files to catch unexpected output:

In [None]:
ls -lh logs/

### Scaling Up to a List of Jobs

We can now combine our template and our list of samples to generate a list of jobs! See what our new submit file looks like:

In [None]:
cat run_full_minimap2.sub

Two changes have turned our previous submit file into something that can submit many jobs at once:
* We have incorporated our list, `listOfReads.txt` in the `queue` option at the end of the file. There are
different ways to "queue" items from a list; we've chosen `queue <variable> from <list.txt>` as a good all-purpose option.
* Wherever our job template has a value that will be unique for every job (the sample id), we have replaced
the value from our first submit file with a variable, `$(reads_subset)`, which was defined in the queue statement.

One way to think about this file is as an inverted for-loop for submitting jobs - where the for statement `queue reads_subset from sample_list.txt` is at the bottom of the file and the rest of the file above the for statement is the body of the loop.

We're now ready to submit our list of jobs!

In [None]:
condor_submit run_full_minimap2.sub

We can check on the status of our multiple jobs in HTCondor's queue by using:

In [None]:
condor_q

When ready, we can check our results in our `results/` directory:

In [None]:
ls -lh results/

## Next Steps

Now that you've completed the long-read minimap tutorial on the OSPool, you're ready to adapt these workflows for your own data and research questions. Here are some suggestions for what you can do next:

🧬 Apply the Workflow to Your Own Data
* Replace the tutorial datasets with your own FASTQ files and reference genome.
* Modify the mapping submit files to fit your data size, read type, and resource needs.

🧰 Customize or Extend the Workflow
* Incorporate quality control steps (e.g., filtering or read statistics) using FastQC.
* Use other mappers or variant callers, such as ngmlr, pbsv, or cuteSV.
* Add downstream tools for annotation, comparison, or visualization (e.g., IGV, bedtools, SURVIVOR).

📦 Create Your Own Containers
* Extend the Apptainer containers used here with additional tools, reference data, or dependencies.
* For help with this, see our [Containers Guide](https://portal.osg-htc.org/documentation/htc_workloads/using_software/containers/).

🚀 Run Larger Analyses
* Submit thousands of mappings or alignment jobs across the OSPool.
* Explore data staging best practices using the OSDF for large-scale genomics workflows.
* Consider using workflow managers (e.g., [DAGman](https://portal.osg-htc.org/documentation/htc_workloads/automated_workflows/dagman-workflows/) or [Pegasus](https://portal.osg-htc.org/documentation/htc_workloads/automated_workflows/tutorial-pegasus/)) with HTCondor.

🧑‍💻 Get Help or Collaborate
* Reach out to [support@osg-htc.org](mailto:support@osg-htc.org) for one-on-one help with scaling your research.
* Attend office hours or training sessions—see the [OSPool Help Page](https://portal.osg-htc.org/documentation/support_and_training/support/getting-help-from-RCFs/) for details.

### Software

In this tutorial, we created a *starter* apptainer containers for Minimap2. This container can serve as a *jumping-off* for you if you need to install additional software for your workflows.

Our recommendation for most users is to use "Apptainer" containers for deploying their software.
For instructions on how to build an Apptainer container, see our guide [Using Apptainer/Singularity Containers](https://portal.osg-htc.org/documentation/htc_workloads/using_software/containers-singularity/).
If you are familiar with Docker, or want to learn how to use Docker, see our guide [Using Docker Containers](https://portal.osg-htc.org/documentation/htc_workloads/using_software/containers-docker/).

This information can also be found in our guide [Using Software on the Open Science Pool](https://portal.osg-htc.org/documentation/htc_workloads/using_software/software-overview/).

### Data

The ecosystem for moving data to, from, and within the HTC system can be complex, especially if trying to work with large data (> gigabytes).
For guides on how data movement works on the HTC system, see our [Data Staging and Transfer to Jobs](https://portal.osg-htc.org/documentation/htc_workloads/managing_data/overview/) guides.

### GPUs

The OSPool has GPU nodes available for common use. If you would like to learn more about our GPU capacity, please visit our [GPU Guide on the OSPool Documentation Portal](https://portal.osg-htc.org/documentation/htc_workloads/specific_resource/gpu-jobs/).

## Getting Help

The OSPool Research Computing Facilitators are here to help researchers using the OSPool for their research. We provide a broad swath of research facilitation services, including:

* **Web guides**: [OSPool Guides](https://portal.osg-htc.org/documentation/) - instructions and how-tos for using the OSPool and OSDF.
* **Email support**: get help within 1-2 business days by emailing [support@osg-htc.org](mailto:support@osg-htc.org).
* **Virtual office hours**: live discussions with facilitators - see the [Email, Office Hours, and 1-1 Meetings](https://portal.osg-htc.org/documentation/support_and_training/support/getting-help-from-RCFs/) page for current schedule.
* **One-on-one meetings**: dedicated meetings to help new users, groups get started on the system; email [support@osg-htc.org](mailto:support@osg-htc.org) to request a meeting.

This information, and more, is provided in our [Get Help](https://portal.osg-htc.org/documentation/support_and_training/support/getting-help-from-RCFs/) page.