## HW05 getting data

### Downloading FASTQ files from the SRA

Now comes the exciting part. We are going to get started on your project. The first step in this process is to download data from the Sequence Read Archive or SRA. This notebook will walk you through all of the steps in this process using a tool called the sra-toolkit. 

-----------

Sections:

1. Creating a run script to prefetch your FASTQ files
2. Creating a run script to use fasterq-dump to download FASTQ R1 and R2 (forward and reverse reads)
3. Creating a run script to zip up your FASTQ files
4. Running a launcher script to submit your run scripts to the cluster to run

-----------


### Getting Started

Before we get started you will need to set several variables that we will use throughout this notebook. 

In [None]:
# set the variables for your netid
netid = "MY_NETID"

In [None]:
# Set the working directory and change into this directory
# All files will be downoaded here. All scripts will be written here.
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"
%cd $work_dir

In [None]:
# Set the fastq directory
# Notice this is the same as the working directory, because we are downloading the fastq files 
# for the first time. We will use this same directory in other Jupyter notebooks for the project.
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"

In [None]:
# Next we will need to see what "xfile" you have been assigned to.
# List the directory to find the name of the file that starts with "x"
# You will use this file for every homework assignment.
!ls x*

In [None]:
# Set the name of your xfile
# replace "MY_XFILE" with the one above
xfile = "MY_XFILE"

### Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will use a command called "source" to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export SRA_TOOLKIT=/contrib/singularity/shared/bhurwitz/sra-tools-3.0.3.sif" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

### Step 1: Writing a run script to prefetch the FASTQ files for your project

The very first step in downloading data from the Sequence Read Archive (SRA) at NCBI is to "pre-fetch" the data using the SRA toolkit. 

#### Using containers to run bioinformatics tools

We will be running many bioinformatics tools using containers. Containers are virtual machines that contain all of the necessary components to run the code. This includes the operating system, the bioinformatics tool, and any dependencies. Containers allow programmers to "package" up their code, so it can be run anywhere (on a laptop, HPC, or in the cloud) without having to reinstall and set everything up to run the code there locally. Everything is in the container!

The UA HPC requires us to use the apptainer command to create/run our bioinformatics tools in containers. The command to run a container looks something like this:

apptainer run NAME_OF_TOOL command [options and files]

Here is an example for the SRA Toolkit that we will be using here:

```
apptainer run ${SRA_TOOLKIT} prefetch [options and files]

```

The apptainer command can only be run from one of the compute nodes, not the login node. This means that we need to put this code inside a shell script to run it. I called this a "run script". We then use the sbatch command to "launch" this script on the HPC. I call this script the "launcher script".

OK, let's get started by creating our "run_scripts". These scripts will run our containers (or bioinformatics code).


In [None]:
# Let's create the run script to pre-fetch FASTQ files by using Python to write it for us.

my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                       
#SBATCH --output=05A_run_prefetch-%a.out
#SBATCH --error=05A_run_prefetch-%a.err
#SBATCH --cpus-per-task=4                    
#SBATCH --mem-per-cpu=2G                            
 
pwd; hostname; date

source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))
 
echo ${names[${SLURM_ARRAY_TASK_ID}]}

apptainer run ${SRA_TOOLKIT} prefetch ${names[${SLURM_ARRAY_TASK_ID}]}

'''

with open('05A_run_prefetch.sh', mode='w') as file:
    file.write(my_code)

### Step 2: Writing a run script to download FASTQ files for your project

After we prefetch all of the FASTQ files, we need to download them. We will use the fasterq-dump command to get the FASTQ R1 and R2 files. 

Here's the fasterq-dump [documentation](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump).

In [None]:
# Let's create the run script to get the FASTQ files by using Python to write it for us.

my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                        
#SBATCH --output=05B_run_fasterq-dump-%a.out
#SBATCH --error=05B_run_fasterq-dump-%a.err
#SBATCH --cpus-per-task=4                    
#SBATCH --mem-per-cpu=2G                            
 
pwd; hostname; date

source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))
 
echo ${names[${SLURM_ARRAY_TASK_ID}]}

apptainer run ${SRA_TOOLKIT} fasterq-dump --split-files ${names[${SLURM_ARRAY_TASK_ID}]}

'''

with open('05B_run_fasterq-dump.sh', mode='w') as file:
    file.write(my_code)

### Step 3: Writing a run script to compress the FASTQ files for your project

The FASTQ files for your project are huge. To stay in the good graces of our HPC staff and keep our file sizes down so we don't run out of space, we will compress our FASTQ files.

In [None]:
# Let's create a run script that gzip's all of the FASTQ files
# These are huge files, so it may take some time to run.
# This script uses gzip to compress each of the *.fastq files in your fastq_dir.

my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7
#SBATCH --output=05B_run_gzip-%a.out
#SBATCH --cpus-per-task=2   
#SBATCH --mem-per-cpu=6G              
 
pwd; hostname; date
source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))

gzip ${FASTQ_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_*.fastq

'''

with open('05C_run_gzip.sh', mode='w') as file:
    file.write(my_code)

## Step 3: Putting it all together

Once you have created the run scripts, you are ready to put them together in a pipeline to run each one, one after another. Each run scipt will be a "job" and each job will wait for the one before it to finish before starting.

For example, the 05A_run_prefetch job need to finish, before we can run the 05B_run_fasterq-dump job. To do this, we will need to set up dependencies in our "launch script". Also, notice that each job is a job array, meaning that it is comprised of multiple jobs within it. In our case each job array has 8 elements in it, one for each accession we are running through that step.

In [None]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

# 05A_run_prefetch: first job - no dependencies
job1=$(sbatch 05A_run_prefetch.sh)
jid1=$(echo $job1 | sed 's/^Submitted batch job //')
echo $jid1

# 05B_run_fasterq-dump: jid2 depends on jid1
job2=$(sbatch --dependency=afterok:$jid1 05B_run_fasterq-dump.sh)
jid2=$(echo $job2 | sed 's/^Submitted batch job //')
echo $jid2

# 05C_run_gzip: jid3 depends on jid2
job3=$(sbatch --dependency=afterok:$jid2 05C_run_gzip.sh)
jid3=$(echo $job3 | sed 's/^Submitted batch job //')
echo $jid3

'''

with open('05_launch_pipeline.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Make the pipeline script executable
!chmod +x *.sh

In [None]:
# now let's run it!
!./05_launch_pipeline.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
# Notice that 05B jobs are dependent on 05A jobs finishing.
!squeue --user=$netid

### What happens next?

Your code will take a little time to get "picked up" by the HPC and move from PD (pending) to R (running). Come back in about a day to double check you got all of the raw sequence files using the hw05_check.ipynb notebook. But, for now, relax and enjoy your day!

### The End! 
Be sure to copy your notebook into the class project directory (see below).

In [None]:
!cp ~/be487-fall-2024/assignments/05_getting_data/hw05_getting_data.ipynb $work_dir

-----