## 01_getting_data

### Downloading FASTQ files from the SRA

Now comes the exciting part. We are going to get started on your project. The first step in this process is to download data from the Sequence Read Archive or SRA. This notebook will walk you through all of the steps in this process using a tool called the sra-toolkit. 

-----------

Sections:

1. Pre-fetching your FASTQ files
2. Using fasterq-dump to download the FASTQ R1 and R2 (forward and reverse) read files
3. Checking that your FASTQ files have been downloaded

-----------


### Getting Started

Before we get started you will need to set several variables that we will use throughout this notebook. 

You will need to rerun this section each time you come back to this notebook to reset the variables.

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"
file_count = 100

In [None]:
# Set the working directory and change into this directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/project/01_getting_data"
%cd $work_dir

In [None]:
# Set the fastq directory
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/project/01_getting_data"

### Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$fastq_dir/$xfile" >> config.sh
!echo "export FILE_COUNT=$file_count" >> config.sh
!echo "export SRA_TOOLKIT=/contrib/singularity/shared/bhurwitz/sra-tools-3.0.3.sif" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh

## Step 1: Prefetching the FASTQ files for your project

The very first step in downloading data from the Sequence Read Archive (SRA) at NCBI is to "pre-fetch" the data using the SRA toolkit. 

### Using containers to run bioinformatics tools

We will be running many bioinformatics tools using containers. Containers are virtual environments that contain all of the necessary components to run the code. This includes the operating system, the tool, and any dependencies. Containers allow programmers to "package" up their code, so it can be run anywhere (on a laptop, HPC, or in the cloud) without having to reinstall and set everything up to run the code there locally. Everything is in the container!

The UA HPC requires us to use the apptainer command to create/run our bioinformatics tools in containers. The command to run a container looks something like this:

```
apptainer run ${SRA_TOOLKIT} prefetch [options and files]

```

Because the apptainer command can only be run from one of the compute nodes, not the login node, we have to put this code inside shell script to run it. We then use the sbatch command to "launch" this script on the HPC.

It looks something like this...

```
sbatch run_script.sh
```

OK, let's get started by creating our "run_scripts".


In [None]:
# Let's create the run_script to pre-fetch fastq files by using Python to write it for us.

my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-${FILE_COUNT}                          
#SBATCH --output=01A_run_prefetch-%a.out
#SBATCH --error=01A_run_prefetch-%a.err
#SBATCH --cpus-per-task=4                    
#SBATCH --mem-per-cpu=2G                            
 
pwd; hostname; date

source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))
 
echo ${names[${SLURM_ARRAY_TASK_ID}]}

apptainer run ${SRA_TOOLKIT} prefetch ${names[${SLURM_ARRAY_TASK_ID}]}

'''

with open('01A_run_prefetch.sh', mode='w') as file:
    file.write(my_code)

## Step 2: Downloading FASTQ files for your project

Now that you have pre-fetched all of your FASTQ files, we are ready to download them. 

We will use the fasterq-dump command to get the FASTQ R1 and R2 files. 

Here's the fasterq-dump [documentation](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump).

In [None]:
# Let's create the script to get the FASTQ files by using Python to write it for us.

my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-${FILE_COUNT}                          
#SBATCH --output=01B_run_fasterq-dump-%a.out
#SBATCH --error=01B_run_fasterq-dump-%a.err
#SBATCH --cpus-per-task=4                    
#SBATCH --mem-per-cpu=2G                            
 
pwd; hostname; date

source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))
 
echo ${names[${SLURM_ARRAY_TASK_ID}]}

apptainer run ${SRA_TOOLKIT} fasterq-dump --split-files ${names[${SLURM_ARRAY_TASK_ID}]}

'''

with open('01B_run_fasterq-dump.sh', mode='w') as file:
    file.write(my_code)

## Step 3: Compressing the FASTQ files for your project


In [None]:
# Let's create a script that gzip's all of the FASTQ files
# These are huge files, so it may take 2 hours to run.
# This script uses gzip to compress each of the *.fastq files in your fastq_dir.
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-${FILE_COUNT}
#SBATCH --output=Job-gzip-%a.out
#SBATCH --cpus-per-task=1   
#SBATCH --mem=4G                
 
pwd; hostname; date
source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))
gzip ${FASTQ_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_*.fastq
'''

with open('01C_run_gzip.sh', mode='w') as file:
    file.write(my_code)

## Step 3: Putting it all together

Once you have created the the run scripts, you are ready to put them together in a pipeline to run each of the steps one by one.

Note that 01A_run_prefetch jobs need to finish, before we can kick off the 01B_run_fasterq-dump. To do this, we will need to set up dependencies in our "launch script".

In [None]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''                     
 


'''

with open('01_launch_pipeline.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# now let's run it!
!sbatch ./01_launch_pipeline.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
# Notice that 01B jobs are dependent on 01A jobs finishing.
!squeue --user=$netid

## Step 4: Checking your FASTQ files

Your code will take a little time to get "picked up" by the HPC and move from PD (pending) to R (running). Be sure to come back and check your directory to be sure that you have R1 and R2 files for each of your accessions. You can run this by returning to this notebook, or by using the shell.

In [None]:
# Go to into the directory you downloaded your FASTQ data
netid = "YOUR_NETID"
%cd /xdisk/bhurwitz/bh_class/$netid/project/01_getting_data

In [None]:
# Check to see if you have an R1 and R2 file for each of your accessions (10 total). Do they have a size > 0?
!ls -l

In [None]:
# Check the files to see if you have FASTQ formatted data.
!head -4 ERR*.fastq

Great Job! Be sure to copy your notebook to your work directory.

In [None]:
cp ~/01_getting_data.ipynb /xdisk/bhurwitz/bh_class/$netid/project/01_getting_data

-----