# SRA Toolkit

### Questions:
- How can I download FASTQ files from the SRA in an automated way?
- What options are available for downloading sequence data via teh SRA toolkit?
### Objectives:
- Download sequence data for our example project.
### Keypoints:
- The sra-tool kit allows you to download FASTQ data 
- The `fasterq-dump` commands has multiple options for downloading sequence data
- Typically, we use the `--split-spot` option for `fasterq-dump` to split the files in for R1/R2, forward and reverse reads 

## The Sequence Read Archive (SRA)
In the last exercise, we learned that one of the biggest repositories for genomic/metagenomic data is the National Certer for 
Biotechnology Information (NCBI). However, accessing these data can be tricky, and requires knowledge about how the data are organized. Users need to jump from the `Bioproject` page, to the `Biosample` page, then to the `SRA experiments`. This is not trivial, and would take you a long time to download all of the data you need from a webpage. Plus, downloading the data to our laptop, and then uploding to the HPC is cumbersome. Instead, we can download the data directly to the HPC using a bioinformatics tool called sra-tools.

Instead we can use the `SRA Toolkit`. Let's explore this...

In [None]:
netid = "YOUR_NETID"
%cd /xdisk/bhurwitz/bh_class/$netid/exercises/05_getting_data

In [None]:
my_data = '''Accessions
SRR10153499
SRR10153504
SRR10153506
SRR10153508
SRR10153510
SRR10153512
SRR10153514
SRR10153573
SRR10153500
SRR10153501
SRR10153502
SRR10153503
SRR10153505
SRR10153507
SRR10153509
SRR10153511
SRR10153513
SRR10153515
'''
with open('SRA-accessions.txt', mode='w') as file:
    file.write(my_data)

In [None]:
!cat SRA-accessions.txt

### Using fasterq-dump to download the data

We will use `fasterq-dump` from the sra-tool kit to retrieve data from the SRA. Let's see 
some of the parameters that this tool can offer:

In [None]:
!apptainer run /contrib/singularity/shared/bhurwitz/sra-tools-3.0.3.sif fasterq-dump --help

Wow! There are a lot of options. First, let's focus on the split options:

```
-s|--split-spot                  split spots into reads
-S|--split-files                 write reads into different files
-3|--split-3                     writes single reads into special file
   --concatenate-reads           writes whole spots into one file
```

#### --split-spot (-s)

This flag will generate a unique file which will contain all the information for the 
library, no matter if those reads are forward or reverse sequenced. Each "spot" is like a piece of DNA.
Each read will come with the 4 lines in the usual in the `FASTQ` format.


#### ---split-file (-S)

With this statement, we will end with separate files for the forward and the reverse reads
(1.fastq and 2.fastq respectively). Nevertheless, the unmated reads (those present in
the forward but without their complement in reverse and vice versa) will also be located
in their respective file. This can be useful for k-mer based analyses, but usually we
will prefer to exclude the unmated reads when assembling genomes from metagenomes. This option
will write each read with the four lines in the usual `FASTQ` format.

#### --concatenate-reads

The information for each read is concatenated and each new spot (information from the forward
and reverse) is written alongside the four lines characteristic of the `FASTQ` format.

#### --split-spot

This is the default option for `fasterq-dump`. The source file is split in a file containing the 
forward reads (_i.e._ 1.fastq) and the reverse ones (_i.e._ 2.fastq). Unmated reads are placed in 
a 3.fastq or SRA-code-name file. Each read is written with the 4 characteristic lines of the `FASTQ` format.
Most of the sequencing projects are now in paired-end read format. This is also the case for the reads that we
will use, so this is the right option for our datasets.


Typically we use --skip-technical so we don't bother downloading a third read file with the barcodes and primers.


Let's run one example with the first accession from our data field: SRR10153499. We will use the 
`--stdout` option, so our output is displayed in the terminal. Also, we will use some of the commands that we 
reviewed in the past lessons.

In [None]:
!apptainer run /contrib/singularity/shared/bhurwitz/sra-tools-3.0.3.sif fasterq-dump -s --stdout SRR10153499 --skip-technical | head -n 8

You should see something like this:

```
@SRR10153499.1 1 length=250
TACGGAGGATACGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGTGGTCCTGCAAGTCAGTGGTGAAAAGCTGAGGCTCAACCTCAGCCTTGCCGTTGAAACTGCAAGACTTGAGAGTACATGATGTGGGCGGAATGCGTAGTGTAGCGGTGAAATGCATAGATATTACGCAGAACTCCGATTGCGAAGGCAGCTCACAAAGGTATATCTGACACTGAGGCACGAAAGCGTGGGGAGCAAAC
+SRR10153499.1 1 length=250
CCCCCCBCCFFFGGGGGGGGGGHGGGGGHHHHHHHGGGHHHHHGHGGGGGGGHHGGHHHHHHHHHHHHHHHHGHHHHHHGHGHGFHHHHHHHHHHHGHHHHHGGGGHHHHHHHHHHGHHHHHHGHHHHHHHHHHHHHGGGGGGGHHGGGGGHHHHHGGGGGGHHHHFHHHHHFHHHHGGGGGGGFGGGGGAEGGGBGGFFFFFFFFBFAFF0BFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFFB9:BA
@SRR10153499.1 1 length=203
CCTGTTTGCGCCCCACGCTTTCGTGCCTCAGTGTCAGATATACCTTTGTGAGCTGCCTTCGCAATCGGAGTTCTGCGTAATATCTATGCATTTCACCGCTACACTACGCATTCCGCCCACATCATGTACTCTCAAGTCTTGCAGTTTCAACGGCAAGGCTGAGGTTGAGCCTCAGCTTTTCACCACTGACTTGCAGGACCACC
+SRR10153499.1 1 length=203
>AAA1B3C@1AAGEGGGGGF0B0BBAFG1FGFGGHHHHHDGBEGFHHHDADAFGFGHFGGFF//EFFCBEFHBGHFFEAEGHFHGBFGF2GHHHEGHGGG>EHFHHFGG?EFHFGCA@CGGHHFGFH2FFHGFFHHGHHHHHHFHHHHFH1ACC@-
```

As mentioned before, the files will be in `FASTQ` format. `fasterq-dump` takes shorter times to accomplish the task because of its multi-thread capability. We can assign how many threads we want `fasterq-dump` to use to the task, more threads is less time. Let's check the number of threads available in our compute node:

In [None]:
!nproc --all

Since `fasterq-dump` does not take multiple accessions at once, we build a command that uses a `while` loop to proccess all the accessions in the `SRA-accessions.txt`

First, notice that our SRA-accessions.txt file has a header "Accessions"

In [None]:
!cat SRA-accessions.txt

We can remove the header using the sed command to ignore the first row

In [None]:
!cat SRA-accessions.txt | sed -n '1!p'

Next, we can add in a while loop to go through each accession in the list, and retrieve the FASTQ files with fasterq-dump in sra-tools.

Also note that I can change the number of processors I use with the "-e" flag, to make fasterq-dump even faster. I am going to use 12 here.

Note that this command takes a couple minutes to run.

In [None]:
!cat SRA-accessions.txt | sed -n '1!p'| while read line; do apptainer run /contrib/singularity/shared/bhurwitz/sra-tools-3.0.3.sif fasterq-dump $line -e 12; done

Wasn't that cool! You should see something like this:

You should see something like this:

spots read      : 14,467
reads read      : 28,934
reads written   : 28,934
spots read      : 13,557
reads read      : 27,114
reads written   : 27,114
...

In [None]:
!ls -lh

You should see this:

```
total 208M
-rw-r--r-- 1 bhurwitz bh_class  227 Sep 14 08:58 SRA-accessions.txt
-rw-r--r-- 1 bhurwitz bh_class 7.8M Sep 14 09:24 SRR10153500_1.fastq
-rw-r--r-- 1 bhurwitz bh_class 6.7M Sep 14 09:24 SRR10153500_2.fastq
-rw-r--r-- 1 bhurwitz bh_class 7.2M Sep 14 09:25 SRR10153501_1.fastq
-rw-r--r-- 1 bhurwitz bh_class 6.3M Sep 14 09:25 SRR10153501_2.fastq
-rw-r--r-- 1 bhurwitz bh_class 5.5M Sep 14 09:25 SRR10153502_1.fastq
-rw-r--r-- 1 bhurwitz bh_class 4.7M Sep 14 09:25 SRR10153502_2.fastq
...
```

In [None]:
!ls SRR* | wc -l

Now, we have 36 files (2 for every one of the 18 samples, (with forward and reverse reads) that we will use in the next 
lessons. 

`fasterq-dump` is a useful tool to access to public data. Since the explotion of the 
next-generation sequencing technologies, it is imperative for publicable research projects 
to upload their data. This is a useful resource for learnes, students and professors to 
use the already scrutinized data to practice, run newly-develop tools, and teach exercises.


Note: To make fasterq-dump even faster, you can user the `prefetch` command and then use `fasterq-dump`. 
You will see how we do this in the homework for this week. This can be incredibly important when you are downloading large datasets.