# Get SRR Data from NCBI

**This documentation aims to help how to go from SRR ID (obtained through SRA Selector - NCBI site) to paired-end FASTQ files. The FASTQ files will then be utilized in another pipeline as input. FASTQ files are required in the nextflow scRNA-seq pipeline that will be used in the next step**

## Obtaining scRNA data from NCBI

when we want to obtain scRNA data from NCBI, there are different ways todo it.
- 1. you can consume the service to deliver the data to your AWS bucket
- 2. You can use sra-toolkit from ubuntu-server or a similar OS to download it directly in the HPC
 
This guide will focus on utilizing (2) as there are many cons on the first approach. Just to name some:
- We need to wait up to 48hs for NCBI to process the request
- we do not have much information about exactly what kind of format we are downloading
- We need to transfer (again) from AWS S3 bucket to the HPC instance where we are planning to work

## Install sra-toolkit
- To make sure we are using the latest available version, we will download `sra-toolkit` from their github page: [SRA-TOOLKIT](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit).
- Once the file has been downloaded, we will install it following this guide: [Installation guide SRA-toolkit](https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit)

Once we have sra-toolkit installed, there are 2 ways we can obtain data files from NCBI SRA selector:
- We can use `prefetch <SRR_ID>` and it will download the selected SRR in the current directory. **important:** The file will be downloaded in the _SRA_ format.
- We can use `fasterq-dump <SRR_ID>` and it will not only download the file, but it will also process it to obtain the _FASTQ_ format.
- **DO NOT USE:** It is also possible to use `fastq-dump` however after several benchmark analysis, the speed is so low that it is not recommended at all.

## About sra-toolkit and fasterq-dump
In case that we have paired-end reads, we need to use `--split-files` while running fasterq-dump. if the reads are single end, `--split-3` should be used instead. paired-end reads will produce as outcome two different files, `{SRR_ID}_1.fastq` and `{SRR_ID}_2.fastq`. As we could expect, single reads will produce only 1 single file as outcome. For additional documentation can be found here: [fasterq-dump Documentation](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)

## Notes
- `-p` shows you the progress in the terminal
- `-e` indicates how many cores you will utilize to process the file
- `--split-files` indicates to the program that it is dealing with paired-end sequences.

# Example

In [1]:
!fasterq-dump --split-files SRR5739553 -e 10 -p

join   :|  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.99-  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.4 2.4 2.4 2.4 2.4 2.4 2.

# How to get many SRA files at once
As we are working with scRNA, it is common to download many files from SRA Selector. In order to do it programatically we will continue with the following steps:
-  We need to create a file called `script.sh` that will check a list of {SRR_ID} from another file called `SRR_acc_list.txt` and it will download from SRA one by one.
-  We need to create `SRR_acc_list.txt` with the list of SRR IDs we are interested to download 

In [1]:
%%bash
cat << 'EOF' > script.sh
#!/bin/bash
while read accession; do
  fasterq-dump $accession -O /mnt/faster/results/ --split-files --threads 16 -p
done < SRR_acc_list.txt
EOF

In [2]:
cat script.sh

#!/bin/bash
while read accession; do
  fasterq-dump $accession -O /mnt/faster/results/ --split-files --threads 16 -p
done < SRR_acc_list.txt


In [7]:
%%bash
cat << 'EOF' > SRR_acc_list.txt
SRR5739552
SRR5739553
SRR5739554
SRR5739555

EOF

**Note:** always make sure you have an empty line at the end, otherwise the last SRR will not be downloaded

In [8]:
cat SRR_acc_list.txt

SRR5739552
SRR5739553
SRR5739554
SRR5739555
