# Download data

This notebook is downloading data from SRA and then using the module, `fasterq-dump`, from the SRA toolkit to get the fastq files associated with the downloaded SRA files.

Note: Need to delete `sra` folder between runs; otherwise, `fastq-dump` will be called on all files in `sra` folder which can include more than your sra accessions.

In [1]:
%load_ext autoreload
%autoreload 2

import os
import shutil
import pandas as pd
import numpy as np
from core_acc_modules import paths

np.random.seed(123)

### Download SRA data

Note: Need to delete `sra` folder between runs otherwise `fastq-dump` will be called on all files in `sra` folder which can include more than your sra accessions.

In [2]:
shutil.rmtree(paths.SRA_DIR)

In [3]:
# Download sra data files
! prefetch --option-file $paths.SRA_ACC


2021-01-15T20:24:08 prefetch.2.10.9: 1) Downloading 'SRR13160334'...
2021-01-15T20:24:08 prefetch.2.10.9:  Downloading via HTTPS...
2021-01-15T20:45:33 prefetch.2.10.9:  HTTPS download succeed
2021-01-15T20:45:34 prefetch.2.10.9:  'SRR13160334' is valid
2021-01-15T20:45:34 prefetch.2.10.9: 1) 'SRR13160334' was downloaded successfully
2021-01-15T20:45:34 prefetch.2.10.9: 'SRR13160334' has 0 unresolved dependencies

2021-01-15T20:45:37 prefetch.2.10.9: 2) Downloading 'ERR3642743'...
2021-01-15T20:45:37 prefetch.2.10.9:  Downloading via HTTPS...
2021-01-15T21:24:42 prefetch.2.10.9:  HTTPS download succeed
2021-01-15T21:24:43 prefetch.2.10.9:  'ERR3642743' is valid
2021-01-15T21:24:43 prefetch.2.10.9: 2) 'ERR3642743' was downloaded successfully
2021-01-15T21:24:43 prefetch.2.10.9: 'ERR3642743' has 0 unresolved dependencies

2021-01-15T21:24:43 prefetch.2.10.9: 3) Downloading 'SRR13234437'...
2021-01-15T21:24:43 prefetch.2.10.9:  Downloading via HTTPS...
2021-01-15T21:50:10 prefetch.2.10.9

### Get FASTQ files associated with SRA downloads

The fastq files store the RNA-seq results, including: sequencing and quality scores for each base call.

Here is a nice blog to explain how to read fastq files: https://thesequencingcenter.com/knowledge-base/fastq-files/

The fastq files gives the sequence of a read at a given location. Our goal is to map these reads to a reference genome so that we can quantify the number of reads that are at a given location, to determine the level of expression.

`fasterq-dump` automatically splits paired-end data into 3 files:
* file_1.fastq having read 1
* file_2.fastq having read 2
* file.fastq having unmatched reads (i.e. read doesn't have a mate pair). 
https://www.rdocumentation.org/packages/geomedb/versions/2.0.1/topics/fasterqDump

In [4]:
os.makedirs(paths.FASTQ_DIR, exist_ok=True)

In [5]:
%%bash -s $paths.SRA_DIR $paths.FASTQ_DIR
for FILE_PATH in $1/*;

do
    # get file name
    sample_name=`basename ${FILE_PATH}`

    # remove extension from file name
    sample_name="${sample_name%.*}"
    
    echo "Getting fastq for $FILE_PATH"
    fasterq-dump $FILE_PATH -O $2/${sample_name} -f
done

spots read      : 24,096,032
reads read      : 48,192,064
reads written   : 48,192,064
spots read      : 8,669,682
reads read      : 17,339,364
reads written   : 17,339,364
spots read      : 15,672,801
reads read      : 31,345,602
reads written   : 31,345,602
