# RNA seq quantification


The main goal of this notebook is to preprocess and to quantify the mRNA obtained in the samples.
There are many approaches and tools to achieve this. Here I am going to apply two different stategies: a k-mer based fast approach (SALMON) and a classic read-map (splice aware) tool (STAR). Since here we are dealing with human samples a reference based approach is more practical because of its increased sensitivity compared to de-novo approaches. Note, however, that one of the limitations of these approaches lies in the applied reference genome, which might be less representative in this application than one would expect.
Here we used the GRCh38 reference genome and its ensemble provided gene annotation. Note that the actual gene annotation has substantial impact on the results rendering RNA-Seq data integration a lot harder task. 
The raw reference genomes are deposited into the cloud as well as the corresponding indeces. The indeces are fitted to this rna-seq data (i.e. smaller k-mers (salmon)).
The indexing and the alignment requires significant amount of resources (min. RAM > 40G; cpu_cores>10), thus it will not be able to run on simple-serverless-solutions, one needs a dedicated resource to manage (fortunately the cloud providers have solutions for that. In this application I used dedicated VM-s). 


### Input: 
   * preprocessed reads
   * optionally genome data to index

   
### Outputs:
  * RNA pseudocounts for samples provided by SALMON
  * indexed genomes provided by STAR
  * indexed genomes provided by SALMON

### Requirements:
  * salmon, star, sam-tools, htseq-counts
  * input data, project directories
  * inputs here assumed to be single reads


### Assumptions and notes
  * the proper paths and project data should be set before the run

### Data:
  * reference genome: https://storage.googleapis.com/turbine-rna/GRCh38.tar.bz2
  * STAR indexed reference genome: https://storage.googleapis.com/turbine-rna/STAR_human_ref_index.tar.bz2
  * SALMON index: https://storage.googleapis.com/turbine-rna/GRCh38_ensemble_transcripts.tar.bz2



# General paths and data


In [None]:
import os
import multiprocessing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from turbine_lib import *


genome_data_http = 'https://storage.googleapis.com/turbine-rna/GRCh38.tar.bz2'
star_index = 'STAR_human_ref_index.tar.bz2'

%load_ext autoreload
%autoreload 2

# Input
base_path = '/bio-apps/turbine-rna/'
results_path = join(base_path, 'results')
trimmed_path = join(results_path, 'trimmed_files/')


# Output folders and parameters
salmon_path = 'salmon'
salmon_index = join(base_path, 'data', 'salmon_GRCh38_transcripts_index')
salmon_output_folder = join(results_path, 'salmon_GRCh38_outputs')

# STAR mapping tool
star_path = 'STAR'
output_dir = join(results_path, 'trimmed_star_GRCh38')
star_reference_index = join(base_path, 'data', 'star_GRCh38_index')


other_options = "--outSAMtype BAM SortedByCoordinate \
 --outSAMunmapped Within \
 --outSAMattributes Standard"

# SAM tools
samtool_path = 'samtools'


# Other paramaters
# Core counts
number_of_threads = multiprocessing.cpu_count()


In [None]:
# Creating directories
if not os.path.exists(salmon_output_folder):
    os.makedirs(salmon_output_folder)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)


# Downloading reference data

DO NOT RUN
(note this is a bash script)


In [None]:
%%bash

wget --directory-prefix=../data https://storage.googleapis.com/turbine-rna/GRCh38.tar.bz2
cd .data
tar -xf GRCh38.tar.bz2


# Indexing for STAR
note that the index path should be matched to the project paths.
The overhang is set to 50. (samples do not contain longer reads) 


In [None]:
%%bash
STAR --runThreadN 32 \
--runMode genomeGenerate \
--genomeDir /bio-apps/turbine-rna/data/star_GRCh38_index \
--genomeFastaFiles /bio-apps/turbine-rna/data/GRCh38/hg38.fa \
--sjdbGTFfile /bio-apps/turbine-rna/data/GRCh38/hg38.ensGene.gtf \
--sjdbOverhang 50


## Calculate STAR alignment

In [None]:
prefix_to_pair = get_illumina_pairs(trimmed_path)
output_files = {}
star_cmds = []
for prefix, fastq_file in prefix_to_pair.items():
    prefix
    star_cmd = '{0} \
    --genomeDir {1} \
    --runThreadN {2} \
    --readFilesIn {3} \
    --outFileNamePrefix {4} \
    {5} '.format(star_path, 
             star_reference_index,
            number_of_threads,
            fastq_file[0],
            join(output_dir, prefix),
            other_options)
    print(star_cmd)
    star_cmds.append(star_cmd)
    output_files[prefix] = [join(output_dir, prefix + 'Aligned.sortedByCoord.out.bam')]
    os.system(star_cmd)
    

### Running samtools for indexing the alignments

In [None]:
# Running samtools for indexing
from os.path import exists
import errno
import os
for prefix in output_files:
    print(prefix, output_files[prefix][0])
    # check if file exists:
    if exists(output_files[prefix][0]):
        samtool_cmd = '{0} index {1}'.format(samtool_path, output_files[prefix][0])
        print(samtool_cmd)
        os.system(samtool_cmd)
    else:
        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), output_files[prefix][0])
        


In [None]:
%%bash

featureCounts -T 4 -s 2 \
  -a /bio-apps/turbine-rna/data/GRCh38/hg38.ensGene.gtf \
  -o /bio-apps/turbine-rna/results/trimmed_star/trimmed_featurecounts.txt \
  /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_star_GRCh38/*.out.bam


# Indexing with SALMON
Note that the index path should be matched to the project paths.

In [None]:
%%bash
salmon index -t /bio-apps/turbine-rna/data/GRCh38/GRCh38_rna.fa -i /bio-apps/turbine-rna/data/salmon_GRCh38_transcripts_index -k 19


# Quantification with salmon
The outputs will be in the salmon_output_folder directory


In [None]:
prefix_to_pair = get_illumina_pairs(trimmed_path)
for prefix, fastq_file in prefix_to_pair.items():
    prefix
    salmon_cmd = '{0} quant \
    -i {1} \
    -l A \
    -r {2} \
    -p {3} \
    -o {4} \
     --numBootstraps 100 \
     --validateMappings \
     --useVBOpt \
    --seqBias'.format(salmon_path,salmon_index, fastq_file[0], number_of_threads, join(salmon_output_folder, prefix))
    print(salmon_cmd)
    os.system(salmon_cmd)