# RNA seq quantification


The main goal of this notebook to preprocess to quantify the mRNA obtained in the samples.
There are many approaches and tools to achive this. Here we are going to apply two different kind of stategy: a k-mer based fast approach (SALMON) and a classical read-map (splice aware) tool (STAR). Since here we are dealing with human samples a reference based approach is more practial, since it's increased sensitivy compared to de-novo apparoches. Note however one the limiation of these approaches lies in the applied reference genome, which might be less representative in this application, then one would expect.
Here we used the GRCh38 reference genome and with it's ensemble provided gene annotation. Note that the actual gene annotation has substuntial impact on the results rendering RNA-Seq data integration a lot harder task. 
The raw reference genomes are deposited into the cloud as well the corresponding indeces. The indeces are fitted to this rna-seq data (i.e. smaller k-mers (salmon)).
The indexing and the aligment requeires significant amount of researouces (min. RAM > 40G; cpu_cores>10), it won't be able to run on simple-serverless-solutions, one needs to a dediated resource to manage (fortunately the cloud providers have solutions for that. In this application I used dedicated VM-s). 


### Input: 
   * preprocessed reads
   * optioannly genome data to index

   
### Outputs:
  * RNA pseudocounts for samples provided by SALMON
  * indexed genomes provided by STAR
  * indexed genome provided by SALMON

### Reuqrements:
  * salmon, star, sam-tools, htseq-counts
  * input data, project directories
  * inputs here assumed to be single reads


### Assumptions and notes
  * the proper paths and project data should be set before the run

### Data:
  * https://storage.googleapis.com/turbine-rna/GRCh38.tar.bz2
  * https://storage.googleapis.com/turbine-rna/STAR_human_ref_index.tar.bz2
  * https://storage.googleapis.com/turbine-rna/GRCh38_ensemble_transcripts.tar.bz2



# general paths and data


In [7]:
import os
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from turbine_lib import *


genome_data_http = 'https://storage.googleapis.com/turbine-rna/GRCh38.tar.bz2'
star_index = 'STAR_human_ref_index.tar.bz2'

%load_ext autoreload
%autoreload 2

# Input
base_path = '/home/ligeti/gitrepos/turbine-rnaseq-ligeti'
results_path = join(base_path, 'results')
trimmed_path = join(results_path, 'trimmed_files/')


# Output folders and parameters
salmon_path = 'salmon'
salmon_index = '/home/ligeti/ensemble/salmon_transcripts_index'
salmon_output_folder = join(results_path, 'salmon_GRCh38_outputs')

# STAR mapping tool
star_path = 'STAR'
output_dir = join(results_path, 'trimmed_star_GRCh38')
star_reference_index = '/home/ligeti/gitrepos/CHM13v2/genomeidx'
other_options = "--outSAMtype BAM SortedByCoordinate \
 --outSAMunmapped Within \
 --outSAMattributes Standard"

# SAM tools
samtool_path = 'samtools'


# Other paramaters
# Core counts
number_of_threads = multiprocessing.cpu_count()


NameError: name 'multiprocessing' is not defined

In [None]:
# Creating directories
if not os.path.exists(salmon_output_folder):
    os.makedirs(salmon_output_folder)

# Donwloading reference data

DO NOT RUN
(note this is a bash script)


In [4]:
%%bash

wget --directory-prefix=./data https://storage.googleapis.com/turbine-rna/GRCh38.tar.bz2
cd .data
tar -xf GRCh38.tar.bz2


--2022-06-13 09:07:46--  https://storage.googleapis.com/turbine-rna/GRCh38.tar.bz2
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.161.128, 172.217.219.128, 142.250.159.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.161.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1005024034 (958M) [application/x-tar]
Saving to: ‘./data/GRCh38.tar.bz2’

     0K .......... .......... .......... .......... ..........  0% 91.5M 10s
    50K .......... .......... .......... .......... ..........  0%  106M 10s
   100K .......... .......... .......... .......... ..........  0% 46.8M 13s
   150K .......... .......... .......... .......... ..........  0% 88.4M 13s
   200K .......... .......... .......... .......... ..........  0% 99.3M 12s
   250K .......... .......... .......... .......... ..........  0%  103M 12s
   300K .......... .......... .......... .......... ..........  0% 90.1M 11s
   350K .......... .......... ......

# Indexing for STAR
note that the index path should be matched to the project paths


In [5]:
%%bash
STAR --runThreadN 32 \
--runMode genomeGenerate \
--genomeDir GRCh38idx \
--genomeFastaFiles /home/ligeti/uscs/hg38.fa \
--sjdbGTFfile /home/ligeti/uscs/hg38.ensGene.gtf \
--sjdbOverhang 50


bash: line 1: 4: command not found


CalledProcessError: Command 'b'4\n'' returned non-zero exit status 127.

In [None]:
prefix_to_pair = get_illumina_pairs(trimmed_path)
prefix_to_pair

output_files = {}
star_cmds = []
for prefix, fastq_file in prefix_to_pair.items():
    prefix
    star_cmd = '{0} \
    --genomeDir {1} \
    --runThreadN {2} \
    --readFilesIn {3} \
    --outFileNamePrefix {4} \
    {5} '.format(star_path, 
             star_reference_index,
            number_of_threads,
            fastq_file[0],
            join(output_dir, prefix),
            other_options)
    print(star_cmd)
    star_cmds.append(star_cmd)
    output_files[prefix] = [join(output_dir, prefix + 'Aligned.sortedByCoord.out.bam')]
    #os.system(star_cmd)


In [None]:
### Running samtools for indexing the aligments

In [None]:
# Running samtools for indexing
from os.path import exists
import errno
import os
for prefix in output_files:
    print(prefix, output_files[prefix][0])
    # check if file exists:
    if exists(output_files[prefix][0]):
        samtool_cmd = '{0} index {1}'.format(samtool_path, output_files[prefix][0])
        print(samtool_cmd)
        os.system(samtool_cmd)
    else:
        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), output_files[prefix][0])
        


# Counting stuff
featureCounts -T 4 -s 2 \
  -a /home/ligeti/ncbi-genomes-2022-06-10/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf \
  -o /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_star/trimmed_featurecounts.txt \
  /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_star/*.out.bam

# Indexing with SALMON
note that the index path should be matched to the project paths
note that the index path should be matched to the project paths
note that the index path should be matched to the project paths
note that the index path should be matched to the project paths
note that the index path should be matched to the project paths
note that the index path should be matched to the project paths


In [None]:
%%bash
salmon index -t /home/ligeti/ensemble/GRCh38_rna.fa -i /home/ligeti/ensemble/salmon_transcripts_index -k 19


# Quantification with salmon
The outputs will be in the salmon_output_folder directory


In [None]:
prefix_to_pair = get_illumina_pairs(trimmed_path)
for prefix, fastq_file in prefix_to_pair.items():
    prefix
    salmon_cmd = '{0} quant \
    -i {1} \
    -l A \
    -r {2} \
    -p {3} \
    -o {4} \
     --numBootstraps 100 \
     --validateMappings \
     --useVBOpt \
    --seqBias'.format(salmon_path,salmon_index, fastq_file[0], number_of_threads, join(salmon_output_folder, prefix))
    print(salmon_cmd)
    os.system(salmon_cmd)