# Build reference transcriptome

The first step to using Salmon to get gene expression data is to build a reference transcriptome to align our sample reads against.

**Goal:** To create an index to evaluate sequences for all possible unique sequences of length k in the target transcriptome

**Input:**

Target transcriptome. This transcriptome is given in the form of a multi-FASTA file, with each entry providing the sequence of a transcript. For prokaryotes, transcripts and genes have a more 1-1 mapping so we're using genes for our reference transcriptome and so we don't need to use tximport to map transcript quants to genes.

*What is the relationship between a gene vs transcript?* A gene is a bounded region of a chromosome within which transcription occurs. The gene (DNA sequence) is transcribed into RNA transcript. In bacteria, these RNA transcripts act as mRNA that can be translated into protein. In eukaryotes, the transcript RNA is pre-mRNA and must undergo additional processing (post-transcriptional modifications) before it can be translated. This processing includes addition of a protective cap and tail, splicing to remove introns. So genes can have multiple mRNA (which can encode different proteins) through the process of alternative splicing, where fragments of the pre-mRNA are assembled in different ways.

**Output:**

Quasi-index over the reference transcriptome, which is a structure that salmon uses to quasi-map RNA-seq reads during quantification. The index is a signature for each transcript in the reference transcriptome

The index contains:
* Suffix array (SA) of the reference transcriptome. There is a suffix array per transcript in the reference, containing a sorted array of all the suffixes of each transcript
* A hash table mapping each k-mer occurring in the reference transcriptome to its location in SA
* https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/08_salmon.html

**Command:**

`> ./bin/salmon index -t transcripts.fa -i transcripts_index --decoys decoys.txt -k 31`
* k = minimum acceptable length for valid matches. So a smaller k might improve sensitivity. They found that k=31 works well for reads of 75bp or longer.
* decoys = Set of target transcript ids that will not appear in the quantification. This decoy transcriptome is meant to mitigate potential spurious mapping of reads that actually arise from some unannotated genomic locus that is sequence-similar to an annotated transcriptome.


**Note:** Here we are using [Salmon](https://combine-lab.github.io/salmon/) version 0.11.2 to be consistent with the version running on Dartmouth's computing cluster (Discovery), which is where the data will be processed.

In [1]:
%load_ext autoreload
%autoreload 2

from core_acc_modules import paths

In [2]:
# Get PAO1 index
! salmon index -t $paths.PAO1_PHAGE_REF -i $paths.PAO1_PHAGE_INDEX

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
index ["/home/alexandra/Documents/Data/Core_accessory/pao1_phage_index"] did not previously exist  . . . creating it
[2021-01-22 19:31:31.731] [jLog] [info] building index
RapMap Indexer

[Step 1 of 4] : counting k-mers


[00mElapsed time: 1.57571s

[00mReplaced 8979 non-ATCG nucleotides
Clipped poly-A tails from 3 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.00331993s
Writing sequence data to file . . . done
Elapsed time: 0.0310718s
[info] Building 32-bit suffix array (length of generalized text is 60718359)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 0.0972567s
done
Elapsed time: 4.86207s
processed 60000000 positions
khash had 37047310 keys
saving hash to disk . . . done
Elapsed time: 2.55567s
[2021-01-22 19:32:11.952] [jLog] [info] done building index


In [3]:
# Get PA14 index
! salmon index -t $paths.PA14_PHAGE_REF -i $paths.PA14_PHAGE_INDEX

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
index ["/home/alexandra/Documents/Data/Core_accessory/pa14_phage_index"] did not previously exist  . . . creating it
[2021-01-22 19:32:12.108] [jLog] [info] building index
RapMap Indexer

[Step 1 of 4] : counting k-mers


[00mElapsed time: 1.5502s

[00mReplaced 8141 non-ATCG nucleotides
Clipped poly-A tails from 3 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.00287256s
Writing sequence data to file . . . done
Elapsed time: 0.0249393s
[info] Building 32-bit suffix array (length of generalized text is 59320458)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 0.11345s
done
Elapsed time: 4.87383s
processed 59000000 positions
khash had 36964624 keys
saving hash to disk . . . done
Elapsed time: 2.84644s
[2021-01-22 19:32:52.790] [jLog] [info] done building index


**Thoughts based on output:**
* Since phage entries are genomes instead of genes we are getting a warning message from Salmon: `Entry with header [NC_028999.1] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.` Is Salmon including these entries?

* When building the index I'm getting that PAO1+phage 510 duplicates are removed, PA14+phage 490 duplicates are removed. Previously when we had separate indexes for PAO1, PA14 and phage we got: PAO1 34 duplicates are removed, PA14 37 duplicates are removed, Phage 391 duplicates removed. So the duplicates seem to be mainly an issue with the phage sequences. I'm not sure if these duplicates are expected?