# Build reference transcriptome

Here we are using [Salmon](https://combine-lab.github.io/salmon/)

**Input:**
* Target transcriptome
* This transcriptome is given to Salmon in the form of a (possibly compressed) multi-FASTA file, with each entry providing the sequence of a transcript
* We downloaded the phage GENOMES from NCBI GenBank

**Note:** For prokaryotes, transcripts and genes have a more 1-1 mapping so we're using genes for our reference transcriptome and so we don't need to use tximport to map transcript quants to genes. 

**Output:**
* The index is a structure that salmon uses to quasi-map RNA-seq reads during quantification
* [Quasi-map](https://academic.oup.com/bioinformatics/article/32/12/i192/2288985) is a way to map sequenced fragments (single or paired-end reads) to a target transcriptome. Quasi-mapping produces what we refer to as fragment mapping information. In particular, it provides, for each query (fragment), the reference sequences (transcripts), strand and position from which the query may have likely originated. In many cases, this mapping information is sufficient for downstream analysis like quantification.

*Algorithm:*

For a query read r through repeated application of: 
1. Determining the next hash table k-mer that starts past the current query position
2. Computing the maximum mappable prefix (MMP) of the query beginning with this k-mer
3. Determining the next informative position (NIP) by performing a longest common prefix (LCP) query on two specifically chosen suffixes in the SA

In [1]:
%load_ext autoreload
%autoreload 2

from core_acc_modules import paths

In [2]:
# Get PAO1 index
! salmon index -t $paths.PAO1_REF -i $paths.PAO1_INDEX

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
[2020-12-22 10:50:46.910] [jLog] [info] building index
out : /home/alexandra/Documents/Data/Core_accessory/pao1_index
[00m[2020-12-22 10:50:46.911] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers

[00m[00m[2020-12-22 10:50:47.079] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[00m[00m[2020-12-22 10:50:47.079] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts
[00mwrote 5685 cleaned references
[00m[2020-12-22 10:50:47.097] [puff::index::jointLog] [info] Filter size 

In [3]:
# Get PA14 index
! salmon index -t $paths.PA14_REF -i $paths.PA14_INDEX

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
[2020-12-22 10:50:50.568] [jLog] [info] building index
out : /home/alexandra/Documents/Data/Core_accessory/pa14_index
[00m[2020-12-22 10:50:50.568] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers

[00m[00m[2020-12-22 10:50:50.714] [puff::index::jointLog] [info] Replaced 3 non-ATCG nucleotides
[00m[00m[2020-12-22 10:50:50.714] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts
[00mwrote 5959 cleaned references
[00m[2020-12-22 10:50:50.752] [puff::index::jointLog] [info] Filter size 

In [4]:
# Get phage index
! salmon index -t $paths.PHAGE_REF -i $paths.PHAGE_INDEX

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
[2020-12-22 10:50:54.176] [jLog] [info] building index
out : /home/alexandra/Documents/Data/Core_accessory/phage_index
[00m[2020-12-22 10:50:54.176] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers
[00m
[00m[00m[2020-12-22 10:50:55.802] [puff::index::jointLog] [info] Replaced 6,326 non-ATCG nucleotides
[00m[00m[2020-12-22 10:50:55.802] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts


[00mwrote 1128 cleaned references
[00m[2020-12-22 10:50:55.856] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[00m[00m[2020-12-22 10:50:56.178] [puff::index::jointLog] [info] ntHll estimated 22009247 distinct k-mers, setting filter size to 2^29
[00mThreads = 2
Vertex length = 31
Hash functions = 5
Filter size = 536870912
Capacity = 2
Files: 
/home/alexandra/Documents/Data/Core_accessory/phage_index/ref_k31_fixed.fa
--------------------------------------------------------------------------------
Round 0, 0:536870912
Pass	Filling	Filtering
1	5	14	
2	1	0
True junctions count = 183497
False junctions count = 225552
Hash table size = 409049
Candidate marks count = 3344391
--------------------------------------------------------------------------------
Reallocating bifurcations time: 0
True marks count: 2238898
Edges construction time: 2
--------------------------------------------------------------------------------
Distinct junction