# Aign and quantify samples

This notebook aligns test samples against the phage reference genome

*Positive test cases:*
* ???

*Negative test cases:*
* E. Coli sample
* Pseudomonas sample containing only core genes

In [1]:
%load_ext autoreload
%autoreload 2

import os
import pandas as pd
import numpy as np
from core_acc_modules import paths_phage

np.random.seed(123)

### Setup SRA toolkit -- only needs to be run once

In [2]:
# Download latest version of compiled binaries of NCBI SRA toolkit 
#if not os.path.exists("sratoolkit.current-centos_linux64.tar.gz"):
#    ! wget "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz"

In [3]:
# Extract tar.gz file 
#if os.path.exists("sratoolkit.current-centos_linux64.tar.gz"):
#    ! tar -xzf sratoolkit.current-centos_linux64.tar.gz

# add binaries to path using export path or editing ~/.bashrc file
#! export PATH=$PATH:sratoolkit.2.10.7-centos_linux64/bin

# Now SRA binaries added to path and ready to use

### Download SRA data

In [4]:
"""# Download sra data files
! prefetch --option-file $paths.SRA_ACC """

'# Download sra data files\n! prefetch --option-file $paths.SRA_ACC '

### Get FASTQ files associated with SRA downloads

The fastq files store the RNA-seq results, including: sequencing and quality scores for each base call.

Here is a nice blog to explain how to read fastq files: https://thesequencingcenter.com/knowledge-base/fastq-files/

The fastq files gives the sequence of a read at a given location. Our goal is to map these reads to a reference genome so that we can quantify the number of reads that are at a given location, to determine the level of expression.

In [5]:
#!mkdir $paths.FASTQ_DIR

In [6]:
#!fastq-dump $paths.SRA_DIR/* --split-files --outdir $paths.FASTQ_DIR/

### Obtain a transcriptome and build an index

Here we are using [Salmon](https://combine-lab.github.io/salmon/)

**Input:**
* Target transcriptome
* This transcriptome is given to Salmon in the form of a (possibly compressed) multi-FASTA file, with each entry providing the sequence of a transcript
* We downloaded the phage GENOMES from NCBI GenBank

**Note:** For prokaryotes, transcripts and genes have a more 1-1 mapping so we're using genes for our reference transcriptome and so we don't need to use tximport to map transcript quants to genes. 

**Output:**
* The index is a structure that salmon uses to quasi-map RNA-seq reads during quantification
* [Quasi-map](https://academic.oup.com/bioinformatics/article/32/12/i192/2288985) is a way to map sequenced fragments (single or paired-end reads) to a target transcriptome. Quasi-mapping produces what we refer to as fragment mapping information. In particular, it provides, for each query (fragment), the reference sequences (transcripts), strand and position from which the query may have likely originated. In many cases, this mapping information is sufficient for downstream analysis like quantification.

*Algorithm:*

For a query read r through repeated application of: 
1. Determining the next hash table k-mer that starts past the current query position
2. Computing the maximum mappable prefix (MMP) of the query beginning with this k-mer
3. Determining the next informative position (NIP) by performing a longest common prefix (LCP) query on two specifically chosen suffixes in the SA

In [7]:
#! salmon index -t $paths_phage.PAO1_REF -i $paths_phage.PAO1_INDEX

In [8]:
! salmon index -t $paths_phage.PHAGE_REF -i $paths_phage.PHAGE_INDEX

Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
[2020-12-15 17:51:18.938] [jLog] [info] building index
out : /home/alexandra/Documents/Data/Core_accessory/phage_index
[00m[2020-12-15 17:51:18.938] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers
[00m
[00m[00m[2020-12-15 17:51:20.572] [puff::index::jointLog] [info] Replaced 6,326 non-ATCG nucleotides
[00m[00m[2020-12-15 17:51:20.572] [puff::index::jointLog] [info] Clipped poly-A tails from 0 transcripts


[00mwrote 1128 cleaned references
[00m[2020-12-15 17:51:20.619] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[00m[00m[2020-12-15 17:51:20.951] [puff::index::jointLog] [info] ntHll estimated 22009247 distinct k-mers, setting filter size to 2^29
[00mThreads = 2
Vertex length = 31
Hash functions = 5
Filter size = 536870912
Capacity = 2
Files: 
/home/alexandra/Documents/Data/Core_accessory/phage_index/ref_k31_fixed.fa
--------------------------------------------------------------------------------
Round 0, 0:536870912
Pass	Filling	Filtering
1	6	13	
2	1	0
True junctions count = 183497
False junctions count = 203448
Hash table size = 386945
Candidate marks count = 3294042
--------------------------------------------------------------------------------
Reallocating bifurcations time: 1
True marks count: 2238898
Edges construction time: 1
--------------------------------------------------------------------------------
Distinct junction

### Quantify gene expression
Now that we have our index built and all of our data downloaded, we’re ready to quantify our samples

**Input:**
* Index of reference transcriptome
* FASTQ of experimental samples

**Output:**

After the salmon commands finish running, you should have a directory named quants, which will have a sub-directory for each sample. These sub-directories contain the quantification results of salmon, as well as a lot of other information salmon records about the sample and the run. 

The main output file (called `quant.sf`). Inside the quantification file for sample DRR016125 in quants/DRR016125/quant.sf, you’ll see a TSV format file listing the name (`Name`) of each transcript, its length (`Length`), effective length (`EffectiveLength`), and its abundance in terms of Transcripts Per Million (`TPM`) and estimated number of reads (`NumReads`) originating from this transcript.

**For each sample we have read counts per gene (where the genes are based on the reference gene file provided above).** 

#### Get quants using phage reference

In [9]:
%%bash -s $paths_phage.PHAGE_QUANT $paths_phage.FASTQ_DIR $paths_phage.PHAGE_INDEX
mkdir $1

for FILE_PATH in $2/*;
do

# get file name
sample_name=`basename ${FILE_PATH}`

# remove extension from file name
sample_name="${sample_name%_*}"

# get base path
base_name=${FILE_PATH%/*}

echo "Processing sample ${sample_name}"

salmon quant -i $3 -l A \
            -1 ${base_name}/${sample_name}_1.fastq \
            -2 ${base_name}/${sample_name}_2.fastq \
            -p 8 --validateMappings -o $1/${sample_name}_quant
done

Processing sample SRR11809598
Processing sample SRR11809598
Processing sample SRR11809599
Processing sample SRR11809599
Processing sample SRR11809600
Processing sample SRR11809600
Processing sample SRR11809601
Processing sample SRR11809601
Processing sample SRR11809602
Processing sample SRR11809602
Processing sample SRR11809603
Processing sample SRR11809603
Processing sample SRR11809604
Processing sample SRR11809604
Processing sample SRR11809605
Processing sample SRR11809605
Processing sample SRR11809606
Processing sample SRR11809606
Processing sample SRR11809607
Processing sample SRR11809607
Processing sample SRR11809626
Processing sample SRR11809626
Processing sample SRR11809627
Processing sample SRR11809627
Processing sample SRR11809628
Processing sample SRR11809628
Processing sample SRR7886554
Processing sample SRR7886554
Processing sample SRR7886555
Processing sample SRR7886555
Processing sample SRR7886556
Processing sample SRR7886556
Processing sample SRR7886557
Processing sample

mkdir: cannot create directory ‘/home/alexandra/ncbi/public/quants_phage’: File exists
Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
### salmon (selective-alignment-based) v1.3.0
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /home/alexandra/Documents/Data/Core_accessory/phage_index }
### [ libType ] => { A }
### [ mates1 ] => { /home/alexandra/ncbi/public/fastq/SRR11809598_1.fastq }
### [ mates2 ] => { /home/alexandra/ncbi/public/fastq/SRR11809598_2.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { /home/alexandra/ncbi/public/quan

### Consolidate sample quantification to gene expression dataframe

In [10]:
# PAO1
# Read through all sample subdirectories in quant/
# Within each sample subdirectory, get quant.sf file
data_dir = paths_phage.PHAGE_QUANT

expression_phage_df = pd.DataFrame(
    pd.read_csv(file, sep="\t", index_col=0)["TPM"].
    rename(file.parent.name.split("_")[0]) 
    for file in data_dir.rglob("*/quant.sf"))    

expression_phage_df.head()

Name,NC_028999.1,MT133560.1,MK599315.1,MH725810.1,MF974178.1,NC_016765.1,NC_031063.1,NC_027375.1,NC_011810.1,MT108726.1,...,DI373497.1,DI373496.1,DI373495.1,DI373494.1,DI373493.1,DI373492.1,DI373491.1,DI373490.1,DI373489.1,DI373488.1
SRR7886564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR7886563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR7886556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR7886554,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR11809604,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# Save gene expression data
expression_phage_df.to_csv(paths_phage.PHAGE_GE, sep='\t')