# Align and quantify samples

This notebook aligns test samples against the phage and PAO1 reference genomes. Our goal is to test our phage reference genome alignment before we port it to the Discovery (Dartmouth computing cluster). We want to check that we are getting more expression of phage genes in a phage sample compared to a non-pseudomonas samples.

In [1]:
%load_ext autoreload
%autoreload 2

import os
import shutil
import pandas as pd
import numpy as np
from core_acc_modules import paths

np.random.seed(123)

### Download SRA data

Note: Need to delete `sra` folder between runs otherwise `fastq-dump` will be called on all files in `sra` folder which can include more than your sra accessions.

In [2]:
shutil.rmtree(paths.SRA_DIR)

In [3]:
# Download sra data files
! prefetch --option-file $paths.SRA_ACC


2020-12-22T16:26:24 prefetch.2.8.2: 1) Downloading 'SRR13160334'...
2020-12-22T16:26:24 prefetch.2.8.2:  Downloading via https...
2020-12-22T16:27:59 prefetch.2.8.2: 1) 'SRR13160334' was downloaded successfully
2020-12-22T16:27:59 prefetch.2.8.2: 'SRR13160334' has 0 unresolved dependencies

2020-12-22T16:27:59 prefetch.2.8.2: 2) Downloading 'ERR3642743'...
2020-12-22T16:27:59 prefetch.2.8.2:  Downloading via https...
2020-12-22T16:32:37 prefetch.2.8.2: 2) 'ERR3642743' was downloaded successfully
2020-12-22T16:32:37 prefetch.2.8.2: 'ERR3642743' has 0 unresolved dependencies

2020-12-22T16:32:37 prefetch.2.8.2: 3) Downloading 'SRR13234437'...
2020-12-22T16:32:37 prefetch.2.8.2:  Downloading via https...
2020-12-22T16:36:29 prefetch.2.8.2 sys: libs/kns/unix/syssock.c:606:KSocketTimedRead: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
2020-12-22T16:36:29 prefetch.2.8.2 int: libs/kns/un

### Get FASTQ files associated with SRA downloads

The fastq files store the RNA-seq results, including: sequencing and quality scores for each base call.

Here is a nice blog to explain how to read fastq files: https://thesequencingcenter.com/knowledge-base/fastq-files/

The fastq files gives the sequence of a read at a given location. Our goal is to map these reads to a reference genome so that we can quantify the number of reads that are at a given location, to determine the level of expression.

In [4]:
if not os.path.exists(paths.FASTQ_DIR):
    os.makedirs(paths.FASTQ_DIR)

In [5]:
!fastq-dump $paths.SRA_DIR/* --split-files --outdir $paths.FASTQ_DIR/

Read 24096032 spots for /home/alexandra/ncbi/public/sra/ERR3642743.sra
Written 24096032 spots for /home/alexandra/ncbi/public/sra/ERR3642743.sra
Read 8669682 spots for /home/alexandra/ncbi/public/sra/SRR13160334.sra
Written 8669682 spots for /home/alexandra/ncbi/public/sra/SRR13160334.sra
Read 32765714 spots total
Written 32765714 spots total


In [6]:
# Copied from https://github.com/hoganlab-dartmouth/sraProcessingPipeline/blob/5974e040c85724a8d385e53153b7707ae7c9c255/DiscoveryScripts/quantifier.py#L83

#!fastq-dump $paths_phage.SRA_DIR/* --skip-technical --readids --split-3 --clip --outdir $paths_phage.FASTQ_DIR/

### Quantify gene expression
Now that we have our index built and all of our data downloaded, we’re ready to quantify our samples

**Input:**
* Index of reference transcriptome
* FASTQ of experimental samples

**Output:**

After the salmon commands finish running, you should have a directory named quants, which will have a sub-directory for each sample. These sub-directories contain the quantification results of salmon, as well as a lot of other information salmon records about the sample and the run. 

The main output file (called `quant.sf`). Inside the quantification file for sample DRR016125 in quants/DRR016125/quant.sf, you’ll see a TSV format file listing the name (`Name`) of each transcript, its length (`Length`), effective length (`EffectiveLength`), and its abundance in terms of Transcripts Per Million (`TPM`) and estimated number of reads (`NumReads`) originating from this transcript.

**For each sample we have read counts per gene (where the genes are based on the reference gene file provided above).** 

#### Get quants using PAO1 reference

In [7]:
if not os.path.exists(paths.PAO1_QUANT):
    os.makedirs(paths.PAO1_QUANT)

In [8]:
%%bash -s $paths.PAO1_QUANT $paths.FASTQ_DIR $paths.PAO1_INDEX

for FILE_PATH in $2/*;
do

# get file name
sample_name=`basename ${FILE_PATH}`

# remove extension from file name
sample_name="${sample_name%_*}"

# get base path
base_name=${FILE_PATH%/*}

echo "Processing sample ${sample_name}"

salmon quant -i $3 -l A \
            -1 ${base_name}/${sample_name}_1.fastq \
            -2 ${base_name}/${sample_name}_2.fastq \
            -p 8 --validateMappings -o $1/${sample_name}_quant
done

Processing sample ERR3642743
Processing sample ERR3642743
Processing sample SRR13160334
Processing sample SRR13160334
Processing sample SRR13234437
Processing sample SRR13234437


Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
### salmon (selective-alignment-based) v1.3.0
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /home/alexandra/Documents/Data/Core_accessory/pao1_index }
### [ libType ] => { A }
### [ mates1 ] => { /home/alexandra/ncbi/public/fastq_phage/ERR3642743_1.fastq }
### [ mates2 ] => { /home/alexandra/ncbi/public/fastq_phage/ERR3642743_2.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { /home/alexandra/ncbi/public/quants_pao1/ERR3642743_quant }
Logs will be written to /home/alexandra/ncbi/public

#### Get quants using PA14 reference

In [9]:
if not os.path.exists(paths.PA14_QUANT):
    os.makedirs(paths.PA14_QUANT)

In [10]:
%%bash -s $paths.PA14_QUANT $paths.FASTQ_DIR $paths.PA14_INDEX

for FILE_PATH in $2/*;
do

# get file name
sample_name=`basename ${FILE_PATH}`

# remove extension from file name
sample_name="${sample_name%_*}"

# get base path
base_name=${FILE_PATH%/*}

echo "Processing sample ${sample_name}"

salmon quant -i $3 -l A \
            -1 ${base_name}/${sample_name}_1.fastq \
            -2 ${base_name}/${sample_name}_2.fastq \
            -p 8 --validateMappings -o $1/${sample_name}_quant
done

Processing sample ERR3642743
Processing sample ERR3642743
Processing sample SRR13160334
Processing sample SRR13160334
Processing sample SRR13234437
Processing sample SRR13234437


Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
### salmon (selective-alignment-based) v1.3.0
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /home/alexandra/Documents/Data/Core_accessory/pa14_index }
### [ libType ] => { A }
### [ mates1 ] => { /home/alexandra/ncbi/public/fastq_phage/ERR3642743_1.fastq }
### [ mates2 ] => { /home/alexandra/ncbi/public/fastq_phage/ERR3642743_2.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { /home/alexandra/ncbi/public/quants_pa14/ERR3642743_quant }
Logs will be written to /home/alexandra/ncbi/public

#### Get quants using phage reference

In [11]:
if not os.path.exists(paths.PHAGE_QUANT):
    os.makedirs(paths.PHAGE_QUANT)

In [12]:
%%bash -s $paths.PHAGE_QUANT $paths.FASTQ_DIR $paths.PHAGE_INDEX

for FILE_PATH in $2/*;
do

# get file name
sample_name=`basename ${FILE_PATH}`

# remove extension from file name
sample_name="${sample_name%_*}"

# get base path
base_name=${FILE_PATH%/*}

echo "Processing sample ${sample_name}"

salmon quant -i $3 -l A \
            -1 ${base_name}/${sample_name}_1.fastq \
            -2 ${base_name}/${sample_name}_2.fastq \
            -p 8 --validateMappings -o $1/${sample_name}_quant
done

Processing sample ERR3642743
Processing sample ERR3642743
Processing sample SRR13160334
Processing sample SRR13160334
Processing sample SRR13234437
Processing sample SRR13234437


Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
### salmon (selective-alignment-based) v1.3.0
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /home/alexandra/Documents/Data/Core_accessory/phage_index }
### [ libType ] => { A }
### [ mates1 ] => { /home/alexandra/ncbi/public/fastq_phage/ERR3642743_1.fastq }
### [ mates2 ] => { /home/alexandra/ncbi/public/fastq_phage/ERR3642743_2.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { /home/alexandra/ncbi/public/quants_phage/ERR3642743_quant }
Logs will be written to /home/alexandra/ncbi/publ

### Consolidate sample quantification to gene expression dataframe

In [13]:
# Read through all sample subdirectories in quant/
# Within each sample subdirectory, get quant.sf file
data_dir = paths.PAO1_QUANT

expression_pao1_df = pd.DataFrame(
    pd.read_csv(file, sep="\t", index_col=0)["TPM"].
    rename(file.parent.name.split("_")[0]) 
    for file in data_dir.rglob("*/quant.sf"))    

expression_pao1_df.head()

Name,PGD134012,PGD134018,PGD134020,PGD134022,PGD134024,PGD134014,PGD134016,PGD134026,PGD134030,PGD134032,...,PGD133904,PGD133906,PGD133902,PGD133898,PGD133900,PGD133894,PGD133896,PGD133892,PGD133884,PGD133886
SRR7886564,231.716766,190.742049,251.434363,244.575451,232.66742,338.778285,154.619466,135.309196,197.297614,208.656767,...,384.006646,160.423009,217.461069,156.88598,207.875748,401.447266,243.877369,242.992502,246.314856,2505.558825
SRR7886563,221.861101,206.2282,262.510896,236.227489,218.772353,277.169311,172.770017,131.65515,203.906846,195.419829,...,422.431994,186.070549,221.798564,169.894611,206.04732,364.537806,220.715998,222.137392,290.767265,1822.432926
SRR13160334,9.290001,10.396172,23.601431,16.471582,2.44842,8.70512,0.979054,4.639901,4.973794,0.0,...,165.995489,15.83991,6.455705,6.528371,10.306895,0.0,1.126866,7.395994,31.574521,404.80925
SRR13196071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR7886556,239.03554,214.717657,274.756361,234.900934,211.425667,331.008247,182.534175,151.789441,177.38904,214.45934,...,313.613041,203.419543,219.460975,122.112693,216.61862,356.472812,204.935818,236.919636,344.232315,840.672326


In [14]:
# Read through all sample subdirectories in quant/
# Within each sample subdirectory, get quant.sf file
data_dir = paths.PA14_QUANT

expression_pa14_df = pd.DataFrame(
    pd.read_csv(file, sep="\t", index_col=0)["TPM"].
    rename(file.parent.name.split("_")[0]) 
    for file in data_dir.rglob("*/quant.sf"))    

expression_pa14_df.head()

Name,PGD1650835,PGD1650837,PGD1650839,PGD1650841,PGD1650843,PGD1650845,PGD1650847,PGD1650849,PGD1650851,PGD1650853,...,PGD1662756,PGD1662758,PGD1662760,PGD1662762,PGD1662764,PGD1662766,PGD1662768,PGD1662770,PGD1662772,PGD1662774
SRR7886564,192.751566,158.720852,214.625977,202.781724,191.179817,284.389596,128.577961,111.754485,169.494706,181.962984,...,206.279768,318.390036,137.651423,176.686931,128.913082,172.640748,325.020691,207.769043,203.636922,198.2032
SRR7886563,185.983215,176.439379,226.76179,199.513132,181.198447,239.846511,148.182413,112.657855,173.914475,171.601826,...,214.476862,349.078921,160.861156,179.750922,147.525041,173.583246,327.571838,190.270258,187.372917,233.744611
SRR13160334,11.000647,12.457031,28.087817,19.657212,2.933759,10.90476,1.173137,5.367987,5.959752,0.0,...,66.349304,198.877568,18.979817,7.735392,7.77398,12.363816,0.0,1.35025,8.862147,46.316389
SRR13196071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR7886556,206.748915,182.729266,239.196084,204.343926,177.187986,281.314948,156.850235,130.633713,157.585184,183.599883,...,184.616302,266.232608,176.685663,187.728499,108.815371,185.383883,340.062547,179.374152,204.736395,291.297672


In [15]:
# Read through all sample subdirectories in quant/
# Within each sample subdirectory, get quant.sf file
data_dir = paths.PHAGE_QUANT

expression_phage_df = pd.DataFrame(
    pd.read_csv(file, sep="\t", index_col=0)["TPM"].
    rename(file.parent.name.split("_")[0]) 
    for file in data_dir.rglob("*/quant.sf"))    

expression_phage_df.head()

Name,NC_028999.1,MT133560.1,MK599315.1,MH725810.1,MF974178.1,NC_016765.1,NC_031063.1,NC_027375.1,NC_011810.1,MT108726.1,...,DI373497.1,DI373496.1,DI373495.1,DI373494.1,DI373493.1,DI373492.1,DI373491.1,DI373490.1,DI373489.1,DI373488.1
SRR13160334,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.371706,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR13196071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR3642743,0.0,0.0,0.27459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR13234437,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR13196068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# Save gene expression data
expression_pao1_df.to_csv(paths.PAO1_GE, sep='\t')
expression_pa14_df.to_csv(paths.PA14_GE, sep='\t')
expression_phage_df.to_csv(paths.PHAGE_GE, sep='\t')