# Align and quantify samples

This notebook aligns test samples against the phage and PAO1 reference genomes. Our goal is to test our phage reference genome alignment before we port it to the Discovery (Dartmouth computing cluster). We want to check that we are getting more expression of phage genes in a phage sample compared to a non-pseudomonas samples.

In [17]:
%load_ext autoreload
%autoreload 2

import os
import shutil
import pandas as pd
import numpy as np
from core_acc_modules import paths

np.random.seed(123)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Download SRA data

Note: Need to delete `sra` folder between runs otherwise `fastq-dump` will be called on all files in `sra` folder which can include more than your sra accessions.

In [4]:
shutil.rmtree(paths.SRA_DIR)

In [5]:
# Download sra data files
! prefetch --option-file $paths.SRA_ACC


2020-12-16T20:11:58 prefetch.2.8.2: 1) Downloading 'SRR12922100'...
2020-12-16T20:11:58 prefetch.2.8.2:  Downloading via https...
2020-12-16T20:13:38 prefetch.2.8.2: 1) 'SRR12922100' was downloaded successfully
2020-12-16T20:13:38 prefetch.2.8.2: 'SRR12922100' has 0 unresolved dependencies

2020-12-16T20:13:38 prefetch.2.8.2: 2) Downloading 'SRR13196071'...
2020-12-16T20:13:38 prefetch.2.8.2:  Downloading via https...
2020-12-16T20:13:55 prefetch.2.8.2: 2) 'SRR13196071' was downloaded successfully
2020-12-16T20:13:55 prefetch.2.8.2: 'SRR13196071' has 0 unresolved dependencies

2020-12-16T20:13:56 prefetch.2.8.2: 3) Downloading 'SRR13196068'...
2020-12-16T20:13:56 prefetch.2.8.2:  Downloading via https...
2020-12-16T20:13:57 prefetch.2.8.2: 3) 'SRR13196068' was downloaded successfully
2020-12-16T20:13:57 prefetch.2.8.2: 'SRR13196068' has 0 unresolved dependencies

2020-12-16T20:13:58 prefetch.2.8.2: 4) Downloading 'SRR11809598'...
2020-12-16T20:13:58 prefetch.2.8.2:  Downloading via ht

### Get FASTQ files associated with SRA downloads

The fastq files store the RNA-seq results, including: sequencing and quality scores for each base call.

Here is a nice blog to explain how to read fastq files: https://thesequencingcenter.com/knowledge-base/fastq-files/

The fastq files gives the sequence of a read at a given location. Our goal is to map these reads to a reference genome so that we can quantify the number of reads that are at a given location, to determine the level of expression.

In [6]:
if not os.path.exists(paths.FASTQ_DIR):
    os.makedirs(paths.FASTQ_DIR)

mkdir: cannot create directory ‘/home/alexandra/ncbi/public/fastq_phage’: File exists


In [7]:
!fastq-dump $paths.SRA_DIR/* --split-files --outdir $paths.FASTQ_DIR/

Read 2101445 spots for /home/alexandra/ncbi/public/sra/SRR11809598.sra
Written 2101445 spots for /home/alexandra/ncbi/public/sra/SRR11809598.sra
Read 3686954 spots for /home/alexandra/ncbi/public/sra/SRR12922100.sra
Written 3686954 spots for /home/alexandra/ncbi/public/sra/SRR12922100.sra
Read 1089559 spots for /home/alexandra/ncbi/public/sra/SRR13196068.sra
Written 1089559 spots for /home/alexandra/ncbi/public/sra/SRR13196068.sra
Read 629579 spots for /home/alexandra/ncbi/public/sra/SRR13196071.sra
Written 629579 spots for /home/alexandra/ncbi/public/sra/SRR13196071.sra
Read 7507537 spots total
Written 7507537 spots total


In [8]:
# Copied from https://github.com/hoganlab-dartmouth/sraProcessingPipeline/blob/5974e040c85724a8d385e53153b7707ae7c9c255/DiscoveryScripts/quantifier.py#L83

#!fastq-dump $paths_phage.SRA_DIR/* --skip-technical --readids --split-3 --clip --outdir $paths_phage.FASTQ_DIR/

### Quantify gene expression
Now that we have our index built and all of our data downloaded, we’re ready to quantify our samples

**Input:**
* Index of reference transcriptome
* FASTQ of experimental samples

**Output:**

After the salmon commands finish running, you should have a directory named quants, which will have a sub-directory for each sample. These sub-directories contain the quantification results of salmon, as well as a lot of other information salmon records about the sample and the run. 

The main output file (called `quant.sf`). Inside the quantification file for sample DRR016125 in quants/DRR016125/quant.sf, you’ll see a TSV format file listing the name (`Name`) of each transcript, its length (`Length`), effective length (`EffectiveLength`), and its abundance in terms of Transcripts Per Million (`TPM`) and estimated number of reads (`NumReads`) originating from this transcript.

**For each sample we have read counts per gene (where the genes are based on the reference gene file provided above).** 

#### Get quants using PAO1 reference

In [None]:
if not os.path.exists(paths.PAO1_QUANT):
    os.makedirs(paths.PAO1_QUANT)

In [13]:
%%bash -s $paths.PAO1_QUANT $paths.FASTQ_DIR $paths.PAO1_INDEX

for FILE_PATH in $2/*;
do

# get file name
sample_name=`basename ${FILE_PATH}`

# remove extension from file name
sample_name="${sample_name%_*}"

# get base path
base_name=${FILE_PATH%/*}

echo "Processing sample ${sample_name}"

salmon quant -i $3 -l A \
            -1 ${base_name}/${sample_name}_1.fastq \
            -2 ${base_name}/${sample_name}_2.fastq \
            -p 8 --validateMappings -o $1/${sample_name}_quant
done

Processing sample SRR11809598
Processing sample SRR11809598
Processing sample SRR12922100
Processing sample SRR12922100
Processing sample SRR13196068
Processing sample SRR13196068
Processing sample SRR13196071
Processing sample SRR13196071


Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
### salmon (selective-alignment-based) v1.3.0
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /home/alexandra/Documents/Data/Core_accessory/pao1_index }
### [ libType ] => { A }
### [ mates1 ] => { /home/alexandra/ncbi/public/fastq_phage/SRR11809598_1.fastq }
### [ mates2 ] => { /home/alexandra/ncbi/public/fastq_phage/SRR11809598_2.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { /home/alexandra/ncbi/public/quants_pao1/SRR11809598_quant }
Logs will be written to /home/alexandra/ncbi/pub

CalledProcessError: Command 'b'\nfor FILE_PATH in $2/*;\ndo\n\n# get file name\nsample_name=`basename ${FILE_PATH}`\n\n# remove extension from file name\nsample_name="${sample_name%_*}"\n\n# get base path\nbase_name=${FILE_PATH%/*}\n\necho "Processing sample ${sample_name}"\n\nsalmon quant -i $3 -l A \\\n            -1 ${base_name}/${sample_name}_1.fastq \\\n            -2 ${base_name}/${sample_name}_2.fastq \\\n            -p 8 --validateMappings -o $1/${sample_name}_quant\ndone\n'' returned non-zero exit status 1.

#### Get quants using phage reference

In [None]:
if not os.path.exists(paths.PHAGE_QUANT):
    os.makedirs(paths.PHAGE_QUANT)

In [12]:
%%bash -s $paths.PHAGE_QUANT $paths.FASTQ_DIR $paths.PHAGE_INDEX

for FILE_PATH in $2/*;
do

# get file name
sample_name=`basename ${FILE_PATH}`

# remove extension from file name
sample_name="${sample_name%_*}"

# get base path
base_name=${FILE_PATH%/*}

echo "Processing sample ${sample_name}"

salmon quant -i $3 -l A \
            -1 ${base_name}/${sample_name}_1.fastq \
            -2 ${base_name}/${sample_name}_2.fastq \
            -p 8 --validateMappings -o $1/${sample_name}_quant
done

Processing sample SRR11809598
Processing sample SRR11809598
Processing sample SRR12922100
Processing sample SRR12922100
Processing sample SRR13196068
Processing sample SRR13196068
Processing sample SRR13196071
Processing sample SRR13196071


mkdir: cannot create directory ‘/home/alexandra/ncbi/public/quants_phage’: File exists
Version Info: ### PLEASE UPGRADE SALMON ###
### A newer version of salmon with important bug fixes and improvements is available. ####
###
The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.
###
Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###
### salmon (selective-alignment-based) v1.3.0
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /home/alexandra/Documents/Data/Core_accessory/phage_index }
### [ libType ] => { A }
### [ mates1 ] => { /home/alexandra/ncbi/public/fastq_phage/SRR11809598_1.fastq }
### [ mates2 ] => { /home/alexandra/ncbi/public/fastq_phage/SRR11809598_2.fastq }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { /home/alexandra/ncbi

CalledProcessError: Command 'b'mkdir $1\n\nfor FILE_PATH in $2/*;\ndo\n\n# get file name\nsample_name=`basename ${FILE_PATH}`\n\n# remove extension from file name\nsample_name="${sample_name%_*}"\n\n# get base path\nbase_name=${FILE_PATH%/*}\n\necho "Processing sample ${sample_name}"\n\nsalmon quant -i $3 -l A \\\n            -1 ${base_name}/${sample_name}_1.fastq \\\n            -2 ${base_name}/${sample_name}_2.fastq \\\n            -p 8 --validateMappings -o $1/${sample_name}_quant\ndone\n'' returned non-zero exit status 1.

### Consolidate sample quantification to gene expression dataframe

In [14]:
# Read through all sample subdirectories in quant/
# Within each sample subdirectory, get quant.sf file
data_dir = paths.PAO1_QUANT

expression_pao1_df = pd.DataFrame(
    pd.read_csv(file, sep="\t", index_col=0)["TPM"].
    rename(file.parent.name.split("_")[0]) 
    for file in data_dir.rglob("*/quant.sf"))    

expression_pao1_df.head()

Name,PGD134012,PGD134018,PGD134020,PGD134022,PGD134024,PGD134014,PGD134016,PGD134026,PGD134030,PGD134032,...,PGD133904,PGD133906,PGD133902,PGD133898,PGD133900,PGD133894,PGD133896,PGD133892,PGD133884,PGD133886
SRR7886564,231.716766,190.742049,251.434363,244.575451,232.66742,338.778285,154.619466,135.309196,197.297614,208.656767,...,384.006646,160.423009,217.461069,156.88598,207.875748,401.447266,243.877369,242.992502,246.314856,2505.558825
SRR7886563,221.861101,206.2282,262.510896,236.227489,218.772353,277.169311,172.770017,131.65515,203.906846,195.419829,...,422.431994,186.070549,221.798564,169.894611,206.04732,364.537806,220.715998,222.137392,290.767265,1822.432926
SRR13196071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR7886556,239.03554,214.717657,274.756361,234.900934,211.425667,331.008247,182.534175,151.789441,177.38904,214.45934,...,313.613041,203.419543,219.460975,122.112693,216.61862,356.472812,204.935818,236.919636,344.232315,840.672326
SRR7886554,207.540231,225.539932,225.249527,219.545459,187.796107,288.503175,170.652752,152.999445,178.938258,189.161895,...,310.622609,162.000331,226.673993,151.294458,188.400366,344.227676,230.639268,215.712007,231.559415,2818.206826


In [15]:
# Read through all sample subdirectories in quant/
# Within each sample subdirectory, get quant.sf file
data_dir = paths.PHAGE_QUANT

expression_phage_df = pd.DataFrame(
    pd.read_csv(file, sep="\t", index_col=0)["TPM"].
    rename(file.parent.name.split("_")[0]) 
    for file in data_dir.rglob("*/quant.sf"))    

expression_phage_df.head()

Name,NC_028999.1,MT133560.1,MK599315.1,MH725810.1,MF974178.1,NC_016765.1,NC_031063.1,NC_027375.1,NC_011810.1,MT108726.1,...,DI373497.1,DI373496.1,DI373495.1,DI373494.1,DI373493.1,DI373492.1,DI373491.1,DI373490.1,DI373489.1,DI373488.1
SRR13196071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR13196068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR12922100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SRR11809598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Save gene expression data
expression_pao1_df.to_csv(paths.PAO1_GE, sep='\t')
expression_phage_df.to_csv(paths.PHAGE_GE, sep='\t')