# Quality Control of RNA-Seq Data from Olympia Oysters
Following along with [this tutorial](https://informatics.fas.harvard.edu/best-practices-for-de-novo-transcriptome-assembly-with-trinity.html). Data was received on Dec. 4 and consists of 8 samples sequenced on 1 PE 100bp Illumina HiSeq 4000 lane at the UChicago Functional Genomics Core. Each sample contains 4-8 pooled individuals from across 4 populations and represents a specific tissue at a specific treatment.

In [1]:
%pwd

u'/home/t.cri.ksilliman/OA_RNA'

Rename files so the two lanes are kept separate

In [2]:
%%bash
for i in /scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/0343/FastQ/*.gz;
    do mv $i ${i/001/0343};
done;

for i in /scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/0348/FastQ/*.gz;
    do mv $i ${i/001/0348};
done;

mv: cannot stat `/scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/0343/FastQ/*.gz': No such file or directory
mv: cannot stat `/scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/0348/FastQ/*.gz': No such file or directory


In [None]:
%%sh
mv /scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/0343/FastQ/* /scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/
mv /scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/0348/FastQ/* /scratch/t.cri.ksilliman/OA_RNA/Oyster/Oyster_Raw_RNASeq/


### FastQC (using [MultiQC](http://multiqc.info/))

### Discard read pairs with an unfixable read ([Harvard FAS Script](https://github.com/harvardinformatics/TranscriptomeAssemblyTools/blob/master/FilterUncorrectabledPEfastq.py))

In [10]:
%%sh

for s in /scratch/t.cri.ksilliman/OA_RNA/Oyster/QC_Output/*R1*cor*;
    do python FilterUncorrectabledPEfastq.py --left_reads $s --right_reads ${s/R1/R2} --out_prefix /scratch/t.cri.ksilliman/OA_RNA/Oyster/QC_Output/fixed;
done;

100000 reads processed
200000 reads processed
300000 reads processed
400000 reads processed
500000 reads processed
600000 reads processed
700000 reads processed
800000 reads processed
900000 reads processed
1000000 reads processed
1100000 reads processed
1200000 reads processed
1300000 reads processed
1400000 reads processed
1500000 reads processed
1600000 reads processed
1700000 reads processed
1800000 reads processed
1900000 reads processed
2000000 reads processed
2100000 reads processed
2200000 reads processed
2300000 reads processed
2400000 reads processed
2500000 reads processed
2600000 reads processed
2700000 reads processed
2800000 reads processed
2900000 reads processed
3000000 reads processed
3100000 reads processed
3200000 reads processed
3300000 reads processed
3400000 reads processed
3500000 reads processed
3600000 reads processed
3700000 reads processed
3800000 reads processed
3900000 reads processed
4000000 reads processed
4100000 reads processed
4200000 reads processed
4

### Trim Galore!
Requires Cutadapt

In [18]:
%%sh

for s in /scratch/t.cri.ksilliman/OA_RNA/Oyster/QC_Output/fixed*R1*;
    do /home/t.cri.ksilliman/TrimGalore-0.4.5/trim_galore --paired --retain_unpaired --phred33 --output_dir /scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads --length 36 -q 5 --stringency 1 -e 0.1 $s ${s/R1/R2};
done;

1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15
1.15


Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> /scratch/t.cri.ksilliman/OA_RNA/Oyster/QC_Output/fixed_CP-15_S7_L004_R1_0343.cor.fq <<)

Found perfect matches for the following adapter sequences:
Adapter type	Count	Sequence	Sequences analysed	Percentage
Illumina	540	AGATCGGAAGAGC	1000000	0.05
Nextera	7	CTGTCTCTTATA	1000000	0.00
smallRNA	2	TGGAATTCTCGG	1000000	0.00
Using Illumina adapter for trimming (count: 540). Second best hit was Nextera (count: 7)

Writing report to '/scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/fixed_CP-15_S7_L004_R1_0343.cor.fq_trimming_report.txt'

SUMMARISING RUN PARAMETERS
Input filename: /scratch/t.cri.ksilliman/OA_RNA/Oyster/QC_Output/fixed_CP-15_S7_L004_R1_0343.cor.fq
Trimming mode: paired-end
Trim Galore version: 0.4.4_dev
Cutadapt version: 1.15
Quality Phr

###  Map trimmed reads to a blacklist to remove unwanted (rRNA reads)
Using a rRNA database from [Silva](https://www.arb-silva.de/) and Bowtie2.

In [24]:
%%sh
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta.gz \
    -P /scratch/t.cri.ksilliman/OA_RNA/

gunzip /scratch/t.cri.ksilliman/OA_RNA/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta.gz

--2018-02-11 11:41:49--  https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta.gz
Resolving www.arb-silva.de... 134.102.40.6
Connecting to www.arb-silva.de|134.102.40.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 238828503 (228M) [application/gzip]
Saving to: `/scratch/t.cri.ksilliman/OA_RNA/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta.gz'

     0K .......... .......... .......... .......... ..........  0%  152K 25m34s
    50K .......... .......... .......... .......... ..........  0%  219K 21m40s
   100K .......... .......... .......... .......... ..........  0%  436K 17m25s
   150K .......... .......... .......... .......... ..........  0%  437K 15m17s
   200K .......... .......... .......... .......... ..........  0%  435K 14m0s
   250K .......... .......... .......... .......... ..........  0%  443K 13m8s
   300K .......... .......... .......... .......... ..........  0% 24.9M 11m16s
   350K .....

In [25]:
%%sh
module load intel/2017
module load bowtie2/2.3.2

bowtie2-build /scratch/t.cri.ksilliman/OA_RNA/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta /scratch/t.cri.ksilliman/OA_RNA/rRNA_ref

Settings:
  Output files: "/scratch/t.cri.ksilliman/OA_RNA/rRNA_ref.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /scratch/t.cri.ksilliman/OA_RNA/SILVA_132_SSURef_Nr99_tax_silva_trunc.fasta
Reading reference sizes
  Time reading reference sizes: 00:00:15
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:09
bmax according to bmaxDivN setting: 198441567
Using parameters --bmax 148831176 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing wit

Building a SMALL index


In [29]:
%%sh
#Change file names so they don't end in _1 and _2 of started in fixed
for r in /scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/*R1*fq; do
    mv "$r" "$(echo $r | sed 's/fixed_CP/CP/g; s/_1.fq/.fq/g')";
done

for r in /scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/*R2*fq; do
    mv "$r" "$(echo $r | sed 's/fixed_CP/CP/g; s/_2.fq/.fq/g')";
done

In [35]:
%%sh
cd /scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/
for file in *R1_0343*val.fq;
    do echo "${file/_L004_R1_0343.cor_val.fq/_}" >> Samples.txt;
done;
cat Samples.txt

CP-15_S7_
CP-16_S8_
CP-17_S9_
CP-18_S10_
CP-1_S1_
CP-2_S2_
CP-3_S3_
CP-4Spl_S11_
CP-4_S4_
CP-5_S5_
CP-6_S6_


In [None]:
%%sh
module load intel/2017
module load bowtie2/2.3.2

mkdir /scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/

A="/scratch/t.cri.ksilliman/OA_RNA/rRNA_ref"
while read file; do
    B="/scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/${file}L004_R1_0343.cor_val.fq"
    C="/scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/${file}L004_R2_0343.cor_val.fq"
    D="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0343_rRNASum.txt"
    E="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0343_rRNA.mapped"
    F="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0343_rRNA.unmapped"
    G="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0343_rRNA.SEmapped"
    H="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0343_rRNA.SEunmapped"
    bowtie2 --very-sensitive-local --phred33  -x $A -1 $B -2 $C --threads 10 --met-file $D --al-conc-gz $E --un-conc-gz $F --al-gz $G --un-gz $H;
done < /scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/Samples.txt

In [None]:
%%sh
module load intel/2017
module load bowtie2/2.3.2

A="/scratch/t.cri.ksilliman/OA_RNA/rRNA_ref"
while read file; do
    B="/scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/${file}L004_R1_0348.cor_val.fq"
    C="/scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/${file}L004_R2_0348.cor_val.fq"
    D="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0348_rRNASum.txt"
    E="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0348_rRNA.mapped"
    F="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0348_rRNA.unmapped"
    G="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0348_rRNA.SEmapped"
    H="/scratch/t.cri.ksilliman/OA_RNA/Oyster/filtered_RNASeq/${file}0348_rRNA.SEunmapped"
    bowtie2 --very-sensitive-local --phred33  -x $A -1 $B -2 $C --threads 10 --met-file $D --al-conc-gz $E --un-conc-gz $F --al-gz $G --un-gz $H;
done < /scratch/t.cri.ksilliman/OA_RNA/Oyster/trimmed_reads/Samples.txt

In [36]:
%pwd

u'/home/t.cri.ksilliman/OA_RNA'