## 2. Trimming fastqs

fastp is a brand new fastq preprocessor that is "2~5 times faster than other FASTQ preprocessing tools like Trimmomatic or Cutadapt". Automatically finds and trims adapters and quality filters. See [publication](https://academic.oup.com/bioinformatics/article/34/17/i884/5093234) and [documentation](https://github.com/OpenGene/fastp) for more info. 

In [1]:
##conda install -c bioconda fastp
import pandas as pd
import os
import subprocess

In [2]:
df = pd.read_csv("/moto/eaton/projects/macaques/metadata.csv")
df[["Species", "Group", "SRR", "BioSample", "Sample", "Study", "PRJ"]]

Unnamed: 0,Species,Group,SRR,BioSample,Sample,Study,PRJ
0,Macaca mulatta northern,mulatta,SRR4454026,SAMN05883679,SRS1762015,SRP092140,PRJNA345528
1,Macaca mulatta southern low altitude,mulatta,SRR4454020,SAMN05883709,SRS1762009,SRP092140,PRJNA345529
2,Macaca mulatta southern high altitude,mulatta,SRR4453966,SAMN05883736,SRS1761955,SRP092140,PRJNA345530
3,Macaca mulatta Indian,mulatta,SRR5628058,SAMN07168901,SRS2238957,SRP049547,PRJNA251548
4,Macaca fascicularis northern,fascicularis,fasno,SAMN00116341,SRS117874,SRP045755,PRJNA51411
5,Macaca fascicularis southern,fascicularis,,SAMD00006158,DRS000787,DRP000438,PRJDB2038
6,Macaca fuscata,mulatta,DRR002233,SAMD00011919,DRS001583,DRP000620,PRJDB2459
7,Macaca thibethana,sinica,SRR1024051,SAMN02390221,SRS498543,SRP032525,PRJNA226187
8,Macaca assamensis,sinica,SRR2981114,SAMN04316321,SRS1196892,SRP067118,PRJNA305009
9,Macaca arctoides,fascicularis,SRR2981139,SAMN04316319,SRS1196879,SRP067118,PRJNA305009


We want to figure out what phred scoring each of our raw data sources uses (33 or 64). Here's a perl script that does so: https://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ#cite_note-2. It's at the bottom and is called fastq_detect.pl

In [4]:
%%bash
for i in SRR4454026 SRR4454020 SRR4453966 SRR5628058 fasno DRR002233 SRR1024051 SRR2981114 SRR2981139 SRR5947292 SRR5947293 SRR5947294 sylvanus silenus; do
    perl /moto/eaton/projects/macaques/scripts/fastq.pl /moto/eaton/projects/macaques/fastqdump/$i.sra_1.fastq.gz
    done


## Analysing 100 records from /moto/eaton/projects/macaques/fastqdump/SRR4454026.sra_1.fastq.gz ... 
# sampled raw quality values are in the range of [35; 74]
# format(s) marked below with 'x' agree with this range
  Illumina 1.3+ :  .  [Phred+64,  Q[64; 104], (0, 40)] 
  Illumina 1.5+ :  .  [Phred+64,  Q[66; 104], (3, 40), with 0=N/A, 1=N/A, 2=Read Segment Quality Control Indicator] 
  Illumina 1.8+ :  x  [Phred+33,  Q[33; 74],  (0, 41)] 
  Sanger        :  .  [Phred+33,  Q[33; 73],  (0, 40)] 
  Solexa        :  .  [Solexa+64, Q[59; 104], (-5, 40)] 

## Analysing 100 records from /moto/eaton/projects/macaques/fastqdump/SRR4454020.sra_1.fastq.gz ... 
# sampled raw quality values are in the range of [35; 70]
# format(s) marked below with 'x' agree with this range
  Illumina 1.3+ :  .  [Phred+64,  Q[64; 104], (0, 40)] 
  Illumina 1.5+ :  .  [Phred+64,  Q[66; 104], (3, 40), with 0=N/A, 1=N/A, 2=Read Segment Quality Control Indicator] 
  Illumina 1.8+ :  x  [Phred+33,  Q[33; 74],  (0, 41)

In [None]:
-rw-r--r-- 1 nsl2119 motoeaton  24G Feb 20 11:41 SRR7639480.filtered_1.fastq.gz
-rw-r--r-- 1 nsl2119 motoeaton  28G Feb 20 11:41 SRR7639480.filtered_2.fastq.gz


Looks like they're all Phred+33 so we don't have to specify it in our trimming or mapping softwares. A couple of the value ranges are 35:75 which is odd but these *should* be Phred+33 because they are not even close to the values of Phred+64/Solexa+64

1) Filtering 6 of the species default arguments (other than thread count - set to 12):

In [5]:
!mkdir /moto/eaton/projects/macaques/filteredfastq

In [9]:
!mkdir /moto/eaton/projects/macaques/filteredfastq/stats

In [18]:
test=['SRR5628058', 'fasno', 'SRR2981139', 'SRR1024051', 'silenus', 'sylvanus']

In [None]:
for i in test:
    cmd='fastp -i /moto/eaton/projects/macaques/fastqdump/'+i+'.sra_1.fastq.gz \
            -I /moto/eaton/projects/macaques/fastqdump/'+i+'.sra_2.fastq.gz \
            -o /moto/eaton/projects/macaques/filteredfastq/'+i+'.filtered_1.fastq.gz \
            -O /moto/eaton/projects/macaques/filteredfastq/'+i+'.filtered_2.fastq.gz \
            -w 12 \
            --json /moto/eaton/projects/macaques/filteredfastq/stats/'+i+'.json \
            --html /moto/eaton/projects/macaques/filteredfastq/stats/'+i+'.html'
    os.system(cmd)

In [None]:
!fastp -i /moto/eaton/projects/macaques/fastqdump/SRR8285768.sra_1.fastq.gz \
    -I /moto/eaton/projects/macaques/fastqdump/SRR8285768.sra_2.fastq.gz \
    -o /moto/eaton/projects/macaques/filteredfastq/SRR8285768.filtered_1.fastq.gz \
    -O /moto/eaton/projects/macaques/filteredfastq/SRR8285768.filtered_2.fastq.gz \
    -w 12 \
    --json /moto/eaton/projects/macaques/filteredfastq/stats/SRR8285768.json \
    --html /moto/eaton/projects/macaques/filteredfastq/stats/SRR8285768.html

In [1]:
##Summary of filtering/trimming can be viewed using the below
from IPython.display import HTML
HTML(filename='/moto/eaton/projects/macaques/filteredfastq/stats/fasno.html')

0,1
fastp version:,0.19.6 (https://github.com/OpenGene/fastp)
sequencing:,paired end (75 cycles + 75 cycles)
mean length before filtering:,"54bp, 54bp"
mean length after filtering:,"53bp, 53bp"
duplication rate:,20.389865%
Insert size peak:,49

0,1
total reads:,3.692988 G
total bases:,202.725223 G
Q20 bases:,132.101543 G (65.162855%)
Q30 bases:,47.071542 G (23.219381%)
GC content:,42.541203%

0,1
total reads:,3.403688 G
total bases:,183.094311 G
Q20 bases:,123.764007 G (67.595768%)
Q30 bases:,44.984579 G (24.569075%)
GC content:,42.088998%

0,1
reads passed filters:,3.403688 G (92.166251%)
reads with low quality:,285.958234 M (7.743276%)
reads with too many N:,3.341150 M (0.090473%)
reads too short:,0 (0.000000%)

0,1
Sequence,Occurrences
A,272523
other adapter sequences,19564969

0,1
Sequence,Occurrences
A,277101
other adapter sequences,19560391

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
,AA,AT,AC,AG,TA,TT,TC,TG,CA,CT,CC,CG,GA,GT,GC,GG
AAA,AAAAA,AAAAT,AAAAC,AAAAG,AAATA,AAATT,AAATC,AAATG,AAACA,AAACT,AAACC,AAACG,AAAGA,AAAGT,AAAGC,AAAGG
AAT,AATAA,AATAT,AATAC,AATAG,AATTA,AATTT,AATTC,AATTG,AATCA,AATCT,AATCC,AATCG,AATGA,AATGT,AATGC,AATGG
AAC,AACAA,AACAT,AACAC,AACAG,AACTA,AACTT,AACTC,AACTG,AACCA,AACCT,AACCC,AACCG,AACGA,AACGT,AACGC,AACGG
AAG,AAGAA,AAGAT,AAGAC,AAGAG,AAGTA,AAGTT,AAGTC,AAGTG,AAGCA,AAGCT,AAGCC,AAGCG,AAGGA,AAGGT,AAGGC,AAGGG
ATA,ATAAA,ATAAT,ATAAC,ATAAG,ATATA,ATATT,ATATC,ATATG,ATACA,ATACT,ATACC,ATACG,ATAGA,ATAGT,ATAGC,ATAGG
ATT,ATTAA,ATTAT,ATTAC,ATTAG,ATTTA,ATTTT,ATTTC,ATTTG,ATTCA,ATTCT,ATTCC,ATTCG,ATTGA,ATTGT,ATTGC,ATTGG
ATC,ATCAA,ATCAT,ATCAC,ATCAG,ATCTA,ATCTT,ATCTC,ATCTG,ATCCA,ATCCT,ATCCC,ATCCG,ATCGA,ATCGT,ATCGC,ATCGG
ATG,ATGAA,ATGAT,ATGAC,ATGAG,ATGTA,ATGTT,ATGTC,ATGTG,ATGCA,ATGCT,ATGCC,ATGCG,ATGGA,ATGGT,ATGGC,ATGGG
ACA,ACAAA,ACAAT,ACAAC,ACAAG,ACATA,ACATT,ACATC,ACATG,ACACA,ACACT,ACACC,ACACG,ACAGA,ACAGT,ACAGC,ACAGG

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
,AA,AT,AC,AG,TA,TT,TC,TG,CA,CT,CC,CG,GA,GT,GC,GG
AAA,AAAAA,AAAAT,AAAAC,AAAAG,AAATA,AAATT,AAATC,AAATG,AAACA,AAACT,AAACC,AAACG,AAAGA,AAAGT,AAAGC,AAAGG
AAT,AATAA,AATAT,AATAC,AATAG,AATTA,AATTT,AATTC,AATTG,AATCA,AATCT,AATCC,AATCG,AATGA,AATGT,AATGC,AATGG
AAC,AACAA,AACAT,AACAC,AACAG,AACTA,AACTT,AACTC,AACTG,AACCA,AACCT,AACCC,AACCG,AACGA,AACGT,AACGC,AACGG
AAG,AAGAA,AAGAT,AAGAC,AAGAG,AAGTA,AAGTT,AAGTC,AAGTG,AAGCA,AAGCT,AAGCC,AAGCG,AAGGA,AAGGT,AAGGC,AAGGG
ATA,ATAAA,ATAAT,ATAAC,ATAAG,ATATA,ATATT,ATATC,ATATG,ATACA,ATACT,ATACC,ATACG,ATAGA,ATAGT,ATAGC,ATAGG
ATT,ATTAA,ATTAT,ATTAC,ATTAG,ATTTA,ATTTT,ATTTC,ATTTG,ATTCA,ATTCT,ATTCC,ATTCG,ATTGA,ATTGT,ATTGC,ATTGG
ATC,ATCAA,ATCAT,ATCAC,ATCAG,ATCTA,ATCTT,ATCTC,ATCTG,ATCCA,ATCCT,ATCCC,ATCCG,ATCGA,ATCGT,ATCGC,ATCGG
ATG,ATGAA,ATGAT,ATGAC,ATGAG,ATGTA,ATGTT,ATGTC,ATGTG,ATGCA,ATGCT,ATGCC,ATGCG,ATGGA,ATGGT,ATGGC,ATGGG
ACA,ACAAA,ACAAT,ACAAC,ACAAG,ACATA,ACATT,ACATC,ACATG,ACACA,ACACT,ACACC,ACACG,ACAGA,ACAGT,ACAGC,ACAGG

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
,AA,AT,AC,AG,TA,TT,TC,TG,CA,CT,CC,CG,GA,GT,GC,GG
AAA,AAAAA,AAAAT,AAAAC,AAAAG,AAATA,AAATT,AAATC,AAATG,AAACA,AAACT,AAACC,AAACG,AAAGA,AAAGT,AAAGC,AAAGG
AAT,AATAA,AATAT,AATAC,AATAG,AATTA,AATTT,AATTC,AATTG,AATCA,AATCT,AATCC,AATCG,AATGA,AATGT,AATGC,AATGG
AAC,AACAA,AACAT,AACAC,AACAG,AACTA,AACTT,AACTC,AACTG,AACCA,AACCT,AACCC,AACCG,AACGA,AACGT,AACGC,AACGG
AAG,AAGAA,AAGAT,AAGAC,AAGAG,AAGTA,AAGTT,AAGTC,AAGTG,AAGCA,AAGCT,AAGCC,AAGCG,AAGGA,AAGGT,AAGGC,AAGGG
ATA,ATAAA,ATAAT,ATAAC,ATAAG,ATATA,ATATT,ATATC,ATATG,ATACA,ATACT,ATACC,ATACG,ATAGA,ATAGT,ATAGC,ATAGG
ATT,ATTAA,ATTAT,ATTAC,ATTAG,ATTTA,ATTTT,ATTTC,ATTTG,ATTCA,ATTCT,ATTCC,ATTCG,ATTGA,ATTGT,ATTGC,ATTGG
ATC,ATCAA,ATCAT,ATCAC,ATCAG,ATCTA,ATCTT,ATCTC,ATCTG,ATCCA,ATCCT,ATCCC,ATCCG,ATCGA,ATCGT,ATCGC,ATCGG
ATG,ATGAA,ATGAT,ATGAC,ATGAG,ATGTA,ATGTT,ATGTC,ATGTG,ATGCA,ATGCT,ATGCC,ATGCG,ATGGA,ATGGT,ATGGC,ATGGG
ACA,ACAAA,ACAAT,ACAAC,ACAAG,ACATA,ACATT,ACATC,ACATG,ACACA,ACACT,ACACC,ACACG,ACAGA,ACAGT,ACAGC,ACAGG

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
,AA,AT,AC,AG,TA,TT,TC,TG,CA,CT,CC,CG,GA,GT,GC,GG
AAA,AAAAA,AAAAT,AAAAC,AAAAG,AAATA,AAATT,AAATC,AAATG,AAACA,AAACT,AAACC,AAACG,AAAGA,AAAGT,AAAGC,AAAGG
AAT,AATAA,AATAT,AATAC,AATAG,AATTA,AATTT,AATTC,AATTG,AATCA,AATCT,AATCC,AATCG,AATGA,AATGT,AATGC,AATGG
AAC,AACAA,AACAT,AACAC,AACAG,AACTA,AACTT,AACTC,AACTG,AACCA,AACCT,AACCC,AACCG,AACGA,AACGT,AACGC,AACGG
AAG,AAGAA,AAGAT,AAGAC,AAGAG,AAGTA,AAGTT,AAGTC,AAGTG,AAGCA,AAGCT,AAGCC,AAGCG,AAGGA,AAGGT,AAGGC,AAGGG
ATA,ATAAA,ATAAT,ATAAC,ATAAG,ATATA,ATATT,ATATC,ATATG,ATACA,ATACT,ATACC,ATACG,ATAGA,ATAGT,ATAGC,ATAGG
ATT,ATTAA,ATTAT,ATTAC,ATTAG,ATTTA,ATTTT,ATTTC,ATTTG,ATTCA,ATTCT,ATTCC,ATTCG,ATTGA,ATTGT,ATTGC,ATTGG
ATC,ATCAA,ATCAT,ATCAC,ATCAG,ATCTA,ATCTT,ATCTC,ATCTG,ATCCA,ATCCT,ATCCC,ATCCG,ATCGA,ATCGT,ATCGC,ATCGG
ATG,ATGAA,ATGAT,ATGAC,ATGAG,ATGTA,ATGTT,ATGTC,ATGTG,ATGCA,ATGCT,ATGCC,ATGCG,ATGGA,ATGGT,ATGGC,ATGGG
ACA,ACAAA,ACAAT,ACAAC,ACAAG,ACATA,ACATT,ACATC,ACATG,ACACA,ACACT,ACACC,ACACG,ACAGA,ACAGT,ACAGC,ACAGG


In [6]:
##Trimming and prepping of single end csfasta files for BWA-mapping
!cutadapt --bwa -q 15 --minimum-length 15 -a CTGCCCCGGGTTCCTCATTCT \
    -a CTGCCCCGGGTTCCTCATTCTCTCAGCAGCATG -g CCACTACGCCTCCGCTTTCCTCTCTATG \
    -g CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT \
    /moto/eaton/projects/macaques/fastqdump/fasso/DRR001227_F3.csfasta \
    /moto/eaton/projects/macaques/fastqdump/fasso/DRR001227_F3_QV.qual \
    >/moto/eaton/projects/macaques/filteredfastq/DRR001227_F.fastq

This is cutadapt 1.18 with Python 3.7.0
Command line parameters: --bwa -q 15 --minimum-length 15 -a CTGCCCCGGGTTCCTCATTCT -a CTGCCCCGGGTTCCTCATTCTCTCAGCAGCATG -g CCACTACGCCTCCGCTTTCCTCTCTATG -g CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT /moto/eaton/projects/macaques/fastqdump/fasso/DRR001227_F3.csfasta /moto/eaton/projects/macaques/fastqdump/fasso/DRR001227_F3_QV.qual -o /moto/eaton/projects/macaques/filteredfastq/test.fastq
Processing reads on 1 core in single-end mode ...
^C
Interrupted


In [None]:
for i in DRR001231 DRR001232 DRR001233; do
    echo $i
    done

In [8]:
##Cutadapt doesn't work with paired end ABI Solid so we have to provide the primers, reversed, for the reverse reads
%%bash
for i in DRR001231 DRR001232 DRR001233; do
    cutadapt --bwa -q 15 --minimum-length 15 -a AGAATGAGGAACCCGGGGCAG \
        -a CATGCTGCTGAGAGAATGAGGAACCCGGGGCAG -g CATAGAGAGGAAAGCGGAGGCGTAGTGG \
        -g ATCACCGACTGCCCATAGAGAGGAAAGCGGAGGCGTAGTGG /moto/eaton/projects/macaques/fastqdump/fasso/${i}_R3.csfasta \
        /moto/eaton/projects/macaques/fastqdump/fasso/${i}_R3_QV.qual \
        >/moto/eaton/projects/macaques/filteredfastq/$i.fastq
    done

SyntaxError: invalid syntax (<ipython-input-8-f0454291331d>, line 3)