#  Whole Genome Shotgun metagenomics: Taxonomic Binning Pipeline

We have two fastqc files to process:
1. *Microbiome1_200k* that contains $200,000$ paired end reads from total DNA extracted from human saliva. So this is a microbiome.


2. *Vir1_100k* that contains $100,000$ paired end reads from the same saliva sample but after purification of viral particles. So this is a virome.


In [19]:
import warnings
warnings.filterwarnings('ignore')

## Check files
Check the number of reads of our files and look at the header with grep and head commands

In [81]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<'EOT'
cd Documentos/Tema_4a
head -n 1 *fq
grep -c "@M02255" *fq
EOT

==> Bact1_R1_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_1:N:0:ACAGTG

==> Bact1_R2_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_2:N:0:ACAGTG

==> Vir1_R1_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_1:N:0:AGTCAA

==> Vir1_R2_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_2:N:0:AGTCAA
Bact1_R1_200000.fq:200000
Bact1_R2_200000.fq:200000
Vir1_R1_100000.fq:100000
Vir1_R2_100000.fq:100000


## Quality (fastqc) ##

In [93]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a
fastqc Vir1_R1_100000.fq -o fastQC_output/
fastqc Vir1_R2_100000.fq -o fastQC_output/
fastqc Bact1_R1_200000.fq -o fastQC_output/
fastqc Bact1_R2_200000.fq -o fastQC_output/
EOT

Analysis complete for Vir1_R1_100000.fq


Started analysis of Vir1_R1_100000.fq
Approx 5% complete for Vir1_R1_100000.fq
Approx 10% complete for Vir1_R1_100000.fq
Approx 15% complete for Vir1_R1_100000.fq
Approx 20% complete for Vir1_R1_100000.fq
Approx 25% complete for Vir1_R1_100000.fq
Approx 30% complete for Vir1_R1_100000.fq
Approx 35% complete for Vir1_R1_100000.fq
Approx 40% complete for Vir1_R1_100000.fq
Approx 45% complete for Vir1_R1_100000.fq
Approx 50% complete for Vir1_R1_100000.fq
Approx 55% complete for Vir1_R1_100000.fq
Approx 60% complete for Vir1_R1_100000.fq
Approx 65% complete for Vir1_R1_100000.fq
Approx 70% complete for Vir1_R1_100000.fq
Approx 75% complete for Vir1_R1_100000.fq
Approx 80% complete for Vir1_R1_100000.fq
Approx 85% complete for Vir1_R1_100000.fq
Approx 90% complete for Vir1_R1_100000.fq
Approx 95% complete for Vir1_R1_100000.fq
Approx 100% complete for Vir1_R1_100000.fq


## Trimming and decontaminating
Trimming poor quality ends and short sequences (**Trimmomatic**) and removal of reads aligning to the human and phiX174 genomes (***bowtie2**). The later one is a contaminant used as spike by Illumina kits to control quality of the sequencing process.

We are only filtering only R1 files because forward reads have usually better quality than reverse reads. 

### Processing

In [92]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a
kneaddata -i Bact1_R1_200000.fq -o kneaddata_out \
-db /home/shared/bowtiedb/GRCh38_PhiX \
--trimmomatic /home/microbioinf/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/ \
-t 2 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:200" \
--bowtie2-options " --sensitive" --remove-intermediate-output
EOT

Initial number of reads ( /home/microbioinf/Documentos/Tema_4a/Bact1_R1_200000.fq ): 200000
Running Trimmomatic ... 
Total reads after trimming ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata.trimmed.fastq ): 127652
Decontaminating ...
Running bowtie2 ... 
Total reads after removing those found in reference database ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata_GRCh38_PhiX_bowtie2_clean.fastq ): 32358
Total reads after merging results from multiple databases ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata.fastq ): 32358

Final output file created: 
/home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata.fastq



In [91]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a
kneaddata -i Vir1_R1_100000.fq -o kneaddata_out \
-db /home/shared/bowtiedb/GRCh38_PhiX \
--trimmomatic /home/microbioinf/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/ \
-t 2 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:200" \
--bowtie2-options " --sensitive" --remove-intermediate-output
EOT

Initial number of reads ( /home/microbioinf/Documentos/Tema_4a/Vir1_R1_100000.fq ): 100000
Running Trimmomatic ... 
Total reads after trimming ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata.trimmed.fastq ): 63602
Decontaminating ...
Running bowtie2 ... 
Total reads after removing those found in reference database ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata_GRCh38_PhiX_bowtie2_clean.fastq ): 63331
Total reads after merging results from multiple databases ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata.fastq ): 63331

Final output file created: 
/home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata.fastq



### Proces statistics

In [96]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a/kneaddata_out
kneaddata_read_count_table --input ./ --output kneaddata_read_counts.txt
EOT

Reading log: ./Bact1_R1_200000_kneaddata.log
Reading log: ./Vir1_R1_100000_kneaddata.log
Read count table written: kneaddata_read_counts.txt


In [97]:
%%bash
ssh microbioinf@192.168.56.101 /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a/kneaddata_out
cat kneaddata_read_counts.txt
EOT

Sample	raw single	trimmed single	decontaminated GRCh38_PhiX single	final single
Bact1_R1_200000_kneaddata	200000	127652	32358	32358
Vir1_R1_100000_kneaddata	100000	63602	63331	63331
