#  Whole Genome Shotgun metagenomics: Taxonomic Binning Pipeline

We have two fastqc files to process:
1. *Microbiome1_200k* that contains $200,000$ paired end reads from total DNA extracted from human saliva. So this is a microbiome.


2. *Vir1_100k* that contains $100,000$ paired end reads from the same saliva sample but after purification of viral particles. So this is a virome.


In [105]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

## Check files
Check the number of reads of our files and look at the header with grep and head commands

In [152]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
cd Documentos/Tema_4a
head -n 1 *fq
grep -c "@M02255" *fq
EOT

==> Bact1_R1_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_1:N:0:ACAGTG

==> Bact1_R2_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_2:N:0:ACAGTG

==> Vir1_R1_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_1:N:0:AGTCAA

==> Vir1_R2_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_2:N:0:AGTCAA
Bact1_R1_200000.fq:200000
Bact1_R2_200000.fq:200000
Vir1_R1_100000.fq:100000
Vir1_R2_100000.fq:100000


## Quality (fastqc) ##

In [153]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a
fastqc Vir1_R1_100000.fq -o fastQC_output/
fastqc Vir1_R2_100000.fq -o fastQC_output/
fastqc Bact1_R1_200000.fq -o fastQC_output/
fastqc Bact1_R2_200000.fq -o fastQC_output/
EOT

Analysis complete for Vir1_R1_100000.fq
Analysis complete for Vir1_R2_100000.fq
Analysis complete for Bact1_R1_200000.fq
Analysis complete for Bact1_R2_200000.fq


## Trimming and decontaminating
Trimming poor quality ends and short sequences (**Trimmomatic**) and removal of reads aligning to the human and phiX174 genomes (***bowtie2**). The later one is a contaminant used as spike by Illumina kits to control quality of the sequencing process.

We are only filtering only R1 files because forward reads have usually better quality than reverse reads. 

### Processing

In [92]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a
kneaddata -i Bact1_R1_200000.fq -o kneaddata_out \
-db /home/shared/bowtiedb/GRCh38_PhiX \
--trimmomatic /home/microbioinf/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/ \
-t 2 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:200" \
--bowtie2-options " --sensitive" --remove-intermediate-output
EOT

Initial number of reads ( /home/microbioinf/Documentos/Tema_4a/Bact1_R1_200000.fq ): 200000
Running Trimmomatic ... 
Total reads after trimming ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata.trimmed.fastq ): 127652
Decontaminating ...
Running bowtie2 ... 
Total reads after removing those found in reference database ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata_GRCh38_PhiX_bowtie2_clean.fastq ): 32358
Total reads after merging results from multiple databases ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata.fastq ): 32358

Final output file created: 
/home/microbioinf/Documentos/Tema_4a/kneaddata_out/Bact1_R1_200000_kneaddata.fastq



In [91]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a
kneaddata -i Vir1_R1_100000.fq -o kneaddata_out \
-db /home/shared/bowtiedb/GRCh38_PhiX \
--trimmomatic /home/microbioinf/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/ \
-t 2 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:200" \
--bowtie2-options " --sensitive" --remove-intermediate-output
EOT

Initial number of reads ( /home/microbioinf/Documentos/Tema_4a/Vir1_R1_100000.fq ): 100000
Running Trimmomatic ... 
Total reads after trimming ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata.trimmed.fastq ): 63602
Decontaminating ...
Running bowtie2 ... 
Total reads after removing those found in reference database ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata_GRCh38_PhiX_bowtie2_clean.fastq ): 63331
Total reads after merging results from multiple databases ( /home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata.fastq ): 63331

Final output file created: 
/home/microbioinf/Documentos/Tema_4a/kneaddata_out/Vir1_R1_100000_kneaddata.fastq



### Process statistics

In [96]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_4a/kneaddata_out
kneaddata_read_count_table --input ./ --output kneaddata_read_counts.txt
EOT

Reading log: ./Bact1_R1_200000_kneaddata.log
Reading log: ./Vir1_R1_100000_kneaddata.log
Read count table written: kneaddata_read_counts.txt


In [154]:
data = """
cat Documentos/Tema_4a/kneaddata_out/kneaddata_read_counts.txt
EOT
"""
output = !ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT' {data}

data = []
# To list of lists
for row in output:
    data.append(row.split('\t'))
# To dataframe
df = pd.DataFrame(data[1:], columns=data[0])
df.style.hide_index().set_properties(**{'text-align': 'right', 'font-family' : 'courier', 'color' : 'darkgreen', "font-size" : "11pt"}).\
set_properties(**{'text-align': 'right', 'font-family' : 'courier', 'color' : 'darkblue', "font-size" : "12pt"}, subset=['Sample'])

Sample,raw single,trimmed single,decontaminated GRCh38_PhiX single,final single
Bact1_R1_200000_kneaddata,200000,127652,32358,32358
Vir1_R1_100000_kneaddata,100000,63602,63331,63331


With grep we can identify the non-contaminated high-quality files

In [155]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
cd Documentos/Tema_4a/kneaddata_out
grep -c "@M02255:" *fastq
EOT

Bact1_R1_200000_kneaddata.fastq:32358
Bact1_R1_200000_kneaddata_GRCh38_PhiX_bowtie2_contam.fastq:95294
Vir1_R1_100000_kneaddata.fastq:63331
Vir1_R1_100000_kneaddata_GRCh38_PhiX_bowtie2_contam.fastq:271


### Check quality

In [156]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
cd Documentos/Tema_4a/kneaddata_out
mkdir fastQC_HighQuality
export PATH=$PATH:/home/microbioinf/miniconda3/bin
fastqc Bact1_R1_200000_kneaddata.fastq -o fastQC_HighQuality/
fastqc Vir1_R1_100000_kneaddata.fastq -o fastQC_HighQuality/
EOT

Analysis complete for Bact1_R1_200000_kneaddata.fastq
Analysis complete for Vir1_R1_100000_kneaddata.fastq


## Blastx comparison with a viral protein database

 We are going to use a Refseq database of viral proteins (around 100Mb) from ncbi (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/), and you have to download it in two separated files that can be joined into one with cat.
 

### Set up the reference database for diamond

This will create a binary DIAMOND database file with the name: *viralprotein.dmnd*

In [157]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
cd Documentos/Tema_4a
export PATH=$PATH:/home/microbioinf/miniconda3/bin
diamond makedb --in viral.protein.faa -d viralprotein
EOT

diamond v0.8.36.98 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 3
Database file: viral.protein.faa
Opening the database file...  [0.015813s]
Loading sequence data (0 sequences processed)...  [0.149662s]
Loading sequence data (100000 sequences processed)...  [0.142183s]
Loading sequence data (200000 sequences processed)...  [0.158145s]
Loading sequence data (300000 sequences processed)...  [0.040807s]
Writing trailer...  [0.003161s]
Closing the input file...  [4e-05s]
Closing the database file...  [0.061918s]
Processed 323029 sequences, 82978170 letters.
Total time = 0.572598s


### Blastx alignments

In [1]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
cd Documentos/Tema_4a
export PATH=$PATH:/home/microbioinf/miniconda3/bin
diamond blastx -d viralprotein.dmnd  -q kneaddata_out/Vir1_R1_100000_kneaddata.fastq -o Vir1HQ_Vs_viralprotein.m8
diamond blastx -d viralprotein.dmnd  -q kneaddata_out/Bact1_R1_200000_kneaddata.fastq -o Bact1HQ_Vs_viralprotein.m8
EOT

diamond v0.8.36.98 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 3
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: 
Opening the database...  [1.5e-05s]
Opening the input file...  [2.8e-05s]
Opening the output file...  [4.1e-05s]
Loading query sequences...  [0.235952s]
Running complexity filter...  [3.78898s]
Building query histograms...  [0.318931s]
Allocating buffers...  [6.2e-05s]
Loading reference sequences...  [0.107758s]
Building reference histograms...  [1.18488s]
Allocating buffers...  [6.1e-05s]
Initializing temporary storage...  [0.000619s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.969958s]
Building query index...  [0.217957s]
Building seed filter...  [0.091635s]
Searching alignments...  [0.137262s]
Processing query chunk 0, reference chunk 0, shape 0, index c

In [44]:
%%bash --out output
ssh microbioinf@192.168.56.101 2>/dev/null  /bin/bash <<'EOT'
cd Documentos/Tema_4a
head -n 1 *fq
grep -c "@M02255" *fq
EOT

In [50]:
print(output)

==> Bact1_R1_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_1:N:0:ACAGTG

==> Bact1_R2_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_2:N:0:ACAGTG

==> Vir1_R1_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_1:N:0:AGTCAA

==> Vir1_R2_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_2:N:0:AGTCAA
Bact1_R1_200000.fq:200000
Bact1_R2_200000.fq:200000
Vir1_R1_100000.fq:100000
Vir1_R2_100000.fq:100000



In [49]:
alist = output.split('\n')
alist

['==> Bact1_R1_200000.fq <==',
 '@M02255:131:000000000-AJC6R:1:2102:25217:13392_1:N:0:ACAGTG',
 '',
 '==> Bact1_R2_200000.fq <==',
 '@M02255:131:000000000-AJC6R:1:2102:25217:13392_2:N:0:ACAGTG',
 '',
 '==> Vir1_R1_100000.fq <==',
 '@M02255:131:000000000-AJC6R:1:1105:23249:10170_1:N:0:AGTCAA',
 '',
 '==> Vir1_R2_100000.fq <==',
 '@M02255:131:000000000-AJC6R:1:1105:23249:10170_2:N:0:AGTCAA',
 'Bact1_R1_200000.fq:200000',
 'Bact1_R2_200000.fq:200000',
 'Vir1_R1_100000.fq:100000',
 'Vir1_R2_100000.fq:100000',
 '']

In [None]:
Bacteria

In [None]:
root: 1288.0
  Viruses: 922.0
    dsDNA viruses, no RNA stage: 898.0
      Ascoviridae: 6.0
      Baculoviridae: 1.0
      Caudovirales: 619.0
        Ackermannviridae: 1.0
        Myoviridae: 343.0
        Podoviridae: 37.0
        Siphoviridae: 225.0
        unclassified Caudovirales: 3.0
      Herpesvirales: 3.0
        Herpesviridae: 3.0
      Iridoviridae: 8.0
      Marseilleviridae: 2.0
      Mimiviridae: 62.0
      Nimaviridae: 1.0
      Nudiviridae: 2.0
      Phycodnaviridae: 123.0
      Poxviridae: 18.0
      unclassified archaeal dsDNA viruses: 4.0
      unclassified dsDNA phages: 9.0
      unclassified dsDNA viruses: 28.0
        Pandoravirus: 15.0
        Pithoviridae: 13.0
    Ortervirales: 4.0
      Retroviridae: 4.0
    ssDNA viruses: 2.0
      Inoviridae: 1.0
      unclassified ssDNA bacterial viruses: 1.0
    unclassified bacterial viruses: 12.0
    unclassified viruses: 4.0
      Kaumoebavirus: 2.0
      Mollivirus sibericum: 2.0
  Not assigned: 332.0

In [None]:
recomputed
root: 1288.0
  Viruses: 589.0
    dsDNA viruses, no RNA stage: 579.0
      Ascoviridae: 4.0
      Caudovirales: 441.0
        Myoviridae: 250.0
        Podoviridae: 20.0
        Siphoviridae: 163.0
        unclassified Caudovirales: 2.0
      Iridoviridae: 6.0
      Mimiviridae: 35.0
      Nudiviridae: 1.0
      Phycodnaviridae: 69.0
      Poxviridae: 1.0
      unclassified archaeal dsDNA viruses: 1.0
      unclassified dsDNA phages: 4.0
      unclassified dsDNA viruses: 8.0
        Pandoravirus: 8.0
    ssDNA viruses: 1.0
      Inoviridae: 1.0
    unclassified bacterial viruses: 5.0
    unclassified viruses: 3.0
      Kaumoebavirus: 2.0
      Mollivirus sibericum: 1.0
  Not assigned: 674.0

**Viruses**


In [None]:
       Myoviridae: 564.0
        Podoviridae: 1055.0
        Siphoviridae: 7232.0
      Mimiviridae: 11.0

In [None]:
New params:
root: 12725.0
  Viruses: 8224.0
    dsDNA viruses, no RNA stage: 7338.0
      Caudovirales: 7190.0
        Myoviridae: 338.0
        Podoviridae: 823.0
        Siphoviridae: 5677.0
        unclassified Caudovirales: 11.0
      Phycodnaviridae: 9.0
      unclassified archaeal dsDNA viruses: 9.0
      unclassified dsDNA phages: 21.0
    unclassified bacterial viruses: 862.0
  Not assigned: 4375.0