#  Whole Genome Shotgun metagenomics: de novo Assembly

We have two fastqc files to process:

*Vir1_100k* that contains $100,000$ paired end reads from the same saliva sample but after purification of viral particles. So this is a virome.


In [27]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
FILE_ID = "ECTV"
FASTQ_STR = "@HWUSI-EAS1752R"
MIN_LEN = "70"

## Preprocessing and quality check


In [32]:
%%bash -s "$FILE_ID" "$FASTQ_STR"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
echo "#### Check files FILE_ID=${FILE_ID}, FASTQ_STR=$FASTQ_STR"
cd Documentos/Tema_3
head ${FILE_ID}*fastq
grep -c $FASTQ_STR ${FILE_ID}*fastq

echo "#### Compute quality"
mkdir ${FILE_ID}_Quality
fastqc ${FILE_ID}_R1.fastq -o ${FILE_ID}_Quality/
fastqc ${FILE_ID}_R2.fastq -o ${FILE_ID}_Quality/

echo "#### Replace ' ' by '_' in header"
head -n 1 ${FILE_ID}*fastq
cat ${FILE_ID}_R1.fastq | sed 's/ /_/g' > ${FILE_ID}_R1_.fastq
cat ${FILE_ID}_R2.fastq | sed 's/ /_/g' > ${FILE_ID}_R2_.fastq
head -n 1 ${FILE_ID}*fastq
EOT

#### Check files FILE_ID=ECTV, FASTQ_STR=@HWUSI-EAS1752R
==> ECTV_R1_.fastq <==
@HWUSI-EAS1752R:23:FC62KHPAAXX:6:3:3542:1008_1:N:0:GCCAAT
NGGATCTCCGATTTCTTTACGATATGGATCTATACCGGACGAATTAATAAACAAACATCCAAAAAAATATGGAATT
+HWUSI-EAS1752R:23:FC62KHPAAXX:6:3:3542:1008_1:N:0:GCCAAT
#*(.(,+,,+@@@@@@222@@@@@@@@:@@@@@@@:::<<71757<<:<:22@222:@@@8518500000:::8:5
@HWUSI-EAS1752R:23:FC62KHPAAXX:6:3:3893:1011_1:N:0:GCCAAT
NGAGTAATACGGTTCAAAATCATAAATGTGATAGTTTCCAGACTGGTATCCGAGTTTTTCTTGGATGATGGATACT
+HWUSI-EAS1752R:23:FC62KHPAAXX:6:3:3893:1011_1:N:0:GCCAAT
#,*))33322@C@@@@C@@C@C@C@@@@@@@@CC@@@C@C@@@@@C@C@@@@@@@C@@CC222@@@@@@C@@@@@@
@HWUSI-EAS1752R:23:FC62KHPAAXX:6:3:5526:1010_1:N:0:GCCAAT
NTGGAGTCGTAAAAAAGTTTTATCTCTTTCTCTCTTTGATGGTCTCATAAAAAAGTTTTACAAAAATATTTTTATT

==> ECTV_R1.fastq <==
@HWUSI-EAS1752R:23:FC62KHPAAXX:6:3:3542:1008 1:N:0:GCCAAT
NGGATCTCCGATTTCTTTACGATATGGATCTATACCGGACGAATTAATAAACAAACATCCAAAAAAATATGGAATT
+HWUSI-EAS1752R:23:FC62KHPAAXX:6:3:3542:1008 1:N:0:GCCAAT
#*(.(,+,,+@@@@@@222@@@@@@@@:@

## Trimming and decontaminating
Trimming poor quality ends and short sequences (**Trimmomatic**) and removal of reads aligning to the human and phiX174 genomes (***bowtie2**). The later one is a contaminant used as spike by Illumina kits to control quality of the sequencing process.

We are only filtering only R1 files because forward reads have usually better quality than reverse reads. 

### Processing

In [33]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
echo "#### Trimming and decontaminating FILE_ID=${FILE_ID} MIN_LEN=${MIN_LEN}"
kneaddata -i ${FILE_ID}_R1_.fastq -i ${FILE_ID}_R2_.fastq \
-o kneaddata_out -db /home/shared/bowtiedb/GRCh38_PhiX \
--trimmomatic /home/microbioinf/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/ \
-t 2 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:${MIN_LEN}" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output
EOT

#### Trimming and decontaminating FILE_ID=ECTV MIN_LEN=70
Initial number of reads ( /home/microbioinf/Documentos/Tema_3/ECTV_R1_.fastq ): 50000
Initial number of reads ( /home/microbioinf/Documentos/Tema_3/ECTV_R2_.fastq ): 50000
Running Trimmomatic ... 
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata.trimmed.1.fastq ): 41996
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata.trimmed.2.fastq ): 41996
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata.trimmed.single.1.fastq ): 4107
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata.trimmed.single.2.fastq ): 1791
Decontaminating ...
Running bowtie2 ... 
Total reads after removing those found in reference database ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_GRCh38_PhiX_bowtie2_paired_clean_1.fastq ): 41839
Total reads after 

### Process statistics

In [38]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
l
less *log
kneaddata_read_count_table --input ./ --output kneaddata_read_counts${FILE_ID}.txt 
grep -c $FASTQ_STR *fastq

04/26/2019 08:44:27 AM - kneaddata.knead_data - INFO: Running kneaddata v0.6.1
04/26/2019 08:44:27 AM - kneaddata.knead_data - INFO: Output files will be written to: /home/microbioinf/Documentos/Tema_3/kneaddata_out
04/26/2019 08:44:27 AM - kneaddata.knead_data - DEBUG: Running with the following arguments: 
verbose = False
bmtagger_path = None
minscore = 50
bowtie2_path = /home/microbioinf/miniconda3/bin/bowtie2
maxperiod = 500
no_discordant = False
serial = False
fastqc_start = False
bmtagger = False
cat_final_output = False
log_level = DEBUG
log = /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata.log
max_memory = 500m
remove_intermediate_output = True
fastqc_path = None
output_dir = /home/microbioinf/Documentos/Tema_3/kneaddata_out
trf_path = None
remove_temp_output = True
reference_db = /home/shared/bowtiedb/GRCh38_PhiX
input = /home/microbioinf/Documentos/Tema_3/ECTV_R1_.fastq /home/microbioinf/Documentos/Tema_3/ECTV_R2_.fastq
pi = 10
reorder = False
pm = 80
tri

In [69]:
data = """
cat Documentos/Tema_3/kneaddata_out/kneaddata_read_counts%s.txt
EOT
""" % FILE_ID
output = !ssh microbioinf@192.168.56.101 /bin/bash <<'EOT' {data}

data = []
# To list of lists
for row in output:
    data.append(row.split('\t'))
# To dataframe
df_knead = pd.DataFrame(data[1:], columns=data[0])
df_knead.style.set_properties(**{'text-align': 'right', 'font-family' : 'courier', 'color' : 'darkgreen', "font-size" : "11pt"}).\
set_properties(**{'text-align': 'right', 'font-family' : 'courier', 'color' : 'darkblue', "font-size" : "12pt"}, subset=['Sample'])

df_knead.transpose()

Unnamed: 0,0
Sample,ECTV_R1__kneaddata
raw pair1,50000
raw pair2,50000
trimmed pair1,41996
trimmed pair2,41996
trimmed orphan1,4107
trimmed orphan2,1791
decontaminated GRCh38_PhiX pair1,41839
decontaminated GRCh38_PhiX pair2,41839
decontaminated GRCh38_PhiX orphan1,12


### Check number of reads

With grep we can identify the non-contaminated high-quality files

In [66]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
grep -c $FASTQ_STR *fastq

ECTV_R1__kneaddata_GRCh38_PhiX_bowtie2_paired_contam_1.fastq:115
ECTV_R1__kneaddata_GRCh38_PhiX_bowtie2_paired_contam_2.fastq:115
ECTV_R1__kneaddata_GRCh38_PhiX_bowtie2_unmatched_1_contam.fastq:30
ECTV_R1__kneaddata_GRCh38_PhiX_bowtie2_unmatched_2_contam.fastq:89
ECTV_R1__kneaddata_paired_1.fastq:41839
ECTV_R1__kneaddata_paired_2.fastq:41839
ECTV_R1__kneaddata_unmatched_1.fastq:12
ECTV_R1__kneaddata_unmatched_2.fastq:5851


### Check quality

In [68]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compute quality"
mkdir ${FILE_ID}_HighQuality
fastqc ${FILE_ID}_R1__kneaddata_paired_1.fastq -o ${FILE_ID}_HighQuality/
fastqc ${FILE_ID}_R1__kneaddata_paired_2.fastq -o ${FILE_ID}_HighQuality/

#### Compute quality
Analysis complete for ECTV_R1__kneaddata_paired_1.fastq
Analysis complete for ECTV_R1__kneaddata_paired_2.fastq


## Assembly

 We are going to use a Refseq database of viral proteins (around 100Mb) from ncbi (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/), and you have to download it in two separated files that can be joined into one with cat.
 

### Set up the reference database for diamond

This will create a binary DIAMOND database file with the name: *viralprotein.dmnd*

In [77]:
K_MER = 35

In [81]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "$K_MER"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MER=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compute assembly K_MER=${K_MER}"
spades.py -1 ${FILE_ID}_R1__kneaddata_paired_1.fastq -2 ${FILE_ID}_R1__kneaddata_paired_2.fastq \
--sc -k ${K_MER} -o ${FILE_ID}-Assembly${K_MER}

echo "#### Check output K_MER=${K_MER}"
cd ${FILE_ID}-Assembly${K_MER}
rep -c ">" *fasta
grep ">" -m 8 contigs.fasta 
grep ">" -m 8 scaffolds.fasta 
grep "NN" *fasta

#### Compute assembly K_MER=35
Command line: /home/microbioinf/miniconda3/bin/spades.py	-1	/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_1.fastq	-2	/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_2.fastq	--sc	-k	35	-o	/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV-Assembly35	

System information:
  SPAdes version: 3.13.0
  Python version: 2.7.15
  OS: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

Output dir: /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV-Assembly35
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Single-cell mode
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_1.fastq']
      right reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_2.fastq']
      interlaced reads: not specified
      s

### Blastx alignments

In [1]:
%%bash
ssh microbioinf@192.168.56.101 2>/dev/null /bin/bash <<'EOT'
cd Documentos/Tema_4a
export PATH=$PATH:/home/microbioinf/miniconda3/bin
diamond blastx -d viralprotein.dmnd  -q kneaddata_out/Vir1_R1_100000_kneaddata.fastq -o Vir1HQ_Vs_viralprotein.m8
diamond blastx -d viralprotein.dmnd  -q kneaddata_out/Bact1_R1_200000_kneaddata.fastq -o Bact1HQ_Vs_viralprotein.m8
EOT

diamond v0.8.36.98 | by Benjamin Buchfink <buchfink@gmail.com>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 3
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 25
Temporary directory: 
Opening the database...  [1.5e-05s]
Opening the input file...  [2.8e-05s]
Opening the output file...  [4.1e-05s]
Loading query sequences...  [0.235952s]
Running complexity filter...  [3.78898s]
Building query histograms...  [0.318931s]
Allocating buffers...  [6.2e-05s]
Loading reference sequences...  [0.107758s]
Building reference histograms...  [1.18488s]
Allocating buffers...  [6.1e-05s]
Initializing temporary storage...  [0.000619s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.969958s]
Building query index...  [0.217957s]
Building seed filter...  [0.091635s]
Searching alignments...  [0.137262s]
Processing query chunk 0, reference chunk 0, shape 0, index c

In [44]:
%%bash --out output
ssh microbioinf@192.168.56.101 2>/dev/null  /bin/bash <<'EOT'
cd Documentos/Tema_4a
head -n 1 *fq
grep -c "@M02255" *fq
EOT

In [50]:
print(output)

==> Bact1_R1_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_1:N:0:ACAGTG

==> Bact1_R2_200000.fq <==
@M02255:131:000000000-AJC6R:1:2102:25217:13392_2:N:0:ACAGTG

==> Vir1_R1_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_1:N:0:AGTCAA

==> Vir1_R2_100000.fq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_2:N:0:AGTCAA
Bact1_R1_200000.fq:200000
Bact1_R2_200000.fq:200000
Vir1_R1_100000.fq:100000
Vir1_R2_100000.fq:100000



In [49]:
alist = output.split('\n')
alist

['==> Bact1_R1_200000.fq <==',
 '@M02255:131:000000000-AJC6R:1:2102:25217:13392_1:N:0:ACAGTG',
 '',
 '==> Bact1_R2_200000.fq <==',
 '@M02255:131:000000000-AJC6R:1:2102:25217:13392_2:N:0:ACAGTG',
 '',
 '==> Vir1_R1_100000.fq <==',
 '@M02255:131:000000000-AJC6R:1:1105:23249:10170_1:N:0:AGTCAA',
 '',
 '==> Vir1_R2_100000.fq <==',
 '@M02255:131:000000000-AJC6R:1:1105:23249:10170_2:N:0:AGTCAA',
 'Bact1_R1_200000.fq:200000',
 'Bact1_R2_200000.fq:200000',
 'Vir1_R1_100000.fq:100000',
 'Vir1_R2_100000.fq:100000',
 '']

In [None]:
Bacteria

In [None]:
root: 1288.0
  Viruses: 922.0
    dsDNA viruses, no RNA stage: 898.0
      Ascoviridae: 6.0
      Baculoviridae: 1.0
      Caudovirales: 619.0
        Ackermannviridae: 1.0
        Myoviridae: 343.0
        Podoviridae: 37.0
        Siphoviridae: 225.0
        unclassified Caudovirales: 3.0
      Herpesvirales: 3.0
        Herpesviridae: 3.0
      Iridoviridae: 8.0
      Marseilleviridae: 2.0
      Mimiviridae: 62.0
      Nimaviridae: 1.0
      Nudiviridae: 2.0
      Phycodnaviridae: 123.0
      Poxviridae: 18.0
      unclassified archaeal dsDNA viruses: 4.0
      unclassified dsDNA phages: 9.0
      unclassified dsDNA viruses: 28.0
        Pandoravirus: 15.0
        Pithoviridae: 13.0
    Ortervirales: 4.0
      Retroviridae: 4.0
    ssDNA viruses: 2.0
      Inoviridae: 1.0
      unclassified ssDNA bacterial viruses: 1.0
    unclassified bacterial viruses: 12.0
    unclassified viruses: 4.0
      Kaumoebavirus: 2.0
      Mollivirus sibericum: 2.0
  Not assigned: 332.0

In [None]:
recomputed
root: 1288.0
  Viruses: 589.0
    dsDNA viruses, no RNA stage: 579.0
      Ascoviridae: 4.0
      Caudovirales: 441.0
        Myoviridae: 250.0
        Podoviridae: 20.0
        Siphoviridae: 163.0
        unclassified Caudovirales: 2.0
      Iridoviridae: 6.0
      Mimiviridae: 35.0
      Nudiviridae: 1.0
      Phycodnaviridae: 69.0
      Poxviridae: 1.0
      unclassified archaeal dsDNA viruses: 1.0
      unclassified dsDNA phages: 4.0
      unclassified dsDNA viruses: 8.0
        Pandoravirus: 8.0
    ssDNA viruses: 1.0
      Inoviridae: 1.0
    unclassified bacterial viruses: 5.0
    unclassified viruses: 3.0
      Kaumoebavirus: 2.0
      Mollivirus sibericum: 1.0
  Not assigned: 674.0

**Viruses**


In [None]:
       Myoviridae: 564.0
        Podoviridae: 1055.0
        Siphoviridae: 7232.0
      Mimiviridae: 11.0

In [None]:
New params:
root: 12725.0
  Viruses: 8224.0
    dsDNA viruses, no RNA stage: 7338.0
      Caudovirales: 7190.0
        Myoviridae: 338.0
        Podoviridae: 823.0
        Siphoviridae: 5677.0
        unclassified Caudovirales: 11.0
      Phycodnaviridae: 9.0
      unclassified archaeal dsDNA viruses: 9.0
      unclassified dsDNA phages: 21.0
    unclassified bacterial viruses: 862.0
  Not assigned: 4375.0