#  Whole Genome Shotgun metagenomics: de novo Assembly

We have two fastqc files to process:

*Vir1_100k* that contains $100,000$ paired end reads from the same saliva sample but after purification of viral particles. So this is a virome.

Different pairs of index primers are added to individual samples in a second thermocycling step, after initial amplification of the target region. This allows you to mix many samples together (up to 96) and sequence them at the same time. Following sequencing, for example on an Illumina MiSeq, the software is able to identify these indexes on each sequence read and, because you have already told the machine which pair of index primers was added to each sample, the machine then knows which sample to associate that read to, allowing you to separate the reads for each different sample.

Use kneaddata to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \ characters at the end of each line are just to split the command over multiple lines to make it easier to read.

parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
 ::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq
Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:

mkdir kneaddata_out/contam_seq

mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq
You can produce a logfile summarizing the kneadData output with this command:

The current version of SPAdes works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide additional contigs that will be used as long reads.

Version 3.10.1 of SPAdes supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously. Note, that SPAdes was initially designed for small genomes. It was tested on bacterial (both single-cell MDA and standard isolates), fungal and other small genomes. SPAdes is not intended for larger genomes (e.g. mammalian size genomes). For such purposes you can use it at your own risk.



In [222]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
#FILE_ID = "ECTV"
#FASTQ_STR = "@HWUSI-EAS1752R"
#MIN_LEN = "70"

FILE_ID = "VIR"
FASTQ_STR = "@M02255"
MIN_LEN = "200"

## Preprocessing and quality check


In [223]:
%%bash -s "$FILE_ID" "$FASTQ_STR"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
echo "#### Check files FILE_ID=${FILE_ID}, FASTQ_STR=$FASTQ_STR"
cd Documentos/Tema_3
head ${FILE_ID}*fastq
grep -c $FASTQ_STR ${FILE_ID}*fastq

echo "#### Compute quality"
mkdir ${FILE_ID}_Quality
fastqc ${FILE_ID}_R1.fastq -o ${FILE_ID}_Quality/
fastqc ${FILE_ID}_R2.fastq -o ${FILE_ID}_Quality/

echo "#### Replace ' ' by '_' in header"
head -n 1 ${FILE_ID}*fastq
cat ${FILE_ID}_R1.fastq | sed 's/ /_/g' > ${FILE_ID}_R1_.fastq
cat ${FILE_ID}_R2.fastq | sed 's/ /_/g' > ${FILE_ID}_R2_.fastq
head -n 1 ${FILE_ID}*fastq
EOT

#### Check files FILE_ID=VIR, FASTQ_STR=@M02255
==> VIR_R1.fastq <==
@M02255:131:000000000-AJC6R:1:1105:23249:10170_1:N:0:AGTCAA
AACTGGCGTTACATGAAGGGCTCTGAGTTGATTGATGCTTTGGAGGAGTACCTGTGAAATGGCCGTCTGAGAAGGTTGTTAATGCGACCGTAAAGTATGGTGGTGTCGTGTTGAGACGTGGACCGTACGCATATTTCGATAAGGGGGGCATTCGATTGTGTGCTACAAGGCTTGGTCTCTCTTCATATATTGTGGAGAGTGATGATTGTGGTCCTGAGATTTATAGTGAGGATGGTATGATTGAGTTGGTGACGTCTTTATGATTCCTGTTACCGAGACTATCCTGAAAACTGCTTACCAT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGFAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEFGGGGGGGGGGGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGCF@EGGGG9CEGGGFFGGGC?FGECFFGGGG9CFGGFEFCGGEGGCF?FG7EGCFGC:FFGF6<FGFGGFGGGFFFFFEGFF@7@)8@FFFBAFA=F<FDFF<157526))4?<39>B9?><?BAA?2>F
@M02255:128:000000000-AG7E5:1:1112:17059:22756_1:N:0:AGTCAA
TGTTGGTCTATATTGCGCTTTAGCAGCCTTGTTTTTAAAAGCTGATACAAACGGGAATATAAAAATAATTGATAAAGAAAACATAAAAACGGATCAACTTGCCGGGGCAAGTGAAGCCATGAAACACATAGAGAAGTCTTCTATTCCATTAATGACAACTTCAGATCTTGACATTTATTCAGATTCATTATAACACACCTTAGAC

## Trimming and decontaminating
Trimming poor quality ends and short sequences (**Trimmomatic**) and removal of reads aligning to the human and phiX174 genomes (***bowtie2**). The later one is a contaminant used as spike by Illumina kits to control quality of the sequencing process.

We are only filtering only R1 files because forward reads have usually better quality than reverse reads. 

### Process

In [224]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
echo "#### Trimming and decontaminating FILE_ID=${FILE_ID} MIN_LEN=${MIN_LEN}"
kneaddata -i ${FILE_ID}_R1_.fastq -i ${FILE_ID}_R2_.fastq \
-o kneaddata_out -db /home/shared/bowtiedb/GRCh38_PhiX \
--trimmomatic /home/microbioinf/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/ \
-t 2 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:${MIN_LEN}" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output
EOT

#### Trimming and decontaminating FILE_ID=VIR MIN_LEN=200
Initial number of reads ( /home/microbioinf/Documentos/Tema_3/VIR_R1_.fastq ): 100000
Initial number of reads ( /home/microbioinf/Documentos/Tema_3/VIR_R2_.fastq ): 100000
Running Trimmomatic ... 
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata.trimmed.1.fastq ): 29521
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata.trimmed.2.fastq ): 29521
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata.trimmed.single.1.fastq ): 34081
Total reads after trimming ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata.trimmed.single.2.fastq ): 1926
Decontaminating ...
Running bowtie2 ... 
Total reads after removing those found in reference database ( /home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_GRCh38_PhiX_bowtie2_paired_clean_1.fastq ): 29310
Total reads after remo

### Process statistics

In [231]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
cat ${FILE_ID}_R1*log
kneaddata_read_count_table --input ./ --output kneaddata_read_counts${FILE_ID}.txt 
grep -c $FASTQ_STR ${FILE_ID}*fastq

04/26/2019 01:17:31 PM - kneaddata.knead_data - INFO: Running kneaddata v0.6.1
04/26/2019 01:17:31 PM - kneaddata.knead_data - INFO: Output files will be written to: /home/microbioinf/Documentos/Tema_3/kneaddata_out
04/26/2019 01:17:31 PM - kneaddata.knead_data - DEBUG: Running with the following arguments: 
verbose = False
bmtagger_path = None
minscore = 50
bowtie2_path = /home/microbioinf/miniconda3/bin/bowtie2
maxperiod = 500
no_discordant = False
serial = False
fastqc_start = False
bmtagger = False
cat_final_output = False
log_level = DEBUG
log = /home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata.log
max_memory = 500m
remove_intermediate_output = True
fastqc_path = None
output_dir = /home/microbioinf/Documentos/Tema_3/kneaddata_out
trf_path = None
remove_temp_output = True
reference_db = /home/shared/bowtiedb/GRCh38_PhiX
input = /home/microbioinf/Documentos/Tema_3/VIR_R1_.fastq /home/microbioinf/Documentos/Tema_3/VIR_R2_.fastq
pi = 10
reorder = False
pm = 80
trimmo

In [232]:
data = """
cat Documentos/Tema_3/kneaddata_out/kneaddata_read_counts%s.txt
EOT
""" % FILE_ID
output = !ssh microbioinf@192.168.56.101 /bin/bash <<'EOT' {data}

data = []
# To list of lists
for row in output:
    data.append(row.split('\t'))
# To dataframe
df_knead = pd.DataFrame(data[1:], columns=data[0])
df_knead.style.hide_index().set_properties(**{'text-align': 'right', 'font-family' : 'courier', 'color' : 'darkgreen', "font-size" : "11pt"}).\
set_properties(**{'text-align': 'right', 'font-family' : 'courier', 'color' : 'darkblue', "font-size" : "12pt"}, subset=['Sample'])
df_knead.transpose()

Unnamed: 0,0,1
Sample,ECTV_R1__kneaddata,VIR_R1__kneaddata
raw pair1,50000,100000
raw pair2,50000,100000
trimmed pair1,41996,29521
trimmed pair2,41996,29521
trimmed orphan1,4107,34081
trimmed orphan2,1791,1926
decontaminated GRCh38_PhiX pair1,41839,29310
decontaminated GRCh38_PhiX pair2,41839,29310
decontaminated GRCh38_PhiX orphan1,12,8


### Check number of reads

With grep we can identify the non-contaminated high-quality files

In [234]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
grep -c $FASTQ_STR ${FILE_ID}*fastq

VIR_R1__kneaddata_GRCh38_PhiX_bowtie2_paired_contam_1.fastq:183
VIR_R1__kneaddata_GRCh38_PhiX_bowtie2_paired_contam_2.fastq:183
VIR_R1__kneaddata_GRCh38_PhiX_bowtie2_unmatched_1_contam.fastq:20
VIR_R1__kneaddata_GRCh38_PhiX_bowtie2_unmatched_2_contam.fastq:113
VIR_R1__kneaddata_paired_1.fastq:29310
VIR_R1__kneaddata_paired_2.fastq:29310
VIR_R1__kneaddata_unmatched_1.fastq:8
VIR_R1__kneaddata_unmatched_2.fastq:35922


### Check quality

In [235]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compute quality"
mkdir ${FILE_ID}_HighQuality
fastqc ${FILE_ID}_R1__kneaddata_paired_1.fastq -o ${FILE_ID}_HighQuality/
fastqc ${FILE_ID}_R1__kneaddata_paired_2.fastq -o ${FILE_ID}_HighQuality/

#### Compute quality
Analysis complete for VIR_R1__kneaddata_paired_1.fastq
Analysis complete for VIR_R1__kneaddata_paired_2.fastq


## Assembly

 We are going to use a Refseq database of viral proteins (around 100Mb) from ncbi (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/), and you have to download it in two separated files that can be joined into one with cat.
 

### Process (*spades*)

In this step we run command **spades** with the paired high-quality and free of known contaminants reads.

#### K_MER = 35

In [236]:
K_MER = 35

In [237]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "35"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MER=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compute assembly K_MER=${K_MER}"
spades.py -1 ${FILE_ID}_R1__kneaddata_paired_1.fastq -2 ${FILE_ID}_R1__kneaddata_paired_2.fastq \
--sc -k ${K_MER} -o ${FILE_ID}-Assembly${K_MER}
EOT

#### Compute assembly K_MER=35
Command line: /home/microbioinf/miniconda3/bin/spades.py	-1	/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_1.fastq	-2	/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_2.fastq	--sc	-k	35	-o	/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR-Assembly35	

System information:
  SPAdes version: 3.13.0
  Python version: 2.7.15
  OS: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

Output dir: /home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR-Assembly35
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Single-cell mode
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_1.fastq']
      right reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_2.fastq']
      interlaced reads: not specified
      single 

In [238]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "$K_MER"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MER=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Check output K_MER=${K_MER}"
cd ${FILE_ID}-Assembly${K_MER}
rep -c ">" *fasta
grep ">" -m 8 contigs.fasta 
grep ">" -m 8 scaffolds.fasta 
grep "NN" *fasta

#### Check output K_MER=35
>NODE_1_length_87312_cov_13.958695
>NODE_2_length_43671_cov_9.036942
>NODE_3_length_24905_cov_6.568195
>NODE_4_length_15602_cov_12.480182
>NODE_5_length_14953_cov_8.458238
>NODE_6_length_13800_cov_8.779295
>NODE_7_length_12853_cov_6.110704
>NODE_8_length_12581_cov_26.042962
>NODE_1_length_87312_cov_13.958695
>NODE_2_length_43671_cov_9.036942
>NODE_3_length_28485_cov_10.565624
>NODE_4_length_24905_cov_6.568195
>NODE_5_length_16556_cov_8.484898
>NODE_6_length_15602_cov_12.480182
>NODE_7_length_12853_cov_6.110704
>NODE_8_length_12581_cov_26.042962
scaffolds.fasta:CCCACAAGGGCCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACTTTA
scaffolds.fasta:TCCTCTTCTTTCGCGCGTTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:TTTTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:NNNNNNNNNNNNNNNNNNNNN

#### List of different K_MER

In [206]:
K_MERS_LIST = ["25", "35", "45"]
K_MERS =  ",".join(K_MERS_LIST)
print(K_MERS)

25,35,45


In [207]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "$K_MERS"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MERS=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compute assembly with no specified K_MER"
spades.py -1 ${FILE_ID}_R1__kneaddata_paired_1.fastq -2 ${FILE_ID}_R1__kneaddata_paired_2.fastq \
--sc -o ${FILE_ID}-Assembly${K_MER}
IFS=","
for K_MER in ${K_MERS}
do
echo "#### Compute assembly K_MER=${K_MER}"
spades.py -1 ${FILE_ID}_R1__kneaddata_paired_1.fastq -2 ${FILE_ID}_R1__kneaddata_paired_2.fastq \
--sc -k ${K_MER} -o ${FILE_ID}-Assembly${K_MER}
done
EOT

#### Compute assembly with no specified M_MER
Command line: /home/microbioinf/miniconda3/bin/spades.py	-1	/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_1.fastq	-2	/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_2.fastq	--sc	-o	/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV-Assembly	

System information:
  SPAdes version: 3.13.0
  Python version: 2.7.15
  OS: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

Output dir: /home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV-Assembly
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Single-cell mode
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_1.fastq']
      right reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/ECTV_R1__kneaddata_paired_2.fastq']
      interlaced reads: not specified
  

In [218]:
K_MERS_LIST = ["","25", "35", "45"]
K_MERS =  ",".join(K_MERS_LIST)
print(K_MERS)

,25,35,45


In [219]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "$K_MERS"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MERS=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
IFS=","
for K_MER in ${K_MERS}
do
    echo
    echo "#### Check output K_MER=${K_MER}"
    cd ${FILE_ID}-Assembly${K_MER}
    rep -c ">" *fasta
    grep ">" -m 8 contigs.fasta 
    grep ">" -m 8 scaffolds.fasta 
    grep "NN" *fasta
done


#### Check output K_MER=
>NODE_1_length_92446_cov_8.851382
>NODE_2_length_61624_cov_7.032646
>NODE_3_length_29587_cov_8.324665
>NODE_4_length_13405_cov_7.334082
>NODE_5_length_421_cov_1.073770
>NODE_6_length_261_cov_1.072816
>NODE_7_length_228_cov_1.942197
>NODE_8_length_227_cov_0.610465
>NODE_1_length_92877_cov_8.814516
>NODE_2_length_75358_cov_7.054686
>NODE_3_length_29587_cov_8.324665
>NODE_4_length_261_cov_1.072816
>NODE_5_length_227_cov_0.610465
>NODE_6_length_90_cov_11.685714
>NODE_7_length_79_cov_49.625000
scaffolds.fasta:TNNNNNNNNNNTACCGCCATTATGGTGGCTAGTGATGTTTGTAAAAAAAATTTGGATTTA
scaffolds.fasta:ATAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGTGGATCCGACGTTTCAACTATTG
scaffolds.fasta:GTTACTGTTACTACTAAAGACTANNNNNNNNNNCTTTTAGAATGACGTCTTGTAATATCA

#### Check output K_MER=25
>NODE_1_length_92446_cov_8.851382
>NODE_2_length_61624_cov_7.032646
>NODE_3_length_29587_cov_8.324665
>NODE_4_length_13405_cov_7.334082
>NODE_5_

#### Comparison of assemblies (*quast*)

In [220]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "$K_MER"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MER=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compare assemblies FILE_ID=${FILE_ID}"
for assembly in ${FILE_ID}-Assembly*; 
    do echo "Processing $assembly file..."; 
    cp ${assembly}/contigs.fasta contigs-${assembly}.fasta
    cp ${assembly}/scaffolds.fasta scaffolds-${assembly}.fasta
done
quast.py contigs* scaffolds* -R ../ECTV-MoscowGenome.fasta
EOT

#### Compare assemblies FILE_ID=ECTV
Processing ECTV-Assembly file...
Processing ECTV-Assembly25 file...
Processing ECTV-Assembly35 file...
Processing ECTV-Assembly45 file...
/home/microbioinf/miniconda3/lib/python2.7/site-packages/quast-5.0.2-py2.7.egg-info/scripts/quast.py contigs-ECTV-Assembly25.fasta contigs-ECTV-Assembly35.fasta contigs-ECTV-Assembly45.fasta contigs-ECTV-Assembly.fasta scaffolds-ECTV-Assembly25.fasta scaffolds-ECTV-Assembly35.fasta scaffolds-ECTV-Assembly45.fasta scaffolds-ECTV-Assembly.fasta -R ../ECTV-MoscowGenome.fasta

Version: 5.0.2

System information:
  OS: Linux-4.15.0-47-generic-x86_64-with-debian-buster-sid (linux_64)
  Python version: 2.7.11
  CPUs number: 3

Started: 2019-04-26 13:12:24

Logging to /home/microbioinf/Documentos/Tema_3/kneaddata_out/quast_results/results_2019_04_26_13_12_24/quast.log
NOTICE: Maximum number of threads is set to 1 (use --threads option to set it manually)

CWD: /home/microbioinf/Documentos/Tema_3/kneaddata_out
Main paramet

In [239]:
data = """
cat Documentos/Tema_3/kneaddata_out/quast*/latest/report.tsv
EOT
"""
output = !ssh microbioinf@192.168.56.101 /bin/bash <<'EOT' {data}
data = []
# To list of lists
for row in output:
    data.append(row.split('\t'))
# To dataframe
df_quast = pd.DataFrame(data[1:], columns=data[0])
df_quast.style.hide_index().set_properties(**{'text-align': 'rigth', 'font-family' : 'courier', 'color' : 'darkgreen', "font-size" : "10pt"}).\
set_properties(**{'text-align': 'left', 'font-family' : 'courier', 'color' : 'darkblue', "font-size" : "10pt"}, \
               subset=['Assembly'])

Assembly,contigs_ECTV_Assembly25,contigs_ECTV_Assembly35,contigs_ECTV_Assembly45,contigs_ECTV_Assembly,scaffolds_ECTV_Assembly25,scaffolds_ECTV_Assembly35,scaffolds_ECTV_Assembly45,scaffolds_ECTV_Assembly
# contigs (>= 0 bp),38,22,34,10,33,19,27,7
# contigs (>= 1000 bp),11,8,13,4,9,6,10,3
# contigs (>= 5000 bp),8,7,9,4,6,5,8,3
# contigs (>= 10000 bp),6,6,6,4,5,4,5,3
# contigs (>= 25000 bp),2,3,3,3,3,3,3,3
# contigs (>= 50000 bp),1,1,0,2,1,2,1,2
Total length (>= 0 bp),198533,196511,196309,198368,198845,196617,196673,198479
Total length (>= 1000 bp),193592,194143,190456,197062,194196,194520,192259,197822
Total length (>= 5000 bp),186718,193083,182355,197062,187322,193460,187864,197822
Total length (>= 10000 bp),171220,184029,158872,197062,179237,184406,164381,197822


### Process (*metaspades*)

#### List of different K_MER

In [251]:
K_MERS_LIST = ["25", "35", "45"]
K_MERS =  ",".join(K_MERS_LIST)
print(K_MERS)

25,35,45


In [252]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "$K_MERS"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MERS=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compute assembly with no specified K_MER"
metaspades.py -1 ${FILE_ID}_R1__kneaddata_paired_1.fastq -2 ${FILE_ID}_R1__kneaddata_paired_2.fastq \
--meta -o meta-${FILE_ID}-Assembly${K_MER}
IFS=","
for K_MER in ${K_MERS}
do
echo "#### Compute assembly K_MER=${K_MER}"
metaspades.py -1 ${FILE_ID}_R1__kneaddata_paired_1.fastq -2 ${FILE_ID}_R1__kneaddata_paired_2.fastq \
--meta -k ${K_MER} -o meta-${FILE_ID}-Assembly${K_MER}
done
EOT

#### Compute assembly with no specified K_MER
Command line: /home/microbioinf/miniconda3/bin/metaspades.py	-1	/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_1.fastq	-2	/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_2.fastq	--meta	-o	/home/microbioinf/Documentos/Tema_3/kneaddata_out/meta-VIR-Assembly	

System information:
  SPAdes version: 3.13.0
  Python version: 2.7.15
  OS: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

Output dir: /home/microbioinf/Documentos/Tema_3/kneaddata_out/meta-VIR-Assembly
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Metagenomic mode
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_1.fastq']
      right reads: ['/home/microbioinf/Documentos/Tema_3/kneaddata_out/VIR_R1__kneaddata_paired_2.fastq']
      interlaced reads: not sp

In [254]:
K_MERS_LIST = ["","25", "35", "45"]
K_MERS =  ",".join(K_MERS_LIST)
print(K_MERS)

,25,35,45


In [255]:
%%bash -s "$FILE_ID" "$FASTQ_STR" "$MIN_LEN" "$K_MERS"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 FASTQ_STR=$2 MIN_LEN=$3 K_MERS=$4 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
IFS=","
for K_MER in ${K_MERS}
do
    echo
    echo "#### Check output K_MER=${K_MER}"
    cd meta-${FILE_ID}-Assembly${K_MER}
    rep -c ">" *fasta
    grep ">" -m 8 contigs.fasta 
    grep ">" -m 8 scaffolds.fasta 
    grep "NN" *fasta
done


#### Check output K_MER=
>NODE_1_length_87312_cov_12.680977
>NODE_2_length_49827_cov_8.457647
>NODE_3_length_41290_cov_6.676852
>NODE_4_length_31161_cov_5.416351
>NODE_5_length_21925_cov_9.228166
>NODE_6_length_19420_cov_25.583527
>NODE_7_length_18141_cov_10.684452
>NODE_8_length_17748_cov_24.048381
>NODE_1_length_87312_cov_12.680977
>NODE_2_length_49827_cov_8.457647
>NODE_3_length_41290_cov_6.676852
>NODE_4_length_40166_cov_9.849144
>NODE_5_length_31161_cov_5.416351
>NODE_6_length_19420_cov_25.583527
>NODE_7_length_17748_cov_24.048381
>NODE_8_length_12677_cov_11.504754
scaffolds.fasta:GAAAAGAGCTGGAAGAAGATGGTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:NNNNNGAAGTGATCGGAAATATTTATGAGAATGAACTAGATTTGATAGTGTACGAGGCTT
scaffolds.fasta:GGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
scaffolds.fasta:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCACTCATAGATTTTGC
scaffolds.fasta:GTGTCATNNNNNNNNNNNNNN

#### Comparison of assemblies (*quast*)

In [259]:
%%bash -s "$FILE_ID"
ssh microbioinf@192.168.56.101 env FILE_ID=$1 2>/dev/null /bin/bash <<'EOT'
export PATH=$PATH:/home/microbioinf/miniconda3/bin
cd Documentos/Tema_3
cd kneaddata_out/
echo "#### Compare assemblies FILE_ID=${FILE_ID}"
for assembly in meta-${FILE_ID}-Assembly*; 
    do echo "Processing $assembly file..."; 
    cp ${assembly}/contigs.fasta meta-contigs-${assembly}.fasta
    cp ${assembly}/scaffolds.fasta meta-scaffolds-${assembly}.fasta
done
quast.py meta-contigs* meta-scaffolds*
EOT

#### Compare assemblies FILE_ID=VIR
Processing meta-VIR-Assembly file...
Processing meta-VIR-Assembly25 file...
Processing meta-VIR-Assembly35 file...
Processing meta-VIR-Assembly45 file...
/home/microbioinf/miniconda3/lib/python2.7/site-packages/quast-5.0.2-py2.7.egg-info/scripts/quast.py meta-contigs-meta-VIR-Assembly25.fasta meta-contigs-meta-VIR-Assembly35.fasta meta-contigs-meta-VIR-Assembly45.fasta meta-contigs-meta-VIR-Assembly.fasta meta-scaffolds-meta-VIR-Assembly25.fasta meta-scaffolds-meta-VIR-Assembly35.fasta meta-scaffolds-meta-VIR-Assembly45.fasta meta-scaffolds-meta-VIR-Assembly.fasta

Version: 5.0.2

System information:
  OS: Linux-4.15.0-47-generic-x86_64-with-debian-buster-sid (linux_64)
  Python version: 2.7.11
  CPUs number: 3

Started: 2019-04-26 16:12:55

Logging to /home/microbioinf/Documentos/Tema_3/kneaddata_out/quast_results/results_2019_04_26_16_12_55/quast.log
NOTICE: Maximum number of threads is set to 1 (use --threads option to set it manually)

CWD: /home

In [260]:
data = """
cat Documentos/Tema_3/kneaddata_out/quast*/latest/report.tsv
EOT
"""
output = !ssh microbioinf@192.168.56.101 /bin/bash <<'EOT' {data}
data = []
# To list of lists
for row in output:
    data.append(row.split('\t'))
# To dataframe
df_quast_meta = pd.DataFrame(data[1:], columns=data[0])
df_quast_meta.style.hide_index().set_properties(**{'text-align': 'rigth', 'font-family' : 'courier', 'color' : 'darkgreen', "font-size" : "10pt"}).\
set_properties(**{'text-align': 'left', 'font-family' : 'courier', 'color' : 'darkblue', "font-size" : "10pt"}, \
               subset=['Assembly'])

Assembly,meta_contigs_meta_VIR_Assembly25,meta_contigs_meta_VIR_Assembly35,meta_contigs_meta_VIR_Assembly45,meta_contigs_meta_VIR_Assembly,meta_scaffolds_meta_VIR_Assembly25,meta_scaffolds_meta_VIR_Assembly35,meta_scaffolds_meta_VIR_Assembly45,meta_scaffolds_meta_VIR_Assembly
# contigs (>= 0 bp),5184.0,4744.0,4405.0,4503.0,5125.0,4696.0,4360.0,4466.0
# contigs (>= 1000 bp),419.0,388.0,360.0,395.0,418.0,390.0,357.0,392.0
# contigs (>= 5000 bp),24.0,27.0,27.0,32.0,27.0,28.0,30.0,33.0
# contigs (>= 10000 bp),10.0,9.0,12.0,11.0,11.0,8.0,11.0,10.0
# contigs (>= 25000 bp),3.0,3.0,3.0,4.0,3.0,4.0,4.0,5.0
# contigs (>= 50000 bp),1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0
Total length (>= 0 bp),3231524.0,3080704.0,2919721.0,3153756.0,3233154.0,3081960.0,2920711.0,3154396.0
Total length (>= 1000 bp),1091752.0,1059009.0,1006115.0,1124948.0,1119353.0,1086350.0,1027274.0,1144212.0
Total length (>= 5000 bp),370024.0,402336.0,385353.0,460108.0,400553.0,416267.0,415499.0,482429.0
Total length (>= 10000 bp),282350.0,284826.0,293309.0,324571.0,298340.0,286198.0,293409.0,324671.0
