In [115]:
import glob
import os

import pandas as pd

# Acquiring and Creating Benchmarking Assets

## Downloading Files Profiling Contigs and Abundances

We can refer to the authors' [FigShare](https://figshare.com/articles/dataset/Benchmark_datasets/11409360) link so that we can download the specific benchmarking assets that they desribe. These assets include calculated tables describing sample taxonomy and contig mapping. These derived tables are what we would need to understand things like precision and recall.

Looking to the CAMI Airways dataset as an example, we can see that we have the abundance and contigs tables which are needed for benchmarking.

In [5]:
#!wget https://ndownloader.figshare.com/files/20294736
#!tar xvf 20294736
#!mv data benchmarking_data

In [7]:
!ls benchmarking_data/airways

abundance.npz  contigs.fna.gz


## Creating Files Profiling Contigs and Abundances

### To-Do

## Creating `clusters.tsv` from Samples

As mentioned elewhere, the standard procedure to prep data for VAMB is to create a metagenome assembly from each sample's reads and then to concatenate all of the resulting contigs. Fortunately, for CAMI Airways, we already have that done by CAMI.

What we want to do next is to map each sample's reads back on to this aggregate catalogue, resulting in a .BAM file for each sample. This part we'll do with the Airways samples.

In [46]:
STUDY = 'example_input_data/cami_challenge/airways/short_read'
CATALOGUE = 'example_input_data/cami_challenge/airways/short_read/gsa.fasta.gz'
CATALOGUE_INDEX = 'benchmarking_data/airways/catalogue.mmi'
OUTPUT_BAM_DIR = 'benchmarking_data/airways'

#### Step 1: Locate the FASTA contig catalog and index it

In [33]:
!~/miniconda3/envs/vamb_env/bin/minimap2 -d $CATALOGUE_INDEX $CATALOGUE

[M::mm_idx_gen::28.042*1.70] collected minimizers
[M::mm_idx_gen::33.851*1.92] sorted minimizers
[M::main::37.692*1.83] loaded/built the index for 1971278 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1971278
[M::mm_idx_stat::38.453*1.81] distinct minimizers: 79724225 (44.64% are singletons); average occurrences: 4.459; average spacing: 5.452
[M::main] Version: 2.17-r941
[M::main] CMD: /home/pathinformatics/miniconda3/envs/vamb_env/bin/minimap2 -d benchmarking_data/airways/catalogue.mmi example_input_data/cami_challenge/airways/short_read/gsa.fasta.gz
[M::main] Real time: 40.017 sec; CPU: 70.304 sec; Peak RSS: 11.618 GB


#### Step 2: Map Each Sample's Reads Back Onto the Catalog 

In [45]:
input_fastq_files = sorted(glob.glob(os.path.join(STUDY,'*','reads','*.fq*')))

for reads_file in input_fastq_files:
    output_bam = os.path.join(OUTPUT_BAM_DIR, reads_file.split('/')[4]+'.bam')
    !~/miniconda3/envs/vamb_env/bin/minimap2 \
        -t $(nproc) \
        -N 50 -ax sr $CATALOGUE_INDEX \
        $reads_file | samtools view -F 3584 -b --threads 8 > $output_bam

[M::main::3.904*0.88] loaded/built the index for 1971278 target sequence(s)
[M::mm_mapopt_update::3.904*0.88] mid_occ = 1000
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1971278
[M::mm_idx_stat::4.665*0.90] distinct minimizers: 79724225 (44.64% are singletons); average occurrences: 4.459; average spacing: 5.452
[M::worker_pipeline::14.342*22.50] mapped 333334 sequences
[M::worker_pipeline::16.581*24.13] mapped 333334 sequences
[M::worker_pipeline::21.158*26.51] mapped 333334 sequences
[M::worker_pipeline::25.463*27.97] mapped 333334 sequences
[M::worker_pipeline::30.135*29.09] mapped 333334 sequences
[M::worker_pipeline::34.898*29.93] mapped 333334 sequences
[M::worker_pipeline::38.986*30.47] mapped 333334 sequences
[M::worker_pipeline::43.622*30.97] mapped 333334 sequences
[M::worker_pipeline::48.158*31.37] mapped 333334 sequences
[M::worker_pipeline::52.475*31.68] mapped 333334 sequences
[M::worker_pipeline::57.511*31.99] mapped 333334 sequences
[M::worker_pipeline::62.

#### Run VAMB

This will result in a clusters.tsv file created by default VAMB which we can use to establish a baseline benchmark for performance.

In [58]:
#!rm -r $vamb_output

In [60]:
vamb_output = os.path.join(STUDY, 'vamb_output')
bamfiles = os.path.join(OUTPUT_BAM_DIR, '*.bam')

!~/miniconda3/envs/vamb_env_gpu/bin/vamb --cuda --outdir $vamb_output --fasta $CATALOGUE --bamfiles $bamfiles -o C --minfasta 200000

[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_10.bam'
[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_4.bam'
[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_27.bam'
[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_12.bam'
[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_23.bam'
[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_26.bam'
[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_11.bam'
[E::idx_find_and_load] Could not retrieve index file for 'benchmarking_data/airways/2017.12.04_18.56.22_sample_7.bam'
[E::idx_find_and_load] Could not retrieve index fi

In [63]:
f"Completes training in rouhgly {round((12170/60)/60, 2)} hours"

'Completes training in rouhgly 3.38 hours'

#### Run CheckM on Outputs

In [88]:
for inputfile in glob.glob(f"{bins}/*"):
    outputfile = inputfile.replace('bins','bins_mod')
    !sed -e 's/\r$//' $inputfile > $outputfile

In [106]:
#!wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz -O util/checkm_data_2015_01_16.tar.gz
#!tar xzvf util/checkm_data_2015_01_16.tar.gz -C util
data_root = os.path.join(os.getcwd(), 'util')

!~/miniconda3/envs/vamb_env/bin/checkm data setRoot $data_root

[2021-01-27 15:47:22] INFO: CheckM v1.1.3
[2021-01-27 15:47:22] INFO: checkm data setRoot /home/pathinformatics/jupyter_projects/vamb/stanford_cs230_project/util
[2021-01-27 15:47:22] INFO: [CheckM - data] Check for database updates. [setRoot]

Path [/home/pathinformatics/jupyter_projects/vamb/stanford_cs230_project/util] exists and you have permission to write to this folder.
(re) creating manifest file (please be patient).

Path [/home/pathinformatics/jupyter_projects/vamb/stanford_cs230_project/util] exists and you have permission to write to this folder.
(re) creating manifest file (please be patient).


In [125]:
outfile = os.path.join(vamb_output, 'checkm_outfile')
bins = os.path.join(vamb_output, 'bins_mod')
log = os.path.join(vamb_output, 'log')
outdir = os.path.join(vamb_output, 'checkm_outdir')

!~/miniconda3/envs/vamb_env/bin/checkm lineage_wf --tab_table -f $outfile -t $(nproc) -x fna $bins $outdir 2>$log

[2021-01-27 16:52:22] INFO: CheckM v1.1.3
[2021-01-27 16:52:22] INFO: checkm lineage_wf --tab_table -f example_input_data/cami_challenge/airways/short_read/vamb_output/checkm_outfile -t 36 -x fna example_input_data/cami_challenge/airways/short_read/vamb_output/bins_mod example_input_data/cami_challenge/airways/short_read/vamb_output/checkm_outdir
[2021-01-27 16:52:22] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2021-01-27 16:52:26] INFO: Identifying marker genes in 435 bins with 36 threads:
[2021-01-27 16:55:45] INFO: Saving HMM info to file.
[2021-01-27 16:55:45] INFO: Calculating genome statistics for 435 bins with 36 threads:
[2021-01-27 16:55:49] INFO: Extracting marker genes to align.
[2021-01-27 16:55:49] INFO: Parsing HMM hits to marker genes:
[2021-01-27 16:56:01] INFO: Extracting 43 HMMs with 36 threads:
[2021-01-27 16:56:01] INFO: Aligning 43 marker genes with 36 threads:
[2021-01-27 16:56:02] INFO: Reading marker alignment files.
[2021-01-27 16:56:02] INFO:

In [126]:
pd.read_csv('example_input_data/cami_challenge/airways/short_read/vamb_output/checkm_outfile', sep ='\t')

Unnamed: 0,Bin Id,Marker lineage,# genomes,# markers,# marker sets,0,1,2,3,4,5+,Completeness,Contamination,Strain heterogeneity
0,PC1000009,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
1,PC100045,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
2,PC1018414,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
3,PC1023906,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
4,PC1024126,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,PC988286,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
431,PC99,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
432,PC99075,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0
433,PC991817,root (UID1),5656,56,24,56,0,0,0,0,0,0.0,0.0,0.0


In [71]:
clusters_tsv = os.path.join(vamb_output, clusters_tsv)
refpath = ''

!python ~/vamb_env_gpu_utils/vamb/src/cmd_benchmark.py $vamb_output $clusters_tsv $refpath

usage: cmd_benchmark.py [--tax TAXPATH] [-m] [-s SEPARATOR] [--disjoint]
                        vambpath clusterspath refpath

Command-line benchmark utility.

positional arguments:
  vambpath       Path to vamb directory
  clusterspath   Path to clusters.tsv
  refpath        Path to reference file

optional arguments:
  --tax TAXPATH  Path to taxonomic maps
  -m             Minimum size of bins [200000]
  -s SEPARATOR   Binsplit separator
  --disjoint     Enforce disjoint clusters
