# Preparing Data for VAMB

First off, as metagenome binning is a complex and challening problem without clear absolute definitions of what constitutes a good or bad binner in all cases, we have to enumerate several assumptions that the authors make which we will also assume to hold true. In general, the workflow used by the authors is as follows:

* Assemble metagenome bins one sample at a time using a dedicated metagenomic assembler. They recommend metaSPAdes.

* Concatenate the contigs/scaffolds to a single FASTA file, making sure that the FASTA headers are all unique.

* Remove contigs < 2000 bp from the FASTA file.

* Map reads from each sample to the FASTA file. They make sure to set the minimum accepted mapping identity threshold to be similar to the identity threshold with which you want to bin. They do not sort the BAM files, or if they are already sorted, sort again by read name. They do not filter for poor MAPQ and they output all secondary alignments.

* Run Vamb with default parameters.

* After binning, split the bins according to the sample they originated from. In this way, you can bin using co-abundance across samples, while still seeing microdiversity from sample to sample.

In [6]:
import vamb

import pysam

import numpy as np

import glob
import os

In [7]:
BASE_DIR = os.getcwd()

# Step 1: Calculate Tetranucleotide Frequencies for Input Sequence Data

1a Filter contigs by size using vamb.vambtools.filtercontigs to only those > 2000 bp

1b Map reads to contigs to obtain BAM file

1c Calculate TNF of contigs using vamb.parsecontigs

### 1a: Filter contigs by size

In [38]:
EXAMPLE_FASTA_FILE = '2021.01.26_15.46.45_sample_0'

In [39]:
input_contigs_fasta = os.path.join(BASE_DIR, f"example_input_data/new_simulations/camisim_outputs/{EXAMPLE_FASTA_FILE}/contigs/anonymous_gsa.fasta.gz")
output_filtered_contigs_fasta = os.path.join(BASE_DIR, f"example_input_data/new_simulations/camisim_outputs/{EXAMPLE_FASTA_FILE}/contigs/anonymous_gsa_filtered.fasta.gz")

with vamb.vambtools.Reader(input_contigs_fasta, 'rb') as inputfile:
    with open(output_filtered_contigs_fasta, 'w') as outputfile:
        vamb.vambtools.filtercontigs(inputfile, outputfile, minlength=2000)

In [40]:
!head -n 10 $output_filtered_contigs_fasta

>S0C11
ACACAAAACTTTTTTTAAGATATCACGTTAAGAAAAATGCTAGGCTGTCCGAGTATAGGC
AAGCATCAAGGTTTGGTAATTTGCTCAAATGATTGTCAACAGAGTGTCTAGGACCAGGTT
TAAAAGATTATAATTCTTAAGGTCTTATTCGTTTATAAAAAATGAATACTCTTTTAAAAT
CTTAATTGAAATGAATAGGAGAAGTTTTCGTGAAAAAGTTAATCATTATTCCTGCTTACA
ATGAAAGCAGTAATATTGTCAATACTATACGTACTATTGAATCAGATGCCCCGGATTTTG
ACTATATCATTATTGATGATTGCTCAACGGATAATACGTTAGCAATATGTCAAAAACAGG
GGTTCAATGTTATTTCTTTGCCCATTAACCTGGGAATTGGCGGTGCGGTGCAAACTGGCT
ATCGTTATGCACAAAGATGTGGATATGACGTTGCAGTTCAAGTAGATGGAGATGGTCAGC
ACAATCCATGCTATTTGGAAAAAATGGTTGAGGTATTAGTTCAATCTTCAGTAAATATGG


### 1b: Map reads back onto FASTA catalogue

In [41]:
!~/miniconda3/envs/vamb_env/bin/minimap2 -d example_input_data/new_simulations/catalogue.mmi $output_filtered_contigs_fasta # make index

[M::mm_idx_gen::0.226*1.08] collected minimizers
[M::mm_idx_gen::0.329*1.65] sorted minimizers
[M::main::0.395*1.54] loaded/built the index for 1342 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1342
[M::mm_idx_stat::0.412*1.52] distinct minimizers: 1744629 (97.45% are singletons); average occurrences: 1.042; average spacing: 5.363
[M::main] Version: 2.17-r941
[M::main] CMD: /home/pathinformatics/miniconda3/envs/vamb_env/bin/minimap2 -d example_input_data/new_simulations/catalogue.mmi /home/pathinformatics/jupyter_projects/vamb/stanford_cs230_project/example_input_data/new_simulations/camisim_outputs/2021.01.26_15.46.45_sample_0/contigs/anonymous_gsa_filtered.fasta.gz
[M::main] Real time: 0.440 sec; CPU: 0.639 sec; Peak RSS: 0.109 GB


In [42]:
minimap2_input = f"example_input_data/new_simulations/camisim_outputs/{EXAMPLE_FASTA_FILE}/reads/anonymous_reads.fq.gz"
minimap2_output = f"example_input_data/new_simulations/camisim_outputs/{EXAMPLE_FASTA_FILE}/re_mapped.bam"

!~/miniconda3/envs/vamb_env/bin/minimap2 \
    -t 8 \
    -N 50 \
    -ax sr example_input_data/new_simulations/catalogue.mmi \
    $minimap2_input | samtools view -F 3584 -b --threads 8 > $minimap2_output

[M::main::0.103*1.02] loaded/built the index for 1342 target sequence(s)
[M::mm_mapopt_update::0.103*1.02] mid_occ = 1000
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1342
[M::mm_idx_stat::0.120*1.01] distinct minimizers: 1744629 (97.45% are singletons); average occurrences: 1.042; average spacing: 5.363
[M::worker_pipeline::1.638*4.54] mapped 333334 sequences
[M::worker_pipeline::2.446*4.75] mapped 333314 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: /home/pathinformatics/miniconda3/envs/vamb_env/bin/minimap2 -t 8 -N 50 -ax sr example_input_data/new_simulations/catalogue.mmi example_input_data/new_simulations/camisim_outputs/2021.01.26_15.46.45_sample_0/reads/anonymous_reads.fq.gz
[M::main] Real time: 2.456 sec; CPU: 11.632 sec; Peak RSS: 0.354 GB


### 1c: Calculate TNFs

In [43]:
# Use Reader to open plain or zipped files. File must be opened in binary mode
with vamb.vambtools.Reader(output_filtered_contigs_fasta, 'rb') as inputfile:
    tnfs, contignames, lengths = vamb.parsecontigs.read_contigs(inputfile)

In [44]:
print('Type of tnfs:', type(tnfs), 'of dtype', tnfs.dtype)
print('Shape of tnfs:', tnfs.shape, end='\n\n')

print('Type of contignames:', type(contignames))
print('Length of contignames:', len(contignames), end='\n\n')

print('First 5 elements of contignames:')
for i in range(5):
    print(contignames[i])

print('\nType of lengths:', type(lengths), 'of dtype', lengths.dtype)
print('Length of lengths:', len(lengths), end='\n\n')

print('First 5 elements of lengths:')
for i in range(5):
    print(lengths[i])

Type of tnfs: <class 'numpy.ndarray'> of dtype float32
Shape of tnfs: (1342, 103)

Type of contignames: <class 'list'>
Length of contignames: 1342

First 5 elements of contignames:
S0C11
S0C61
S0C133
S0C177
S0C201

Type of lengths: <class 'numpy.ndarray'> of dtype int64
Length of lengths: 1342

First 5 elements of lengths:
2537
4228
2008
2299
2476


In [45]:
all_sample_bamfiles = glob.glob(f"example_input_data/new_simulations/camisim_outputs/{EXAMPLE_FASTA_FILE}/re_mapped.bam")

In [46]:
# for bamfile in all_sample_bamfiles:
#     sorted_bamfile = bamfile.replace('.bam','.sorted.bam').replace('/bam','/sorted_bam')

#     test_head = pysam.AlignmentFile(bamfile, 'rb')
#     indexer = test_head.header['HD']['SO']

#     if indexer != 'queryname':
#         print('sorting bam file')
#         if not os.path.exists(os.path.dirname(sorted_bamfile)):
#             os.mkdir(os.path.dirname(sorted_bamfile))
#         pysam.sort("-n",  "-o", sorted_bamfile, bamfile)

#     test_head.close()

# all_sorted_sample_bamfiles = glob.glob('example_input_data/new_simulations/camisim_outputs/2021.01.26_04.04.06_sample_0/sorted_bam/*.bam')

# Step 2: Calculate RPKM from BAM files

In [47]:
rpkms = vamb.parsebam.read_bamfiles(all_sample_bamfiles) 
print('Type of rpkms:', type(rpkms), 'of dtype', rpkms.dtype)
print('Shape of rpkms', rpkms.shape)

Type of rpkms: <class 'numpy.ndarray'> of dtype float32
Shape of rpkms (1342, 1)


# Write out assets for use in VAMB

In [52]:
vamb_inputs_base = os.path.join(BASE_DIR, 'example_input_data/new_simulations/camisim_outputs/vamb_inputs')

if not os.path.exists(vamb_inputs_base):
    os.mkdir(vamb_inputs_base)
    

with open(os.path.join(vamb_inputs_base, 'contignames.npz'), 'wb') as file:
    vamb.vambtools.write_npz(file, np.array(contignames))

with open(os.path.join(vamb_inputs_base, 'lengths.npz'), 'wb') as file:
    vamb.vambtools.write_npz(file, lengths)

with open(os.path.join(vamb_inputs_base, 'tnfs.npz'), 'wb') as file:
    vamb.vambtools.write_npz(file, tnfs)
    
with open(os.path.join(vamb_inputs_base, 'rpkms.npz'), 'wb') as file:
    vamb.vambtools.write_npz(file, rpkms)