In [1]:
import glob

import pandas as pd

import os

# Acquiring Data

As authors mention in the paper, they used 4 training and 2 holdout datasets.

### 4 Training datasets:

* MetaHIT error-free (n=264)

* Sample-specific assemblies for CAMI2 Airways (n=10), Oral (n=10), and Urogenital (n=9)

### 2 Hold-out datasets:

* CAMI2 Skin (n=10), Gastrointestinal (n=10)

## MetaHIT Dataset

As the authors describe, they obtained the MetaHIT assembly from https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/, specifically downloading the files named `depth.txt.gz` and `assembly-filtered.fa.gz`. These have been placed in the MetaHIT_data folder.

In [2]:
metahit_urls = ['https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/depth.txt.gz',
                'https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/assembly-filtered.fa.gz']


for metahit_url in metahit_urls:
    !wget $metahit_url -P example_input_data/MetaHIT_data/

--2021-02-25 12:59:18--  https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/depth.txt.gz
Resolving portal.nersc.gov (portal.nersc.gov)... 128.55.206.28, 128.55.206.24
Connecting to portal.nersc.gov (portal.nersc.gov)|128.55.206.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 142739748 (136M) [application/x-gzip]
Saving to: ‘example_input_data/MetaHIT_data/depth.txt.gz.2’

depth.txt.gz.2        9%[>                   ]  12.41M  3.34MB/s    eta 48s    ^C
--2021-02-25 12:59:24--  https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/assembly-filtered.fa.gz
Resolving portal.nersc.gov (portal.nersc.gov)... 128.55.206.28, 128.55.206.24
Connecting to portal.nersc.gov (portal.nersc.gov)|128.55.206.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 333460075 (318M) [application/x-gzip]
Saving to: ‘example_input_data/MetaHIT_data/assembly-filtered.fa.gz.2’

     assembly-filte   0%[                    ]   1.65M  1.50MB/s      

Note that the abundance table used for VAMB was obtained from the Kang et al MetaBAT publication (https://peerj.com/articles/1165/) and it looks like this is the file `depth.txt` that we have here which is generated typically from a BAM file using the MetaBAT2 script `jgi_summarize_bam_contig_depths`.

In [3]:
#!gzip -dk example_input_data/MetaHIT_data/depth.txt.gz

contig_depths = pd.read_csv('example_input_data/MetaHIT_data/depth.txt', sep='\t')
contig_depths.head(3)

Unnamed: 0,contigName,contigLen,totalAvgDepth,all_291-shred_2500_31_rand_94521.fa-ERR011087.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011087.fastq.fix.fastq.sam.bam-var,all_291-shred_2500_31_rand_94521.fa-ERR011088.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011088.fastq.fix.fastq.sam.bam-var,all_291-shred_2500_31_rand_94521.fa-ERR011089.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011089.fastq.fix.fastq.sam.bam-var,all_291-shred_2500_31_rand_94521.fa-ERR011090.fastq.fix.fastq.sam.bam,...,all_291-shred_2500_31_rand_94521.fa-ERR011346.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011346.fastq.fix.fastq.sam.bam-var,all_291-shred_2500_31_rand_94521.fa-ERR011347.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011347.fastq.fix.fastq.sam.bam-var,all_291-shred_2500_31_rand_94521.fa-ERR011348.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011348.fastq.fix.fastq.sam.bam-var,all_291-shred_2500_31_rand_94521.fa-ERR011349.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011349.fastq.fix.fastq.sam.bam-var,all_291-shred_2500_31_rand_94521.fa-ERR011350.fastq.fix.fastq.sam.bam,all_291-shred_2500_31_rand_94521.fa-ERR011350.fastq.fix.fastq.sam.bam-var
0,gi|224815735|ref|NZ_ACGB01000001.1|_[Acidamino...,5871,97.1261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.42251,2.25269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,gi|224815735|ref|NZ_ACGB01000001.1|_[Acidamino...,2500,98.2171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.51658,5.72159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,gi|224815735|ref|NZ_ACGB01000001.1|_[Acidamino...,2500,102.339,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.11267,1.98925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## CAMI Datasets

As described on the CAMI page [here](https://data.cami-challenge.org/participate), there are 4 URLs from which we can procure CAMI datasets 

Airways: https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Airways

Gastrointestinal tract: https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Gastrointestinal_tract

Oral cavity: https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Oral

Skin: https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Skin

Urogenital tract: https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Urogenital_tract


### A note about how data is organized within CAMI:

* Sample folders start with the date of creation and end with the sample number: yyyy.mm.dd_hh.mm.ss_sample_#

* In every sample folder there are three subfolders, bam, contigs and reads. The bam folder contains the mapping of all the created reads to the input genomes:

* Inside this folder is a bam file for every genome for which at least one read was produced, which is uniquely indicated by a combination of OTU and a running ID counter for the number of genomes included in that OTU in the sample: OTU_ID.bam. 

* The contigs folder contains the gold standard assembly for that particular sample. It contains two files, the gold standard in fasta format, anonymous_gsa.fasta.gz, and the mapping for each contigs to its genome/taxon id and position in this genome, gsa_mapping.tsv.gz.

* The reads folder contains the created reads for that sample. It contains two files, one with the fq reads themselves, containing both ends for paired end sequencing and with anonymised names, anonymous_reads.fq.gz, and the second one is a mapping of every single read to the genome it originated from and the original read ID (pre anonymisation), reads_mapping.tsv.gz.

* Every data set contains one abundance file per sample mapping OTUs to genomes: abundance#.tsv

* Every data set contains the pooled gold standard assembly over all samples in the folder: anonymous_gsa_pooled.fasta.gz

In [4]:
cami_urls = {
    'Airways': 'https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Airways',
    'Gastrointestinal tract': 'https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Gastrointestinal_tract',
    'Oral cavity': 'https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Oral',
    'Skin: https': '//openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Skin',
    'Urogenital tract': 'https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Urogenital_tract'
}

In [5]:
# List Files
url = cami_urls['Airways']

#!java -jar example_input_data/cami_challenge/camiClient.jar -l $url

In [None]:
!java -jar util/camiClient.jar -d $url example_input_data/cami_challenge/airways -p short_read

Downloading example_input_data/cami_challenge/airways/short_read/genomes/GCA_000466785.3_ASM46678v3.fa
Downloading example_input_data/cami_challenge/airways/short_read/2017.12.04_18.56.22_sample_12/bam/OTU_97.45281.0.bam
Downloading example_input_data/cami_challenge/airways/short_read/2017.12.04_18.56.22_sample_9/bam/OTU_97.34725.0.bam.bai
Downloading example_input_data/cami_challenge/airways/short_read/2017.12.04_18.56.22_sample_27/bam/OTU_97.25219.0.bam
Downloading example_input_data/cami_challenge/airways/short_read/2017.12.04_18.56.22_sample_10/bam/OTU_97.40239.1.bam
Downloading example_input_data/cami_challenge/airways/short_read/2017.12.04_18.56.22_sample_26/bam/OTU_97.196.0.bam
Downloading example_input_data/cami_challenge/airways/short_read/2017.12.04_18.56.22_sample_4/bam/OTU_97.38699.0.bam
Downloading example_input_data/cami_challenge/airways/short_read/2017.12.04_18.56.22_sample_4/bam/OTU_97.30815.0.bam.bai
Downloading example_input_data/cami_challenge/airways/short_read/201

## General Workflow for Adding New Data

We can use CAMSIM to generate new synthetic metagenome files. 

Generally, we should have installed the cami Docker container: `docker pull cami/camisim:latest`

CAMSIM can take many types of simulation parameters but we can keep it simple, simulating single-timepoint examples for now and we'll do it *de novo*.

### Steps for genome simulation:

* Data processing: CAMISIM will remove any sequences from provided assemblies that are shorter than 1000 bases and will check that sequences only contain valid characters being A,C,T,G and certain ambiguous base encodings: RYWSMKHBVDN.

* The input we need to provide is a taxonomic profile in .biom format or .ini format

In [12]:
input_folder = os.path.join(os.getcwd(), 'example_input_data/new_simulations/camisim_inputs')
output_folder = os.path.join(os.getcwd(), 'example_input_data/new_simulations/camisim_outputs')

input_directory = f"{input_folder}:/input:rw"
output_directory = f"{output_folder}:/output:rw"

!echo UTMBpath123 | sudo -S docker run --rm -v $input_directory -v $output_directory  \
    cami/camisim:latest  metagenomesimulation.py /input/mini_config.ini

[sudo] password for pathinformatics: 2021-02-25 19:02:04 INFO: [MetagenomeSimulationPipeline] Metagenome simulation starting
2021-02-25 19:02:04 INFO: [MetagenomeSimulationPipeline] Validating Genomes
2021-02-25 19:02:04 INFO: [MetadataReader] Reading file: '/input/genome_to_id.tsv'
2021-02-25 19:02:25 INFO: [MetagenomeSimulationPipeline] Design Communities
2021-02-25 19:02:25 INFO: [CommunityDesign] Drawing strains.
2021-02-25 19:02:25 INFO: [MetadataReader 82380310288] Reading file: '/input/metadata.tsv'
2021-02-25 19:02:25 INFO: [MetadataReader 28547833133] Reading file: '/input/genome_to_id.tsv'
2021-02-25 19:02:25 INFO: [CommunityDesign] Validating raw sequence files!
2021-02-25 19:02:47 INFO: [NcbiTaxonomy] Building taxonomy tree...
2021-02-25 19:02:47 INFO: [NcbiTaxonomy] Reading 'nodes' file:	'/tmp/tmpsBgzMr/NCBI/nodes.dmp'
2021-02-25 19:02:58 INFO: [NcbiTaxonomy] Reading 'names' file:	'/tmp/tmpsBgzMr/NCBI/names.dmp'
2021-02-25 19:02:59 INFO: [NcbiTaxonomy] Reading 'merged' fil

# Notes on CAMISIM Inputs in this example

## Simulation Input Files
CAMISIM expects to have a set of .fasta files and a genomes_to_id.tsv file provided to it in the config file. Here, we're defining those tables elswhere and bringing them into the CAMISIM inputs folder for simulation but, otherwise, we would have to generate these files. Similarly, it requires a metadata.tsv file which we also bring in here as shown in the log output:

```
[sudo] password for pathinformatics: 2021-02-25 19:02:04 INFO: [MetagenomeSimulationPipeline] Metagenome simulation starting
2021-02-25 19:02:04 INFO: [MetagenomeSimulationPipeline] Validating Genomes
2021-02-25 19:02:04 INFO: [MetadataReader] Reading file: '/input/genome_to_id.tsv'
2021-02-25 19:02:25 INFO: [MetagenomeSimulationPipeline] Design Communities
2021-02-25 19:02:25 INFO: [CommunityDesign] Drawing strains.
2021-02-25 19:02:25 INFO: [MetadataReader 82380310288] Reading file: '/input/metadata.tsv'
2021-02-25 19:02:25 INFO: [MetadataReader 28547833133] Reading file: '/input/genome_to_id.tsv'
2021-02-25 19:02:25 INFO: [CommunityDesign] Validating raw sequence files!
2021-02-25 19:02:47 INFO: [NcbiTaxonomy] Building taxonomy tree...
2021-02-25 19:02:47 INFO: [NcbiTaxonomy] Reading 'nodes' file:	'/tmp/tmpsBgzMr/NCBI/nodes.dmp'
2021-02-25 19:02:58 INFO: [NcbiTaxonomy] Reading 'names' file:	'/tmp/tmpsBgzMr/NCBI/names.dmp'
2021-02-25 19:02:59 INFO: [NcbiTaxonomy] Reading 'merged' file:	'/tmp/tmpsBgzMr/NCBI/merged.dmp'
2021-02-25 19:02:59 INFO: [NcbiTaxonomy] Done (13.0s)
2021-02-25 19:02:59 INFO: [MetagenomeSimulationPipeline] Move Genomes
```

Just note that, otherwise we would have to define these from a collection of new strain genomes.

## Running Reads Simulation

As a second point, running read simulation is--as with many bioinformatics tools--fastidious with regard to its dependencies so we run this in a Docker container with the following command:

```
docker run --rm -v $input_directory -v $output_directory  \
    cami/camisim:latest  metagenomesimulation.py /input/mini_config.ini
```

Note that we map the host folder for `camisim_inputs` to the internal folder `/input` in Docker and we also map the host folder `camisim_outputs` to the internal folder `/output`:

```
input_folder = os.path.join(os.getcwd(), 'example_input_data/new_simulations/camisim_inputs')
output_folder = os.path.join(os.getcwd(), 'example_input_data/new_simulations/camisim_outputs')

input_directory = f"{input_folder}:/input:rw"
output_directory = f"{output_folder}:/output:rw"
```

And this is why the config file references `/input` folders for the genome_to_id and metadata files.

## Final Note on Input Files

Note the structure of the genomes_to_id.tsv file. Whatever genomes we select, we need a column that starts with the correct path, being /input/genomes and, second, we actually want to write this to file WITHOUT headers OR indexes or it will throw an error with CAMISIM.

This produces a bunch of output files but really we just want the contigs file and the reads files. Importantly, note that this will produce an alignment BAM file for each of the input genomes used.

In [19]:
pd.read_csv('example_input_data/new_simulations/camisim_inputs/genome_to_id.tsv', sep='\t').head(3)

Unnamed: 0,Genome15.0,/input/genomes/GCA_000227705.3_ASM22770v3.fa
0,Genome14.0,/input/genomes/GCA_000006785.2_ASM678v2.fa
1,Genome11.0,/input/genomes/GCA_000255115.3_ASM25511v3.fa
2,Genome24.0,/input/genomes/GCA_000025905.1_ASM2590v1.fa


In [20]:
pd.read_csv('example_input_data/new_simulations/camisim_inputs/metadata.tsv', sep='\t').head(3)

Unnamed: 0,genome_ID,OTU,NCBI_ID,novelty_category
0,Genome15.0,377615,717605,known_strain
1,Genome14.0,1314,160490,known_strain
2,Genome11.0,885581,646529,known_strain


In [22]:
!cat .gitignore

.DS_Store
benchmark_data/
example_input_data/cami_challenge/
example_input_data/MetaHIT_data/
example_input_data/mosaic_challenge/
example_input_data/new_simulations/camisim_outputs/
example_input_data/new_simulations/catalogue.mmi


In [None]:
%bash
