Copangraph is designed to represent the pan-metagenomic content across multiple metagenomic samples as a sequence graph. By doing so, it provides the foundation of a "comparative metagenomic" framework to associate microbial genomic signatures to host (or microbiome) phenotypes.
- Anaconda-2025-06
Copangraph is implemented in C++20. To download, clone copangraph with
git clone https://github.com/korem-lab/copangraph.git
Installation depends on conda, all other dependencies are installed in the copangraph conda environment.
To construct the copangraph conda environment
cd copangraph
From here, run
conda env create --file environments/copangraph.yaml
which creates the copangraph environment, installing all dependencies
required for compiliation and runtime into it.
Please consult environments/copangraph.yaml for a complete list of
all dependencies with version numbers.
To compile copangraph, activate the conda environment
conda activate copangraph
and then compile with
snakemake -c 1 -s compile_code.smk
This will run a snakemake routine that compiles the copangraph from source. On standard laptops, compilation takes a few (under 5) minutes. Succesfull compilation results in the four executables in the following directory structure:
./bin/
release/
copangraph
extension
debug/
copangraph
extension
Copangraph requires paired-end extended contigs, one set constructed for each sample, as input. Paired-end extended contigs are constructed for each sample as follows:
# First assemble each sample with a metagenomic assembler. Currently, we support MEGAHIT:
megahit -t 4 -1 sampleA_1.fastq.gz -2 sampleA_2.fastq.gz -o sampleA_asm
# Next, construct an index from the sample's assembly and map the sample's reads to the assembly, writing the mappings to bam format.
bowtie2-build --threads 4 sampleA_asm/final.contigs.fa sampleA_asm/sampleA_idx
bowtie2 --threads 4 -x sampleA_asm/sampleA_idx -1 sampleA_1.fastq.gz -2 sampleA_2.fastq.gz | samtools view -@ 2 -bS -h - > sampleA_asm/sampleA_mapping.bam
# Next, sort the mappings by read name.
samtools sort -n -@ 4 -o sampleA_asm/sampleA_sorted_mapping.bam sampleA_asm/sampleA_mapping.bam
# Then run paired-end extension.
copangraph/bin/release/extension -t 4 -i sampleA_asm/final.contigs.fa -b sampleA_asm/sampleA_sorted_mapping.bam --pe-only -o extended_contigs -n sampleA
# This will result in a file, ./extended_contigs/sampleA.pe_ext.fasta, which can be input into copangraph.
We have a included a small demo file and use it to show how to run copangraph, and to explain its output.
This demo runs quickly (under 5 minutes). Note that, to construct a multi-sample copangraph, rather than passing the path
of a single paired-end extended fasta file to -s, pass a file containing the absolute paths to a list of paired-end extended fasta files.
One file path per line.
To run copangraph on the demo, execute
./bin/release/copangraph -s demo/simple_test.pe_ext.fasta -g demo -o demo/ -t 2 -d 0.02
Which will run copangraph using two threads (-t 2) and collapsing homologous sequences with 98% (-d 0.02; 1-0.02=0.98)
sequence identity. Once the demo completes, the follwing output files will be written to the demo folder:
- demo.gfa : the copangraph in gfa format
- demo.fasta : a multi-fasta file, each element being a sequence assigned to a copangraph node
- demo.ncolor.gfa : the node (color) occurrence file, describing the sample occurrence in each node
- demo.ecolor.gfa : the edge (color) occurence file, describing the sample occurrence in each edge.
- demo.log : a log file writing the copangraph output.
Coleman I. et al. Comparative metagenomics using pan-metagenomic graphs. bioRxiv (2025). https://doi.org/10.1101/2025.09.07.674724