This program generates core-gene alignments from a list of assemblies. It downloads the genomic sequences from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ and re-annotates them using Prokka. It then uses Roary to generate the pan-genome, and extracts the core genome, which are a set of genes that appear in all the assemblies. The protein sequences of each core gene are aligned by MUSCLE, and then back-translated to DNA sequences.
The program was written in Bash, Go and Python. It requires following programs:
and Python libaries:
pip install --user tqdm biopython
and Go libaries:
go get -u github.com/cheggaaa/pb
go get -u github.com/mattn/go-sqlite3
go get -u gopkg.in/alecthomas/kingpin.v2
go get -u github.com/kussell-lab/biogo/seq
A docker file is also provided for building a docker image (see https://docs.docker.com/ for how to use docker). The docker file also shows how to install this program in Ubuntu 17.10.
AssemblyAlignmentGenerate <assembly summary file> <accession list file> <output directory> <output prefix>
<assembly summary file>
can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt;<accession list file>
contain a list of assembly accessions;<output directory>
contains the results;<output prefix>
is the prefix of the results.
The output is a XMFA file containing the final alignments of DNA sequences of the core genes. The file can be found in <output directory>/<output prefix>_core.xmfa
.