Skip to content

Commit

Permalink
Merge pull request #33 from neherlab/feat/finalize-paper
Browse files Browse the repository at this point in the history
## Changes:

- added [mmseqs2](https://github.com/soedinglab/MMseqs2) as an alternative alignment kernel that guarantees higher sensitivity at the expense of longer computational time, see [#33](#33).
- updated Docker file to include mmseqs2 in the container.
- updated the documentation, including discussion of alignment kernel sensitivities and examples of application of PanGraph to plasmids by [@liampshaw](https://github.com/neherlab/pangraph/commits?author=liampshaw).
- errors that occur in worker threads are now emitted on the main thread, see [#25](#25).
- fixed a bug in detransitive, see this [commit](a965132)
- added snakemake pipeline in the `script` folder to perform the analysis published in our [paper](https://github.com/neherlab/pangraph#citing).
- added `-K` option to the `build` command to control kmer length for mmseqs aligner, see this [commit](0857c36).
  • Loading branch information
mmolari committed Oct 12, 2022
2 parents 3ea0fa1 + b2fc96b commit c544bab
Show file tree
Hide file tree
Showing 101 changed files with 10,224 additions and 1,604 deletions.
2 changes: 2 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@
/pangraph
/pangraph.tar.gz
/vendor
/playgrounds
/script
1 change: 0 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
*.fa filter=lfs diff=lfs merge=lfs -text
*.fna* filter=lfs diff=lfs merge=lfs -text
*.json filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
6 changes: 6 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,12 @@ jobs:
make docker
- name: 'Run tests'
run: |
set -euxo pipefail
make docker-test
- name: 'Login to DockerHub'
if: ${{ github.ref_type == 'tag' }}
run: |
Expand Down
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,18 @@ pangraph
pangraph.tar.gz
bin
tutorial
playgrounds
.vscode
script/synthetic_data
script/panx_data
script/.snakemake
script/projections
script/size-benchmark
script/incremental_size
script/panx_results
script/local_scripts
__pycache__
docker_test

deps/minimap2/build
deps/minimap2/products
Expand Down
52 changes: 52 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# PanGraph Changelog

## v0.6.0

- added [mmseqs2](https://github.com/soedinglab/MMseqs2) as an alternative alignment kernel that guarantees higher sensitivity at the expense of longer computational time, see [#33](https://github.com/neherlab/pangraph/pull/33).
- updated Docker file to include mmseqs2 in the container.
- updated the documentation, including discussion of alignment kernel sensitivities and examples of application of PanGraph to plasmids by [@liampshaw](https://github.com/neherlab/pangraph/commits?author=liampshaw).
- errors that occur in worker threads are now emitted on the main thread, see [#25](https://github.com/neherlab/pangraph/pull/25).
- fixed bug when using `mash` option see this [commit](https://github.com/neherlab/pangraph/commit/2167c2e9f72b2962ef2e2b9ec1fbe0e16fe0f568)
- fixed a bug in detransitive, see this [commit](https://github.com/neherlab/pangraph/commit/a9651323aba2822d1b1c380a086fae4216c8030d)
- added snakemake pipeline in the `script` folder to perform the analysis published in our [paper](https://github.com/neherlab/pangraph#citing).
- added `-K` option to the `build` command to control kmer length for mmseqs aligner, see this [commit](https://github.com/neherlab/pangraph/commit/0857c36c7c8d11d53e8efab91cf5d18c35685a6e).

## v0.5.0

- fix: error with gfa export of fully duplicated paths by @mmolari in [#19](https://github.com/neherlab/pangraph/pull/19)
- GFA export bug fixes by @nnoll in [#28](https://github.com/neherlab/pangraph/pull/28)
- chore: add docker container by @ivan-aksamentov in [#27](https://github.com/neherlab/pangraph/pull/27)
- fix: deal with zero length blocks getting added to segment by @nnoll in [#20](https://github.com/neherlab/pangraph/pull/20)

[Full changelog](https://github.com/neherlab/pangraph/compare/v0.4.1...0.5.0)

## v0.4.1

- Smaller binaries: Artifacts now pulled in as needed.

## v0.4.0

- Marginalize command now (optionally) takes list of strains to project onto.
- Command line arguments and flags can now be mixed in order.
- Export can filter out any duplications.

## v0.3.0

Added command line options:
- Build: `-u` => force sequences to uppercase letters
- Polish: `-c` => preserve case (uses MAFFT command line flag)
Additionally, removed bug associated with sequences mapping as empty intervals to blocks.

Lastly, large improvement to the algorithm's multicore usage by balancing the initial guide tree.

## v0.2.1

Modified CLI to accept input on standard input for all subcommands. This allows for a nicer chaining of pangraph functions from the shell. Additionally, there were many small bugs that are fixed.

## v0.1-alpha

Source code bundled as a relocatable application. Currently only for Linux-based operating systems but intend to release for MacOSX as well.

## Citing

[^1]
47 changes: 28 additions & 19 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,36 @@ FROM debian:11 as builder

SHELL ["bash", "-c"]


RUN set -euxo pipefail \
&& export DEBIAN_FRONTEND=noninteractive \
&& apt-get update -qq --yes \
&& apt-get install -qq --no-install-recommends --yes \
&& export DEBIAN_FRONTEND=noninteractive \
&& apt-get update -qq --yes \
&& apt-get install -qq --no-install-recommends --yes \
build-essential \
ca-certificates \
curl \
mafft \
make \
mash \
>/dev/null \
&& apt-get autoremove --yes >/dev/null \
&& apt-get clean autoclean >/dev/null \
&& rm -rf /var/lib/apt/lists/*
>/dev/null \
&& apt-get autoremove --yes >/dev/null \
&& apt-get clean autoclean >/dev/null \
&& rm -rf /var/lib/apt/lists/*

RUN set -euxo pipefail >/dev/null \
&& curl -sSL -o "mmseqs-linux.tar.gz" "https://github.com/soedinglab/MMseqs2/releases/download/13-45111/mmseqs-linux-sse2.tar.gz" \
&& tar xf "mmseqs-linux.tar.gz" -C . 2>/dev/null \
&& mv "mmseqs/bin/mmseqs" "/usr/bin/mmseqs" \
&& chmod +x "/usr/bin/mmseqs" \
&& rm "mmseqs-linux.tar.gz" \
&& rm -r "mmseqs"

ENV PATH="/build_dir/bin:/build_dir/vendor/julia/bin:$PATH"

COPY . /build_dir/

RUN set -euxo pipefail \
&& cd /build_dir \
&& make
&& cd /build_dir \
&& make


# Stage: production image
Expand All @@ -42,22 +49,24 @@ COPY --from=builder /root/.julia/artifacts /root/.julia/artifacts
COPY --from=builder /root/.julia/conda/3/bin /root/.julia/conda/3/bin
COPY --from=builder /root/.julia/conda/3/lib /root/.julia/conda/3/lib

COPY --from=builder /usr/bin/mmseqs /usr/bin/mmseqs

SHELL ["bash", "-c"]

RUN set -euxo pipefail \
&& export DEBIAN_FRONTEND=noninteractive \
&& apt-get update -qq --yes \
&& apt-get install -qq --no-install-recommends --yes \
&& export DEBIAN_FRONTEND=noninteractive \
&& apt-get update -qq --yes \
&& apt-get install -qq --no-install-recommends --yes \
mafft \
mash \
>/dev/null \
&& apt-get autoremove --yes >/dev/null \
&& apt-get clean autoclean >/dev/null \
&& rm -rf /var/lib/apt/lists/*
>/dev/null \
&& apt-get autoremove --yes >/dev/null \
&& apt-get clean autoclean >/dev/null \
&& rm -rf /var/lib/apt/lists/*

# Allows non-root users to read dependencies
RUN set -euxo pipefail \
&& chmod -R +r /root/ \
&& chmod +x /root/
&& chmod -R +r /root/ \
&& chmod +x /root/

CMD ["/usr/bin/pangraph"]
13 changes: 11 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,6 @@ release:
clean:
rm -rf pangraph pangraph.tar.gz

include script/rules.mk


export CONTAINER_NAME=neherlab/pangraph

Expand All @@ -87,6 +85,17 @@ docker:

docker build --target prod $${DOCKER_TAGS} .

docker-test:
set -euxo pipefail

docker run -i --rm \
--volume="$$(pwd):/workdir" \
--workdir="/workdir" \
--user="$$(id -u):$$(id -g)" \
--ulimit core=0 \
"$${CONTAINER_NAME}:latest" \
bash docs/dev/docker_test.sh

docker-push:
set -euxo pipefail

Expand Down
64 changes: 38 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,70 +6,83 @@

> a bioinformatic toolkit to align large sets of closely related genomes into a graph data structure

## Overview

**pangraph** provides both a command line interface, as well as a Julia library, to find homology amongst large collections of closely related genomes.
The core of the algorithm partitions each genome into _pancontigs_ that represent a sequence interval related by vertical descent.
Each genome is then an ordered walk along _pancontigs_; the collection of all genomes form a graph that captures all observed structural diversity.
**pangraph** is a standalone tool useful to parsimoniously infer horizontal gene transfer events within a community; perform comparative studies of genome gain, loss, and rearrangement dynamics; or simply to compress many related genomes.


## Installation

The core algorithm and command line tools are self-contained and require no additional dependencies.
The library is written in and thus requires Julia to be installed on your machine.
Julia binaries for all operating systems can be found [here](https://julialang.org/downloads/).
The library is written in and thus requires [Julia](https://julialang.org/downloads/) to be installed on your machine.

### Library
pangraph is available:
- as a **julia library**
- as a **Docker container**
- it can be compiled into a relocatable **binary**

#### Local Environment
For more extended instructions on installation please refer to the [documentation](https://neherlab.github.io/pangraph/#Installation).

Clone the repository
### Julia Library

To install pangraph as a julia library in a local environment:
```bash
# clone the repository
git clone https://github.com/neherlab/pangraph.git && cd pangraph
# build the package
julia --project=. -e 'using Pkg; Pkg.build()'
```

Build the package. This will create a separate Julia environment for **pangraph**
The library can be accessed directly by entering the REPL:
```bash
julia --project=. -e 'using Pkg; Pkg.build()'
julia --project=.
```

Enter the REPL
Alternatively, command-line functionalities can be accessed by running the main `src/PanGraph.jl` script:
```bash
julia --project=.
# example: build a graph from E.coli genomes
julia --project=. src/PanGraph.jl build -c example_datasets/ecoli.fa.gz > graph.json
```

#### Global Package
Note that to access the complete set of functionalities, the [optional dependencies](#optional-dependencies) must be installed and available in your `$PATH`.


**Important** please do not mix this method with that described above.
Instead of creating a _local_ PanGraph specific environment, this method will install into the Julia base environment.
We recommend, unless for a specific reason, to default to installing within a local environment.
However, if needed, global installation can be achieved by running
### Docker container

PanGraph is available as a Docker container:

```bash
julia -e 'using Pkg; Pkg.add(url="https://github.com/nnoll/minimap2_jll.jl"); Pkg.add(url="https://github.com/neherlab/pangraph.git")'
docker pull neherlab/pangraph:latest
```

The PanGraph package is available globally within the Julia REPL.
See the [documentation](https://neherlab.github.io/pangraph/#Installation) for extended instuctions on its usage.


### Relocatable binary
Releases can be obtained from [GitHub](https://github.com/neherlab/pangraph/releases)

Alternatively, **pangraph** can be built locally on your machine by running (inside the cloned repo)
**pangraph** can be built locally on your machine by running (inside the cloned repo)
```bash
export jc="path/to/julia/executable" make pangraph && make install
```
This will build the executable and place a symlink into `bin/`.
**Importantly,** if `jc ` is not explicitly set, it will default to vendor/julia-$VERSION/bin/julia.
If this file does not exist, we will download automatically for the user, provided the host system is Linux or MacOSX.
**Note,** it is recommended by the PackageCompiler.jl documentation to utilize the officially distributed binaries, not those distributed by your Linux distribution.
As such, it may not work if you attempt to do so.
**Importantly,** if `jc` is not explicitly set, it will default to `vendor/julia-$VERSION/bin/julia`. If this file does not exist, we will download automatically for the user, provided the host system is Linux or MacOSX.
Moreover, for the compilation to work, it is necessary to have [MAFFT](https://mafft.cbrc.jp/alignment/software/) and [mmseqs2](https://github.com/soedinglab/MMseqs2) available in your `$PATH`, see [optional dependencies](#optional-dependencies).

**Note,** it is [recommended by the PackageCompiler.jl documentation](https://julialang.github.io/PackageCompiler.jl/stable/#Installation-instructions) to utilize the officially distributed binaries for Julia, not those distributed by your Linux distribution. As such, compilation may not work if you attempt to do so.


### Optional dependencies
**pangraph** can _optionally_ use both [mash](https://github.com/marbl/Mash) and [MAFFT](https://mafft.cbrc.jp/alignment/software/).

**pangraph** can _optionally_ use [mash](https://github.com/marbl/Mash), [MAFFT](https://mafft.cbrc.jp/alignment/software/) or [mmseqs2](https://github.com/soedinglab/MMseqs2), as explained in [the documentation](https://neherlab.github.io/pangraph/#Optional-dependencies).
For full functionality, it is recommended to install these tools and have them available on `$PATH`.

Alternatively, a script `bin/setup-pangraph` is provided to install both tools into `bin/` for Linux-based operating systems.


## Examples

Please refer to the tutorials within the [documentation](https://neherlab.github.io/pangraph/) for an in-depth usage guide.
Expand All @@ -80,7 +93,7 @@ Align a multi-fasta `sequence.fa` and realign each _pancontig_ with MAFFT
pangraph build sequence.fa | pangraph polish > graph.json
```

Export a graph `graph.json` to GFA for visualization
Export a graph `graph.json` into `export/pangraph.gfa` as GFA for visualization
```bash
pangraph export graph.json
```
Expand All @@ -91,13 +104,12 @@ Output all computed data to directory `pairs`
pangraph marginalize -d pairs graph.json
```

See [Makefile](Makefile) for more real-world examples.

## Citing
PanGraph: scalable bacterial pan-genome graph construction
Nicholas Noll, Marco Molari, Richard Neher
bioRxiv 2022.02.24.481757; doi: https://doi.org/10.1101/2022.02.24.481757


## License

[MIT License](LICENSE)
24 changes: 20 additions & 4 deletions bin/setup-pangraph
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
#!/bin/sh -e

os="Linux64"
root=$(realpath ".local")

mkdir -p $root

download()
downloadMash()
{
name="$1"; shift 1
url="$1"; shift 1
tag="$1"; shift 1
os="$1"; shift 1

cd $root

Expand All @@ -18,6 +18,21 @@ download()
mv "$name-$os-$tag"/$name "../bin/$name"
}

downloadMMseqs()
{
name="$1"; shift 1
url="$1"; shift 1
tagrel="$1"; shift 1
tagarch="$1"; shift 1
os="$1"; shift 1

cd $root

curl -L -o "$name-$os.tar.gz" "$url/$tagrel/$name-$os-$tagarch.tar.gz"
tar xf "$name-$os.tar.gz" -C . 2>/dev/null
mv "$name/bin/$name" "../bin/$name"
}

build()
{
name="$1"; shift 1
Expand All @@ -38,6 +53,7 @@ build()
make install
}

(download "mash" "https://github.com/marbl/Mash/releases/download" "v2.2")
(downloadMash "mash" "https://github.com/marbl/Mash/releases/download" "v2.2" "Linux64")
(downloadMMseqs "mmseqs" "https://github.com/soedinglab/MMseqs2/releases/download" "13-45111" "sse2" "linux")
(build "mafft" "https://mafft.cbrc.jp/alignment/software" "7.490")
# rm -r $root
# rm -r $root
10 changes: 0 additions & 10 deletions docs/Makefile

This file was deleted.

0 comments on commit c544bab

Please sign in to comment.