Skip to content

Commit

Permalink
Merge pull request #391 from pangenome/doc
Browse files Browse the repository at this point in the history
refresh documentation [skip ci]
  • Loading branch information
AndreaGuarracino committed Apr 11, 2024
2 parents de71ee5 + cd02c0d commit 9c2fd91
Show file tree
Hide file tree
Showing 9 changed files with 58 additions and 77 deletions.
36 changes: 18 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Then, you can run `pggb` on each community (set of sequences) independently (see
```bash
pggb -i in.fa \ # input file in FASTA format
-o output \ # output directory
-n 9 \ # number of haplotypes
-n 9 \ # number of haplotypes (optional with PanSN-spec)
-t 16 \ # number of threads
-p 90 \ # minimum average nucleotide identity for segments
-s 5k \ # segment length for scaffolding the graph
Expand All @@ -55,7 +55,7 @@ In the above example, to partition your sequences into communities, execute:
```bash
partition-before-pggb -i in.fa \ # input file in FASTA format
-o output \ # output directory
-n 9 \ # number of haplotypes
-n 9 \ # number of haplotypes (optional with PanSN-spec)
-t 16 \ # number of threads
-p 90 \ # minimum average nucleotide identity for segments
-s 5k \ # segment length for scaffolding the graph
Expand Down Expand Up @@ -93,11 +93,10 @@ A goal of `pggb` is to reduce the complexity of making these alignments, which a

### key parameters

The overall structure of `pggb`'s output graph is defined by three parameters: genome number (`-n`), segment length (`-s`), and pairwise identity (`-p`).
Genome number is a given, but varies in ways that are difficult to infer and is thus left up to the user.
The overall structure of `pggb`'s output graph is defined by two parameters: segment length (`-s`) and pairwise identity (`-p`).
Segment length defines the seed length used by the "MashMap3" homology mapper in `wfmash`.
The pairwise identity is the minimum allowed pairwise identity between seeds, which is estimated using a mash-type approximation based on k-mer Jaccard.
Mappings are initiated from collinear chains of around 5 seeds (`-l, --block-length`), and extended greedily as far as possible, allowing up to `-n` minus 1 mappings at each query position.
Mappings are initiated from collinear chains of around 5 seeds (`-l, --block-length`), and extended greedily as far as possible, allowing up to `-n` minus 1 mappings at each query position. Genome number (`-n`) is automatically computed if sequence names follow [PanSN-spec](https://github.com/pangenome/PanSN-spec).

An additional parameter, `-k`, can also greatly affect graph structure by pruning matches shorter than a given threshold from the initial graph model.
In effect, `-k N` removes any match shorter than `N`bp from the initial alignment.
Expand All @@ -113,7 +112,7 @@ Finally, we apply `gfaffix` to remove forks where both alternatives have the sam
### bringing your pangenome into focus

We suggest using default parameters for initial tests.
For instance `pggb -i in.fa.gz -o out1 -t 16 -n 100` would be a minimal build command for a 100-genome pangenome from `in.fa.gz`.
For instance `pggb -i in.fa.gz -o out ` would be a minimal build command for a pangenome from `in.fa.gz`.
The default parameters provide a good balance between runtime and graph quality for small-to-medium (1kbp-100Mbp) problems.

However, we find that parameters may still need to be adjusted to fine-tune `pggb` to a given problem.
Expand All @@ -125,13 +124,13 @@ These parameters must be tuned so that the graph resolves structures of interest
In preparation of a manuscript on `pggb`, we have developed a [set of example pangenome builds for a collection of diverse species](https://github.com/pangenome/pggb-paper/blob/main/workflows/AllSpecies.md#all-species).
(These also use cross-validation against [`nucmer`](https://mummer4.github.io/) to evaluate graph quality.)

Examples:
Examples (`-n` is optional if sequence names follow [PanSN-spec](https://github.com/pangenome/PanSN-spec)):

- Human, whole genome, 90 haplotypes: `pggb -p 98 -s 50k -n 90 -k 79 ...`
- 15 helicobacter genomes, 5% divergence: `pggb -n 15 -k 79`, and 15 at higher (10%) divergence `pggb -n 15 -k 19 -P asm20 ...`
- Yeast genomes, 5% divergence: `pggb`'s defaults should work well, just set `-n`.
- Aligning 9 MHC class II assemblies from vertebrate genomes (5-10% divergence): `pggb -n 9 -k 29 ...`
- A few thousand bacterial genomes `pggb -x auto -n 2146 ...`. In general mapping sparsification (`-x auto`) is a good idea when you have many hundreds to thousands of genomes.
- Human, whole genome, 90 haplotypes: `pggb -p 98 -s 10k -k 47 [-n 90]...`
- 15 helicobacter genomes, 5% divergence: `pggb -k 47 [-n 15]`, and 15 at higher (10%) divergence `pggb -k 23 [-n 15] ...`
- Yeast genomes, 5% divergence: `pggb`'s defaults should work well.
- Aligning 9 MHC class II assemblies from vertebrate genomes (5-10% divergence): `pggb -k 29 [-n 9] ...`
- A few thousand bacterial genomes `pggb -x auto [-n 2000] ...`. In general mapping sparsification (`-x auto`) is a good idea when you have many hundreds to thousands of genomes.

`pggb` defaults to using the number of threads as logical processors on the system (the thread count given by `getconf _NPROCESSORS_ONLN`).
Use `-t` to set an appropriate level of parallelism if you can't use all the processors on your system.
Expand All @@ -147,7 +146,7 @@ cd pggb
./pggb -i data/HLA/DRB1-3123.fa.gz -p 70 -s 500 -n 10 -t 16 -V 'gi|568815561' -o out -M
```

This yields a variation graph in GFA format, a multiple sequence alignment in MAF format (`-M`), and several diagnostic images (all in the directory `out/`).
This yields a variation graph in GFA format, a multiple sequence alignment in MAF format (`-M`), and several diagnostic images (all in the directory `out/`). We specify `-n` because the sequences do not follow [PanSN-spec](https://github.com/pangenome/PanSN-spec), so the number of haplotypes can not be automatically computed.
We also call variants with `-V` with respect to the reference `gi|568815561`.

### 1D graph visualization
Expand Down Expand Up @@ -217,9 +216,10 @@ also rectifies issues with the initial wfa-based alignment.

### manual-mode

You'll need `wfmash`, `seqwish`, `smoothxg`, `odgi`, `gfaffix`, and `vg` in your shell's `PATH`.
You'll need `wfmash`, `seqwish`, `smoothxg`, `odgi`, and `gfaffix` in your shell's `PATH`.
These can be individually built and installed.
Then, put the `pggb` bash script in your path to complete installation.
Optionally, install `bcftools`, `vcfbub`, `vcfwave`, and `vg` for calling and normalizing variants, `MultiQC` for generating summarized statistics in a MultiQC report, or `pigz` to compress the output files of the pipeline.

### Docker

Expand Down Expand Up @@ -252,7 +252,7 @@ cd pggb
you can run the container using the [human leukocyte antigen (HLA) data](data/HLA) provided in this repo:

```bash
docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
```

The `-v` argument of `docker run` always expects a full path.
Expand Down Expand Up @@ -281,7 +281,7 @@ docker build --target binary -t ${USER}/pggb:latest .
Staying in the `pggb` directory, we can run `pggb` with the locally built image:

```bash
docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
```
A script that handles the whole building process automatically can be found at https://github.com/nf-core/pangenome#building-a-native-container.

Expand All @@ -308,7 +308,7 @@ Finally, run `pggb` from the Singularity image.
For Singularity, to be able to read and write files to a directory on the host operating system, we need to 'bind' that directory using the `-B` option and pass the `pggb` command as an argument.

```bash
singularity run -B ${PWD}/data:/data ../pggb_latest.sif pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -V 'gi|568815561' -o /data/out
singularity run -B ${PWD}/data:/data ../pggb_latest.sif pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out
```

A script that handles the whole building process automatically can be found at https://github.com/nf-core/pangenome#building-a-native-container.
Expand Down Expand Up @@ -359,7 +359,7 @@ The docker image already contains v1.11 of `MultiQC`.

## authors

*Garrison E., *Guarracino A., Heumos S., Villani F., Bao Z., Tattini L., Hagmann J., Vorbrugg S., Ashbrook D. G., Thorell K., Chen H., Sudmant P. H., Liti G., Colonna V., Prins P.
Erik Garrison*, Andrea Guarracino*, Simon Heumos, Flavia Villani, Zhigui Bao, Lorenzo Tattini, Jörg Hagmann, Sebastian Vorbrugg, Santiago Marco-Sola, Christian Kubica, David G. Ashbrook, Kaisa Thorell, Rachel L. Rusholme-Pilcher, Gianni Liti, Emilio Rudbeck, Sven Nahnsen, Zuyu Yang, Mwaniki N. Moses, Franklin L. Nobrega, Yi Wu, Hao Chen, Joep de Ligt, Peter H. Sudmant, Nicole Soranzo, Vincenza Colonna, Robert W. Williams, Pjotr Prins, Building pangenome graphs, bioRxiv 2023.04.05.535718; doi: https://doi.org/10.1101/2023.04.05.535718

## license

Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Core packages

+ `mashmap <https://github.com/marbl/MashMap>`_ variant for approximate mappings
+ `wavefront-guided <https://github.com/ekg/wflign>`_ global alignment for long secs
+ `wavefront <https://github.com/smarco/WFA>`_ algorithm for base-level alignment
+ `wavefront <https://github.com/smarco/WFA2-lib>`_ algorithm for base-level alignment
+ Pairwise alignments in `PAF <https://github.com/lh3/miniasm/blob/master/PAF.md>`_ format

* - |seqwish|
Expand Down
19 changes: 10 additions & 9 deletions docs/rst/essential_parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,27 +15,28 @@ Each pangenome is different. We may require different settings to obtain useful
Mapping
-------------------------

``pggb`` requires that the user sets a mapping identity minimum ``-p``, a segment length ``-s``, and a number of secondary mappings ``-n`` per segment.
In ``pggb``, the main parameters in mainly shaping pangenome graph structure are the mapping identity minimum ``-p`` and the segment length ``-s``.
These three parameters passed to ``wfmash`` are essential for establishing the basic structure of the pangenome:

- ``-s[N], --segment-length=[N]`` length of the mapped and aligned segment
- ``-p[%], --map-pct-id=[%]`` percentage identity minimum in the mapping step
- ``-n[N], --n-mappings=[N]`` maximum number of mappings and alignments reported for each segment

Thse parameters can be set using some prior information about the sequences that you're using.
Crucially, ``--segment-length`` provides a kind of minimum alignment length filter. The ``mashmap`` step in ``wfmash`` will only consider segments of this size,
and require them to have an approximate pairwise identity of at least ``--map-pct-id``. For small pangenome graphs, or where there are few repeats, ``--segment-length``
can be set low ( for example 3000 as in :ref:`quick_start_example`).
However, for larger contexts, with repeats, it can be very important to set this high (for instance 100000 in the case of human genomes).
can be set low (for example 3000 as in :ref:`quick_start_example` for the MHC pangenome graph).
However, for larger contexts, with repeats, it can be useful to set this high (for instance, even 10/20 kbps in the case of human genomes).
A long segment length ensures that we represent long collinear regions of the input sequences in the structure of the graph.
In general, this should at least be larger than transposon and other common repeats in your pangenome.
By default, ``wfmash`` only keeps mappings with at least 3 times the size of a segment.
By default, ``wfmash`` only keeps mappings with at least 5 times the size of a segment.
This can be adjusted with ``-l, --block-length BLOCK``.

Although the defaults (``-p 95 -s 10k``) should work for most pangenome contexts, it is recommended to set suitable minimum mapping identity ``-p`` and segment length ``-s``.
In particular, for high divergence problems (e.g. models built from separate species) it can be necessary to set ``-p`` and ``-s`` to different levels.
Increasing ``-p`` and ``-s`` will increase the stringency of the initial alignment, while reducing them will make this more sensitive.
Moreover, ``pggb`` requires that the user set a number of mappings ``-n`` per segment.
``-n`` represents the number of haplotypes in the input pangenome.
This is automatically computed if sequence names follow `PanSN-spec <https://github.com/pangenome/PanSN-spec>`_.


-------------------------
Expand All @@ -49,9 +50,9 @@ Convert this to an approximate percent identity and provide it as ``-p, --map-pc
Target number of alignment
-------------------------

The ``pggb graph`` is defined by the number of mappings per segment of each genome ``-n, --n-mappings N``.
Ideally, you should set this to equal the number of haplotypes in the pangenome.
The ``pggb`` graph is defined by the number of mappings per segment of each genome. ``-c, --n-mappings N``.
Ideally, you should set this to equal to 1 to have each genome mapped against all others once.
Because that's the maximum number of secondary mappings and alignments that we expect.
Howver, in case of pangenome with copy number variation, you may want to set this to a higher number.
Keep in mind that the total work of alignment is proportional to ``N*N``, and these multimappings can be highly redundant.
If you provide a ``N`` that is not equal to the number of haplotypes, provide the actual number of haplotypes to ``-H``.
This helps ``smoothxg`` to determine the right POA problem size.
In general, it is recommended to set this to 1, and only increase it if you have a good reason to do so.
15 changes: 0 additions & 15 deletions docs/rst/faqs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,3 @@ How are non-common nucleotide sequences treated?

The supported canonical bases are `A`, `C`, `T`, `G`. All other `nucleotide symbols <http://www.hgmd.cf.ac.uk/docs/nuc_lett.html>`_ are treated as `N`.
In particular, in ``wfmash``, non-canonical bases are treated as mismatches when guiding the local base-level alignments.

1. Point 1
---------------------------------------------------------------------------------------------

Point 1

2. Point 2
--------------------------------------------------------------------------------------------

Point 2

Question 2?
=========================================================================

Answer 2
12 changes: 5 additions & 7 deletions docs/rst/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,9 @@ Manual-mode
<br />

You'll need `wfmash <https://github.com/waveygang/wfmash>`_, `seqwish <https://github.com/ekg/seqwish>`_, `smoothxg <https://github.com/pangenome/smoothxg>`_,
`odgi <https://github.com/pangenome/odgi>`_, `gfaffix <https://github.com/marschall-lab/GFAffix>`_, `bcftools <https://github.com/samtools/bcftools>`_ and `vg <https://github.com/vgteam/vg>`_
in your shell's ``PATH``. They can be build from source, or installed via Bioconda.
Then, add the ``pggb`` bash script to your ``PATH`` to complete the installation.
`odgi <https://github.com/pangenome/odgi>`_, and `gfaffix <https://github.com/marschall-lab/GFAffix>`_ in your shell's ``PATH``. They can be build from source or installed via Bioconda. Then, add the ``pggb`` bash script to your ``PATH`` to complete the installation.
`How to add a binary to my path? <https://zwbetz.com/how-to-add-a-binary-to-your-path-on-macos-linux-windows/>`_ |br|
Optionally, install `MultiQC <https://multiqc.info/>`_ for reporting or `pigz <https://zlib.net/pigz/>`_ to compress the output files of the pipeline.
Optionally, install `bcftools <https://github.com/samtools/bcftools>`_, `vcfbub <https://github.com/pangenome/vcfbub>`_, `vcfwave <https://github.com/vcflib/vcflib>`, and `vg <https://github.com/vgteam/vg>`_ for calling and normalizing variants, `MultiQC <https://multiqc.info/>`_ for generating summarized statistics in a MultiQC report, or `pigz <https://zlib.net/pigz/>`_ to compress the output files of the pipeline.


Docker
Expand Down Expand Up @@ -56,7 +54,7 @@ you can run the container using the human leukocyte antigen (HLA) data provided

.. code-block:: bash
docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -v -V 'gi|568815561:#' -o /data/out -M -C cons,100,1000,10000 -m"
docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
The ``-v`` argument of ``docker run`` always expects a full path.
Expand Down Expand Up @@ -93,7 +91,7 @@ Staying in the ``pggb`` directory, we can run ``pggb`` with the locally build im

.. code-block:: bash
docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -v -V 'gi|568815561:#' -o /data/out -M -C cons,100,1000,10000 -m"
docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
A script that handles the whole building process automatically can be found at `https://github.com/nf-core/pangenome#building-a-native-container <https://github.com/nf-core/pangenome#building-a-native-container>`_`.

Expand Down Expand Up @@ -124,7 +122,7 @@ Finally, run `pggb` from the Singularity image.
For Singularity to be able to read and write files to a directory on the host operating system, we need to 'bind' that directory using the `-B` option and pass the `pggb` command as an argument.

.. code-block:: bash
singularity run -B ${PWD}/data:/data ../pggb_latest.sif "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -v -V 'gi|568815561:#' -o /data/out -M -m"
singularity run -B ${PWD}/data:/data ../pggb_latest.sif "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
A script that handles the whole building process automatically can be found at `https://github.com/nf-core/pangenome#building-a-native-container <https://github.com/nf-core/pangenome#building-a-native-container>`_`.

Expand Down

0 comments on commit 9c2fd91

Please sign in to comment.