Merge pull request #391 from pangenome/doc

refresh documentation [skip ci]
pangenome · Apr 11, 2024 · 9c2fd91 · 9c2fd91
2 parents de71ee5 + cd02c0d
commit 9c2fd91
Show file tree

Hide file tree

Showing 9 changed files with 58 additions and 77 deletions.
diff --git a/README.md b/README.md
@@ -36,7 +36,7 @@ Then, you can run `pggb` on each community (set of sequences) independently (see
 ```bash
 pggb -i in.fa \       # input file in FASTA format
      -o output \      # output directory
-     -n 9 \           # number of haplotypes
+     -n 9 \           # number of haplotypes (optional with PanSN-spec)
      -t 16 \          # number of threads
      -p 90 \          # minimum average nucleotide identity for segments
      -s 5k \          # segment length for scaffolding the graph
@@ -55,7 +55,7 @@ In the above example, to partition your sequences into communities, execute:
 ```bash
 partition-before-pggb -i in.fa \       # input file in FASTA format
                       -o output \      # output directory
-                      -n 9 \           # number of haplotypes
+                      -n 9 \           # number of haplotypes (optional with PanSN-spec)
                       -t 16 \          # number of threads
                       -p 90 \          # minimum average nucleotide identity for segments
                       -s 5k \          # segment length for scaffolding the graph
@@ -93,11 +93,10 @@ A goal of `pggb` is to reduce the complexity of making these alignments, which a
 
 ### key parameters
 
-The overall structure of `pggb`'s output graph is defined by three parameters: genome number (`-n`), segment length (`-s`), and pairwise identity (`-p`).
-Genome number is a given, but varies in ways that are difficult to infer and is thus left up to the user.
+The overall structure of `pggb`'s output graph is defined by two parameters: segment length (`-s`) and pairwise identity (`-p`).
 Segment length defines the seed length used by the "MashMap3" homology mapper in `wfmash`.
 The pairwise identity is the minimum allowed pairwise identity between seeds, which is estimated using a mash-type approximation based on k-mer Jaccard.
-Mappings are initiated from collinear chains of around 5 seeds (`-l, --block-length`), and extended greedily as far as possible, allowing up to `-n` minus 1 mappings at each query position.
+Mappings are initiated from collinear chains of around 5 seeds (`-l, --block-length`), and extended greedily as far as possible, allowing up to `-n` minus 1 mappings at each query position. Genome number (`-n`) is automatically computed if sequence names follow [PanSN-spec](https://github.com/pangenome/PanSN-spec).
 
 An additional parameter, `-k`, can also greatly affect graph structure by pruning matches shorter than a given threshold from the initial graph model.
 In effect, `-k N` removes any match shorter than `N`bp from the initial alignment.
@@ -113,7 +112,7 @@ Finally, we apply `gfaffix` to remove forks where both alternatives have the sam
 ### bringing your pangenome into focus
 
 We suggest using default parameters for initial tests.
-For instance `pggb -i in.fa.gz -o out1 -t 16 -n 100` would be a minimal build command for a 100-genome pangenome from `in.fa.gz`.
+For instance `pggb -i in.fa.gz -o out ` would be a minimal build command for a pangenome from `in.fa.gz`.
 The default parameters provide a good balance between runtime and graph quality for small-to-medium (1kbp-100Mbp) problems.
 
 However, we find that parameters may still need to be adjusted to fine-tune `pggb` to a given problem.
@@ -125,13 +124,13 @@ These parameters must be tuned so that the graph resolves structures of interest
 In preparation of a manuscript on `pggb`, we have developed a [set of example pangenome builds for a collection of diverse species](https://github.com/pangenome/pggb-paper/blob/main/workflows/AllSpecies.md#all-species).
 (These also use cross-validation against [`nucmer`](https://mummer4.github.io/) to evaluate graph quality.)
 
-Examples:
+Examples (`-n` is optional if sequence names follow [PanSN-spec](https://github.com/pangenome/PanSN-spec)):
 
-- Human, whole genome, 90 haplotypes: `pggb -p 98 -s 50k -n 90 -k 79 ...`
-- 15 helicobacter genomes, 5% divergence: `pggb -n 15 -k 79`, and 15 at higher (10%) divergence `pggb -n 15 -k 19 -P asm20 ...`
-- Yeast genomes, 5% divergence: `pggb`'s defaults should work well, just set `-n`.
-- Aligning 9 MHC class II assemblies from vertebrate genomes (5-10% divergence): `pggb -n 9 -k 29 ...`
-- A few thousand bacterial genomes `pggb -x auto -n 2146 ...`. In general mapping sparsification (`-x auto`) is a good idea when you have many hundreds to thousands of genomes.
+- Human, whole genome, 90 haplotypes: `pggb -p 98 -s 10k -k 47 [-n 90]...`
+- 15 helicobacter genomes, 5% divergence: `pggb -k 47 [-n 15]`, and 15 at higher (10%) divergence `pggb -k 23 [-n 15] ...`
+- Yeast genomes, 5% divergence: `pggb`'s defaults should work well.
+- Aligning 9 MHC class II assemblies from vertebrate genomes (5-10% divergence): `pggb -k 29 [-n 9] ...`
+- A few thousand bacterial genomes `pggb -x auto [-n 2000] ...`. In general mapping sparsification (`-x auto`) is a good idea when you have many hundreds to thousands of genomes.
 
 `pggb` defaults to using the number of threads as logical processors on the system (the thread count given by `getconf _NPROCESSORS_ONLN`).
 Use `-t` to set an appropriate level of parallelism if you can't use all the processors on your system.
@@ -147,7 +146,7 @@ cd pggb
 ./pggb -i data/HLA/DRB1-3123.fa.gz -p 70 -s 500 -n 10 -t 16 -V 'gi|568815561' -o out -M
 ```
 
-This yields a variation graph in GFA format, a multiple sequence alignment in MAF format (`-M`), and several diagnostic images (all in the directory `out/`).
+This yields a variation graph in GFA format, a multiple sequence alignment in MAF format (`-M`), and several diagnostic images (all in the directory `out/`). We specify `-n` because the sequences do not follow [PanSN-spec](https://github.com/pangenome/PanSN-spec), so the number of haplotypes can not be automatically computed.
 We also call variants with `-V` with respect to the reference `gi|568815561`.
 
 ### 1D graph visualization
@@ -217,9 +216,10 @@ also rectifies issues with the initial wfa-based alignment.
 
 ### manual-mode
 
-You'll need `wfmash`, `seqwish`, `smoothxg`, `odgi`, `gfaffix`, and `vg` in your shell's `PATH`.
+You'll need `wfmash`, `seqwish`, `smoothxg`, `odgi`, and `gfaffix` in your shell's `PATH`.
 These can be individually built and installed.
 Then, put the `pggb` bash script in your path to complete installation.
+Optionally, install `bcftools`, `vcfbub`, `vcfwave`, and `vg` for calling and normalizing variants, `MultiQC` for generating summarized statistics in a MultiQC report, or `pigz` to compress the output files of the pipeline.
 
 ### Docker
 
@@ -252,7 +252,7 @@ cd pggb
 you can run the container using the [human leukocyte antigen (HLA) data](data/HLA) provided in this repo:
 
 ```bash
-docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
+docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
 ```
 
 The `-v` argument of `docker run` always expects a full path.
@@ -281,7 +281,7 @@ docker build --target binary -t ${USER}/pggb:latest .
 Staying in the `pggb` directory, we can run `pggb` with the locally built image:
 
 ```bash
-docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
+docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
 ```
 A script that handles the whole building process automatically can be found at https://github.com/nf-core/pangenome#building-a-native-container.
 
@@ -308,7 +308,7 @@ Finally, run `pggb` from the Singularity image.
 For Singularity, to be able to read and write files to a directory on the host operating system, we need to 'bind' that directory using the `-B` option and pass the `pggb` command as an argument.
 
 ```bash
-singularity run -B ${PWD}/data:/data ../pggb_latest.sif pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -V 'gi|568815561' -o /data/out
+singularity run -B ${PWD}/data:/data ../pggb_latest.sif pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out
 ```
 
 A script that handles the whole building process automatically can be found at https://github.com/nf-core/pangenome#building-a-native-container.
@@ -359,7 +359,7 @@ The docker image already contains v1.11 of `MultiQC`.
 
 ## authors
 
-*Garrison E., *Guarracino A., Heumos S., Villani F., Bao Z., Tattini L., Hagmann J., Vorbrugg S., Ashbrook D. G., Thorell K., Chen H., Sudmant P. H., Liti G., Colonna V., Prins P.
+Erik Garrison*, Andrea Guarracino*, Simon Heumos, Flavia Villani, Zhigui Bao, Lorenzo Tattini, Jörg Hagmann, Sebastian Vorbrugg, Santiago Marco-Sola, Christian Kubica, David G. Ashbrook, Kaisa Thorell, Rachel L. Rusholme-Pilcher, Gianni Liti, Emilio Rudbeck, Sven Nahnsen, Zuyu Yang, Mwaniki N. Moses, Franklin L. Nobrega, Yi Wu, Hao Chen, Joep de Ligt, Peter H. Sudmant, Nicole Soranzo, Vincenza Colonna, Robert W. Williams, Pjotr Prins, Building pangenome graphs, bioRxiv 2023.04.05.535718; doi: https://doi.org/10.1101/2023.04.05.535718
 
 ## license
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -34,7 +34,7 @@ Core packages
 
         + `mashmap <https://github.com/marbl/MashMap>`_ variant for approximate mappings
         + `wavefront-guided <https://github.com/ekg/wflign>`_ global alignment for long secs
-        + `wavefront <https://github.com/smarco/WFA>`_ algorithm for base-level alignment
+        + `wavefront <https://github.com/smarco/WFA2-lib>`_ algorithm for base-level alignment
         + Pairwise alignments in `PAF <https://github.com/lh3/miniasm/blob/master/PAF.md>`_ format
 
     * - |seqwish|

diff --git a/docs/rst/essential_parameters.rst b/docs/rst/essential_parameters.rst
@@ -15,27 +15,28 @@ Each pangenome is different. We may require different settings to obtain useful
 Mapping
 -------------------------
 
-``pggb`` requires that the user sets a mapping identity minimum ``-p``, a segment length ``-s``, and a number of secondary mappings ``-n`` per segment.
+In ``pggb``, the main parameters in mainly shaping pangenome graph structure are the mapping identity minimum ``-p`` and the segment length ``-s``.
 These three parameters passed to ``wfmash`` are essential for establishing the basic structure of the pangenome:
 
     - ``-s[N], --segment-length=[N]`` length of the mapped and aligned segment
     - ``-p[%], --map-pct-id=[%]`` percentage identity minimum in the mapping step
-    - ``-n[N], --n-mappings=[N]`` maximum number of mappings and alignments reported for each segment
 
 Thse parameters can be set using some prior information about the sequences that you're using.
 Crucially, ``--segment-length`` provides a kind of minimum alignment length filter. The ``mashmap`` step in ``wfmash`` will only consider segments of this size, 
 and require them to have an approximate pairwise identity of at least ``--map-pct-id``. For small pangenome graphs, or where there are few repeats, ``--segment-length``
-can be set low ( for example 3000 as in :ref:`quick_start_example`).
-However, for larger contexts, with repeats, it can be very important to set this high (for instance 100000 in the case of human genomes).
+can be set low (for example 3000 as in :ref:`quick_start_example` for the MHC pangenome graph).
+However, for larger contexts, with repeats, it can be useful to set this high (for instance, even 10/20 kbps in the case of human genomes).
 A long segment length ensures that we represent long collinear regions of the input sequences in the structure of the graph.
 In general, this should at least be larger than transposon and other common repeats in your pangenome.
-By default, ``wfmash`` only keeps mappings with at least 3 times the size of a segment.
+By default, ``wfmash`` only keeps mappings with at least 5 times the size of a segment.
 This can be adjusted with ``-l, --block-length BLOCK``.
 
 Although the defaults (``-p 95 -s 10k``) should work for most pangenome contexts, it is recommended to set suitable minimum mapping identity ``-p`` and segment length ``-s``.
 In particular, for high divergence problems (e.g. models built from separate species) it can be necessary to set ``-p`` and ``-s`` to different levels.
 Increasing ``-p`` and ``-s`` will increase the stringency of the initial alignment, while reducing them will make this more sensitive.
 Moreover, ``pggb`` requires that the user set a number of mappings ``-n`` per segment.
+``-n`` represents the number of haplotypes in the input pangenome.
+This is automatically computed if sequence names follow `PanSN-spec <https://github.com/pangenome/PanSN-spec>`_.
 
 
 -------------------------
@@ -49,9 +50,9 @@ Convert this to an approximate percent identity and provide it as ``-p, --map-pc
 Target number of alignment
 -------------------------
 
-The ``pggb graph`` is defined by the number of mappings per segment of each genome ``-n, --n-mappings N``.
-Ideally, you should set this to equal the number of haplotypes in the pangenome.
+The ``pggb`` graph is defined by the number of mappings per segment of each genome. ``-c, --n-mappings N``.
+Ideally, you should set this to equal to 1 to have each genome mapped against all others once.
 Because that's the maximum number of secondary mappings and alignments that we expect.
+Howver, in case of pangenome with copy number variation, you may want to set this to a higher number.
 Keep in mind that the total work of alignment is proportional to ``N*N``, and these multimappings can be highly redundant. 
-If you provide a ``N`` that is not equal to the number of haplotypes, provide the actual number of haplotypes to ``-H``.
-This helps  ``smoothxg`` to determine the right POA problem size.
+In general, it is recommended to set this to 1, and only increase it if you have a good reason to do so.
diff --git a/docs/rst/faqs.rst b/docs/rst/faqs.rst
@@ -12,18 +12,3 @@ How are non-common nucleotide sequences treated?
 
 The supported canonical bases are `A`, `C`, `T`, `G`. All other `nucleotide symbols <http://www.hgmd.cf.ac.uk/docs/nuc_lett.html>`_ are treated as `N`.
 In particular, in ``wfmash``, non-canonical bases are treated as mismatches when guiding the local base-level alignments.
-
-1. Point 1
----------------------------------------------------------------------------------------------
-
-Point 1
-
-2. Point 2
---------------------------------------------------------------------------------------------
-
-Point 2
-
-Question 2?
-=========================================================================
-
-Answer 2
diff --git a/docs/rst/installation.rst b/docs/rst/installation.rst
@@ -12,11 +12,9 @@ Manual-mode
    <br />
 
 You'll need `wfmash <https://github.com/waveygang/wfmash>`_, `seqwish <https://github.com/ekg/seqwish>`_, `smoothxg <https://github.com/pangenome/smoothxg>`_,
-`odgi <https://github.com/pangenome/odgi>`_, `gfaffix <https://github.com/marschall-lab/GFAffix>`_, `bcftools <https://github.com/samtools/bcftools>`_ and `vg <https://github.com/vgteam/vg>`_ 
-in your shell's ``PATH``. They can be build from source, or installed via Bioconda.
-Then, add the ``pggb`` bash script to your ``PATH`` to complete the installation. 
+`odgi <https://github.com/pangenome/odgi>`_, and `gfaffix <https://github.com/marschall-lab/GFAffix>`_ in your shell's ``PATH``. They can be build from source or installed via Bioconda. Then, add the ``pggb`` bash script to your ``PATH`` to complete the installation. 
 `How to add a binary to my path? <https://zwbetz.com/how-to-add-a-binary-to-your-path-on-macos-linux-windows/>`_ |br|
-Optionally, install `MultiQC <https://multiqc.info/>`_ for reporting or `pigz <https://zlib.net/pigz/>`_ to compress the output files of the pipeline.
+Optionally, install `bcftools <https://github.com/samtools/bcftools>`_, `vcfbub <https://github.com/pangenome/vcfbub>`_, `vcfwave <https://github.com/vcflib/vcflib>`, and `vg <https://github.com/vgteam/vg>`_ for calling and normalizing variants, `MultiQC <https://multiqc.info/>`_ for generating summarized statistics in a MultiQC report, or `pigz <https://zlib.net/pigz/>`_ to compress the output files of the pipeline.
 
 
 Docker
@@ -56,7 +54,7 @@ you can run the container using the human leukocyte antigen (HLA) data provided
 
 .. code-block:: bash
 
-    docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -v -V 'gi|568815561:#' -o /data/out -M -C cons,100,1000,10000 -m"
+    docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
 
 
 The ``-v`` argument of ``docker run`` always expects a full path.
@@ -93,7 +91,7 @@ Staying in the ``pggb`` directory, we can run ``pggb`` with the locally build im
 
 .. code-block:: bash
 
-    docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -v -V 'gi|568815561:#' -o /data/out -M -C cons,100,1000,10000 -m"
+    docker run -it -v ${PWD}/data/:/data ${USER}/pggb /bin/bash -c "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
 
 A script that handles the whole building process automatically can be found at `https://github.com/nf-core/pangenome#building-a-native-container <https://github.com/nf-core/pangenome#building-a-native-container>`_`.
 
@@ -124,7 +122,7 @@ Finally, run `pggb` from the Singularity image.
 For Singularity to be able to read and write files to a directory on the host operating system, we need to 'bind' that directory using the `-B` option and pass the `pggb` command as an argument.
 
 .. code-block:: bash
-    singularity run -B ${PWD}/data:/data ../pggb_latest.sif "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -v -V 'gi|568815561:#' -o /data/out -M -m"
+    singularity run -B ${PWD}/data:/data ../pggb_latest.sif "pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o /data/out"
 
 A script that handles the whole building process automatically can be found at `https://github.com/nf-core/pangenome#building-a-native-container <https://github.com/nf-core/pangenome#building-a-native-container>`_`.