Table of contents
- Mandatory parameters
- Specify steps and threads
- Cluster post-processing
- Metadata processing
- Core genome cutoff
- Association inference
- Other options
Usage example: ./panX -fn data/TestSet -sl TestSet 1>TestSet.log 2>TestSet.err For help: ./panX -h
the absolute path for project folder
species name (e.g.: S_aureus)
Specify steps and threads
panX runs through all steps by default, which contain pan-genome analysis and data preparation for visualization. Alternatively, user can run specific steps.
e.g.: -st 5 6 (It requires that steps before 5 have been already finished.)
number of threads (default:1)
panX generates orthologous gene clusters by itself and can alternatively use clustering output from other pan-genome tools. To ingest the output from roary, orthofinder, or other tools, one can specify the path of output from these tools via:
the absolute path of roary result (e.g.: /path/roary.out) ,
Example for using roary output without panX post-processing:
./panX.py -st 1 3 4 5 6 7 8 9 10 11 -np -rp ./data/yourSpecies/clustered_proteins -fn ...
Example for using roary output with panX post-processing:
./panX.py -st 1 3 4 5 6 7 8 9 10 11 -rp ./data/yourSpecies/clustered_proteins -fn ...
the absolute path of orthofinder result (e.g.: /path/orthofinder.out)
For other tools, panX expects a tab-separated values (TSV) file containing each gene cluster per row and each gene separated by | between strain name and locus_tag (strain01|gene01)
the absolute path of result from other orthology inference tool (e.g.: /path/other_tool.out)
Parameters used to run diamond
DIAMOND [(Buchfink et al. 2015 Nature Methods)] (http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3176.html)
alternative diamond path provided by user
-dme --diamond_evalue (default:0.001)
-dmt --diamond_max_target_seqs (default:600)
maximum number of target sequences per query estimation: #strain * #max_duplication (50*10=500, max_duplication is the maximum of the counts of gene duplication in all strains)
-dmi --diamond_identity (default:0)
sequence identity threshold to report an alignment
-dmqc --diamond_query_cover (default:0)
query sequence coverage threshold to report an alignment
-dmsc --diamond_subject_cover (default:0)
subject sequence coverage threshold to report an alignment
Running diamond with divide-and-conquer algorithm for large dataset
using divide-and-conquer(DC) algorithm
-dcs --subset_size (default:50)
subset size (number of strains in a subset) used in divide-and-conquer(DC) algorithm
-dmsi --diamond_identity_subproblem (default:90)
subproblem alignment: sequence identity threshold to report an alignment
-dmsqc --diamond_query_cover_subproblem (default:90)
subproblem alignment: query sequence coverage threshold to report an alignment
-dmssc --diamond_subject_cover_subproblem (default:90)
subproblem alignment: subject sequence coverage threshold to report an alignment
panX can also use a matrix of all-against-all blast hits to generate clusters with MCL.
the absolute path for blast result (e.g.: /path/blast.out)
Other clustering parameters
- -imcl --mcl_inflation (default:1.5)
inflation parameter (this parameter affects granularity)
- -rna --enable_RNA_clustering cluster rRNAs
Running Blastn on RNAs
-bmt --blastn_RNA_max_target_seqs (default:100)
the maximum number of target sequences per query; estimation: #strain * #max_duplication
Not apply postprocessing
Split long branches
For each gene tree, panX scans the branches longer than the diversity cutoff and splits the tree from those branches into sub-clusters. The diversity cutoff is defined by the formula: (0.1+fcdcore_diversity)/(1+fcdcore_diversity)
-fcd --factor_core_diversity (default:2.0)
factor used to define the diversity cutoff for splitting long branches on gene tree of a cluster
Alternatively, the diversity cutoff can be specified explicitly with the argument -slb, which is mutually exclusive from -fcd . If this value is not given (by default), panX uses the diversity cutoff calculated from core genome diversity as described in above-mentioned formula.
-slb --split_long_branch_cutoff (default: '')
using the diversity cutoff specified by user
Relatively recent gene duplications (paralogs) are not considered when only applying long branch splitting. panX searches paralogy evidence via two tree traversals by taking both paralogy score and branch length into consideration. In order to identify these paralogs and split them into orthologs, a paralogy score is computed for each branch on each gene tree, which is the count of strains having genes on both sides of the branch.
The splitting procedure can be summarized in these conditions: branch_length/paralog_branch_cutoff + paralog_score/(1.5 * nstrains)>1 and paralog_score > 1 panX uses the defined diversity cutoff as paralog branch_cutoff by default. The paralogy splitting is iteratively conducted until no paralogs can be found.
Alternatively, paralog branch_cutoff can be specified with the argument -pbc instead of using the defined diversity cutoff.
-pbc --paralog_branch_cutoff' (default:'')
branch_length cutoff for paralogy splitting provided by user
Resolve unclustered genes
(peaks in gene length distribution)
From our experience, there can be sometimes a small number of unclustered genes, despite of their sequence similarity. panX spots those sequences by inspecting the peaks in the gene length distribution (average length of genes from each cluster), statistically speaking, by comparing the length distribution to smoothed length distribution.
Smoothed length distribution is calculated based on the convolution with defined sliding window by formula: smoothed_length_distribution = convolve(cluster_length_distribution, window) The signal is detected when: (cluster_length_distribution - smoothed_length_distribution) > maximum(strain_proportion*nstrains, sigma_scale*squareRoot(smoothed_length_distribution)) The related parameters are:
-ws --window_size_smoothed (default:5)
postprocess_unclustered_genes: window size for smoothed cluster length distribution
-spr --strain_proportion (default:0.3)
postprocess_unclustered_genes: strain proportion
-ss --sigma_scale (default:3)
postprocess_unclustered_genes: sigma scale For each peak, all genes in these clusters are aligned, with corresponding phylogeny being inferred. Sub-clusters are then split using long branch splitting principle as described above.
panX parses metadata from GenBank file by default. To link customized metadata to the strains, a TSV table needs to be provided. (example) This will only use the meta-information given by user.
the absolute path for meta_information file (e.g.: /path/meta.out)
For metadata such as drug concentration measurements, which user might want to have the branch and presence/absence association to be visualized, an extra TSV table is required to specify what metadata types are used for above-mentioned association inference. (example)
the absolute path for pre-defined metadata structure (discrete/continuous data type, etc.)
Core genome cutoff
Core genome can be defined either as strictly core or soft core. Strictly core genome contain genes present in all strains; soft core genome includes genes shared by the majority of strains (e.g.: 80% of strains). Correspondingly, setting core genome threshold affects core genome diversity and strain tree based on core gene SNPs, specially when incomplete genomes or outliers are involved in the dataset.
-cg --core_genome_threshold (default:1.0)
percentage of strains used to decide whether a gene is core (1.0 for strictly core gene; < 1.0 for soft core genes)
Core gene strain constraint If incomplete genomes are included in the analysis, user can provide an extract list to ensure that soft core genes must be found in all strains in the list.
the absolute path for user-provided subset of strains for restricting core genes
infer branch association
E.g.: ./panX.py -iba -mtf ./data/yourSpecies/meta_config.tsv -fn ...
absolute path of raxml
remove all temporary files
simple tree: does not apply treetime for ancestral inference
store locus_tags in a separate file instead of saving locus_tags in gene cluster json for large dataset
use raw locus_tag from GenBank instead of strain_ID + locus_tag
add customized column in gene cluster json file for visualization
not split gene presence/absence and gain/loss pattern into separate files for each cluster
use nucleotide/amino acid sequence files (fna/faa) when no genBank files given (this option does not consider annotations)
RAxML tree optimization: maximal running time (minutes, default:30min)
disable enable gene gain and loss inference (not recommended)