GraphPart

Biological sequence dataset partitioning method

Graph-Part is a Python package for generating partitions (i.e. train-test splits, or splits for cross-validation) of biological sequence datasets. It ensures minimal homology between different partitions, while balancing partitions for labels or other desired criteria.

Preprint: https://www.biorxiv.org/content/10.1101/2023.04.14.536886v1

Installation

GraphPart relies on needleall from the EMBOSS package for Needleman-Wunsch alignments of sequences. Please refer to the official EMBOSS documentation for installation methods. Additionally, GraphPart supports MMseqs2 for alignments. To use other algorithms that compute pairwise similarity measures, please refer to the precomputed mode.

We recommend to install GraphPart in a conda environment, and install EMBOSS from bioconda. The same goes for MMseqs2.

conda install -c bioconda emboss

# If you also want to use the MMseqs2 mode
conda install -c conda-forge -c bioconda mmseqs2

Alternatively, on Ubuntu, EMBOSS is available directly via sudo apt-get install emboss .

To install GraphPart, run

pip install graph-part

The command graphpart will now be available on your command line.

Alternatively, you can install GraphPart from source:

conda install -c bioconda emboss
git clone https://github.com/graph-part/graph-part.git
cd graph-part
pip install .

Instructions

Command line interface for FASTA data

As an example, this is a basic command for partitioning a dataset at a maximum pairwise cross-partition identity of 30% into 5 folds. The resulting partitions are balanced for equal frequencies of the class labels specified in label= in the FASTA headers. --threads can be adapted according to your system and has no effect on the partitioning itself, it only affects the resource usage and processing speed.

graphpart needle --fasta-file netgpi_dataset.fasta --threshold 0.3 --out-file graphpart_assignments.csv --labels-name label --partitions 5 --threads 12

Alternatively, a train-validation-test split of the data can be made instead of folds:

graphpart needle --fasta-file netgpi_dataset.fasta --threshold 0.3 --out-file graphpart_assignments.csv --labels-name label --test-ratio 0.1 --val-ratio 0.05 --threads 12

Python API

A tutorial notebook showcasing how to use GraphPart from within Python is included at tutorial.ipynb. The tutorial also covers partitioning of small molecule data.

Input format

GraphPart works on FASTA files with a custom header format, e.g.

>P42098|label=CLASSA|priority=0
MAPSWRFFVCFLLWGGTELCSPQPVWQDEGQRLRPSKPPTVMVECQEAQLVVIVSKDLFGTGKLIRPADL
>P0CL66|label=CLASSB|priority=1
MKKYLLGIGLILALIACKQNVSSLDEKNSVSVDLPGEMKVLVSKEKNKDGKYDLIATVDKLELKGTSDKN

Alternatively , ":" and " - " (note there are spaces on either side of the -) can be used as separators instead of "|". It should be taken care that the sequence identifiers themselves contain no separator symbols. The keywords label and priority can be customized by specifying the --labels-name and --priority-name arguments. Both elements of the header are optional, GraphPart can also just partition based on sequences alone, without any class balancing.
You can find a script to convert .csv datasets into the custom .fasta format at csv_to_fasta.py

Output format

GraphPart produces a .csv file that contains the cluster assignment for each sequence. Column cluster contains the partition number. Removed sequences are not contained in the output file.

AC,priority,label-val,between_connectivity,cluster
P42098,False,0,0,0.0
Q6LEM5,False,0,0,0.0
Q9JI81,False,0,0,1.0

API

Minimal command:

graphpart ALIGNMENT_MODE -ff FASTAFILE.fasta -th 0.3

Supported aligners

Alignment mode	Description
`needle`	Use EMBOSS needleall to compute exact pairwise global Needleman-Wunsch (NW) identities for all sequences.
`mmseqs2`	Use MMseqs2 to compute fast identities from local alignments. Use with caution for nucleotides, as there it cannot be guaranteed that MMseqs2 computes all pairwise alignments.
`precomputed`	Use a list of precomputed identities or other similarity/distance metrics.
`mmseqs2needle`	Uses MMseqs2 to compute fast identities, followed by recomputation of NW identities of all identities between the separation threshold and an user-defined lower threshold.

Arguments

Shared

Long	Short	Description
`--fasta-file`	`-ff`	Path to the input fasta file, formatted according to the input format.
`--out-file`	`-of`	Path at which to save the partition assignments as `.csv`. Defaults to `graphpart_result.csv`.
`--threshold`	`-th`	The desired partitioning threshold, should be within the bounds defined by the metric.
`--partitions`	`-pa`	Number of partitions to generate. Defaults to 5.
`--transformation`	`-tf`	Transformation to apply to the similarity/distance metric. GraphPart operates on distances, therefore similarity metrics need to be transformed. Can be any of `one-minus`, `inverse`, `square`, `log`, `None`. See the source for definitions. As an example, when operating with sequence identities ranging from 0 to 1, the transformation `one-minus` yields corresponding distances. Defaults to `one-minus`.
`--priority-name`	`-pn`	The name of the retention priority in the fasta headers. If specified, the algorithm first tries to reach the treshold by removing/moving low-priority (`0`) samples before proceeding to `1` samples. Defaults to `None`. See the the input format section for an example.
`--labels-name`	`-ln`	The name of the label in the fasta headers. Used for balancing partitions. Defaults to `None`.
`--initialization-mode`	`-im`	Use either slow or fast restricted nearest neighbor linkage or no initialization. Can be any of `slow-nn`, `fast-nn`, `simple`. Defaults to `slow-nn`.
`--no-moving`	`-nm`	By default, the removing procedure tries to relocate sequences to another partition if it finds more within-threshold neighbours in any. This flag disallows moving. In high-redundancy datasets, moving can lead to imbalanced partitions and should be disabled.
`--save-checkpoint-path`	`-sc`	Optional path to save the computed identities above the chosen threshold as an edge list. Can be used to quickstart runs in the `precomputed` mode. Defaults to `None` with no file saved.
`--test-ratio`	`-te`	Make a train-val-test split instead of partitions for cross-validation. Overrides `--partitions` when specified. Defaults to 0. Needs to be a multiple of 0.05.
`--val-ratio`	`-va`	Make a train-val-test split instead of partitions for cross-validation. Overrides `--partitions` when specified. Defaults to 0. Needs to be a multiple of 0.05.

needle

Long	Short	Description
`--denominator`	`-dn`	Denominator to use for percent sequence identity computation. The number of perfect matching positions is divided by the result of this operation. Can be any of `shortest`, `longest`, `mean`, `full`, `no_gaps`. The first three options are computed from the original lengths of the aligned sequences. `full` refers to the full length of the alignment, including gaps, and is the default. `no_gaps` subtracts gaps from the full alignment length.
`--threads`	`-nt`	The number of threads to run in parallel. If `-1`, will use all available resources. Defaults to 1.
`--chunks`	`-nc`	The number of chunks into which to split the fasta file for multithreaded alignment. Defaults to 10.
`--parallel-mode`	`-pm`	The Python parallelization strategy to use. `multithread` or `multiprocess`. Multiprocessing is potentially faster (especially for short sequences), but increases memory usage. Defaults to `multithread`
`--nucleotide`	`-nu`	Use this flag if the input contains nucleotide sequences. By default, assumes proteins.
`--triangular`	`-tr`	Only compute triangular of the full distance matrix. Twice as fast, but can yield slightly different results if an alignment has two different solutions with the same score, but different identities. In some cases, a pairwise identity can be slightly above the threshold in one solution, and slightly below in another (e.g A:B = 29.8%, B:A = 30.4%).
`--gapopen`	`-gapopen`	[10.0 for any sequence] The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix. (Floating point number from 1.0 to 100.0)
`--gapextend`	`-gapextend`	[0.5 for any sequence] The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to a very low value and using the gap extension penalty to control gap scoring. (Floating point number from 0.0 to 10.0)
`--endextend`	`-endextend`	[0.5 for any sequence] The end gap extension, penalty is added to the end gap penalty for each base or residue in the end gap. This is how long end gaps are penalized. (Floating point number from 0.0 to 10.0)
`--endweight`	`-endweight`	Flag. Apply end gap penalties. By default, no end gap penalties are applied.
`--endopen`	`-endopen`	[10.0 for any sequence] The end gap open penalty is the score taken away when an end gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences. (Floating point number from 1.0 to 100.0)
`--matrix`	`-datafile`	This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62'. These files are found in the 'data' directory of the EMBOSS installation. If `--nucleotide`, the default is 'EDNAFULL'.

mmseqs2

Long	Short	Description
`--denominator`	`-dn`	Denominator to use for percent sequence identity computation. The number of perfect matching positions is divided by the result of this operation. Can be any of `shortest`, `longest`, `n_aligned`. `n_aligned` is the length of the alignment. Use this with caution, as GraphPart doesn't use coverage controls in the mmseqs2 mode. Defaults to `shortest`.
`--nucleotide`	`-nu`	Use this flag if the input contains nucleotide sequences. By default, assumes proteins. Use with caution! Not guaranteed to compute all pairwise alignments.
`--prefilter`	`-pr`	Use MMseqs2 prefiltering at the highest sensitivity instead of forcing computation of all-vs-all alignments.

precomputed

Long	Short	Description
`--edge-file`	`-ef`	Path to a comma separated file containing precomputed pairwise metrics, the first two columns should contain sequence identifiers specified in the `--fasta-file`. This is can be used to run GraphPart with an alignment tool different from the default `needleall` and `mmseqs`.
`--metric-column`	`-mc`	Specifies in which column the metric is found. Indexing starts at 0, defaults to 2 when left unspecified.

mmseqs2needle

This mode combines needle and mmseqs2. After a first run of mmseqs2, all pairwise alignments with an indentity below recompute-threshold are computed using needle. Only then the partitioning is performed. Note that if recompute-threshold is very low, it can be faster to just run in needle mode. All arguments for the two modes are reused here, with the following exceptions:

Long	Short	Description
`--recompute-threshold`	`-re`	The threshold for MMseqs2 above which alignments should be recomputed using needleall. Has to be a number lower than `--threshold`.
`--denominator-mmseqs`	`-dnm`	Replaces `--denominator`. Applies to the mmseqs2 alignment step.
`--denominator-needle`	`-dnn`	Replaces `--denominator`. Applies to the needle alignment step.

Citation

GraphPart: Homology partitioning for biological sequence analysis
Felix Teufel, Magnús Halldór Gíslason, José Juan Almagro Armenteros, Alexander Rosenberg Johansen, Ole Winther, Henrik Nielsen
bioRxiv 2023.04.14.536886; doi: https://doi.org/10.1101/2023.04.14.536886

FAQ

How should I pick chunks ?
chunks should be picked so that all threads are utilized. Each chunk is aligned to each other chunk, so threads <= chunks*chunks results in full utilization.
I want to test multiple thresholds and partitioning parameters - How can I do this efficiently ?
When constructing the graph, we only retain identities that are larger than the selected threshold, as only those form relevant edges for partitioning the data. All other similarities are discarded as they are computed. To test multiple thresholds, the most efficient way is to first try the lowest threshold to be considered and save the edge list by specifying --save-checkpoint-path EDGELIST.csv. In the next run, use graphpart precomputed -ef EDGELIST.csv to start directly from the previous alignment result.
GraphPart starts with nicely balanced partitions, but after homology removal the sizes are very imbalanced.
By default, GraphPart tries to retain as many sequences as possible. In cases where the initialization clustering is far away from a valid solution (this happens when there are a lot of classes, with potentially small counts, and when there is high overall sequence similarity in the data), moving sequences between partitions will cause some partitions to grow large at the expense of others. You can try --no-moving to prevent this behaviour.
I want to use a different similarity metric than sequence identity. GraphPart can be used with any similarity or distance metric. To do so, you need to provide a list of precomputed pairwise similarities in the precomputed mode. The first two columns of the file should contain the sequence identifiers specified in the --fasta-file. The third column should contain the similarity metric. The --metric-column argument can be used to specify the column index. If you want to use a similarity metric, you need to specify a transformation using --transformation. See the source for definitions. As an example, when operating with sequence identities ranging from 0 to 1, the transformation one-minus yields corresponding distances. If your metric is a distance, you can use --transformation None to skip the transformation step.
I want to use a different alignment tool than EMBOSS needleall or MMseqs. This is supported by the precomputed mode. Please refer to the answer above.
How does the moving step decide to which partition to move a sequence to? After initialization of the partitions, GraphPart iteratively moves sequences between partitions and removes sequences from the data to achieve homology separation. For each sequence, we compute how many connections it has to sequences in each other partition. If there are partitions with more connections than the current partition, the sequence is moved to the partition with the maximum number of connections. If there is a tie in the number of connections, the sequence is moved to the partition that appeared first when iterating the underlying graph data structure. We do not explicitly control this order.

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.github/workflows		.github/workflows
benchmarking		benchmarking
deprecated		deprecated
graph_part		graph_part
webserver		webserver
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
csv_to_fasta.py		csv_to_fasta.py
requirements.txt		requirements.txt
setup.py		setup.py
tutorial.ipynb		tutorial.ipynb

License

graph-part/graph-part

Folders and files

Latest commit

History

Repository files navigation

GraphPart

Installation

Instructions

Command line interface for FASTA data

Python API

Input format

Output format

API

Supported aligners

Arguments

Shared

needle

mmseqs2

precomputed

mmseqs2needle

Citation

FAQ

About

Topics

Resources

License

Stars

Watchers

Forks

Languages