USUM: Plotting sequence similarity embeddings using USEARCH & UMAP

USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.

Installation

Install USEARCH dependency manually: https://drive5.com/usearch/download.html
(consider supporting the author by buying the 64bit license)
Install usum using PIP:

pip install usum

Usage

Use usum to plot input protein or DNA sequences in FASTA format.

Show all available options using usum --help

Minimal example

usum example.fa --maxdist 0.2 --termdist 0.3 --output example

Multiple input files with labels

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example

This will produce a PNG plot:

An interactive Bokeh HTML plot is also created:

Using t-SNE instead of UMAP

You can also produce a t-SNE plot using the --tsne flag.

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example

This will produce a PNG plot:

Plotting random subset

You can use --limit to extract and plot a random subset of the input sequences.

# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example

You can control randomness and reproducibility using the --seed option.

Plotting options

See usum --help for all plotting options.

See UMAP API Guide for more info about the UMAP options.

Use --limit to plot a random subset of records
Use --width and --height to control plot size in pixels
Use --resume to reuse previous distance matrix from the output folder
Use --tsne to produce a t-SNE embedding instead of UMAP (you can use this with --resume)
Use --umap-spread to control how close together the embedded points are in the UMAP embedding
Use --umap-min-dist to control minimum distance between points in UMAP embedding
Use --neighbors to control number of neighbors in UMAP graph

Reusing previous results

When changing just the plot options, you can use --resume to reuse previous results from the output folder.

Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.

# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire

Programmatic use

from usum import usum

# Show help
help(usum)

# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)

How it works

A sparse distance matrix is calculated using USEARCH calc_distmx command.
The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
The distance matrix is embedded as a precomputed metric using UMAP
The embedding is plotted using umap.plot.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
docs		docs
test		test
usum		usum
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

test

test

usum

usum

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE.txt

LICENSE.txt

Makefile

Makefile

README.md

README.md

setup.py

setup.py

Repository files navigation

USUM: Plotting sequence similarity embeddings using USEARCH & UMAP

Installation

Usage

Minimal example

Multiple input files with labels

Using t-SNE instead of UMAP

Plotting random subset

Plotting options

Reusing previous results

Programmatic use

How it works

About

Releases 6

Packages

Languages

License

prihoda/usum

Folders and files

Latest commit

History

Repository files navigation

USUM: Plotting sequence similarity embeddings using USEARCH & UMAP

Installation

Usage

Minimal example

Multiple input files with labels

Using t-SNE instead of UMAP

Plotting random subset

Plotting options

Reusing previous results

Programmatic use

How it works

About

Topics

Resources

License

Stars

Watchers

Forks

Languages