Skip to content

jbethune/rust-twobit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

twobit

Efficient 2bit file reader, implemented in pure Rust.

Build Latest Version Documentation twobit: rustc 1.51+ MIT

The 2bit file format is used to store genomic sequences on disk. It allows for fast access to specific parts of the genome.

This crate is inspired by py2bit and tries to offer somewhat similar functionality with no C-dependency, no external crate dependencies, and great performance. It follows 2 bit specification version 0.

Examples

use twobit::TwoBitFile;

let mut tb = TwoBitFile::open("assets/foo.2bit")?;
assert_eq!(tb.chrom_names(), &["chr1", "chr2"]);
assert_eq!(tb.chrom_sizes(), &[150, 100]);
let expected_seq = "NNACGTACGTACGTAGCTAGCTGATC";
assert_eq!(tb.read_sequence("chr1", 48..74)?, expected_seq);

All sequence-related methods expect range argument; one can pass .. (unbounded range) in order to query the entire sequence:

assert_eq!(tb.read_sequence("chr1", ..)?.len(), 150);

Files can be fully cached in memory in order to provide fast random access and avoid any IO operations when decoding:

let mut tb_mem = TwoBitFile::open_and_read("assets/foo.2bit")?;
let expected_seq = tb.read_sequence("chr1", ..)?;
assert_eq!(tb_mem.read_sequence("chr1", ..)?, expected_seq);

2bit files offer two types of masks: N masks (aka hard masks) for unknown or arbitrary nucleotides, and soft masks for lower-case nucleotides (e.g. "t" instead of "T").

Hard masks are always enabled; soft masks are disabled by default, but can be enabled manually:

let mut tb_soft = tb.enable_softmask(true);
let expected_seq = "NNACGTACGTACGTagctagctGATC";
assert_eq!(tb_soft.read_sequence("chr1", 48..74)?, expected_seq);