I read this literature review:
It points to another paper, which it claims has 60x-300x speedup over the "GK01" implemented in this crate, and claims it has state-of-the-art space upper bound:
It would be awesome to implement the ZhangWang algorithm for this crate! The algorithm outlined in the paper look kind of appealing. (I was going to implement it, then got sidetracked by benchmarks #32 and never got any further...)