Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

fast string searching #484

Open
UnixJunkie opened this Issue Dec 19, 2013 · 23 comments

Comments

Projects
None yet
4 participants
Member

UnixJunkie commented Dec 19, 2013

I think we still miss a Bayer-Moore kind of string search in batteries.

Member

UnixJunkie commented Dec 25, 2013

There is an implementation of the Knuth-Morris-Pratt string searching algorithm there in OCaml:
http://gallium.inria.fr/blog/kmp/

Member

UnixJunkie commented Dec 25, 2013

Some interesting literature for Christmas:
"Experimental Results on String Matching Algorithms"
www-igm.univ-mlv.fr/~lecroq/articles/spe95.pdf‎

Member

UnixJunkie commented Dec 25, 2013

It's Boyer-Moore by the way, I always misspell it.

Member

UnixJunkie commented Dec 27, 2013

There is a rotting pull request here: #377
For a different algorithm.

@ghost ghost assigned UnixJunkie Jan 6, 2014

Contributor

314eter commented Jun 1, 2016

I'm interested in implementing faster string searching. Is this still useful?

I can implement different algorithms and compare their performance, but every algorithm has its advantages. I don't think we can afford too much preprocessing in String.find, because this will degrade performance on short patterns. Two-way search or a simple hash (e.g. the sum of the characters) seems promising.

For more optimized algorithms, another function will be necessary (e.g. String.fast_find pattern text), such that the preprocessing can be done separately (let finder = String.fast_find pattern) from the searching (finder text). Possible candidates are Knuth-Morris-Pratt, Boyer-Moore, Horspool, and all other variants, but I would prefer to keep it simple.

Another possibility is to choose an algorithm based on the size of the pattern and the text.

Owner

gasche commented Jun 1, 2016

I think that this could be a useful thing to have, but also suspect that it does not need to be inside Batteries: you could probably have more users for the library without a Batteries dependency, and you don't really need the support of the existing Batteries functionality. You could have a library that works on string, bytes and maybe char Bigarray, and provide an interruptible interface (if you reach the end of the string and need more input, return the current search state reified as a value to let users restart search after they get more input) to let people devise their own streaming/buffering strategies.

Contributor

314eter commented Jun 1, 2016

A new library for fast string searching would be nice too. Maybe I will do that if I find the time 😛

But there already is a function String.find in Batteries, and it's slow. So would you consider replacing it if there is a simple alternative that is faster on every input?

Member

c-cube commented Jun 1, 2016

I think there already are libraries on opam for string searching (e.g. "xstr", probably "stringext" — not sure —, "astring"…). You can also try to adapt my code from containers that uses KMP in general, and String.find for patterns of length one.

Contributor

314eter commented Jun 1, 2016

All those libraries (except containers) use the naive algorithm.

Owner

gasche commented Jun 1, 2016 edited

I think that a library dedicated to string searching algorithms could be an occasion to have this specialized feature in a central place, independently from a "big library" (Batteries, Containers, Core, you name it), and compare algorithms.

@c-cube: do you have any idea of what is the performance difference between using your KMP implementation and ocaml-re's string combinator? (For example, on a program that reads a file line by line, uses the search algorithm to tell if the string " let " appears in the string, and report the number of matching lines.)

Member

c-cube commented Jun 1, 2016

I have no idea, though it would be interesting. I have a few benchmarks inside containers for KMP vs naive algorithm, though. A separate library would make sense (there is also a functorized KMP in containers, btw — for searching in bigarrays, arrays of scalars, etc.).

@314eter If you make a separate library for string searching on github, under a reasonably permissive license (not GPL; LGPL, MIT, BSD, etc. ok) I'd be happy to contribute my code.

Member

UnixJunkie commented Jun 2, 2016

I think for batteries we need two functions:

val compile: string -> t (* to prepare the string for fast search *)
val find: ~offset:int -> t -> string -> int (* to find it in another string *)

Boyer-Moore looks good enough for most people I think.

Member

UnixJunkie commented Jun 2, 2016

@314eter let's keep it simple ;)

Owner

gasche commented Jun 2, 2016

@UnixJunkie the interface you propose is okay, but I think it would also be important to have a function that makes search incremental/interruptible: if at the end of the string there is no match, but there is a potential match in progress, and then I start searching in a new string (morally the continuation of the previous input), how can I make the algorithm start exactly in the state it was in?

This would call for an additional interface about as follows:

type search_state
val init : search_state
type find_result = Found of int | Eof of search_state
val find_incremental : ~offset:int -> t -> string -> search_state -> find_result
Contributor

314eter commented Jun 2, 2016

@UnixJunkie Boyer-Moore is not simple ;)

My Horspool implementation is faster than KMP on every input I tested and faster than naive on everything except for very short patterns. Boyer-Moore has a better worst-case complexity, but I suspect that it will be slower in most cases.

Owner

gasche commented Jun 2, 2016

If you are interested to get your implementation in Batteries, feel free to open a pull request. We have a benchsuite folder in which we include performance comparisons of various implementations, for others to reproduce and future reference, it would be nice to have a file for string searching in there. See for example benchsuite/bench_nreplace for inspiration.

If there is a quick test you can perform in the String.{,r}find{,_from} implementation to dynamically switch to a less naive implementation (I'd guess testing both the pattern length and the string length), it is good -- users get faster code without upgrading. You should also explicitly expose the algorithm, I think, for people that want to access the precomputation interface.

Contributor

314eter commented Jul 1, 2016

I implemented and benchmarked a few string searching algorithms. Unfortunately, there is no clear winner.

  • KMP is a decent performer on all input
  • The hash algorithm is always better than or comparable to naive, except for rare cases with a lot of hash collisions
  • Horspool performs good for longer patterns, but there are exceptions
Member

UnixJunkie commented Nov 9, 2016

There are ocaml implementations of Boyer-Moore in there:
https://github.com/TWal/ENS_boyermoore/tree/master/ocaml

Member

c-cube commented Nov 9, 2016

I have a decent implementation of KMP in containers (with special case for 1-char patterns), and would be interested in benchmarking it against other implementations. It definitely goes faster than the naive string search in my benchmarks.

Contributor

314eter commented Nov 9, 2016

I compared the Boyer-Moore implementation (moore) and the KMP of containers (containers) with my implementations (boyermoore and kmp).
benchmark

They perform very comparable. Containers is slower, but that's probably because it's more flexible (searching in two directions).

Owner

gasche commented Nov 9, 2016

What is the vertical axis measuring (is a higher position better or worse?). What are the length of the searched texts? We are interested in the behavior on large texts (are those functions usable for real-world data processing?), but also in the behavior on fairly small texts (length 10 or so), as it naturally occurs when looking for a word or (unicode?) symbol in, say, URLs or filesystem paths. Which of those benchmarks are indicative of the performance on small texts?

Contributor

314eter commented Nov 9, 2016

An explanation, the code, and more algorithms can be found here.

The vertical axis measures the time to count the number of occurrences of the pattern in the text. Lower is better.

The benchmarks are large texts, but all the algorithms are linear in the length of the text. For very small texts, the relative overhead of the pattern compilation will be too high (but not if you're looking for the same pattern in a lot of different small texts), but I didn't benchmark it. The problem is that it's difficult to find good benchmarks that cover all possible cases. Suggestions for more benchmarks are very welcome.

Member

c-cube commented Nov 9, 2016

Thanks for the benchmarks! There is also the one-char case which can be optimized by falling back to String.index, btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment