Performance Issue with small alphabets and long texts #11

almondtools · 2016-08-09T06:16:28Z

I tried to benchmark your library with StringBench.

Yet there seems to be a performance issue with all of your implementations of AbstractSequenceSearcher.

You can reproduce this by

checking out StringBench
selecting BSBoyerMooreHorspoolIncubationTest.java
removing the @Ignore
starting the test

I yet did not test it with any other parameters, so perhaps this performance issue is related to other scenarios, too.

Maybe this issue is related to an incorrect test setup. If so please show me how to correct it.

The text was updated successfully, but these errors were encountered:

nishihatapalmer · 2016-08-10T08:38:45Z

Hi Stefan,

thanks for benchmarking byteseek - I'm very interested to see how it stacks
up against other implementations.

I will have a look at the issue and get back to you.

Regards,

Matt.

On 9 August 2016 at 07:16, Stefan Mandel notifications@github.com wrote:

I tried to benchmark your library with StringBench
https://github.com/almondtools/stringbench.

Yet there seems to be a performance issue with all of your implementations
of AbstractSequenceSearcher.

You can reproduces this by

checking out StringBench

selecting BSBoyerMooreHorspoolIncubationTest.java

removing the @ignore

starting the test

I yet did not test it with any other parameters, so perhaps this
performance issue is related to other scenarios, too.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#11, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AEOOJMw96IEoSreG2vTwbr9AZpwO4k_3ks5qeBs8gaJpZM4JfwD4
.

nishihatapalmer · 2016-08-10T17:15:31Z

Hi Stefan,

one thing you may be interested in. I've been doing some benchmarking of
my own using jmh. I've also been using the SMART
http://www.dmi.unict.it/~faro/smart/authors?author=T.%20Lecroq tool,
which benchmarks over 160 search algorithms, written in the C language.
While horspool and other shift-table search is OK for single pattern
searching, it turns out there are much better performing algorithms, which
I will include into byteseek at some point.

For small alphabets and pattern sizes, Shift-OR works extremely well. For
almost all other sizes, a family of algorithms called QF (Q-gram Filtering)
almost always beats the others, although variants of BNDM does do well in
some cases. These are described in "Bit-Parallel Search Algorithms for
Long Patterns, International Symposium on Experimental Algorithms (SEA
2010), by Branislav Durian, Hannu Peltola, Leena Salmela and Jorma Tarhio",

There are implementations in the SMART tool. I've ported these to my jmh
benchmarks in Java, and the results are very similar to the C results. The
QF43 and QF62 variants do particularly well, with a look up table size of
4096 bytes. So bit-parallel algorithms seem to pretty much rule over the
older shift-table based searches. They are taking advantage of
bit-parallel operations which are of course very fast as they occur in
hardware.

There are some even faster algorithms out there, like EPSM but they take
advantage of specific hardware CPU instructions for specific chips, which
are not implementable in Java.

Happy to share any of the benchmarking code or new algorithms with you -
they will eventually appear in byteseek when I get the time though!

Regards,

Matt.

On 10 August 2016 at 09:38, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

thanks for benchmarking byteseek - I'm very interested to see how it
stacks up against other implementations.

I will have a look at the issue and get back to you.

Regards,

Matt.

On 9 August 2016 at 07:16, Stefan Mandel notifications@github.com wrote:

I tried to benchmark your library with StringBench
https://github.com/almondtools/stringbench.

Yet there seems to be a performance issue with all of your
implementations of AbstractSequenceSearcher.

You can reproduces this by

checking out StringBench

selecting BSBoyerMooreHorspoolIncubationTest.java

removing the @ignore

starting the test

I yet did not test it with any other parameters, so perhaps this
performance issue is related to other scenarios, too.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#11, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AEOOJMw96IEoSreG2vTwbr9AZpwO4k_3ks5qeBs8gaJpZM4JfwD4
.

almondtools · 2016-08-10T18:40:03Z

Hello Matt,

Thanks for pointing me to the paper. I will read it by time.

I already knew the SMART web site. Yet the results are not (or no more) available there. I remember a paper comparing many (single pattern) algorithms, maybe this was related to the web site.

Yet it is not that relevant to me, because I turned my focus on searching multiple patterns in a text. If you found some resources on this subject, let me know.

Regards,

Stefan

nishihatapalmer · 2016-08-10T19:54:56Z

Hi Stefan,

yes, multi pattern searches are also interesting... If I come across
anything I'll let you know.

I think I found the source of the massive slowdown in the byteseek tests.
There appear to be a bug in the StringReader, which causes it to keep
searching forever. I'll let you know when a fix is available - I really
should have tested it...

Regards,

Matt

On 10 August 2016 at 19:40, Stefan Mandel notifications@github.com wrote:

Hello Matt,

Thanks for pointing me to the paper. I will read it by time.

I already knew the SMART web site. Yet the results are not (or no more)
available there. I remember a paper comparing many (single pattern)
algorithms, maybe this was related to the web site.

Yet it is not that relevant to me, because I turned my focus on searching
multiple patterns in a text. If you found some resources on this subject,
let me know.

Regards,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJMc-jM6youmF5ysE6hfAHdnKTf4lks5qehsDgaJpZM4JfwD4
.

nishihatapalmer · 2016-08-11T18:54:38Z

Hi Stefan,

I've confirmed that a simple bug exists in the StringReader class of
byteseek. With the fix your incubator tests complete successfully and
quickly.

I'll figure out when to do another byteseek release and what will be
included in it, and get back to you once a new version is available on
maven central.

Many thanks for reporting the issue to me, and I hope to see byteseek
included in your benchmark suite fairly soon.

Regards,

Matt.

On 10 August 2016 at 20:54, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

yes, multi pattern searches are also interesting... If I come across
anything I'll let you know.

I think I found the source of the massive slowdown in the byteseek tests.
There appear to be a bug in the StringReader, which causes it to keep
searching forever. I'll let you know when a fix is available - I really
should have tested it...

Regards,

Matt

On 10 August 2016 at 19:40, Stefan Mandel notifications@github.com
wrote:

Hello Matt,

Thanks for pointing me to the paper. I will read it by time.

I already knew the SMART web site. Yet the results are not (or no more)
available there. I remember a paper comparing many (single pattern)
algorithms, maybe this was related to the web site.

Yet it is not that relevant to me, because I turned my focus on searching
multiple patterns in a text. If you found some resources on this subject,
let me know.

Regards,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJMc-jM6youmF5ysE6hfAHdnKTf4lks5qehsDgaJpZM4JfwD4
.

nishihatapalmer · 2016-08-11T19:06:04Z

Hi Stefan,

one more thing occurs to me. Why use the (buggy) StringReader, or any
WindowReaders at all? The WindowReader interface is for more complex
scenarios involving random access to streams of unknown length, for example.

Using the byte array searchForwards() methods on the searchers is probably
a more realistic comparison to the other search algorithms in your
benchmark in any case. This would also enable you to benchmark byteseek as
it is, without waiting for another release.

Regards,

Matt

On 11 August 2016 at 19:54, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

I've confirmed that a simple bug exists in the StringReader class of
byteseek. With the fix your incubator tests complete successfully and
quickly.

I'll figure out when to do another byteseek release and what will be
included in it, and get back to you once a new version is available on
maven central.

Many thanks for reporting the issue to me, and I hope to see byteseek
included in your benchmark suite fairly soon.

Regards,

Matt.

On 10 August 2016 at 20:54, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

yes, multi pattern searches are also interesting... If I come across
anything I'll let you know.

I think I found the source of the massive slowdown in the byteseek
tests. There appear to be a bug in the StringReader, which causes it to
keep searching forever. I'll let you know when a fix is available - I
really should have tested it...

Regards,

Matt

On 10 August 2016 at 19:40, Stefan Mandel notifications@github.com
wrote:

Hello Matt,

Thanks for pointing me to the paper. I will read it by time.

I already knew the SMART web site. Yet the results are not (or no more)
available there. I remember a paper comparing many (single pattern)
algorithms, maybe this was related to the web site.

Yet it is not that relevant to me, because I turned my focus on
searching multiple patterns in a text. If you found some resources on this
subject, let me know.

Regards,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJMc-jM6youmF5ysE6hfAHdnKTf4lks5qehsDgaJpZM4JfwD4
.

nishihatapalmer · 2016-08-11T19:27:59Z

If we modify the ByteSeekBenchmark class to search over byte arrays, then
all the byteseek tests pass fine:

@OverRide
public List find(String pattern, String text) {
Searcher searcher = algorithm.get(pattern);
byte[] stringBytes = text.getBytes();
ForwardSearchIterator searchIterator = new
ForwardSearchIterator(searcher, stringBytes, 0);

List indexes = new ArrayList<>();

while (searchIterator.hasNext()) {
List<SearchResult> results = searchIterator.next();
for (SearchResult result : results) {
indexes.add((int) result.getMatchPosition());
}
}
return indexes;
}

On 11 August 2016 at 20:05, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

one more thing occurs to me. Why use the (buggy) StringReader, or any
WindowReaders at all? The WindowReader interface is for more complex
scenarios involving random access to streams of unknown length, for example.

Using the byte array searchForwards() methods on the searchers is probably
a more realistic comparison to the other search algorithms in your
benchmark in any case. This would also enable you to benchmark byteseek as
it is, without waiting for another release.

Regards,

Matt

On 11 August 2016 at 19:54, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

I've confirmed that a simple bug exists in the StringReader class of
byteseek. With the fix your incubator tests complete successfully and
quickly.

I'll figure out when to do another byteseek release and what will be
included in it, and get back to you once a new version is available on
maven central.

Many thanks for reporting the issue to me, and I hope to see byteseek
included in your benchmark suite fairly soon.

Regards,

Matt.

On 10 August 2016 at 20:54, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

yes, multi pattern searches are also interesting... If I come across
anything I'll let you know.

I think I found the source of the massive slowdown in the byteseek
tests. There appear to be a bug in the StringReader, which causes it to
keep searching forever. I'll let you know when a fix is available - I
really should have tested it...

Regards,

Matt

On 10 August 2016 at 19:40, Stefan Mandel notifications@github.com
wrote:

Hello Matt,

Thanks for pointing me to the paper. I will read it by time.

I already knew the SMART web site. Yet the results are not (or no more)
available there. I remember a paper comparing many (single pattern)
algorithms, maybe this was related to the web site.

Yet it is not that relevant to me, because I turned my focus on
searching multiple patterns in a text. If you found some resources on this
subject, let me know.

Regards,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJMc-jM6youmF5ysE6hfAHdnKTf4lks5qehsDgaJpZM4JfwD4
.

almondtools · 2016-08-11T20:45:21Z

Ok, I will integrate this code. Just to prevent misunderstandings: Your code will not scale on large texts, right?

So if I decide to extend the benchmark by some large text sample, I will have to fall back to the WindowReader?

nishihatapalmer · 2016-08-11T20:53:40Z

OK, good :)

You are right that it won't scale using the current StringReader, but only
because of a bug in that Reader, not because of any limitation in the
search algorithms.

Over byte arrays of any size, or any other WindowReader, it should be
absolutely fine.

Matt

On 11 August 2016 at 21:45, Stefan Mandel notifications@github.com wrote:

Ok, I will integrate this code. Just to prevent misunderstandings: Your
code will not scale on large texts, right?

So if I decide to extend the benchmark by some large text sample, I will
have to fall back to the WindowReader?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJC-ZYvxc_O5l6LUg8lNislI0lPjHks5qe4nigaJpZM4JfwD4
.

nishihatapalmer · 2016-08-11T21:47:53Z

I guess it will always incur the penalty of transforming the string into a
byte array first, but it's a byte seeking kind of library....

Regards,

Matt.

On 11 August 2016 at 21:45, Stefan Mandel notifications@github.com wrote:

Ok, I will integrate this code. Just to prevent misunderstandings: Your
code will not scale on large texts, right?

So if I decide to extend the benchmark by some large text sample, I will
have to fall back to the WindowReader?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJC-ZYvxc_O5l6LUg8lNislI0lPjHks5qe4nigaJpZM4JfwD4
.

almondtools · 2016-08-11T21:51:33Z

Sorry, my question was a bit confusing,

Converting a large text file to bytes will consume heap for the whole text file, which is not scaleable. I suppose that WindowReader processes chars in a stream (or buffer/window) not in an array, so it will be scaleable.

I was mistaken to believe that this is a problem of your benchmark implementation, but after reviewing it I see that it is a problem of my benchmark interface. So if there is something to rework, it is my part.

By the way: I have tested your Benchmark succesfully (and will commit in the next time). I do not know when I will start a new Benchmark (it is dependent on my free time and bugfixes provided by third parties).

Regards,

Stefan

nishihatapalmer · 2016-08-11T21:58:54Z

OK, good news!

No worries on when it will happen. Like yourself my progress on byteseek
takes place when I get some free time. Or when I get a good bug report
from someone, so thank you for that.

Regards,

Matt.

On 11 August 2016 at 22:51, Stefan Mandel notifications@github.com wrote:

Sorry, my question was a bit confusing,

Converting a large text file to bytes will consume heap for the whole text
file, which is not scaleable. I suppose that WindowReader processes chars
in a stream (or buffer/window) not in an array, so it will be scaleable.

I was mistaken to believe that this is a problem of your benchmark
implementation, but after reviewing it I see that it is a problem of my
benchmark interface. So if there is something to rework, it is my part.

By the way: I have tested your Benchmark succesfully (and will commit in
the next time). I do not know when I will start a new Benchmark (it is
dependent on my free time and bugfixes provided by third parties).

Regards,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEXoph_GtZ3Lsag9Y6JMtL494O0mks5qe5llgaJpZM4JfwD4
.

almondtools · 2016-08-12T16:38:44Z

I had to adjust the benchmark to filter overlapping results. Now looks like this:

@Override
public List<Integer> find(String pattern, String text) {
    Searcher<SequenceMatcher> searcher = algorithm.get(pattern);
    byte[] stringBytes = text.getBytes();
    ForwardSearchIterator<SequenceMatcher> searchIterator = new
        ForwardSearchIterator<SequenceMatcher>(searcher, stringBytes, 0);

    List<Integer> indexes = new ArrayList<>();

    long lastPosition = -1;
    while (searchIterator.hasNext()) {
        List<SearchResult<SequenceMatcher>> results = searchIterator.next();
        for (SearchResult<SequenceMatcher> result : results) {
            long pos = result.getMatchPosition();
            if (pos >= lastPosition) {
                indexes.add((int) pos);
                lastPosition = pos + result.getMatchingObject().length();
            }
        }
    }
    return indexes;
}

If there is a more elegant solution with ByteSeek, let me know.

Testing only non-overlapping matches was a design decision:

not to be too exclusive: Some libraries simply do not support overlapping results
to test the correctness of overlap elimination: Some libraries claim to eliminate overlaps, but produce the wrong result set.

nishihatapalmer · 2016-08-12T17:42:14Z

Hi Stefan,

At this point, it's probably easier just to ditch the
ForwardSearchIterator, which was only ever a convenience class anyway, and
just write the search logic. Working with both the iterator and the
position seems trickier than just coding the obvious logic. Something like:

@OverRide
public List find(String pattern, String text) {
Searcher searcher = algorithm.get(pattern);
byte[] stringBytes = text.getBytes();

List indexes = new ArrayList<>();

int searchPos = 0;
while (searchPos < stringBytes.length) {
List<SearchResult> results =
searcher.searchForwards(stringBytes, searchPos);
if (results.isEmpty()) break;
for (SearchResult result : results) {
int matchPos = (int) result.getMatchPosition();
indexes.add(matchPos);
searchPos = matchPos + result.getMatchingObject().length();
}
}

return indexes;
}

On 12 August 2016 at 17:38, Stefan Mandel notifications@github.com wrote:

I had to adjust the benchmark to filter overlapping results. Now looks
like this:

@OverRide
public List find(String pattern, String text) {
Searcher searcher = algorithm.get(pattern);
byte[] stringBytes = text.getBytes();
ForwardSearchIterator searchIterator = new
ForwardSearchIterator(searcher, stringBytes, 0);
List<Integer> indexes = new ArrayList<>();

long lastPosition = -1;
while (searchIterator.hasNext()) {
    List<SearchResult<SequenceMatcher>> results = searchIterator.next();
    for (SearchResult<SequenceMatcher> result : results) {
        long pos = result.getMatchPosition();
        if (pos >= lastPosition) {
            indexes.add((int) pos);
            lastPosition = pos + result.getMatchingObject().length();
        }
    }
}
return indexes;
}

If there is a more elegant solution with ByteSeek, let me know.

Testing only non-overlapping matches was a design decision:

not to be too exclusive: Some libraries simply do not support
overlapping results

to test the correctness of overlap elimination: Some libraries claim
to eliminate overlaps, but produce the wrong result set.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJNJAQEWg3fXC2XUC8COqRHb1VTrPks5qfKGUgaJpZM4JfwD4
.

nishihatapalmer · 2016-08-12T17:49:50Z

Note that you will only ever get zero or one search results returned by a
SequenceMatcher searcher in the list of search results. The generic search
interface supports lists of results because some searches can return more
than one match at a time (e.g. multi-pattern).

The next version of byteseek will support getting the match positions
directly from sequence searchers, with no lists of search results to
iterate over at all. Which should also be faster and produce less garbage.

Matt.

On 12 August 2016 at 18:41, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

At this point, it's probably easier just to ditch the
ForwardSearchIterator, which was only ever a convenience class anyway, and
just write the search logic. Working with both the iterator and the
position seems trickier than just coding the obvious logic. Something like:

@OverRide
public List find(String pattern, String text) {
Searcher searcher = algorithm.get(pattern);
byte[] stringBytes = text.getBytes();

List indexes = new ArrayList<>();

int searchPos = 0;
while (searchPos < stringBytes.length) {
List<SearchResult> results = searcher.searchForwards(stringBytes, searchPos);
if (results.isEmpty()) break;
for (SearchResult result : results) {
int matchPos = (int) result.getMatchPosition();
indexes.add(matchPos);
searchPos = matchPos + result.getMatchingObject().length();
}
}

return indexes;
}

On 12 August 2016 at 17:38, Stefan Mandel notifications@github.com
wrote:
I had to adjust the benchmark to filter overlapping results. Now looks
like this:

@OverRide
public List find(String pattern, String text) {
Searcher searcher = algorithm.get(pattern);
byte[] stringBytes = text.getBytes();
ForwardSearchIterator searchIterator = new
ForwardSearchIterator(searcher, stringBytes, 0);
List<Integer> indexes = new ArrayList<>();

long lastPosition = -1;
while (searchIterator.hasNext()) {
    List<SearchResult<SequenceMatcher>> results = searchIterator.next();
    for (SearchResult<SequenceMatcher> result : results) {
        long pos = result.getMatchPosition();
        if (pos >= lastPosition) {
            indexes.add((int) pos);
            lastPosition = pos + result.getMatchingObject().length();
        }
    }
}
return indexes;
}

If there is a more elegant solution with ByteSeek, let me know.

Testing only non-overlapping matches was a design decision:

not to be too exclusive: Some libraries simply do not support
overlapping results

to test the correctness of overlap elimination: Some libraries
claim to eliminate overlaps, but produce the wrong result set.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJNJAQEWg3fXC2XUC8COqRHb1VTrPks5qfKGUgaJpZM4JfwD4
.

almondtools · 2016-08-13T21:54:50Z

Now a new benchmark is available, look at result-2016-08-13.txt

Comparing your Horspool with the StringAndChars Horspool is quite confusing: Using alphabet size/pattern length 2/2 your algorithm is twice as fast as StringsAndChars, any other is slower.

There seems to be a pattern: SCHorspool gets faster on larger alphabets, BSBoyerMooreHorspool does not.

I would think of two possible reasons:

the benchmark implementation adds overhead time
the byte-oriented processing leads to slower performance

Another question:
Which of the Searchers should be part of the benchmark. I preselected Horspool, Sunday, Set-Horspool and Wu-Manber for the next benchmark (this one does not include multi pattern searches). Is it better to include the final flag variant (additionally or in place of the other one)?
And can you give me a hint what SequenceMatcherSearcher is? Should it be included in the benchmark?

nishihatapalmer · 2016-08-13T22:37:47Z

Hi Stefan,

I was pretty sure byteseek would be slower than the others, but mostly
because it has a lot of additional capabilities, for example any position
can contain a character class, rather than just a single char to match.

Can you summarise how bad it is for byteseek? I'm surprised if it is
performing incredibly poorly, but won't be surprised if it doesn't top the
charts....

As to what searchers should be included, good question! I would not bother
with any of my multi-pattern searchers just yet. I think some of them will
probably be pretty good, but I haven't tested them yet. I'm pretty sure
one of them is getting results wrong occasionally.

Likewise, the SequenceMatcherSearcher can also be ignored. That's just me
playing with building simple search algorithms which take advantage of a
bit more information to speed up search.

The Final Flag variants are probably worth benchmarking. I invented the
technique they use, so I'm biased and would say that, but they do seem to
be faster in the profiling and benchmarking I've done.

Regards,

Matt.

On 13 August 2016 at 22:54, Stefan Mandel notifications@github.com wrote:

Now a new benchmark is available, look at result-2016-08-13.txt
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt

Comparing your Horspool with the StringAndChars Horspool is quite
confusing: Using alphabet size/pattern length 2/2 your algorithm is twice
as fast as StringsAndChars, any other is slower.

There seems to be a pattern: SCHorspool gets faster on larger alphabets,
BSBoyerMooreHorspool does not.

I would think of two possible reasons:

the benchmark implementation adds overhead time

the byte-oriented processing leads to slower performance

Another question:
Which of the Searchers should be part of the benchmark. I preselected
Horspool, Sunday, Set-Horspool and Wu-Manber for the next benchmark (this
one does not include multi pattern searches). Is it better to include the
final flag variant (additionally or in place of the other one)?
And can you give me a hint what SequenceMatcherSearcher is? Should it be
included in the benchmark?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJLIkDbmZLA0SKRy-lEsXSYOxlsqrks5qfj0rgaJpZM4JfwD4
.

almondtools · 2016-08-14T05:38:54Z

Yet the benchmark contains

the Java-API methods
my own library strings and chars
your library byteseek

All other libraries are on hold because they fail to pass the benchmark (timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced document. Next to this file is a csv-file that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark: quite comparable for alphabet sizes of 2 and 4, but not competetive for all other.

Comparing it with the naive search via String.indexOf at 1024/1024 (alphabet size/pattern size) it is even 30 times slower, where SCHorspool is 5 times faster (actually the benchmark is biased towards naive search, see almondtools/stringbench#1, but that does not explain the performance lack compared with other horspool implementations).

nishihatapalmer · 2016-08-15T11:08:49Z

HI Stefan,

thanks, I'll have a look. The byteseek library is used in an intensive
file scanning application (DROID
https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/)
by the UK National Archives, and seems to perform really well, so the
results seem surprising.

Matt.

On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com wrote:

Yet the benchmark contains

the Java-API methods

my own library strings and chars

your library byteseek

All other libraries are on hold because they fail to pass the benchmark
(timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced
document
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt.
Next to this file is a csv-file
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.csv
that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars
compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark:
quite comparable for alphabet sizes of 2 and 4, but not competetive for all
other.

Comparing it with the naive search via String.indexOf at 1024/1024
(alphabet size/pattern size) it is even 30 times slower, where SCHorspool
is 5 times faster (actually the benchmark is biased towards naive search,
see almondtools/stringbench#1
almondtools/stringbench#1, but that does not
explain the performance lack compared with other horspool implementations).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEY6bzWKpHmtxnfjbx5vhBMgIbCpks5qfqnugaJpZM4JfwD4
.

nishihatapalmer · 2016-08-15T11:36:17Z

It's very peculiar. My own benchmarks all show shift table based
algorithms getting consistently faster as the pattern and alphabet size
increases, up to a point anyway. In your benchmarks, the byteseek scores
are pretty flat across alphabet sizes and pattern sizes.

It may suggest that the byteseek benchmarks are being dominated by some
fixed cost (creating the byte array...?) which swamps the algorithm
timing. I'll investigate further - these benchmarks don't make sense right
now.

Regards,

Matt

On 15 August 2016 at 12:08, Matt Palmer mattpalms@gmail.com wrote:

HI Stefan,

thanks, I'll have a look. The byteseek library is used in an intensive
file scanning application (DROID
https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/)
by the UK National Archives, and seems to perform really well, so the
results seem surprising.

Matt.

On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com
wrote:

Yet the benchmark contains

the Java-API methods

my own library strings and chars

your library byteseek

All other libraries are on hold because they fail to pass the benchmark
(timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced
document
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt.
Next to this file is a csv-file
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.csv
that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars
compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark:
quite comparable for alphabet sizes of 2 and 4, but not competetive for all
other.

Comparing it with the naive search via String.indexOf at 1024/1024
(alphabet size/pattern size) it is even 30 times slower, where SCHorspool
is 5 times faster (actually the benchmark is biased towards naive search,
see almondtools/stringbench#1
almondtools/stringbench#1, but that does not
explain the performance lack compared with other horspool implementations).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEY6bzWKpHmtxnfjbx5vhBMgIbCpks5qfqnugaJpZM4JfwD4
.

nishihatapalmer · 2016-08-15T11:57:07Z

In fact, byteseek appears to be roughly flat for each alphabet size.
Pattern size doesn't seem to matter much at all, except at the low end.

Out of interest, what does an alphabet size greater than 256 mean? I guess
it's for multi-byte character sets?

Matt.

On 15 August 2016 at 12:35, Matt Palmer mattpalms@gmail.com wrote:

It's very peculiar. My own benchmarks all show shift table based
algorithms getting consistently faster as the pattern and alphabet size
increases, up to a point anyway. In your benchmarks, the byteseek scores
are pretty flat across alphabet sizes and pattern sizes.

It may suggest that the byteseek benchmarks are being dominated by some
fixed cost (creating the byte array...?) which swamps the algorithm
timing. I'll investigate further - these benchmarks don't make sense right
now.

Regards,

Matt

On 15 August 2016 at 12:08, Matt Palmer mattpalms@gmail.com wrote:

HI Stefan,

thanks, I'll have a look. The byteseek library is used in an intensive
file scanning application (DROID
https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/)
by the UK National Archives, and seems to perform really well, so the
results seem surprising.

Matt.

On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com
wrote:

Yet the benchmark contains

the Java-API methods

my own library strings and chars

your library byteseek

All other libraries are on hold because they fail to pass the benchmark
(timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced
document
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt.
Next to this file is a csv-file
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.csv
that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars
compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark:
quite comparable for alphabet sizes of 2 and 4, but not competetive for all
other.

Comparing it with the naive search via String.indexOf at 1024/1024
(alphabet size/pattern size) it is even 30 times slower, where SCHorspool
is 5 times faster (actually the benchmark is biased towards naive search,
see almondtools/stringbench#1
almondtools/stringbench#1, but that does
not explain the performance lack compared with other horspool
implementations).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEY6bzWKpHmtxnfjbx5vhBMgIbCpks5qfqnugaJpZM4JfwD4
.

nishihatapalmer · 2016-08-15T12:20:47Z

One thing I would comment on about your jmh benchmark structure. I notice
that you have this in the SinglePatternMatcherBenchmark:

@benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@WarmUp(iterations = 5)
@measurement(iterations = 5)
@fork(1)
public void benchmarkFind() {
for (String pattern : sample.getPattern()) {
List result = find(pattern, sample.getSample());
results.put(pattern, result);
}
}

So you are both getting the pattern to test and the sample to test with
(and recording the results) within the benchmark method. Really, you need
to be as close to the benchmarking method (search) as possible, and not
include obtaining the pattern and sample within it.

In this circumstance, setup costs are going to dominate the much faster
search methods. You really need to separate the search from the provision
of data to it.

I've done this in my own benchmarks using parameterised values for pattern
and alphabet size. It is less elegant and re-usable code, but the
benchmaring results are much more accurate and worthwhile, as I am only
measuring the performance of each search algorithm. By example, this is
what one of my benchmarking methods looks like:

@benchmark
public long testHorspoolSequenceSearch(Data data, Searchers searchers) {
return searchers.horspoolSearcher.searchCount(data.bytes);
}

Happy to share any of my benchmarking code with you if it helps - I can put
it up on Github.

Regards,

Matt

On 15 August 2016 at 12:56, Matt Palmer mattpalms@gmail.com wrote:

In fact, byteseek appears to be roughly flat for each alphabet size.
Pattern size doesn't seem to matter much at all, except at the low end.

Out of interest, what does an alphabet size greater than 256 mean? I
guess it's for multi-byte character sets?

Matt.

On 15 August 2016 at 12:35, Matt Palmer mattpalms@gmail.com wrote:

It's very peculiar. My own benchmarks all show shift table based
algorithms getting consistently faster as the pattern and alphabet size
increases, up to a point anyway. In your benchmarks, the byteseek scores
are pretty flat across alphabet sizes and pattern sizes.

It may suggest that the byteseek benchmarks are being dominated by some
fixed cost (creating the byte array...?) which swamps the algorithm
timing. I'll investigate further - these benchmarks don't make sense right
now.

Regards,

Matt

On 15 August 2016 at 12:08, Matt Palmer mattpalms@gmail.com wrote:

HI Stefan,

thanks, I'll have a look. The byteseek library is used in an intensive
file scanning application (DROID
https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/)
by the UK National Archives, and seems to perform really well, so the
results seem surprising.

Matt.

On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com
wrote:

Yet the benchmark contains

the Java-API methods

my own library strings and chars

your library byteseek

All other libraries are on hold because they fail to pass the benchmark
(timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced
document
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt.
Next to this file is a csv-file
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.csv
that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars
compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark:
quite comparable for alphabet sizes of 2 and 4, but not competetive for all
other.

Comparing it with the naive search via String.indexOf at 1024/1024
(alphabet size/pattern size) it is even 30 times slower, where SCHorspool
is 5 times faster (actually the benchmark is biased towards naive search,
see almondtools/stringbench#1
almondtools/stringbench#1, but that does
not explain the performance lack compared with other horspool
implementations).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEY6bzWKpHmtxnfjbx5vhBMgIbCpks5qfqnugaJpZM4JfwD4
.

nishihatapalmer · 2016-08-15T12:44:17Z

Hi Stefan,

Rather than putting the benchmarking code up on Github, as it's not really
in a finished state, I've attached a zip of it as it stands right now. It
won't compile for you as it's linked to search code you don't have, but you
should be able to see how I'm separating the setup of patterns and data
from the benchmarking of the algorithms themselves.

It does mean I have to define different benchmark classes to accommodate
variations in what algorithms search what kinds of data, which is a pain.

Search preparation (calculating shift tables, etc.) is not included in the
benchmark times, but could easily be placed inside the benchmarking method
if required.

Anyway - not saying you should do what I'm doing here - but I do think it's
important to only benchmark what is essential, not the acquisition of
patterns and data for testing with.

Regards,

Matt

On 15 August 2016 at 13:20, Matt Palmer mattpalms@gmail.com wrote:

One thing I would comment on about your jmh benchmark structure. I notice
that you have this in the SinglePatternMatcherBenchmark:

@benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@WarmUp(iterations = 5)
@measurement(iterations = 5)
@fork(1)
public void benchmarkFind() {
for (String pattern : sample.getPattern()) {
List result = find(pattern, sample.getSample());
results.put(pattern, result);
}
}

So you are both getting the pattern to test and the sample to test with
(and recording the results) within the benchmark method. Really, you need
to be as close to the benchmarking method (search) as possible, and not
include obtaining the pattern and sample within it.

In this circumstance, setup costs are going to dominate the much faster
search methods. You really need to separate the search from the provision
of data to it.

I've done this in my own benchmarks using parameterised values for pattern
and alphabet size. It is less elegant and re-usable code, but the
benchmaring results are much more accurate and worthwhile, as I am only
measuring the performance of each search algorithm. By example, this is
what one of my benchmarking methods looks like:

@benchmark
public long testHorspoolSequenceSearch(Data data, Searchers searchers) {
return searchers.horspoolSearcher.searchCount(data.bytes);
}

Happy to share any of my benchmarking code with you if it helps - I can
put it up on Github.

Regards,

Matt

On 15 August 2016 at 12:56, Matt Palmer mattpalms@gmail.com wrote:

In fact, byteseek appears to be roughly flat for each alphabet size.
Pattern size doesn't seem to matter much at all, except at the low end.

Out of interest, what does an alphabet size greater than 256 mean? I
guess it's for multi-byte character sets?

Matt.

On 15 August 2016 at 12:35, Matt Palmer mattpalms@gmail.com wrote:

It's very peculiar. My own benchmarks all show shift table based
algorithms getting consistently faster as the pattern and alphabet size
increases, up to a point anyway. In your benchmarks, the byteseek scores
are pretty flat across alphabet sizes and pattern sizes.

It may suggest that the byteseek benchmarks are being dominated by some
fixed cost (creating the byte array...?) which swamps the algorithm
timing. I'll investigate further - these benchmarks don't make sense right
now.

Regards,

Matt

On 15 August 2016 at 12:08, Matt Palmer mattpalms@gmail.com wrote:

HI Stefan,

thanks, I'll have a look. The byteseek library is used in an intensive
file scanning application (DROID
https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/)
by the UK National Archives, and seems to perform really well, so the
results seem surprising.

Matt.

On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com
wrote:

Yet the benchmark contains

the Java-API methods

my own library strings and chars

your library byteseek

All other libraries are on hold because they fail to pass the
benchmark (timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced
document
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt.
Next to this file is a csv-file
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.csv
that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars
compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark:
quite comparable for alphabet sizes of 2 and 4, but not competetive for all
other.

Comparing it with the naive search via String.indexOf at 1024/1024
(alphabet size/pattern size) it is even 30 times slower, where SCHorspool
is 5 times faster (actually the benchmark is biased towards naive search,
see almondtools/stringbench#1
almondtools/stringbench#1, but that does
not explain the performance lack compared with other horspool
implementations).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEY6bzWKpHmtxnfjbx5vhBMgIbCpks5qfqnugaJpZM4JfwD4
.

nishihatapalmer · 2016-08-15T12:50:16Z

A good rule of thumb for jmh benchmarking is that there should be no loops
in the method marked @benchmark.

On 15 August 2016 at 13:43, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

Rather than putting the benchmarking code up on Github, as it's not really
in a finished state, I've attached a zip of it as it stands right now. It
won't compile for you as it's linked to search code you don't have, but you
should be able to see how I'm separating the setup of patterns and data
from the benchmarking of the algorithms themselves.

It does mean I have to define different benchmark classes to accommodate
variations in what algorithms search what kinds of data, which is a pain.

Search preparation (calculating shift tables, etc.) is not included in the
benchmark times, but could easily be placed inside the benchmarking method
if required.

Anyway - not saying you should do what I'm doing here - but I do think
it's important to only benchmark what is essential, not the acquisition of
patterns and data for testing with.

Regards,

Matt

On 15 August 2016 at 13:20, Matt Palmer mattpalms@gmail.com wrote:

One thing I would comment on about your jmh benchmark structure. I
notice that you have this in the SinglePatternMatcherBenchmark:

@benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@WarmUp(iterations = 5)
@measurement(iterations = 5)
@fork(1)
public void benchmarkFind() {
for (String pattern : sample.getPattern()) {
List result = find(pattern, sample.getSample());
results.put(pattern, result);
}
}

So you are both getting the pattern to test and the sample to test with
(and recording the results) within the benchmark method. Really, you need
to be as close to the benchmarking method (search) as possible, and not
include obtaining the pattern and sample within it.

In this circumstance, setup costs are going to dominate the much faster
search methods. You really need to separate the search from the provision
of data to it.

I've done this in my own benchmarks using parameterised values for
pattern and alphabet size. It is less elegant and re-usable code, but the
benchmaring results are much more accurate and worthwhile, as I am only
measuring the performance of each search algorithm. By example, this is
what one of my benchmarking methods looks like:

@benchmark
public long testHorspoolSequenceSearch(Data data, Searchers searchers) {
return searchers.horspoolSearcher.searchCount(data.bytes);
}

Happy to share any of my benchmarking code with you if it helps - I can
put it up on Github.

Regards,

Matt

On 15 August 2016 at 12:56, Matt Palmer mattpalms@gmail.com wrote:

In fact, byteseek appears to be roughly flat for each alphabet size.
Pattern size doesn't seem to matter much at all, except at the low end.

Out of interest, what does an alphabet size greater than 256 mean? I
guess it's for multi-byte character sets?

Matt.

On 15 August 2016 at 12:35, Matt Palmer mattpalms@gmail.com wrote:

It's very peculiar. My own benchmarks all show shift table based
algorithms getting consistently faster as the pattern and alphabet size
increases, up to a point anyway. In your benchmarks, the byteseek scores
are pretty flat across alphabet sizes and pattern sizes.

It may suggest that the byteseek benchmarks are being dominated by some
fixed cost (creating the byte array...?) which swamps the algorithm
timing. I'll investigate further - these benchmarks don't make sense right
now.

Regards,

Matt

On 15 August 2016 at 12:08, Matt Palmer mattpalms@gmail.com wrote:

HI Stefan,

thanks, I'll have a look. The byteseek library is used in an
intensive file scanning application (DROID
https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/)
by the UK National Archives, and seems to perform really well, so the
results seem surprising.

Matt.

On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com
wrote:

Yet the benchmark contains

the Java-API methods

my own library strings and chars

your library byteseek

All other libraries are on hold because they fail to pass the
benchmark (timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced
document
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt.
Next to this file is a csv-file
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.csv
that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars
compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark:
quite comparable for alphabet sizes of 2 and 4, but not competetive for all
other.

Comparing it with the naive search via String.indexOf at 1024/1024
(alphabet size/pattern size) it is even 30 times slower, where SCHorspool
is 5 times faster (actually the benchmark is biased towards naive search,
see almondtools/stringbench#1
almondtools/stringbench#1, but that does
not explain the performance lack compared with other horspool
implementations).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEY6bzWKpHmtxnfjbx5vhBMgIbCpks5qfqnugaJpZM4JfwD4
.

nishihatapalmer · 2016-08-15T13:58:23Z

One thing I initially struggled with was wanting to record the results to
compare them with other search algorithms. I eventually separated this
function entirely out into a standard JUnit test that compares all results
for all algorithms.

Validating that the algorithms give the correct results does not need to be
part of benchmarking, which means you don't have to preserve this state
across benchmarking runs. It still needs doing - just not as part of
benchmarking.

cheers,

Matt.

On 15 August 2016 at 13:49, Matt Palmer mattpalms@gmail.com wrote:

A good rule of thumb for jmh benchmarking is that there should be no loops
in the method marked @benchmark.

On 15 August 2016 at 13:43, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

Rather than putting the benchmarking code up on Github, as it's not
really in a finished state, I've attached a zip of it as it stands right
now. It won't compile for you as it's linked to search code you don't
have, but you should be able to see how I'm separating the setup of
patterns and data from the benchmarking of the algorithms themselves.

It does mean I have to define different benchmark classes to accommodate
variations in what algorithms search what kinds of data, which is a pain.

Search preparation (calculating shift tables, etc.) is not included in
the benchmark times, but could easily be placed inside the benchmarking
method if required.

Anyway - not saying you should do what I'm doing here - but I do think
it's important to only benchmark what is essential, not the acquisition of
patterns and data for testing with.

Regards,

Matt

On 15 August 2016 at 13:20, Matt Palmer mattpalms@gmail.com wrote:

One thing I would comment on about your jmh benchmark structure. I
notice that you have this in the SinglePatternMatcherBenchmark:

@benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@WarmUp(iterations = 5)
@measurement(iterations = 5)
@fork(1)
public void benchmarkFind() {
for (String pattern : sample.getPattern()) {
List result = find(pattern, sample.getSample());
results.put(pattern, result);
}
}

So you are both getting the pattern to test and the sample to test with
(and recording the results) within the benchmark method. Really, you need
to be as close to the benchmarking method (search) as possible, and not
include obtaining the pattern and sample within it.

In this circumstance, setup costs are going to dominate the much faster
search methods. You really need to separate the search from the provision
of data to it.

I've done this in my own benchmarks using parameterised values for
pattern and alphabet size. It is less elegant and re-usable code, but the
benchmaring results are much more accurate and worthwhile, as I am only
measuring the performance of each search algorithm. By example, this is
what one of my benchmarking methods looks like:

@benchmark
public long testHorspoolSequenceSearch(Data data, Searchers searchers) {
return searchers.horspoolSearcher.searchCount(data.bytes);
}

Happy to share any of my benchmarking code with you if it helps - I can
put it up on Github.

Regards,

Matt

On 15 August 2016 at 12:56, Matt Palmer mattpalms@gmail.com wrote:

In fact, byteseek appears to be roughly flat for each alphabet size.
Pattern size doesn't seem to matter much at all, except at the low end.

Out of interest, what does an alphabet size greater than 256 mean? I
guess it's for multi-byte character sets?

Matt.

On 15 August 2016 at 12:35, Matt Palmer mattpalms@gmail.com wrote:

It's very peculiar. My own benchmarks all show shift table based
algorithms getting consistently faster as the pattern and alphabet size
increases, up to a point anyway. In your benchmarks, the byteseek scores
are pretty flat across alphabet sizes and pattern sizes.

It may suggest that the byteseek benchmarks are being dominated by
some fixed cost (creating the byte array...?) which swamps the algorithm
timing. I'll investigate further - these benchmarks don't make sense right
now.

Regards,

Matt

On 15 August 2016 at 12:08, Matt Palmer mattpalms@gmail.com wrote:

HI Stefan,

thanks, I'll have a look. The byteseek library is used in an
intensive file scanning application (DROID
https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/)
by the UK National Archives, and seems to perform really well, so the
results seem surprising.

Matt.

On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com
wrote:

Yet the benchmark contains

the Java-API methods

my own library strings and chars

your library byteseek

All other libraries are on hold because they fail to pass the
benchmark (timeouts or verification errors).

A summary of all tested algorithms is at the bottom of the referenced
document
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.txt.
Next to this file is a csv-file
https://raw.githubusercontent.com/almondtools/stringbench/master/benchmarkresults/result-2016-08-13.csv
that could be analyzed in Calc/Excel.

To get an impression how your algorithm compares to StringsAndChars
compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark:
quite comparable for alphabet sizes of 2 and 4, but not competetive for all
other.

Comparing it with the naive search via String.indexOf at 1024/1024
(alphabet size/pattern size) it is even 30 times slower, where SCHorspool
is 5 times faster (actually the benchmark is biased towards naive search,
see almondtools/stringbench#1
almondtools/stringbench#1, but that
does not explain the performance lack compared with other horspool
implementations).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEY6bzWKpHmtxnfjbx5vhBMgIbCpks5qfqnugaJpZM4JfwD4
.

almondtools · 2016-08-15T17:31:05Z

Hello Matt,

Your hints for benchmarking are correct, but they do not apply. Calling a getter on an already set up sample will cost only nanoseconds - and is according to jmh the best practice to benchmark with variable tuples.

The setup of the sample is properly separated from the benchmark, it can be found in the method annotated with @Setup.

It is also correct that validation is not needed in the benchmark - but since it is done in its own @TearDown method this should be without effect on the benchmark results.

I think you should not concentrate on the algorithm (this does probably work), but on the result collection, because this is code of me and maybe I did use your api in a wrong way ...

... indeed after reviewing the code of the find(String, String) method in ByteSeekBenchmark, I find byte[] stringBytes = text.getBytes(); which means that a long array of bytes is copied for every pattern. I will think how to eliminate this line. If you find a smart solution to this, let me know.

Yet I recommend that you review the find(String, String) method on your own. I do not want that this benchmark produces wrong results because of an error in usage.

Regards and thanks for your feedback,

Stefan

nishihatapalmer · 2016-08-15T17:52:04Z

Hi Stefan,

OK - I still prefer to only include the code I want to test directly in the
benchmark methods, but if it makes little difference to timings and makes
the code cleaner...

To eliminate the problem of having to convert the string to a byte array on
each call of find(), the only way I can think of is to template the
SinglePatternMatcherBenchmark and SinglePatternSample on the type of data
they contain - either a String or a byte array. Then the call to find
can pass the appropriate type in, avoiding the need to convert them from a
string to a byte array.

Regards.

Matt.

On 15 August 2016 at 18:31, Stefan Mandel notifications@github.com wrote:

Hello Matt,

Your hints for benchmarking are correct, but they do not apply. Calling a
getter on an already set up sample will cost only nanoseconds - and is
according to jmh the best practice to benchmark with variable tuples.

The setup of the sample is properly separated from the benchmark. All
samples are created at class loading time:

@DataPoints
public static SinglePatternSample[] sample = createSamples();

It is also correct that validation is not needed in the benchmark - but
since it is done in its own @teardown method this should be without
effect on the benchmark results.

I think you should not concentrate on the algorithm (this does probably
work), but on the result collection, because this is code of me and maybe I
did use your api in a wrong way ...

... indeed after reviewing the code of the find(String) method in
ByteSeekBenchmark, I find byte[] stringBytes = text.getBytes(); which
means that a long array of bytes is copied for every pattern. I will think
how to eliminate this line. If you find a smart solution to this, let me
know.

Yet I recommend that you review the find(String) method on your own. I do
not want that this benchmark produces wrong results because of an error in
usage.

Regards and thanks for your feedback,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJOG7xlO_fqq9eOuGHUGqsPksd6gyks5qgKJZgaJpZM4JfwD4
.

nishihatapalmer · 2016-08-16T11:36:50Z

Hi,

just having a look at your latest changes. That seems to solve the problem
of passing bytes or test around. I guess incurring two hashmap lookups
when you need bytes won't skew the results very much at all. It's clearly
an interface not originally designed to pass different kinds of data
around, but it works.

One difference I have noted is that your search classes do pre-processing
when they are instantiated, whereas byteseek defers doing this until either
you start to search, or you explicitly do such preparation with a call to
prepareForwards() on the searcher.

So the results are not comparable right now, since the byteseek results
include preprocessing costs that yours don't. You can force the search
algorithms to pre-calculate by calling prepareForwards() on each search
class. This should probably be done in the preparePatterns() call on the
ByteSeekBenchmark class.

Still playing with it so we can get some good reliable results!

Regards,

Matt

On 15 August 2016 at 18:51, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

OK - I still prefer to only include the code I want to test directly in
the benchmark methods, but if it makes little difference to timings and
makes the code cleaner...

To eliminate the problem of having to convert the string to a byte array
on each call of find(), the only way I can think of is to template the
SinglePatternMatcherBenchmark and SinglePatternSample on the type of data
they contain - either a String or a byte array. Then the call to find
can pass the appropriate type in, avoiding the need to convert them from a
string to a byte array.

Regards.

Matt.

On 15 August 2016 at 18:31, Stefan Mandel notifications@github.com
wrote:

Hello Matt,

Your hints for benchmarking are correct, but they do not apply. Calling a
getter on an already set up sample will cost only nanoseconds - and is
according to jmh the best practice to benchmark with variable tuples.

The setup of the sample is properly separated from the benchmark. All
samples are created at class loading time:

@DataPoints
public static SinglePatternSample[] sample = createSamples();

It is also correct that validation is not needed in the benchmark - but
since it is done in its own @teardown method this should be without
effect on the benchmark results.

I think you should not concentrate on the algorithm (this does probably
work), but on the result collection, because this is code of me and maybe I
did use your api in a wrong way ...

... indeed after reviewing the code of the find(String) method in
ByteSeekBenchmark, I find byte[] stringBytes = text.getBytes(); which
means that a long array of bytes is copied for every pattern. I will think
how to eliminate this line. If you find a smart solution to this, let me
know.

Yet I recommend that you review the find(String) method on your own. I
do not want that this benchmark produces wrong results because of an error
in usage.

Regards and thanks for your feedback,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJOG7xlO_fqq9eOuGHUGqsPksd6gyks5qgKJZgaJpZM4JfwD4
.

nishihatapalmer · 2016-08-16T13:02:10Z

Hi Stefan,

I've run a new mini benchmark using your latest code, modified to use the
prepareForwards() method so the same things are being measured for byteseek
as for the other algorithms.

The results are interesting - csv file attached. Byteseek does
considerably better this time around, often appearing as the fastest
algorithm, or in the top three.

However, I'm still highly suspicious of these benchmark results. The
reason is that, although faster now, byteseek's benchmark results are still
basically flat against each alphabet size.

I would expect longer patterns to perform faster for these algorithms.
This is what I see in my own benchmarks, what I see for your algorithms in
your benchmarks, and what is theoretically predicted for them.

More investigation needed!

Regards,

Matt.

On 16 August 2016 at 12:36, Matt Palmer mattpalms@gmail.com wrote:

Hi,

just having a look at your latest changes. That seems to solve the
problem of passing bytes or test around. I guess incurring two hashmap
lookups when you need bytes won't skew the results very much at all. It's
clearly an interface not originally designed to pass different kinds of
data around, but it works.

One difference I have noted is that your search classes do pre-processing
when they are instantiated, whereas byteseek defers doing this until either
you start to search, or you explicitly do such preparation with a call to
prepareForwards() on the searcher.

So the results are not comparable right now, since the byteseek results
include preprocessing costs that yours don't. You can force the search
algorithms to pre-calculate by calling prepareForwards() on each search
class. This should probably be done in the preparePatterns() call on the
ByteSeekBenchmark class.

Still playing with it so we can get some good reliable results!

Regards,

Matt

On 15 August 2016 at 18:51, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

OK - I still prefer to only include the code I want to test directly in
the benchmark methods, but if it makes little difference to timings and
makes the code cleaner...

To eliminate the problem of having to convert the string to a byte array
on each call of find(), the only way I can think of is to template the
SinglePatternMatcherBenchmark and SinglePatternSample on the type of data
they contain - either a String or a byte array. Then the call to find
can pass the appropriate type in, avoiding the need to convert them from a
string to a byte array.

Regards.

Matt.

On 15 August 2016 at 18:31, Stefan Mandel notifications@github.com
wrote:

Hello Matt,

Your hints for benchmarking are correct, but they do not apply. Calling
a getter on an already set up sample will cost only nanoseconds - and is
according to jmh the best practice to benchmark with variable tuples.

The setup of the sample is properly separated from the benchmark. All
samples are created at class loading time:

@DataPoints
public static SinglePatternSample[] sample = createSamples();

It is also correct that validation is not needed in the benchmark - but
since it is done in its own @teardown method this should be without
effect on the benchmark results.

I think you should not concentrate on the algorithm (this does probably
work), but on the result collection, because this is code of me and maybe I
did use your api in a wrong way ...

... indeed after reviewing the code of the find(String) method in
ByteSeekBenchmark, I find byte[] stringBytes = text.getBytes(); which
means that a long array of bytes is copied for every pattern. I will think
how to eliminate this line. If you find a smart solution to this, let me
know.

Yet I recommend that you review the find(String) method on your own. I
do not want that this benchmark produces wrong results because of an error
in usage.

Regards and thanks for your feedback,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJOG7xlO_fqq9eOuGHUGqsPksd6gyks5qgKJZgaJpZM4JfwD4
.

almondtools · 2016-08-16T14:56:27Z

I will include the preparation at setup time, as you recommended. Where did you attach the files?

Maybe you replied by email, but github filters file attachments? Or can I find them in your project space?

To get back to the original problem: It would be best if StringReader gets fixed so that I can use this class. This would be more comparable to the other algorithms. The modification that is active at this time, preprocesses the text (converting to bytes) in the setup phase - which was thought to be part of the benchmarking phase.

nishihatapalmer · 2016-08-16T15:20:56Z

Hi Stefan,

I didn't attach the modified code. In ByteSeekBenchmark, I just modified
this method:

@OverRide
public void preparePatterns(Set patterns) {
this.algorithm = patterns.stream()
.collect(toMap(pattern -> pattern, pattern -> create(pattern)));
for (Searcher searcher : algorithm.values()) {
searcher.prepareForwards();
}
}

I'm sure there's a more elegant way using maps or something...

I think you have to be able to pass in whatever data source a particular
library requires into the benchmark methods to avoid unfair comparisons
between them. Or you must limit your benchmarks to only those search
libraries which take strings for patterns and search text natively.

Using a StringReader will still incur the penalty of converting the text to
bytes - that is all it does when passed a string in it's constructor. So
any StringReaders you used would have to be pre-built before use. I'm not
sure what the StringReader gives you that the byte array doesn't...?

Regards,

Matt.

On 16 August 2016 at 15:56, Stefan Mandel notifications@github.com wrote:

I will include the preparation at setup time, as you recommended. Where
did you attach the files?

Maybe you replied by email, but github filters file attachments? Or can I
find them in your project space?

To get back to the original problem: It would be best if StringReader gets
fixed so that I can use this class. This would be more comparable to the
other algorithms. The modification that is active at this time,
preprocesses the text (converting to bytes) in the setup phase - which was
thought to be part of the benchmarking phase.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJGE9VlucG4l8mFRHncxgMUZcmev3ks5qgc-bgaJpZM4JfwD4
.

nishihatapalmer · 2016-08-16T15:56:53Z

After some thought, I'm coming round to the idea that the benchmarks with
prepareForwards() may be OK.

They aren't flat in fact across alphabets in fact, although the difference
in speed for different lengths is less obvious than in my benchmarks. This
may just reflect the amount of data being processed in our different
benchmarks.

I modified your benchmarks to just run with a single parameter, rather than
processing the whole set - just to see if the results were similar to
processing the set. They are broadly the same. I may try to parameterise
the benchmark with a few patterns to see if there is much variation for
different patterns of the same size for different implementations.

Anyway, can't see anything hugely weird going on, but I shall probably keep
playing with it for a while.

Regards,

Matt.

On 16 August 2016 at 16:20, Matt Palmer mattpalms@gmail.com wrote:

Hi Stefan,

I didn't attach the modified code. In ByteSeekBenchmark, I just modified
this method:

@OverRide
public void preparePatterns(Set patterns) {
this.algorithm = patterns.stream()
.collect(toMap(pattern -> pattern, pattern -> create(pattern)));
for (Searcher searcher : algorithm.values()) {
searcher.prepareForwards();
}
}

I'm sure there's a more elegant way using maps or something...

I think you have to be able to pass in whatever data source a particular
library requires into the benchmark methods to avoid unfair comparisons
between them. Or you must limit your benchmarks to only those search
libraries which take strings for patterns and search text natively.

Using a StringReader will still incur the penalty of converting the text
to bytes - that is all it does when passed a string in it's constructor.
So any StringReaders you used would have to be pre-built before use. I'm
not sure what the StringReader gives you that the byte array doesn't...?

Regards,

Matt.

On 16 August 2016 at 15:56, Stefan Mandel notifications@github.com
wrote:

I will include the preparation at setup time, as you recommended. Where
did you attach the files?

Maybe you replied by email, but github filters file attachments? Or can I
find them in your project space?

To get back to the original problem: It would be best if StringReader
gets fixed so that I can use this class. This would be more comparable to
the other algorithms. The modification that is active at this time,
preprocesses the text (converting to bytes) in the setup phase - which was
thought to be part of the benchmarking phase.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJGE9VlucG4l8mFRHncxgMUZcmev3ks5qgc-bgaJpZM4JfwD4
.

almondtools · 2016-08-16T17:49:24Z

Hello Matt,

The idea of StringBench was to test java.lang.String performance. But the ByteSeekBenchmark now does not benchmark String performance but byte[] performance. This would not be critical if the byte[]- conversion did not dominate the search.

Do not misunderstand me - your lib is byteseek - so nobody should be suprised on this. But I did the mistake to expect your classes to be performant on String`s and others possible do so too.

Using a StringReader will still incur the penalty of converting the text to bytes - that is all it does
when passed a string in it's constructor. So any StringReaders you used would have to be pre-built
before use. I'm not sure what the StringReader gives you that the byte array doesn't...?

I identified StringReader as the class where the byte[]-conversion could be optimized out. Maybe I underestimate this task, but doing so would have two benefits:

a fair comparison of the algorithms
a fair communication of results to the users

nishihatapalmer · 2016-08-16T18:02:02Z

Hi Stefan,

The StringReader doesn't do on the fly conversion of text into bytes. It
really just wraps a byte array converted from the string. It may well go
away in the next release of byteseek, since there is already a
ByteArrayReader there which could just have some String and CharSet
constructors.

I agree that the purpose of StringBench is to benchmark string searching.
So you have gone to some considerable effort to accommodate byteseek in
your tests, by giving it byte arrays to search on.

I guess the question for you is what is the purpose of StringBench? If the
explicit goal is to benchmark searching Strings, then you should include
the cost of converting the string into a byte array as part of the overhead
of using byteseek. If the goal is to see which search algorithm
implementations can find data the fastest (without regard to the data
source), then this cost should not be included.

Regards,

Matt.

On 16 August 2016 at 18:49, Stefan Mandel notifications@github.com wrote:

Hello Matt,

The idea of StringBench was to test java.lang.String performance. But the
ByteSeekBenchmark now does not benchmark String performance but byte[]
performance. This would not be critical if the byte[]- conversion did not
dominate the search.

Do not misunderstand me - your lib is _byte_seek - so nobody should be
suprised on this. But I did the mistake to expect your classes to be
performant on String`s and others possible do so too.

Using a StringReader will still incur the penalty of converting the text
to bytes - that is all it does
when passed a string in it's constructor. So any StringReaders you used
would have to be pre-built
before use. I'm not sure what the StringReader gives you that the byte
array doesn't...?

I identified StringReader as the class where the byte[]-conversion could
be optimized out. Maybe I underestimate this task, but doing so would have
two benefits:

a fair comparison of the algorithms

a fair communication of results to the users

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJFbfTpPCuNSDu3NIX2WV-i5o5WQGks5qgfgkgaJpZM4JfwD4
.

almondtools · 2016-08-16T19:16:05Z

Hello Matt,

I thought of a possible future StringReader ;). Yet I am not aware of the effort it would take to make it idenpendent from the byte array.

Concerning the purpose of StringBench:

patterns may be String, char[], byte[] or any other type (patterns are small compared to the text and patterns are searched for often, so processing it to a suitable type is acceptable)
texts to search in should be presented as type that is near to the real world. This means to me:
- Strings (because large texts can be specified as StringLiteral)
- InputStreams (if reading from files or sockets)
- Reader (if reading from files or sockets)
I think that byte[] and char[] are, if containing large texts, always the result of a copy or a conversion from one of the upper data types. And since such a copy costs performance it should be left to the libraries as a tradeoff.

The algorithm performance would be interesting to find out possible implementation bugs. So maybe the next version of the StringBench benchmark both Algorithm performance and String performance.

nishihatapalmer · 2016-08-16T20:59:59Z

Hello Stefan,

those seem like mostly reasonable assumptions. I would challenge your last
statement:

I think that byte[] and char[] are, if containing large texts, always
the result of a copy or a conversion from one of the upper data types. And
since such a copy costs performance it should be left to the libraries as a
tradeoff

I think you just have to be clear what is being measured in your
benchmarks. If one library incurs no cost when working with a type, but
another is forced to perform a conversion, then you need to make that
explicit, or the benchmarks will be misleading.

Regards,

Matt

Strings are the one core type byteseek isn't native to, the others it
should do quite well on. I might look at changing that in future, it is a
rather common type.

On 16 August 2016 at 20:16, Stefan Mandel notifications@github.com wrote:

Hello Matt,

I thought of a possible future StringReader ;). Yet I am not aware of the
effort it would take to make it idenpendent from the byte array.

Concerning the purpose of StringBench:

patterns may be String, char[], byte[] or any other type (patterns
are small compared to the text and patterns are searched for often, so
processing it to a suitable type is acceptable)

texts to search in should be presented as type that is near to the
real world. This means to me:

Strings (because large texts can be specified as StringLiteral)

InputStreams (if reading from files or sockets)

Reader (if reading from files or sockets)

I think that byte[] and char[] are, if containing large texts,
always the result of a copy or a conversion from one of the upper data
types. And since such a copy costs performance it should be left to the
libraries as a tradeoff.

The algorithm performance would be interesting to find out possible
implementation bugs. So maybe the next version of the StringBench benchmark
both Algorithm performance and String performance.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJOZ55tsYu0oYidG9qLJkP3k9LrVfks5qggx1gaJpZM4JfwD4
.

nishihatapalmer · 2016-08-16T21:11:42Z

The Algorithm and String performance sounds like good ways to differentiate
them :)

cheers,

Matt.

On 16 August 2016 at 21:59, Matt Palmer mattpalms@gmail.com wrote:

Hello Stefan,

those seem like mostly reasonable assumptions. I would challenge your
last statement:

I think that byte[] and char[] are, if containing large texts,
always the result of a copy or a conversion from one of the upper data
types. And since such a copy costs performance it should be left to the
libraries as a tradeoff

I think you just have to be clear what is being measured in your
benchmarks. If one library incurs no cost when working with a type, but
another is forced to perform a conversion, then you need to make that
explicit, or the benchmarks will be misleading.

Regards,

Matt

Strings are the one core type byteseek isn't native to, the others it
should do quite well on. I might look at changing that in future, it is a
rather common type.

On 16 August 2016 at 20:16, Stefan Mandel notifications@github.com
wrote:

Hello Matt,

I thought of a possible future StringReader ;). Yet I am not aware of the
effort it would take to make it idenpendent from the byte array.

Concerning the purpose of StringBench:

patterns may be String, char[], byte[] or any other type (patterns
are small compared to the text and patterns are searched for often, so
processing it to a suitable type is acceptable)

texts to search in should be presented as type that is near to the
real world. This means to me:

Strings (because large texts can be specified as StringLiteral)

InputStreams (if reading from files or sockets)

Reader (if reading from files or sockets)

I think that byte[] and char[] are, if containing large texts,
always the result of a copy or a conversion from one of the upper data
types. And since such a copy costs performance it should be left to the
libraries as a tradeoff.

The algorithm performance would be interesting to find out possible
implementation bugs. So maybe the next version of the StringBench benchmark
both Algorithm performance and String performance.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJOZ55tsYu0oYidG9qLJkP3k9LrVfks5qggx1gaJpZM4JfwD4
.

almondtools · 2016-08-18T06:39:17Z

What do you think about testing:

String performance
File performance

I found out that SC performs quite well with strings, but very poor with files. (yet beaten by your implementation). Naive string search seems to work better, but the current implementation will not scale because it does first convert the file contents to a long string (which should fail if the file is some GB large).

I also found out that converting to byte[] in the benchmark is not critical compared with IO.

nishihatapalmer · 2016-08-18T16:03:12Z

Hi Stefan,

File and String look like two use cases people would be interested in. I
might generalise File to be just binary data, so it works with input
streams too.

Of course, you have to be stream friendly to do that, i.e. it has to work
when you don't know the final length to be searched.

Byteseek was built to search binary data, so it should perform very well on
that. Interesting that converting to byte[] isn't critical compared to the
IO.

Searching large amounts of text data sounds like an interesting problem. Do
you parse the text as binary data in chunks and convert to strings on the
fly? Gets tricky with variable length byte encodings as you don't know
where the boundaries are. ASCII or other fixed byte encodings would be
easy to handle.

Cheers

Matt

On 18 Aug 2016 00:39, "Stefan Mandel" notifications@github.com wrote:

What do you think about testing:

String performance

File performance

I found out that SC performs quite well with strings, but very poor with
files. (yet beaten by your implementation). Naive string search seems to
work better, but the current implementation will not scale because it does
first convert the file contents to a long string (which should fail if the
file is some GB large).

I also found out that converting to byte[] in the benchmark is not
critical compared with IO.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJDI_53RKjBFVSuMe6iMI_HonfOKuks5qg_4cgaJpZM4JfwD4
.

almondtools · 2016-08-18T21:07:05Z

Hello Matt,

I thought that my implementation was stream friendly. But I did never test reading from a file. After all it works, yet not really performant.

My approach reads buffers with a Reader and searches in these buffers. Variable-length chars limit the implementation options.

I consider reimplementing this approach, or finding an approach with RandomAccessFile.

Regards

Stefan

almondtools · 2016-08-28T15:18:47Z

I recently started a new benchmark with following results:

your Horspool implementations perform best for files/streams
WuManber and SetHorspool do not find all occurrences (maybe configuration? I will analyze this in time, yet your are welcome to do this, too (just select the matching incubation tests).

Unfortunately the benchmark for files does not work above an alphabet size of 255. All reader/stream-based algorithms (not only yours) do not find the correct number of matches. Reading all bytes and searching naively or with regexes yet seems to work (consequently the chart shows Java-Regex-Searching to be the best on files with alphabets larger than 255 chars). Probably an encoding problem.

nishihatapalmer · 2016-08-28T18:42:44Z

Hi Stefan,

Yes, the multi pattern searchers are completely untested right now - it
really isn't worth benchmarking them. I have an idea what the bug might be

but just getting the core bits of byteseek tested has taken all my
limited free time for the last few years... Having children has that effect!

As far as alphabets greater than 255 (256?) go, the maximum alphabet
byteseek supports is 256 - i.e. all the unique byte values. Multi-byte
encodings of text can still be searched to some extent using byteseek, as
long as the search string and the search text are both encoded in the same
way, e.g. to UTF-8 or something.

I believe some multi-byte text encodings have different byte sequences
which are equivalent to each other. Byteseek can only match exact byte
sequences, so it would miss those.

Regards,

On 28 Aug 2016 16:18, "Stefan Mandel" notifications@github.com wrote:

I recently started a new benchmark with following results:

your Horspool implementations perform best for files/streams

WuManber and SetHorspool do not find all occurrences (maybe
configuration? I will analyze this in time, yet your are welcome to do
this, too (just select the matching incubation tests).

Unfortunately the benchmark for files does not work above an alphabet size
of 255. All reader/stream-based algorithms (not only yours) do not find the
correct number of matches. Reading all bytes and searching naively or with
regexes yet seems to work (consequently the chart shows
Java-Regex-Searching to be the best on files with alphabets larger than 255
chars). Probably an encoding problem.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJI3F1iZZBjd8X4NU8B-Q__euI3uJks5qkabYgaJpZM4JfwD4
.

almondtools · 2016-08-28T20:19:04Z

Hello Matt,

The problem with larger alphabets was indeed a bug in reading with another encoding. Should be fixed on the trunk. I will inform you when newer benchmarks are available

Regards,

Stefan

nishihatapalmer · 2016-08-28T20:38:04Z

Hi Stefan,

cool, thanks.

As far as my multi-pattern searches go, I did actually have tests for them,
which they were passing. From memory, I hit a bug reading across Window
boundaries in one of the Wu-Manber searchers, so I just disabled all the
multi-pattern tests until I had time to fix it. The other aspects of
byteseek needed more urgent attention, and I haven't yet returned to them.

So - they may well work for you. You are using byte arrays, not the
WindowReader interface and I don't recall any bugs in those
implementations. I can't be completely sure without a bit more testing
though.

Regards,

Matt.

On 28 August 2016 at 21:19, Stefan Mandel notifications@github.com wrote:

Hello Matt,

The problem with larger alphabets was indeed a bug in reading with another
encoding. Should be fixed on the trunk. I will inform you when newer
benchmarks are available

Regards,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJEBCrLRIrFDC6AuZ4L8vX0eWc1WQks5qke04gaJpZM4JfwD4
.

almondtools · 2016-09-11T18:17:03Z

Hello Matt,

Now a working benchmark is online, here an excerpt of the performance for an alphabet size of 32 and a pattern size of 32:

name,family,number,alphabet,pattern,time
BSBoyerMooreHorspool,SUFFIX,1,32,32,499.868265
BSHorspoolFinalFlag,SUFFIX,1,32,32,514.874054
BSSunday,SUFFIX,1,32,32,523.536243
JavaIndexOf,NAIVE,1,32,32,1044.936366
JavaRegex,SUFFIX,1,32,32,896.213711
SCBNDM,FACTOR,1,32,32,808.651137
SCBOM,FACTOR,1,32,32,886.547606
SCHorspool,SUFFIX,1,32,32,739.143626
SCKnuthMorrisPratt,PREFIX,1,32,32,3716.777362
SCShiftAnd,PREFIX,1,32,32,3612.73296
SCSunday,SUFFIX,1,32,32,841.144946

If you want to generate such csvs for other params, use the ProcessResults class:
java ProcessResults 1 .

Regards,

Stefan

nishihatapalmer · 2016-09-12T12:42:07Z

Great, I'll check it out!

Cheers

Matt

On 11 Sep 2016 19:17, "Stefan Mandel" notifications@github.com wrote:

Hello Matt,

Now a working benchmark is online, here an excerpt of the performance for
an alphabet size of 32 and a pattern size of 32:

name,family,number,alphabet,pattern,time
BSBoyerMooreHorspool,SUFFIX,1,32,32,499.868265
BSHorspoolFinalFlag,SUFFIX,1,32,32,514.874054
BSSunday,SUFFIX,1,32,32,523.536243
JavaIndexOf,NAIVE,1,32,32,1044.936366
JavaRegex,SUFFIX,1,32,32,896.213711
SCBNDM,FACTOR,1,32,32,808.651137
SCBOM,FACTOR,1,32,32,886.547606
SCHorspool,SUFFIX,1,32,32,739.143626
SCKnuthMorrisPratt,PREFIX,1,32,32,3716.777362
SCShiftAnd,PREFIX,1,32,32,3612.73296
SCSunday,SUFFIX,1,32,32,841.144946

If you want to generate such csvs for other params, use the ProcessResults
class:
java ProcessResults 1 .

Regards,

Stefan

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEOOJC9GaNQhfD0QG1a-WHOh-0U8EB8fks5qpEWggaJpZM4JfwD4
.

nishihatapalmer · 2016-10-29T19:18:19Z

Performance issue solved, and some interesting benchmarks!

…zero.

nishihatapalmer closed this as completed Oct 29, 2016

nishihatapalmer added a commit that referenced this issue Apr 23, 2017

Fix issue #11 - StringReader should only create a window at position …

db616b4

…zero.

Performance Issue with small alphabets and long texts #11

Performance Issue with small alphabets and long texts #11

Comments

almondtools commented Aug 9, 2016 • edited

nishihatapalmer commented Aug 10, 2016

nishihatapalmer commented Aug 10, 2016

almondtools commented Aug 10, 2016

nishihatapalmer commented Aug 10, 2016

nishihatapalmer commented Aug 11, 2016

nishihatapalmer commented Aug 11, 2016

nishihatapalmer commented Aug 11, 2016

almondtools commented Aug 11, 2016

nishihatapalmer commented Aug 11, 2016

nishihatapalmer commented Aug 11, 2016

almondtools commented Aug 11, 2016

nishihatapalmer commented Aug 11, 2016

almondtools commented Aug 12, 2016

nishihatapalmer commented Aug 12, 2016

nishihatapalmer commented Aug 12, 2016

almondtools commented Aug 13, 2016

nishihatapalmer commented Aug 13, 2016

almondtools commented Aug 14, 2016

nishihatapalmer commented Aug 15, 2016

nishihatapalmer commented Aug 15, 2016

nishihatapalmer commented Aug 15, 2016

nishihatapalmer commented Aug 15, 2016

nishihatapalmer commented Aug 15, 2016

nishihatapalmer commented Aug 15, 2016

nishihatapalmer commented Aug 15, 2016

almondtools commented Aug 15, 2016 • edited

nishihatapalmer commented Aug 15, 2016

nishihatapalmer commented Aug 16, 2016

nishihatapalmer commented Aug 16, 2016

almondtools commented Aug 16, 2016

nishihatapalmer commented Aug 16, 2016

nishihatapalmer commented Aug 16, 2016

almondtools commented Aug 16, 2016

nishihatapalmer commented Aug 16, 2016

almondtools commented Aug 16, 2016

nishihatapalmer commented Aug 16, 2016

nishihatapalmer commented Aug 16, 2016

almondtools commented Aug 18, 2016

nishihatapalmer commented Aug 18, 2016

almondtools commented Aug 18, 2016

almondtools commented Aug 28, 2016

nishihatapalmer commented Aug 28, 2016

almondtools commented Aug 28, 2016

nishihatapalmer commented Aug 28, 2016

almondtools commented Sep 11, 2016

nishihatapalmer commented Sep 12, 2016

nishihatapalmer commented Oct 29, 2016

almondtools commented Aug 9, 2016 •

edited

almondtools commented Aug 15, 2016 •

edited