-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issue with small alphabets and long texts #11
Comments
Hi Stefan, thanks for benchmarking byteseek - I'm very interested to see how it stacks I will have a look at the issue and get back to you. Regards, Matt. On 9 August 2016 at 07:16, Stefan Mandel notifications@github.com wrote:
|
Hi Stefan, one thing you may be interested in. I've been doing some benchmarking of For small alphabets and pattern sizes, Shift-OR works extremely well. For There are implementations in the SMART tool. I've ported these to my jmh There are some even faster algorithms out there, like EPSM but they take Happy to share any of the benchmarking code or new algorithms with you - Regards, Matt. On 10 August 2016 at 09:38, Matt Palmer mattpalms@gmail.com wrote:
|
Hello Matt, Thanks for pointing me to the paper. I will read it by time. I already knew the SMART web site. Yet the results are not (or no more) available there. I remember a paper comparing many (single pattern) algorithms, maybe this was related to the web site. Yet it is not that relevant to me, because I turned my focus on searching multiple patterns in a text. If you found some resources on this subject, let me know. Regards, Stefan |
Hi Stefan, yes, multi pattern searches are also interesting... If I come across I think I found the source of the massive slowdown in the byteseek tests. Regards, Matt On 10 August 2016 at 19:40, Stefan Mandel notifications@github.com wrote:
|
Hi Stefan, I've confirmed that a simple bug exists in the StringReader class of I'll figure out when to do another byteseek release and what will be Many thanks for reporting the issue to me, and I hope to see byteseek Regards, Matt. On 10 August 2016 at 20:54, Matt Palmer mattpalms@gmail.com wrote:
|
Hi Stefan, one more thing occurs to me. Why use the (buggy) StringReader, or any Using the byte array searchForwards() methods on the searchers is probably Regards, Matt On 11 August 2016 at 19:54, Matt Palmer mattpalms@gmail.com wrote:
|
If we modify the ByteSeekBenchmark class to search over byte arrays, then @OverRide List indexes = new ArrayList<>(); while (searchIterator.hasNext()) { On 11 August 2016 at 20:05, Matt Palmer mattpalms@gmail.com wrote:
|
Ok, I will integrate this code. Just to prevent misunderstandings: Your code will not scale on large texts, right? So if I decide to extend the benchmark by some large text sample, I will have to fall back to the WindowReader? |
OK, good :) You are right that it won't scale using the current StringReader, but only Over byte arrays of any size, or any other WindowReader, it should be Matt On 11 August 2016 at 21:45, Stefan Mandel notifications@github.com wrote:
|
I guess it will always incur the penalty of transforming the string into a Regards, Matt. On 11 August 2016 at 21:45, Stefan Mandel notifications@github.com wrote:
|
Sorry, my question was a bit confusing, Converting a large text file to bytes will consume heap for the whole text file, which is not scaleable. I suppose that WindowReader processes chars in a stream (or buffer/window) not in an array, so it will be scaleable. I was mistaken to believe that this is a problem of your benchmark implementation, but after reviewing it I see that it is a problem of my benchmark interface. So if there is something to rework, it is my part. By the way: I have tested your Benchmark succesfully (and will commit in the next time). I do not know when I will start a new Benchmark (it is dependent on my free time and bugfixes provided by third parties). Regards, Stefan |
OK, good news! No worries on when it will happen. Like yourself my progress on byteseek Regards, Matt. On 11 August 2016 at 22:51, Stefan Mandel notifications@github.com wrote:
|
I had to adjust the benchmark to filter overlapping results. Now looks like this:
If there is a more elegant solution with ByteSeek, let me know. Testing only non-overlapping matches was a design decision:
|
Hi Stefan, At this point, it's probably easier just to ditch the @OverRide List indexes = new ArrayList<>(); int searchPos = 0; return indexes; On 12 August 2016 at 17:38, Stefan Mandel notifications@github.com wrote:
|
Note that you will only ever get zero or one search results returned by a The next version of byteseek will support getting the match positions Matt. On 12 August 2016 at 18:41, Matt Palmer mattpalms@gmail.com wrote:
|
Now a new benchmark is available, look at result-2016-08-13.txt Comparing your Horspool with the StringAndChars Horspool is quite confusing: Using alphabet size/pattern length 2/2 your algorithm is twice as fast as StringsAndChars, any other is slower. There seems to be a pattern: SCHorspool gets faster on larger alphabets, BSBoyerMooreHorspool does not. I would think of two possible reasons:
Another question: |
Hi Stefan, I was pretty sure byteseek would be slower than the others, but mostly Can you summarise how bad it is for byteseek? I'm surprised if it is As to what searchers should be included, good question! I would not bother Likewise, the SequenceMatcherSearcher can also be ignored. That's just me The Final Flag variants are probably worth benchmarking. I invented the Regards, Matt. On 13 August 2016 at 22:54, Stefan Mandel notifications@github.com wrote:
|
Yet the benchmark contains
All other libraries are on hold because they fail to pass the benchmark (timeouts or verification errors). A summary of all tested algorithms is at the bottom of the referenced document. Next to this file is a csv-file that could be analyzed in Calc/Excel. To get an impression how your algorithm compares to StringsAndChars compare the lines of BSBoyerMooreHorspoolBenchmark and SCHorspoolBenchmark: quite comparable for alphabet sizes of 2 and 4, but not competetive for all other. Comparing it with the naive search via String.indexOf at 1024/1024 (alphabet size/pattern size) it is even 30 times slower, where SCHorspool is 5 times faster (actually the benchmark is biased towards naive search, see almondtools/stringbench#1, but that does not explain the performance lack compared with other horspool implementations). |
HI Stefan, thanks, I'll have a look. The byteseek library is used in an intensive Matt. On 14 August 2016 at 06:38, Stefan Mandel notifications@github.com wrote:
|
It's very peculiar. My own benchmarks all show shift table based It may suggest that the byteseek benchmarks are being dominated by some Regards, Matt On 15 August 2016 at 12:08, Matt Palmer mattpalms@gmail.com wrote:
|
In fact, byteseek appears to be roughly flat for each alphabet size. Out of interest, what does an alphabet size greater than 256 mean? I guess Matt. On 15 August 2016 at 12:35, Matt Palmer mattpalms@gmail.com wrote:
|
One thing I would comment on about your jmh benchmark structure. I notice @benchmark So you are both getting the pattern to test and the sample to test with In this circumstance, setup costs are going to dominate the much faster I've done this in my own benchmarks using parameterised values for pattern @benchmark Happy to share any of my benchmarking code with you if it helps - I can put Regards, Matt On 15 August 2016 at 12:56, Matt Palmer mattpalms@gmail.com wrote:
|
Hi Stefan, Rather than putting the benchmarking code up on Github, as it's not really It does mean I have to define different benchmark classes to accommodate Search preparation (calculating shift tables, etc.) is not included in the Anyway - not saying you should do what I'm doing here - but I do think it's Regards, Matt On 15 August 2016 at 13:20, Matt Palmer mattpalms@gmail.com wrote:
|
A good rule of thumb for jmh benchmarking is that there should be no loops On 15 August 2016 at 13:43, Matt Palmer mattpalms@gmail.com wrote:
|
One thing I initially struggled with was wanting to record the results to Validating that the algorithms give the correct results does not need to be cheers, Matt. On 15 August 2016 at 13:49, Matt Palmer mattpalms@gmail.com wrote:
|
Hello Matt, Your hints for benchmarking are correct, but they do not apply. Calling a getter on an already set up sample will cost only nanoseconds - and is according to jmh the best practice to benchmark with variable tuples. The setup of the sample is properly separated from the benchmark, it can be found in the method annotated with It is also correct that validation is not needed in the benchmark - but since it is done in its own I think you should not concentrate on the algorithm (this does probably work), but on the result collection, because this is code of me and maybe I did use your api in a wrong way ... ... indeed after reviewing the code of the Yet I recommend that you review the Regards and thanks for your feedback, Stefan |
Hi Stefan, OK - I still prefer to only include the code I want to test directly in the To eliminate the problem of having to convert the string to a byte array on Regards. Matt. On 15 August 2016 at 18:31, Stefan Mandel notifications@github.com wrote:
|
Hi, just having a look at your latest changes. That seems to solve the problem One difference I have noted is that your search classes do pre-processing So the results are not comparable right now, since the byteseek results Still playing with it so we can get some good reliable results! Regards, Matt On 15 August 2016 at 18:51, Matt Palmer mattpalms@gmail.com wrote:
|
Hi Stefan, I've run a new mini benchmark using your latest code, modified to use the The results are interesting - csv file attached. Byteseek does However, I'm still highly suspicious of these benchmark results. The I would expect longer patterns to perform faster for these algorithms. More investigation needed! Regards, Matt. On 16 August 2016 at 12:36, Matt Palmer mattpalms@gmail.com wrote:
|
I will include the preparation at setup time, as you recommended. Where did you attach the files? Maybe you replied by email, but github filters file attachments? Or can I find them in your project space? To get back to the original problem: It would be best if StringReader gets fixed so that I can use this class. This would be more comparable to the other algorithms. The modification that is active at this time, preprocesses the text (converting to bytes) in the setup phase - which was thought to be part of the benchmarking phase. |
Hi Stefan, I didn't attach the modified code. In ByteSeekBenchmark, I just modified @OverRide I'm sure there's a more elegant way using maps or something... I think you have to be able to pass in whatever data source a particular Using a StringReader will still incur the penalty of converting the text to Regards, Matt. On 16 August 2016 at 15:56, Stefan Mandel notifications@github.com wrote:
|
After some thought, I'm coming round to the idea that the benchmarks with They aren't flat in fact across alphabets in fact, although the difference I modified your benchmarks to just run with a single parameter, rather than Anyway, can't see anything hugely weird going on, but I shall probably keep Regards, Matt. On 16 August 2016 at 16:20, Matt Palmer mattpalms@gmail.com wrote:
|
Hello Matt, The idea of StringBench was to test Do not misunderstand me - your lib is byteseek - so nobody should be suprised on this. But I did the mistake to expect your classes to be performant on String`s and others possible do so too.
I identified StringReader as the class where the byte[]-conversion could be optimized out. Maybe I underestimate this task, but doing so would have two benefits:
|
Hi Stefan, The StringReader doesn't do on the fly conversion of text into bytes. It I agree that the purpose of StringBench is to benchmark string searching. I guess the question for you is what is the purpose of StringBench? If the Regards, Matt. On 16 August 2016 at 18:49, Stefan Mandel notifications@github.com wrote:
|
Hello Matt, I thought of a possible future StringReader ;). Yet I am not aware of the effort it would take to make it idenpendent from the byte array. Concerning the purpose of StringBench:
The algorithm performance would be interesting to find out possible implementation bugs. So maybe the next version of the StringBench benchmark both Algorithm performance and String performance. |
Hello Stefan, those seem like mostly reasonable assumptions. I would challenge your last
I think you just have to be clear what is being measured in your Regards, Matt Strings are the one core type byteseek isn't native to, the others it On 16 August 2016 at 20:16, Stefan Mandel notifications@github.com wrote:
|
The Algorithm and String performance sounds like good ways to differentiate cheers, Matt. On 16 August 2016 at 21:59, Matt Palmer mattpalms@gmail.com wrote:
|
What do you think about testing:
I found out that SC performs quite well with strings, but very poor with files. (yet beaten by your implementation). Naive string search seems to work better, but the current implementation will not scale because it does first convert the file contents to a long string (which should fail if the file is some GB large). I also found out that converting to byte[] in the benchmark is not critical compared with IO. |
Hi Stefan, File and String look like two use cases people would be interested in. I Of course, you have to be stream friendly to do that, i.e. it has to work Byteseek was built to search binary data, so it should perform very well on Searching large amounts of text data sounds like an interesting problem. Do Cheers Matt On 18 Aug 2016 00:39, "Stefan Mandel" notifications@github.com wrote:
|
Hello Matt, I thought that my implementation was stream friendly. But I did never test reading from a file. After all it works, yet not really performant. My approach reads buffers with a Reader and searches in these buffers. Variable-length chars limit the implementation options. I consider reimplementing this approach, or finding an approach with RandomAccessFile. Regards Stefan |
I recently started a new benchmark with following results:
Unfortunately the benchmark for files does not work above an alphabet size of 255. All reader/stream-based algorithms (not only yours) do not find the correct number of matches. Reading all bytes and searching naively or with regexes yet seems to work (consequently the chart shows Java-Regex-Searching to be the best on files with alphabets larger than 255 chars). Probably an encoding problem. |
Hi Stefan, Yes, the multi pattern searchers are completely untested right now - it
As far as alphabets greater than 255 (256?) go, the maximum alphabet I believe some multi-byte text encodings have different byte sequences Regards, On 28 Aug 2016 16:18, "Stefan Mandel" notifications@github.com wrote:
|
Hello Matt, The problem with larger alphabets was indeed a bug in reading with another encoding. Should be fixed on the trunk. I will inform you when newer benchmarks are available Regards, Stefan |
Hi Stefan, cool, thanks. As far as my multi-pattern searches go, I did actually have tests for them, So - they may well work for you. You are using byte arrays, not the Regards, Matt. On 28 August 2016 at 21:19, Stefan Mandel notifications@github.com wrote:
|
Hello Matt, Now a working benchmark is online, here an excerpt of the performance for an alphabet size of 32 and a pattern size of 32: name,family,number,alphabet,pattern,time If you want to generate such csvs for other params, use the ProcessResults class: Regards, Stefan |
Great, I'll check it out! Cheers Matt On 11 Sep 2016 19:17, "Stefan Mandel" notifications@github.com wrote:
|
Performance issue solved, and some interesting benchmarks! |
I tried to benchmark your library with StringBench.
Yet there seems to be a performance issue with all of your implementations of
AbstractSequenceSearcher
.You can reproduce this by
@Ignore
I yet did not test it with any other parameters, so perhaps this performance issue is related to other scenarios, too.
Maybe this issue is related to an incorrect test setup. If so please show me how to correct it.
The text was updated successfully, but these errors were encountered: