-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future work #17
Comments
Can you rebase your changes on top of the current master branch? It would be nice to see the benchmarks running in caliper. |
I will rebase that on Monday. Currently there is only NFA, so the performance of matching UTF8 bytes is worse then matching runes because algorithm fans out on Threads for UTF8 byte ranges (just like RE2 does). Our next step is to add DFA. We expect to see significant performance improvements then. |
|
Caliper benchmarks currently convert String inputs to UTF8 Slices, which causes overhead. We have internally run JMH benchmarks and worst case (wildcard .) cases took ~2-2.5 times compared to rune matching. Simple patterns (no fanout) were slightly faster. We also have POC Presto prototype with RE2J on Slices but with rune matching and our test queries had noticeable improvement compared to String RE2J. I will port benchmarks such that inputs are static UTF8 Slices. This will give more comparable results. |
Here are initial caliper results with byte matching: |
On 27 November 2015 at 11:04, Karol Sobczak notifications@github.com
RE2J has users. How do you propose to change both the API and the regular It seems to me you are describing a fork of RE2J, not a modification of it. |
Yes, currently it is a fork, but it is possible to support String API again. In order to do so a program generation for UTF16 bytes is required. Additionally, String data will be copied to Slice and then matched. In current master UTF8 byte data needs to be converted to String, which is rather slow. |
Here are the newest Caliper results from using DFA from https://github.com/Teradata/re2j/tree/re2j_on_bytes branch: |
Karol, let me be clear: first, you must not make incompatible changes to this package. It has many existing clients, some of which you cannot see, and if you make the changes describe above, those programs will break. Second, I am very uneasy about adding a dependency on the almost totally undocumented and terrifyingly unsafe-looking airlift/slice. What is the slice abstraction? Why not have RE2J define an interface that abstracts and input stream and make airlift/slice implement that interface? |
@alandonovan that pull request was created by mistake (it goes to TD fork), I have closed it already. I understand that you are reluctant to use Slice, but interface is also not that great idea. This is because reading input takes considerable amount of CPU %. When you have an interface with multiple implementations, then JIT won't be able to inline method calls and will have to use virtual calls. This will cause performance degradation, which can be avoided by having just one concrete class. |
Slice is convenient for us since it is used in Presto, therefore data doesn't have to be converted. Of course it is possible to replace Slice with other structure, but Slice gives some nice functionalities like String<->Slice conversion, builders, can work with off-heap memory, etc. It aims for efficiency. |
OK, never mind.
I understand that dynamic calls come at a cost relative to static calls to a single concrete type, but if you intend this package for use outside Teradata, or perhaps for other purposes even within your company, you could make the API more general and useful by expressing it in terms of (byte[], start, length) triples rather than a complex (and unsafe) third-party package. I have my doubts that the cost of making a single bulk copy of the input is even measurable relative to the cost of running the matcher. BTW, my concerns about airlift/slice's safety are not academic; see Paul Sandoz of Oracle's "Safety Not Guaranteed" slide deck (http://cr.openjdk.java.net/~psandoz/dv14-uk-paul-sandoz-unsafe-the-situation.pdf). |
Unsafe is still up in the air, especially with Java 9 on the way (despite delays). It's been a hot topic due to Jigsaw. Lots of projects relying on Unsafe has to make changes. There are talks of a public API, but that again is a change, uncertainty and not the way forward. |
I though about using triples, however you would have to use them in multiple places. This would make the code more complex and less readable. Additionally, we need to have support for slice/substring type of operation. I agree that
While
Imagine large documents where matching pattern is at the beginning of a text. You could also have large documents for which you could quickly determine that they don't match pattern by looking at first few bytes. By my knowledge those could be the use cases at Facebook. We ran benchmarks and without copying query runs for few seconds compared to few minutes for queries that do make copy of each input row. |
Dear Community,
As some of you know we at Teradata are planning to incorporate RE2J into Presto database (https://github.com/facebook/presto) since it is based on lighting fast RE2 and has potential for huge performance improvements. As part of our research on RE2J we found that it currently lacks many algorithms and optimizations that make RE2j so fast. Those optimizations require byte matching and not rune matching like in RE2J. Therefore as part of our work we would like to make RE2J internally work on UTF8 bytes. Additionally, we would like an input data to be represented by Slice (https://github.com/airlift/slice). Slice is basically a wrapper around byte[] array that can be sliced. It would allow us to directly match raw bytes. We believe that our changes will make RE2J one of the fastest matching solutions for Java in big data high performance scenarios. However, our approach might break current functionality of matching Java UTF16 Strings. We might add support for it later, but it will be less efficient than raw bytes matching because conversion from Strings to Slices is required. Our primary focus would be on matching UTF8 sequences. Our POC branch with RE2J on Slices is here: https://github.com/Teradata/re2j/tree/re2j_on_bytes. What do you think about our plan?
Here are some useful FAQs that additionally explain our motivations:
Slices are effectively a wrapper around a memory fragment. Since we would like RE2J to match UTF8 bytes instead of runes it is logical to use a structure that provides fast access to bytes and allows slicing. It happens that Slice is such a structure and is also used in Presto.
Because raw byte[] array cannot be sliced without memory copying. We aim for maximum performance.
Because ByteBuffer is an interface and will introduce virtual method calls when RE2J reads input bytes. This will hurt performance.
Having MachineInput abstracted with multiple implementations will introduce virtual method calls and will hurt RE2J performance.
We could compile matching program specifically for UTF16 byte sequences and convert Strings to Slices. Matching will be efficient but would require one memory copying (String to Slice).
Presto uses UTF8. Additionally I would risk saying that most of the text in big data is either stored as ANSI or UTF8. That makes UTF8 our primary target. See http://utf8everywhere.org/.
Original RE2 matches bytes. Byte matching allows us to port algorithms and optimizations from RE2. Without those algorithms we won’t be able to match RE2 performance.
If you don’t care about performance then Java Regular Expressions is probably your number one choice. If you want high performance matching then you won't choose Java Regular Expressions. RE2J is not a drop in replacement anyway since it doesn't support backreferences. Therefore I don’t think we should target Java Regular Expressions users.
Absolutely. Slices are small airlift subproject so there is not a lot of dependency. Additionally, I think big data and high performance community will care more about fast UTF8 bytes matching solution.
The text was updated successfully, but these errors were encountered: