-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Q] Is it possible to improve parse speed of the Joni regexp library. #27
Comments
@hiroyuki-sato A difference in speed is in your joni code more work is being done than in the java version outside the regexp. Java regexp works on char whereas Joni works on byte. When you read your data into a String Java will transcode what it reads into UTF16-LE. When you enter your loop you are taking an UTF16-LE String and transcoding it back to UTF-8 then getting a byte[]. We have seen this transcoding cost end up being as costly as performing regexp matches (although it depends on the regexp so I am not saying that of your case neccesarily). If you make your LineDecoder actually work with original bytes then that cost will go away. Assuming your source you are reading is already UTF-8. |
Hello, @enebo Thank you for the comment. Could I have you confirm whether my understanding is correct or not? Improve plan.
byte[] line_bytes = byteReader.read(); // read byte[] instead of String.
Matcher matcher = regex.matcher(line_bytes);
int result = matcher.search(0, line_bytes.length, Option.DEFAULT); Current code
String line = lineDecoder.poll();
byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);
Matcher matcher = regex.matcher(line_bytes);
int result = matcher.search(0, line_bytes.length, Option.DEFAULT); Is this correct? Thank you for your advice. Kind regards. |
The other thing is laziness, if you dont want to materialize the whole buffer, some extra reading machinery will be needed (also to deal with newlines). |
@hiroyuki-sato yes you understood what I said but as @lopex mentions there may be more work. I will also explain @lopex comment a bit more since I think that needs more explanation. You will read in data as byte[] (some amount from a buffer size to entire file...). If you need your regexp to be single-line regexp then you need to figure out where '\n' are. So you need to read your data into a byte[]. Walk the byte[] calculating each utf-8 char until you find a '\n'. At this point, you can make a new byte[] containing only the line. Or you can keep track of the line with two ints and use joni method like: https://github.com/jruby/joni/blob/master/src/org/joni/Regex.java#L113. This will eliminate constructing additional byte[] array per line. @lopex can you give him a simple jcodings code snippet for finding a newline region in a byte[]? |
@hiroyuki-sato: https://github.com/jruby/jcodings/blob/master/src/org/jcodings/Encoding.java#L200. |
Thank you for the advice. I realized that the following code is not good.
So I'll try to use
I'll try to use ByteByffer or something class for calculating newline position. I'll close this issue within a week by myself. Again, I appreciate your advice. |
Hello, team.
First of all. Thank you very much for making great JRuby software.
Now, I'm making embulk-parser-joni_regexp.
I wanted to use Oniguruma compatible regular expression library.
That's why I'm using Joni.
Currently, My Joni code over three times slower than java.util.regex library.
My original code is here.
https://github.com/hiroyuki-sato/embulk-parser-joni_regexp/blob/master/src/main/java/org/embulk/parser/joni_regexp/JoniRegexpParserPlugin.java#L91-L119
I made test code regexp_test for compare regex speed.
The main part is the following.
It it possible to improve parse speed of the Joni regex library?
Thank you for your advice.
Kind regards.
format string
Joni
java.util.regex
The text was updated successfully, but these errors were encountered: