Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] Is it possible to improve parse speed of the Joni regexp library. #27

Closed
hiroyuki-sato opened this issue Feb 23, 2017 · 7 comments
Closed

Comments

@hiroyuki-sato
Copy link

hiroyuki-sato commented Feb 23, 2017

Hello, team.

First of all. Thank you very much for making great JRuby software.

Now, I'm making embulk-parser-joni_regexp.
I wanted to use Oniguruma compatible regular expression library.
That's why I'm using Joni.

Currently, My Joni code over three times slower than java.util.regex library.

My original code is here.
https://github.com/hiroyuki-sato/embulk-parser-joni_regexp/blob/master/src/main/java/org/embulk/parser/joni_regexp/JoniRegexpParserPlugin.java#L91-L119

I made test code regexp_test for compare regex speed.

The main part is the following.

It it possible to improve parse speed of the Joni regex library?

Thank you for your advice.

Kind regards.

format string

        String format = "^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \\[(?<time>[^\\]]*)\\] \"(?<method>\\S+)(?: +(?<path>[^ ]*) +\\S*)?\" (?<code>[^ ]*) (?<size>[^ ]*)(?: \"(?<referer>[^\\\"]*)\" \"(?<agent>[^\\\"]*)\")?$";

Joni

        byte[] pattern = format.getBytes(StandardCharsets.UTF_8);
        Regex regexp = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE);
// ...
            while (true) {
// ...
                byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);
                Matcher matcher = regexp.matcher(line_bytes);
                int result = matcher.search(0, line_bytes.length, Option.DEFAULT);
// ...
            }

java.util.regex

        Pattern pattern = Pattern.compile(format);

            while (true) {
// ...
                Matcher matcher = pattern.matcher(line);
                if (matcher.matches()) {
// ...
            }
@enebo
Copy link
Member

enebo commented Feb 24, 2017

@hiroyuki-sato A difference in speed is in your joni code more work is being done than in the java version outside the regexp. Java regexp works on char whereas Joni works on byte. When you read your data into a String Java will transcode what it reads into UTF16-LE. When you enter your loop you are taking an UTF16-LE String and transcoding it back to UTF-8 then getting a byte[]. We have seen this transcoding cost end up being as costly as performing regexp matches (although it depends on the regexp so I am not saying that of your case neccesarily).

If you make your LineDecoder actually work with original bytes then that cost will go away. Assuming your source you are reading is already UTF-8.

@hiroyuki-sato
Copy link
Author

Hello, @enebo

Thank you for the comment.

Could I have you confirm whether my understanding is correct or not?
(This is because for my English skill.)

Improve plan.

  • Read a data as byte[] instead of String.
byte[] line_bytes = byteReader.read(); // read byte[] instead of String. 
Matcher matcher = regex.matcher(line_bytes);
int result = matcher.search(0, line_bytes.length, Option.DEFAULT);

Current code

  • Don't read data as String.
  • It need to convert byte[] if I use Joni library.
String line = lineDecoder.poll(); 
byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);
Matcher matcher = regex.matcher(line_bytes);
int result = matcher.search(0, line_bytes.length, Option.DEFAULT);

Is this correct?

Thank you for your advice.

Kind regards.

@lopex
Copy link
Member

lopex commented Feb 24, 2017

The other thing is laziness, if you dont want to materialize the whole buffer, some extra reading machinery will be needed (also to deal with newlines).

@enebo
Copy link
Member

enebo commented Feb 24, 2017

@hiroyuki-sato yes you understood what I said but as @lopex mentions there may be more work.

I will also explain @lopex comment a bit more since I think that needs more explanation.

You will read in data as byte[] (some amount from a buffer size to entire file...). If you need your regexp to be single-line regexp then you need to figure out where '\n' are. So you need to read your data into a byte[]. Walk the byte[] calculating each utf-8 char until you find a '\n'. At this point, you can make a new byte[] containing only the line. Or you can keep track of the line with two ints and use joni method like: https://github.com/jruby/joni/blob/master/src/org/joni/Regex.java#L113. This will eliminate constructing additional byte[] array per line.

@lopex can you give him a simple jcodings code snippet for finding a newline region in a byte[]?

@lopex
Copy link
Member

lopex commented Feb 24, 2017

@hiroyuki-sato: https://github.com/jruby/jcodings/blob/master/src/org/jcodings/Encoding.java#L200.
Trying to come up with general (using jcodings apis) and performant lazy solution...

@hiroyuki-sato
Copy link
Author

Hello @enebo and @lopex

Thank you for the advice.

I realized that the following code is not good.
Because, the "line_bytes" isn't permanent size.
(I don't write Java code much.)

byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);

So I'll try to use (byte[], int start, int end) method which is you explained.

public Regex(byte[] bytes, int p, int end)
public abstract boolean isNewLine(byte[]bytes, int p, int end);

I'll try to use ByteByffer or something class for calculating newline position.

I'll close this issue within a week by myself.

Again, I appreciate your advice.

@hiroyuki-sato
Copy link
Author

Hello @enebo and @lopex

Thank you for your advice.
I forgot to close this issue.

Thank you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants