[Q] Is it possible to improve parse speed of the Joni regexp library. #27

hiroyuki-sato · 2017-02-23T09:25:58Z

Hello, team.

First of all. Thank you very much for making great JRuby software.

Now, I'm making embulk-parser-joni_regexp.
I wanted to use Oniguruma compatible regular expression library.
That's why I'm using Joni.

Currently, My Joni code over three times slower than java.util.regex library.

My original code is here.
https://github.com/hiroyuki-sato/embulk-parser-joni_regexp/blob/master/src/main/java/org/embulk/parser/joni_regexp/JoniRegexpParserPlugin.java#L91-L119

I made test code regexp_test for compare regex speed.

The main part is the following.

It it possible to improve parse speed of the Joni regex library?

Thank you for your advice.

Kind regards.

format string

        String format = "^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \\[(?<time>[^\\]]*)\\] \"(?<method>\\S+)(?: +(?<path>[^ ]*) +\\S*)?\" (?<code>[^ ]*) (?<size>[^ ]*)(?: \"(?<referer>[^\\\"]*)\" \"(?<agent>[^\\\"]*)\")?$";

Joni

        byte[] pattern = format.getBytes(StandardCharsets.UTF_8);
        Regex regexp = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE);
// ...
            while (true) {
// ...
                byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);
                Matcher matcher = regexp.matcher(line_bytes);
                int result = matcher.search(0, line_bytes.length, Option.DEFAULT);
// ...
            }

java.util.regex

        Pattern pattern = Pattern.compile(format);

            while (true) {
// ...
                Matcher matcher = pattern.matcher(line);
                if (matcher.matches()) {
// ...
            }

enebo · 2017-02-24T14:38:12Z

@hiroyuki-sato A difference in speed is in your joni code more work is being done than in the java version outside the regexp. Java regexp works on char whereas Joni works on byte. When you read your data into a String Java will transcode what it reads into UTF16-LE. When you enter your loop you are taking an UTF16-LE String and transcoding it back to UTF-8 then getting a byte[]. We have seen this transcoding cost end up being as costly as performing regexp matches (although it depends on the regexp so I am not saying that of your case neccesarily).

If you make your LineDecoder actually work with original bytes then that cost will go away. Assuming your source you are reading is already UTF-8.

hiroyuki-sato · 2017-02-24T16:35:20Z

Hello, @enebo

Thank you for the comment.

Could I have you confirm whether my understanding is correct or not?
(This is because for my English skill.)

Improve plan.

Read a data as byte[] instead of String.

byte[] line_bytes = byteReader.read(); // read byte[] instead of String. 
Matcher matcher = regex.matcher(line_bytes);
int result = matcher.search(0, line_bytes.length, Option.DEFAULT);

Current code

Don't read data as String.
It need to convert byte[] if I use Joni library.

String line = lineDecoder.poll(); 
byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);
Matcher matcher = regex.matcher(line_bytes);
int result = matcher.search(0, line_bytes.length, Option.DEFAULT);

Is this correct?

Thank you for your advice.

Kind regards.

lopex · 2017-02-24T16:42:24Z

The other thing is laziness, if you dont want to materialize the whole buffer, some extra reading machinery will be needed (also to deal with newlines).

enebo · 2017-02-24T17:01:37Z

@hiroyuki-sato yes you understood what I said but as @lopex mentions there may be more work.

I will also explain @lopex comment a bit more since I think that needs more explanation.

You will read in data as byte[] (some amount from a buffer size to entire file...). If you need your regexp to be single-line regexp then you need to figure out where '\n' are. So you need to read your data into a byte[]. Walk the byte[] calculating each utf-8 char until you find a '\n'. At this point, you can make a new byte[] containing only the line. Or you can keep track of the line with two ints and use joni method like: https://github.com/jruby/joni/blob/master/src/org/joni/Regex.java#L113. This will eliminate constructing additional byte[] array per line.

@lopex can you give him a simple jcodings code snippet for finding a newline region in a byte[]?

lopex · 2017-02-24T17:37:45Z

@hiroyuki-sato: https://github.com/jruby/jcodings/blob/master/src/org/jcodings/Encoding.java#L200.
Trying to come up with general (using jcodings apis) and performant lazy solution...

hiroyuki-sato · 2017-02-27T01:29:31Z

Hello @enebo and @lopex

Thank you for the advice.

I realized that the following code is not good.
Because, the "line_bytes" isn't permanent size.
(I don't write Java code much.)

byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);

So I'll try to use (byte[], int start, int end) method which is you explained.

public Regex(byte[] bytes, int p, int end)

public abstract boolean isNewLine(byte[]bytes, int p, int end);

I'll try to use ByteByffer or something class for calculating newline position.

I'll close this issue within a week by myself.

Again, I appreciate your advice.

hiroyuki-sato · 2017-03-12T09:03:58Z

Hello @enebo and @lopex

Thank you for your advice.
I forgot to close this issue.

Thank you again.

hiroyuki-sato closed this as completed Mar 12, 2017

sonots mentioned this issue May 18, 2017

Use joni to improve REGEXP performance sonots/embulk-filter-row#16

Merged

hiroyuki-sato mentioned this issue Jun 8, 2017

Performance improvement hiroyuki-sato/embulk-parser-joni_regexp#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Is it possible to improve parse speed of the Joni regexp library. #27

[Q] Is it possible to improve parse speed of the Joni regexp library. #27

hiroyuki-sato commented Feb 23, 2017 •

edited

enebo commented Feb 24, 2017 •

edited

hiroyuki-sato commented Feb 24, 2017

lopex commented Feb 24, 2017

enebo commented Feb 24, 2017

lopex commented Feb 24, 2017

hiroyuki-sato commented Feb 27, 2017

hiroyuki-sato commented Mar 12, 2017

[Q] Is it possible to improve parse speed of the Joni regexp library. #27

[Q] Is it possible to improve parse speed of the Joni regexp library. #27

Comments

hiroyuki-sato commented Feb 23, 2017 • edited

format string

Joni

java.util.regex

enebo commented Feb 24, 2017 • edited

hiroyuki-sato commented Feb 24, 2017

Improve plan.

Current code

lopex commented Feb 24, 2017

enebo commented Feb 24, 2017

lopex commented Feb 24, 2017

hiroyuki-sato commented Feb 27, 2017

hiroyuki-sato commented Mar 12, 2017

hiroyuki-sato commented Feb 23, 2017 •

edited

enebo commented Feb 24, 2017 •

edited