Fix segmentation fault #64

ksss · 2017-03-22T03:15:08Z

"".split(1)
#=> segmentation fault

I think, it should delegate to core if argument is none OnigRegexp object.

mattn · 2017-03-22T03:21:59Z

I want to make fallback. please add #ifndef MRUBY_UTF8_STRING and #endif. if "あ".len == 1 to the test.

ksss · 2017-03-22T08:27:34Z

Sorry, I couldn't understand what that means.
I think, It's a duplicated implementation with mruby-core and mruby-onig-regexp.

# without MRB_UTF8_STRING
$ mruby -e 'p "あいうえお".string_split("")'
["\343", "\201", "\202", "\343", "\201", "\204", "\343", "\201", "\206", "\343", "\201", "\210", "\343", "\201", "\212"]

# with MRB_UTF8_STRING
$ mruby -e 'p "あいうえお".string_split("")'
["あ", "い", "う", "え", "お"]

This is String#string_split behavior.
If you want to test this behavior, we should write a test on the core.

mattn · 2017-03-22T14:46:37Z

I'm thinking this possibly make user confusing because return latin-1 letters even though OnigRegexp can handle unicode. How do you think?

mattn · 2017-03-22T14:51:02Z

If string_split return latin-1 letters, regexp_split also return latin-1 lettres.

#ifdef MRUBY_UTF8_STRING
      err = onig_new(&rp, (OnigUChar*)ptr, (OnigUChar*)ptr + len, ONIG_OPTION_DEFAULT,
                     ONIG_ENCODING_UTF8, OnigDefaultSyntax, NULL);
#else
      err = onig_new(&rp, (OnigUChar*)ptr, (OnigUChar*)ptr + len, ONIG_OPTION_DEFAULT,
                     ONIG_ENCODING_ASCII, OnigDefaultSyntax, NULL);
#endif

ksss · 2017-03-23T01:59:07Z

OK, I have to confess I didn't understand the unicode.
And I understood that you are considering how users feel without the implementation. aren't you?

I think, Users are not thinking about using OnigRegexp in this case.

"あいうえお".split("")

Because, It don't seems to be using Regexp object anywhere.

mattn · 2017-03-23T02:37:52Z

Because, It don't seems to be using Regexp object anywhere.

Then, we will have to make error like below.

#ifndef MRB_UTF8_STRING
#error mruby-onig-regexp require MRB_UTF8_STRING
#endif

I just don't like contradiction between string(without utf8) and onig-regexp(can handle utf8).

MRB_UTF8_STRING	"あ"[0].bytesize	"あ".split("").size	"あいう".split(/い/)
defined	3	1	works well
not defined	1	3	works well

ksss · 2017-03-23T05:25:27Z

Thank you for comment.

Is it intentional?

(It installed mruby-onig-regexp)

MRB_UTF8_STRING	"あ".onig_regexp_split("")	"あ".string_split("")
defined	["あ"]	["あ"]
not defined	["\343\201\202"]	["\343", "\201", "\202"]

I don't know why result should be ["\343\201\202"] if "あ".onig_regexp_split("") with MRB_UTF8_STRING is not defined.

This test case seems to be passing both MRB_UTF8_STRING defined or not.

mruby-onig-regexp/test/mruby_onig_regexp.rb

Line 331 in 071fc4e

    
           assert_equal ['あ', 'い', 'う', 'え', 'お'], 'あいうえお'.onig_regexp_split('')

mattn · 2017-03-23T06:12:03Z

Is it intentional?

Yes, I know current implementation have contradiction. So I want to fix.

I don't know why result should be ["\343\201\202"] if "あ".onig_regexp_split("") with MRB_UTF8_STRING is not defined.

Because current implementation do this at https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L833-L840

Let's fix like below.

remove https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L833-L840
deligate straight forward
add #ifdef MRB_UTF8_STRING and #endif to https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L733-L742
initialize onigumo engnine with ONIG_ENCODING_ASCII if MRB_UTF8_STRING is not defined.
send pull-request to mruby core for adding something flag (like String#encoding_name?) to detect whether mruby support utf-8 strings.

ksss · 2017-03-23T08:33:38Z

1 => I agree.
2, 3, 4 => I don't have a motivation.

I don't understand the meaning of problem.

My goal is this.( and fix segv )

MRB_UTF8_STRING	"あ".split("") with mruby-onig-regexp	"あ".split("") without mruby-onig-regexp
defined	["あ"]	["あ"]
not defined	["\343", "\201", "\202"]	["\343", "\201", "\202"]

But, It's not your goal. right?
Then, I concentrate only on fixing segv at this PR.

mattn · 2017-03-23T10:04:40Z

When MRB_UTF8_STRING is not defined, mruby core and mruby-onig-regexp should NOT handle utf-8 string. Agree? So /./ MUST NOT be matched to あ.

ksss · 2017-03-24T00:24:14Z

Oh, I noticed that mruby-onig-regexp already support UTF-8 not just String#split when MRB_UTF8_STRING not defined.

"あいうえお".gsub(/./, "a")
#=> "aaaaa"

"あ".scan(/./)
#=> ["\343\201\202"]

"あ".split(//)
#=> ["\343\201\202"]

This is mruby-onig-regexp's spec. I respect this.

So, I understood this behavior.

"あ".split("")
#=> ["\343\201\202"]

This is a unified result.

ksss · 2017-03-24T00:52:11Z

I rewrote a patch to fix segv.

mattn · 2017-03-24T00:57:43Z

Thanks!

ksss force-pushed the segv branch from f6a8ba7 to a01294f Compare March 24, 2017 00:38

Fix segmentation fault

712c264

ksss force-pushed the segv branch from a01294f to 712c264 Compare March 24, 2017 00:42

mattn mentioned this pull request Mar 24, 2017

/./ should match to latin-1 letter when MRB_UTF8_STRING is not defined. #66

Closed

mattn merged commit be2ed0d into mattn:master Mar 24, 2017

ksss deleted the segv branch March 24, 2017 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segmentation fault #64

Fix segmentation fault #64

ksss commented Mar 22, 2017

mattn commented Mar 22, 2017 •

edited

ksss commented Mar 22, 2017

mattn commented Mar 22, 2017

mattn commented Mar 22, 2017

ksss commented Mar 23, 2017

mattn commented Mar 23, 2017

ksss commented Mar 23, 2017

mattn commented Mar 23, 2017

ksss commented Mar 23, 2017

mattn commented Mar 23, 2017

ksss commented Mar 24, 2017

ksss commented Mar 24, 2017

mattn commented Mar 24, 2017

Fix segmentation fault #64

Fix segmentation fault #64

Conversation

ksss commented Mar 22, 2017

mattn commented Mar 22, 2017 • edited

ksss commented Mar 22, 2017

mattn commented Mar 22, 2017

mattn commented Mar 22, 2017

ksss commented Mar 23, 2017

mattn commented Mar 23, 2017

ksss commented Mar 23, 2017

mattn commented Mar 23, 2017

ksss commented Mar 23, 2017

mattn commented Mar 23, 2017

ksss commented Mar 24, 2017

ksss commented Mar 24, 2017

mattn commented Mar 24, 2017

mattn commented Mar 22, 2017 •

edited