Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segmentation fault #64

Merged
merged 1 commit into from
Mar 24, 2017
Merged

Fix segmentation fault #64

merged 1 commit into from
Mar 24, 2017

Conversation

ksss
Copy link
Contributor

@ksss ksss commented Mar 22, 2017

"".split(1)
#=> segmentation fault

I think, it should delegate to core if argument is none OnigRegexp object.

@mattn
Copy link
Owner

mattn commented Mar 22, 2017

I want to make fallback. please add #ifndef MRUBY_UTF8_STRING and #endif. if "あ".len == 1 to the test.

@ksss
Copy link
Contributor Author

ksss commented Mar 22, 2017

Sorry, I couldn't understand what that means.
I think, It's a duplicated implementation with mruby-core and mruby-onig-regexp.

# without MRB_UTF8_STRING
$ mruby -e 'p "あいうえお".string_split("")'
["\343", "\201", "\202", "\343", "\201", "\204", "\343", "\201", "\206", "\343", "\201", "\210", "\343", "\201", "\212"]

# with MRB_UTF8_STRING
$ mruby -e 'p "あいうえお".string_split("")'
["あ", "い", "う", "え", "お"]

This is String#string_split behavior.
If you want to test this behavior, we should write a test on the core.

@mattn
Copy link
Owner

mattn commented Mar 22, 2017

I'm thinking this possibly make user confusing because return latin-1 letters even though OnigRegexp can handle unicode. How do you think?

@mattn
Copy link
Owner

mattn commented Mar 22, 2017

If string_split return latin-1 letters, regexp_split also return latin-1 lettres.

#ifdef MRUBY_UTF8_STRING
      err = onig_new(&rp, (OnigUChar*)ptr, (OnigUChar*)ptr + len, ONIG_OPTION_DEFAULT,
                     ONIG_ENCODING_UTF8, OnigDefaultSyntax, NULL);
#else
      err = onig_new(&rp, (OnigUChar*)ptr, (OnigUChar*)ptr + len, ONIG_OPTION_DEFAULT,
                     ONIG_ENCODING_ASCII, OnigDefaultSyntax, NULL);
#endif

@ksss
Copy link
Contributor Author

ksss commented Mar 23, 2017

OK, I have to confess I didn't understand the unicode.
And I understood that you are considering how users feel without the implementation. aren't you?

I think, Users are not thinking about using OnigRegexp in this case.

"あいうえお".split("")

Because, It don't seems to be using Regexp object anywhere.

@mattn
Copy link
Owner

mattn commented Mar 23, 2017

Because, It don't seems to be using Regexp object anywhere.

Then, we will have to make error like below.

#ifndef MRB_UTF8_STRING
#error mruby-onig-regexp require MRB_UTF8_STRING
#endif

I just don't like contradiction between string(without utf8) and onig-regexp(can handle utf8).

MRB_UTF8_STRING "あ"[0].bytesize "あ".split("").size "あいう".split(/い/)
defined 3 1 works well
not defined 1 3 works well

@ksss
Copy link
Contributor Author

ksss commented Mar 23, 2017

Thank you for comment.

Is it intentional?

(It installed mruby-onig-regexp)

MRB_UTF8_STRING "あ".onig_regexp_split("") "あ".string_split("")
defined ["あ"] ["あ"]
not defined ["\343\201\202"] ["\343", "\201", "\202"]

I don't know why result should be ["\343\201\202"] if "あ".onig_regexp_split("") with MRB_UTF8_STRING is not defined.

This test case seems to be passing both MRB_UTF8_STRING defined or not.

assert_equal ['あ', 'い', 'う', 'え', 'お'], 'あいうえお'.onig_regexp_split('')

@mattn
Copy link
Owner

mattn commented Mar 23, 2017

Is it intentional?

Yes, I know current implementation have contradiction. So I want to fix.

I don't know why result should be ["\343\201\202"] if "あ".onig_regexp_split("") with MRB_UTF8_STRING is not defined.

Because current implementation do this at https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L833-L840

Let's fix like below.

  1. remove https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L833-L840
    deligate straight forward
  2. add #ifdef MRB_UTF8_STRING and #endif to https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L733-L742
  3. initialize onigumo engnine with ONIG_ENCODING_ASCII if MRB_UTF8_STRING is not defined.
  4. send pull-request to mruby core for adding something flag (like String#encoding_name?) to detect whether mruby support utf-8 strings.

@ksss
Copy link
Contributor Author

ksss commented Mar 23, 2017

1 => I agree.
2, 3, 4 => I don't have a motivation.

I don't understand the meaning of problem.

My goal is this.( and fix segv )

MRB_UTF8_STRING "あ".split("") with mruby-onig-regexp "あ".split("") without mruby-onig-regexp
defined ["あ"] ["あ"]
not defined ["\343", "\201", "\202"] ["\343", "\201", "\202"]

But, It's not your goal. right?
Then, I concentrate only on fixing segv at this PR.

@mattn
Copy link
Owner

mattn commented Mar 23, 2017

When MRB_UTF8_STRING is not defined, mruby core and mruby-onig-regexp should NOT handle utf-8 string. Agree? So /./ MUST NOT be matched to .

@ksss
Copy link
Contributor Author

ksss commented Mar 24, 2017

Oh, I noticed that mruby-onig-regexp already support UTF-8 not just String#split when MRB_UTF8_STRING not defined.

"あいうえお".gsub(/./, "a")
#=> "aaaaa"

"あ".scan(/./)
#=> ["\343\201\202"]

"あ".split(//)
#=> ["\343\201\202"]

This is mruby-onig-regexp's spec. I respect this.

So, I understood this behavior.

"あ".split("")
#=> ["\343\201\202"]

This is a unified result.

@ksss
Copy link
Contributor Author

ksss commented Mar 24, 2017

I rewrote a patch to fix segv.

@mattn
Copy link
Owner

mattn commented Mar 24, 2017

Thanks!

@ksss ksss deleted the segv branch March 24, 2017 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants