modified char_splitter to support UTF-8 (implements #299) #304

kmaehashi · 2016-07-20T10:59:03Z

Implements #299

TkrUdagawa · 2016-07-21T02:03:00Z

jubatus/core/fv_converter/char_splitter.cpp

    if (end == std::string::npos) {
      size_t len = string.size() - begin;
-      bounds.push_back(std::make_pair(begin, len));
+      size_t len_bytes = ustring_to_string(target.substr(begin, len)).size();


I think this target.substr(begin, len) is equal to target.substr(begin) because this code get the substring of target from begin to the end of target.
And also the variable len is not necessary in this condition.

Exactly! Fixed.

kmaehashi · 2016-07-21T04:25:06Z

As a future work, we can speed-up this feature extraction by couting bytes of ustrings without converting it into std::string. Such feature can be implemented injubatus::util::data::string::ustring.

TkrUdagawa · 2016-07-21T04:26:18Z

👍

Tobe Yutaro and others added 3 commits July 20, 2016 18:39

modified char_split to support UTF-8

c49ac58

revise implementation

4a11147

fix code style

8ae1a83

kmaehashi added this to the 0.3.3 milestone Jul 20, 2016

fix code style

db7ff40

kmaehashi mentioned this pull request Jul 20, 2016

modified char_split to support UTF-8 (implements #299) #301

Closed

TkrUdagawa self-assigned this Jul 20, 2016

TkrUdagawa reviewed Jul 21, 2016
View reviewed changes

remove unused length calculation

a36d542

TkrUdagawa merged commit 7bd4473 into develop Jul 21, 2016

kmaehashi deleted the support_utf8_work branch July 21, 2016 04:27

kmaehashi mentioned this pull request Jul 21, 2016

Support UTF-8 strings in char_splitter #299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modified char_splitter to support UTF-8 (implements #299) #304

modified char_splitter to support UTF-8 (implements #299) #304

kmaehashi commented Jul 20, 2016

TkrUdagawa Jul 21, 2016

kmaehashi Jul 21, 2016

kmaehashi commented Jul 21, 2016

TkrUdagawa commented Jul 21, 2016

modified char_splitter to support UTF-8 (implements #299) #304

modified char_splitter to support UTF-8 (implements #299) #304

Conversation

kmaehashi commented Jul 20, 2016

TkrUdagawa Jul 21, 2016

Choose a reason for hiding this comment

kmaehashi Jul 21, 2016

Choose a reason for hiding this comment

kmaehashi commented Jul 21, 2016

TkrUdagawa commented Jul 21, 2016