-
Notifications
You must be signed in to change notification settings - Fork 682
use unicode codepoints instead of byte arrays #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this): | |||
| return True | |||
| debugger = Debugger() | |||
| hb.buffer_set_message_func (buf, debugger.message, 1, 0) | |||
| hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1) | |||
| hb.buffer_add_codepoints (buf, codepoints, 0, -1) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hb_buffer_add_codepoints() does not validate the text and i’m not sure if Python strings can never contain invalid Unicode characters, it is better to use the _utf32 method as it will work with the same input but HarfBuzz will also make sure the input is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think python can ever create an invalid string can it?
>>> a = "ab\xfgXa"
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 2-4: truncated \xXX escape
On Fri, Jun 17, 2016 at 9:31 AM, Khaled Hosny notifications@github.com
wrote:
In src/sample.py
#269 (comment):@@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this):
return True
debugger = Debugger()
hb.buffer_set_message_func (buf, debugger.message, 1, 0)
-hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1)
+hb.buffer_add_codepoints (buf, codepoints, 0, -1)hb_buffer_add_codepoints() does not validate the text and i’m not sure if
Python strings can never contain invalid Unicode characters, it is better
to use the _utf32 method as it will work with the same input but HarfBuzz
will also make sure the input is valid.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/behdad/harfbuzz/pull/269/files/0a9c03bc45e82c95377431576ddc92e8cd32ddb4#r67510092,
or mute the thread
https://github.com/notifications/unsubscribe/ACcEOkY9WMMIaMY0EA6WQyuTmASwA0WQks5qMqG9gaJpZM4I3_8R
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It accepts surrogate characters for one, which are invalid in UTF-32:
a = "\uD800\uDFFF"This is a valid string in Python, but not valid UTF-32 string and if you pass such code units to _add_utf32() it will replace them with the replacement character but _add_codepoints() won’t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work if I just changed the function to add_utf32? and still use
ord() ? I don’t understand unicode encoding that well sorry
On Fri, Jun 17, 2016 at 2:30 PM, Khaled Hosny notifications@github.com
wrote:
In src/sample.py
#269 (comment):@@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this):
return True
debugger = Debugger()
hb.buffer_set_message_func (buf, debugger.message, 1, 0)
-hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1)
+hb.buffer_add_codepoints (buf, codepoints, 0, -1)It accepts surrogate characters for one, which are invalid in UTF-32:
a = "\uD800\uDFFF"
This is a valid string in Python, but not valid UTF-32 string and if you
pass such code units to _add_utf32() it will replace them with the
replacement character but _add_codepoints() won’t.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/behdad/harfbuzz/pull/269/files/0a9c03bc45e82c95377431576ddc92e8cd32ddb4#r67555245,
or mute the thread
https://github.com/notifications/unsubscribe/ACcEOvPW5yIbNA13B-ENuXNE56RLxGLpks5qMufEgaJpZM4I3_8R
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated pull request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if add_utf32() does everything add_codepoints() does and better, why are
there two functions?
On Fri, Jun 17, 2016 at 7:09 PM, Khaled Hosny notifications@github.com
wrote:
In src/sample.py
#269 (comment):@@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this):
return True
debugger = Debugger()
hb.buffer_set_message_func (buf, debugger.message, 1, 0)
-hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1)
+hb.buffer_add_codepoints (buf, codepoints, 0, -1)It should.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/behdad/harfbuzz/pull/269/files/0a9c03bc45e82c95377431576ddc92e8cd32ddb4#r67586781,
or mute the thread
https://github.com/notifications/unsubscribe/ACcEOvaipSrr2_lGpxJRClCrgkES_Xq_ks5qMykdgaJpZM4I3_8R
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See bcba8b4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it does the exact same thing except not validate? is it worth pointing
that out in the/any documentation or something since 99.9% of the time
you’re not going to get invalid strings in python at least
On Fri, Jun 17, 2016 at 7:59 PM, Khaled Hosny notifications@github.com
wrote:
In src/sample.py
#269 (comment):@@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this):
return True
debugger = Debugger()
hb.buffer_set_message_func (buf, debugger.message, 1, 0)
-hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1)
+hb.buffer_add_codepoints (buf, codepoints, 0, -1)—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/behdad/harfbuzz/pull/269/files/0a9c03bc45e82c95377431576ddc92e8cd32ddb4#r67589679,
or mute the thread
https://github.com/notifications/unsubscribe/ACcEOtYnYMrKZbfRGkXRkioHLPRkzfByks5qMzTkgaJpZM4I3_8R
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation mentions this, do you think it is not clear enough?
| @@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this): | |||
| return True | |||
| debugger = Debugger() | |||
| hb.buffer_set_message_func (buf, debugger.message, 1, 0) | |||
| hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1) | |||
| hb.buffer_add_utf32 (buf, codepoints, 0, -1) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 should be len(codepoints) since your list is not NULL-terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that’s what the -1 does…what does the 0 do then?
On Sat, Jun 18, 2016 at 10:25 AM, Khaled Hosny notifications@github.com
wrote:
In src/sample.py
#269 (comment):@@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this):
return True
debugger = Debugger()
hb.buffer_set_message_func (buf, debugger.message, 1, 0)
-hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1)
+hb.buffer_add_utf32 (buf, codepoints, 0, -1)-1 should be len(codepoints) since your list is not NULL-terminated.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/behdad/harfbuzz/pull/269/files/88f9f6da65ce5187bb43ebd0cf470955bf8f84da#r67601692,
or mute the thread
https://github.com/notifications/unsubscribe/ACcEOi__X9y4lNDo4oAxizYQgVL-PANoks5qM__QgaJpZM4I3_8R
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do you need len(codepoints) for the add_codepoints() function too?
On Sat, Jun 18, 2016 at 11:22 AM, Kelvin Ma kelvinsthirteen@gmail.com
wrote:
so that’s what the -1 does…what does the 0 do then?
On Sat, Jun 18, 2016 at 10:25 AM, Khaled Hosny notifications@github.com
wrote:In src/sample.py
#269 (comment):@@ -39,7 +40,7 @@ def message (self, buf, font, msg, data, _x_what_is_this):
return True
debugger = Debugger()
hb.buffer_set_message_func (buf, debugger.message, 1, 0)
-hb.buffer_add_utf8 (buf, text.encode('utf-8'), 0, -1)
+hb.buffer_add_utf32 (buf, codepoints, 0, -1)-1 should be len(codepoints) since your list is not NULL-terminated.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/behdad/harfbuzz/pull/269/files/88f9f6da65ce5187bb43ebd0cf470955bf8f84da#r67601692,
or mute the thread
https://github.com/notifications/unsubscribe/ACcEOi__X9y4lNDo4oAxizYQgVL-PANoks5qM__QgaJpZM4I3_8R
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Can you rebase on master (to fix Travis failure) and squash the commits, so that I can merge this? |
|
@khaledhosny like this #271 ? |
|
You can update the same pull request (by pulling, rebasing, then forced push), but a new pull request is fine. Closing this then. |
No description provided.