Skip to content

Conversation

tayloraswift
Copy link
Contributor

No description provided.

@khaledhosny khaledhosny merged commit 07461d0 into harfbuzz:master Jun 18, 2016
@@ -20,6 +20,7 @@ def tounicode(s, encoding='utf-8'):

fontdata = open (sys.argv[1], 'rb').read ()
text = tounicode(sys.argv[2])
codepoints = list(map(ord, text))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on narrow Python 2 builds, ord() will fail if input text contains unicode characters beyond 0xFFFF.
The fontTools.misc.py23 has a workaround for that.

@behdad
Copy link
Member

behdad commented Jun 29, 2016

Actually I'm not even sure this is a good idea. The encode(utf-8) approach is A LOT more efficient than creating a Pythonic list of integers and then passing that to HarfBuzz. I don't see the benefit of this. Plus, what @anthrotype said.

@behdad behdad mentioned this pull request Jun 29, 2016
@tayloraswift
Copy link
Contributor Author

if i remember right the reason i said to change this to ord() is because
encode('utf-8') has some issues with special chars like ā. Maybe utf-32
will work though the last time I used it it was acting real weird, see
https://lists.freedesktop.org/archives/harfbuzz/2016-June/005634.html .
Apparently you need [int.from_bytes(c.encode("utf-16be"), byteorder='big')
for c in string] to make it work, which nothing but a really long and ugly
way of writing list(map(ord, string)).

On Wed, Jun 29, 2016 at 12:12 AM, Behdad Esfahbod notifications@github.com
wrote:

Actually I'm not even sure this is a good idea. The encode(utf-8) approach
is A LOT more efficient than creating a Pythonic list of integers and then
passing that to HarfBuzz. I don't see the benefit of this. Plus, what
@anthrotype https://github.com/anthrotype said.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#271 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ACcEOhvHFheNMjavUSwf4a8HCRXtmeM6ks5qQfCrgaJpZM4I5C96
.

@behdad
Copy link
Member

behdad commented Jun 29, 2016

if i remember right the reason i said to change this to ord() is because encode('utf-8') has some issues with special chars like ā.

Then let's debug that. Tell us your original problem, not your proposed solution. UTF-8 approach definitely works just fine for non-ASCII for me. What was your problem?

which nothing but a really long and ugly way of writing list(map(ord, string)).

As was pointed out by @anthrotype, that's wrong on some builds of Python and causes hard-to-debug bugs.

@tayloraswift
Copy link
Contributor Author

everything about this is on this mailing list thread: https://lists.freedesktop.org/archives/harfbuzz/2016-June/005633.html

UTF-8 doesn’t work because the cluster values don’t line up with the character indexes any more. ā causes the clusters to jump by 2 when it should only count for 1.

On Jun 29, 2016, at 12:42 AM, Behdad Esfahbod notifications@github.com wrote:

if i remember right the reason i said to change this to ord() is because encode('utf-8') has some issues with special chars like ā.

Then let's debug that. Tell us your original problem, not your proposed solution. UTF-8 approach definitely works just fine for non-ASCII for me. What was your problem?

which nothing but a really long and ugly way of writing list(map(ord, string)).

As was pointed out by @anthrotype, that's wrong on some builds of Python and causes hard-to-debug bugs.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@anthrotype
Copy link
Member

By the way, sys.argv is a list of Unicode strings on Python3, but on Python2 it's a list of bytes strings encoded using the console's encoding, sys.stdin.encoding, which may not be UTF-8, especially on non-Unix.

@behdad
Copy link
Member

behdad commented Jun 29, 2016

UTF-8 doesn’t work because the cluster values don’t line up with the character indexes any more. ā causes the clusters to jump by 2 when it should only count for 1.

Ok I see. In that case the correct way to do this is to use utf16 or utf32 based on whether the python build is narrow or wide.

@anthrotype or @khaledhosny can you put that together? I don't have a narrow build to test. We should still have the utf-8 code in the sample as well.

@tayloraswift
Copy link
Contributor Author

utf16 just makes the bug harder to catch, it has the same problem because
constant index jumps is not guaranteed there either. utf-32 should work in
theory but last time I checked it wasn't working in harfbuzz

On Wed, Jun 29, 2016 at 3:03 PM, Behdad Esfahbod notifications@github.com
wrote:

UTF-8 doesn’t work because the cluster values don’t line up with the
character indexes any more. ā causes the clusters to jump by 2 when it
should only count for 1.

Ok I see. In that case the correct way to do this is to use utf16 or utf32
based on whether the python build is narrow or wide.

@anthrotype https://github.com/anthrotype or @khaledhosny
https://github.com/khaledhosny can you put that together? I don't have
a narrow build to test. We should still have the utf-8 code in the sample
as well.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#271 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ACcEOsX4V85KsgXKMVnjIJmVQermWKJSks5qQsGFgaJpZM4I5C96
.

@behdad
Copy link
Member

behdad commented Jun 29, 2016

utf16 just makes the bug harder to catch, it has the same problem because constant index jumps

No. If the python build is narrow, then you want the index to jump.

@tayloraswift
Copy link
Contributor Author

and if the build is not narrow?

On Wed, Jun 29, 2016 at 3:37 PM, Behdad Esfahbod notifications@github.com
wrote:

utf16 just makes the bug harder to catch, it has the same problem because
constant index jumps

No. If the python build is narrow, then you want the index to jump.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#271 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ACcEOjb5Zy0H4AgDfQvhGfJjIaDxj4V-ks5qQsmUgaJpZM4I5C96
.

@behdad
Copy link
Member

behdad commented Jun 29, 2016

and if the build is not narrow?

Then you want the utf-32 option.

@behdad
Copy link
Member

behdad commented Jun 29, 2016

This way the clusters will match Python string indices.

@tayloraswift
Copy link
Contributor Author

except utf-32 isn’t working

'coefficient'

gid 0 : 0 @ 748 + (0, 0)
gid 0 : 1 @ 748 + (0, 0)
gid 0 : 2 @ 748 + (0, 0)
gid 0 : 3 @ 748 + (0, 0)
gid 70 : 4 @ 846 + (0, 0)
gid 0 : 5 @ 748 + (0, 0)
gid 0 : 6 @ 748 + (0, 0)
gid 0 : 7 @ 748 + (0, 0)
gid 82 : 8 @ 1018 + (0, 0)
gid 0 : 9 @ 748 + (0, 0)
gid 0 : 10 @ 748 + (0, 0)
gid 0 : 11 @ 748 + (0, 0)
gid 72 : 12 @ 860 + (0, 0)
gid 0 : 13 @ 748 + (0, 0)
gid 0 : 14 @ 748 + (0, 0)
gid 0 : 15 @ 748 + (0, 0)
gid 73 : 16 @ 616 + (0, 0)
gid 0 : 17 @ 748 + (0, 0)
gid 0 : 18 @ 748 + (0, 0)
gid 0 : 19 @ 748 + (0, 0)
gid 73 : 20 @ 616 + (0, 0)
gid 0 : 21 @ 748 + (0, 0)
gid 0 : 22 @ 748 + (0, 0)
gid 0 : 23 @ 748 + (0, 0)
gid 76 : 24 @ 540 + (0, 0)
gid 0 : 25 @ 748 + (0, 0)
gid 0 : 26 @ 748 + (0, 0)
gid 0 : 27 @ 748 + (0, 0)
gid 70 : 28 @ 846 + (0, 0)
gid 0 : 29 @ 748 + (0, 0)
gid 0 : 30 @ 748 + (0, 0)
gid 0 : 31 @ 748 + (0, 0)
gid 76 : 32 @ 540 + (0, 0)
gid 0 : 33 @ 748 + (0, 0)
gid 0 : 34 @ 748 + (0, 0)
gid 0 : 35 @ 748 + (0, 0)
gid 72 : 36 @ 860 + (0, 0)
gid 0 : 37 @ 748 + (0, 0)
gid 0 : 38 @ 748 + (0, 0)
gid 0 : 39 @ 748 + (0, 0)
gid 81 : 40 @ 1064 + (0, 0)
gid 0 : 41 @ 748 + (0, 0)
gid 0 : 42 @ 748 + (0, 0)
gid 0 : 43 @ 748 + (0, 0)
gid 87 : 44 @ 622 + (0, 0)
gid 0 : 45 @ 748 + (0, 0)
gid 0 : 46 @ 748 + (0, 0)
gid 0 : 47 @ 748 + (0, 0)

@behdad
Copy link
Member

behdad commented Jun 30, 2016

except utf-32 isn’t working

You need to tell us a bit more than "isn't working".

@tayloraswift
Copy link
Contributor Author

it just isn’t working idk what to tell you it makes harfbuzz spit out a bunch of “0” glyphs. It looks like the output pattern in the comment before yours with 3 zeros for every actual glyph

behdad added a commit that referenced this pull request Jun 30, 2016
@behdad
Copy link
Member

behdad commented Jun 30, 2016

Fixed properly.

@HinTak
Copy link
Contributor

HinTak commented Oct 2, 2018

Has anybody actually looked at the result before and after this change? Somehow gid3 (space) is added at the cluster zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants