handle unicode string #30

tjwei · 2015-08-05T04:01:19Z

Fix a few regular expression for pdf string.
My project https://github.com/tjwei/translatePDF uses pdfrw and handling a lot of unicode pdf string.
This is a patch that works for handling various Chinese string in pdf.
This patch fixed the following issue:
(\0160) should be parsed as \016 0 not oct(0160), so it should be decoded into \xe30 not max(int(1600, 8), 127).

pmaupin · 2015-08-05T04:52:03Z

Thanks!

pmaupin · 2015-08-05T04:56:03Z

Do you have a PDF that fails on the old code and works on the new code? I could add it to the tests.

tjwei · 2015-08-05T05:43:05Z

I have a few, but unfortunately are all copyrighted. Attempted to use xetex to generated one, but not successful. Will send you one when I found it.

pmaupin · 2015-08-05T14:20:43Z

It really wants a unittest in here anyway. I'll add an issue for that.

Thanks,
Pat

handle unicode string

66271ce

pmaupin merged commit 66271ce into pmaupin:master Aug 5, 2015

pmaupin mentioned this pull request Aug 5, 2015

Add more unittests for string encoding #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle unicode string #30

handle unicode string #30

tjwei commented Aug 5, 2015

pmaupin commented Aug 5, 2015

pmaupin commented Aug 5, 2015

tjwei commented Aug 5, 2015

pmaupin commented Aug 5, 2015

handle unicode string #30

handle unicode string #30

Conversation

tjwei commented Aug 5, 2015

pmaupin commented Aug 5, 2015

pmaupin commented Aug 5, 2015

tjwei commented Aug 5, 2015

pmaupin commented Aug 5, 2015