Conversion performance #8

ScottPJones · 2016-01-14T12:48:51Z

I've written some simple functions that create tables using iconv.jl, and then do the conversions in pure Julia code instead of calling iconv, as well as comparing the performance of

converting from an 8-bit character set to UTF-8 via iconv.jl
" " " " to UTF-16 via iconv.jl
" " " " to UTF-16 via https://github.com/nolta/ICU.jl
" " " " to UTF-8 via my convert code
" " " " to UTF-16 via my convert code

I've made a Gist with benchmark results (using https://github.com/johnmyleswhite/Benchmarks.jl)
along with the code and benchmarking code, at:
https://gist.github.com/ScottPJones/fcd12f675edb3d79b5ce.
The tables created are also very small, at most couple hundred bytes (or less) per character set
(maximum, if the character set is ASCII compatible, is 256 bytes, if it an ANSI character set, max is 192 bytes, and only 64 bytes for CP1252 - which woud probably be the most used conversion).

Should we move towards using this approach at least for the 8-bit character set conversions?
It would also make it easy to add all of the options that Python 3 has, for handling invalid characters
(error, remove, replace with fixed replacement character (default 0xfffd) or string, insert quoted XML escape sequence, insert quoted as \uxxxx or \u{xxxx}.

The text was updated successfully, but these errors were encountered:

nalimilan · 2016-01-14T18:23:35Z

Could you write a summary of the results? Also, I think it would make sense to measure the performance of converting many strings using the same iconv/ICU handler, since this is a more reasonable scenario.

I think progressively adding pure-Julia converters, starting with the most common encodings, is a good idea. That would justify changing the name of the package. One difficult point is to get a relatively consistent behavior as regards invalid characters, given that iconv will be less flexible than your Julia code; need to think about it. Anyway, if you can find a consistent plan to implement this, that be be nice.

ScottPJones · 2016-01-15T02:33:30Z

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.
ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

nalimilan · 2016-01-15T10:15:10Z

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.

5120 still isn't that much to offset the cost of creating a handle. But indeed in many use cases that overhead is an issue, so that's fair.

ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.
https://github.com/nolta/UnicodeExtras.jl#file-encoding

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

Makes sense.

ScottPJones · 2016-01-15T12:32:23Z

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.

Ah, we are using ICU.jl at work, so that's what I benchmarked. I'll have to look at UnicodeExtras.jl

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

Yes - I understand iconv.jl isn't optimized yet - we'd really need to benchmark separately

calling iconv_open, 2) the rest of overhead of creating a StringEncoder,
(might show that we need to cache encoders, just like I want to cache the tables for pure Julia)
the actual call to iconv!, 4) overhead of doing things through IOBuffer / write
Part of the problem might be that the transcoding iconv call might not be optimizing or simplifying the case where the source or destination is UTF-8 or UTF-16. (Internally all transcoding goes via Unicode,
I think UTF-16 or UTF-32)

I like to benchmark as many possibilities as possible - so I don't get caught out by some case I didn't benchmark that turns out to be common.

sambitdash · 2017-08-10T16:57:12Z

julia-iconv performance will be the platform native performance plus a small marshalling overhead by the julia and native C communication layer. iconv is a very matured library and no significant code additions have not gone in after 2011 which makes it ultra-stable. Any newer library may have that downside. Unless proven otherwise it may not be a good idea to move away from iconv.

nalimilan · 2017-08-12T13:58:24Z

The problem is not the cost of communication between Julia and C (that cost should be null), it's just that iconv is said to be relatively slow. Julia allows generating very efficient code for common conversions on the fly, which should be worth it at least for simple cases. Anyway we'd only do this if benchmarks show it's really faster.

ScottPJones mentioned this issue Jan 14, 2016

Capitalize package name #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion performance #8

Conversion performance #8

ScottPJones commented Jan 14, 2016

nalimilan commented Jan 14, 2016

ScottPJones commented Jan 15, 2016

nalimilan commented Jan 15, 2016

ScottPJones commented Jan 15, 2016

sambitdash commented Aug 10, 2017

nalimilan commented Aug 12, 2017

Conversion performance #8

Conversion performance #8

Comments

ScottPJones commented Jan 14, 2016

nalimilan commented Jan 14, 2016

ScottPJones commented Jan 15, 2016

nalimilan commented Jan 15, 2016

ScottPJones commented Jan 15, 2016

sambitdash commented Aug 10, 2017

nalimilan commented Aug 12, 2017