Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion performance #8

Open
ScottPJones opened this issue Jan 14, 2016 · 6 comments
Open

Conversion performance #8

ScottPJones opened this issue Jan 14, 2016 · 6 comments

Comments

@ScottPJones
Copy link
Contributor

I've written some simple functions that create tables using iconv.jl, and then do the conversions in pure Julia code instead of calling iconv, as well as comparing the performance of

  1. converting from an 8-bit character set to UTF-8 via iconv.jl
  2. " " " " to UTF-16 via iconv.jl
  3. " " " " to UTF-16 via https://github.com/nolta/ICU.jl
  4. " " " " to UTF-8 via my convert code
  5. " " " " to UTF-16 via my convert code

I've made a Gist with benchmark results (using https://github.com/johnmyleswhite/Benchmarks.jl)
along with the code and benchmarking code, at:
https://gist.github.com/ScottPJones/fcd12f675edb3d79b5ce.
The tables created are also very small, at most couple hundred bytes (or less) per character set
(maximum, if the character set is ASCII compatible, is 256 bytes, if it an ANSI character set, max is 192 bytes, and only 64 bytes for CP1252 - which woud probably be the most used conversion).

Should we move towards using this approach at least for the 8-bit character set conversions?
It would also make it easy to add all of the options that Python 3 has, for handling invalid characters
(error, remove, replace with fixed replacement character (default 0xfffd) or string, insert quoted XML escape sequence, insert quoted as \uxxxx or \u{xxxx}.

@nalimilan
Copy link
Member

Could you write a summary of the results? Also, I think it would make sense to measure the performance of converting many strings using the same iconv/ICU handler, since this is a more reasonable scenario.

I think progressively adding pure-Julia converters, starting with the most common encodings, is a good idea. That would justify changing the name of the package. One difficult point is to get a relatively consistent behavior as regards invalid characters, given that iconv will be less flexible than your Julia code; need to think about it. Anyway, if you can find a consistent plan to implement this, that be be nice.

@ScottPJones
Copy link
Contributor Author

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.
ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

@nalimilan
Copy link
Member

The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120.
The strings were created with random bytes 0-255, and then changing any invalid bytes (for that character set) to '?'.
iconv.jl was quite a bit slower, > 70x slower even on larger strings converting to UTF-16,
and > 45x slower converting to UTF-8.
ICU.jl was about twice as slow in general the pure Julia conversion code.
With the pure Julia converter, converting to UTF-8 is about 57% slower than to UTF-16 (which is to be expected, dealing with UTF-8 is generally significantly slower than UTF-16).
This was on large strings, 5120 bytes, which should have reduced the effect of the setup of the StringEncoder for the iconv.jl tests.

5120 still isn't that much to offset the cost of creating a handle. But indeed in many use cases that overhead is an issue, so that's fair.

ICU.jl was pretty fast, but only supports UTF-16 (there is support for UTF-8 now in ICU, but it isn't
available in ICU.jl).

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.
https://github.com/nolta/UnicodeExtras.jl#file-encoding

I'm not sure exactly what you want, about measuring converting many strings.
I suppose I could call StringEncoder directly, instead of using decode().
Is that what you were thinking of?

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors.
I was also thinking about using the Encoding types (if you looked at the code in the Gist), and having the tables loadable on demand, with a Dict to keep track of loaded encodings, and finally implementing next(), start(), done(), so that it would be possible to get a character at a time from either a IOBuffer, Vector{UInt8}, via a pointer, or maybe even MMap'ed memory, based on a loaded encoding.
That would also make transcoding (using Unicode as the intermediary, like iconv does), easy also in pure Julia.
We could still use iconv for multi-byte character sets, until we have time to move those to pure Julia
(I've done it it in past, pretty efficiently, in table driven fashion, so I'm confident I could add that part, it's just a matter of time to do it [not too long, but more than a Saturday afternoon], and my "paying work" priorities (we need 8-bit support now, but MB character sets, not yet))

Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀

Makes sense.

@ScottPJones
Copy link
Contributor Author

Why do you say that? UnicodeExtras.jl supports UTF8String too, though it would likely be slow because I think ICU uses UTF-16 as an intermediate.

Ah, we are using ICU.jl at work, so that's what I benchmarked. I'll have to look at UnicodeExtras.jl

Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.

Yes - I understand iconv.jl isn't optimized yet - we'd really need to benchmark separately

  1. calling iconv_open, 2) the rest of overhead of creating a StringEncoder,
    (might show that we need to cache encoders, just like I want to cache the tables for pure Julia)
  2. the actual call to iconv!, 4) overhead of doing things through IOBuffer / write
    Part of the problem might be that the transcoding iconv call might not be optimizing or simplifying the case where the source or destination is UTF-8 or UTF-16. (Internally all transcoding goes via Unicode,
    I think UTF-16 or UTF-32)

I like to benchmark as many possibilities as possible - so I don't get caught out by some case I didn't benchmark that turns out to be common.

@sambitdash
Copy link
Contributor

julia-iconv performance will be the platform native performance plus a small marshalling overhead by the julia and native C communication layer. iconv is a very matured library and no significant code additions have not gone in after 2011 which makes it ultra-stable. Any newer library may have that downside. Unless proven otherwise it may not be a good idea to move away from iconv.

@nalimilan
Copy link
Member

The problem is not the cost of communication between Julia and C (that cost should be null), it's just that iconv is said to be relatively slow. Julia allows generating very efficient code for common conversions on the fly, which should be worth it at least for simple cases. Anyway we'd only do this if benchmarks show it's really faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants