Speed up encoding handling in streams. #601

jtv · 2022-09-20T14:36:23Z

Speeds up scanning of text in various encodings in stream_to and stream_from. (Even stream_to needs to be able to do that, because it escapes data for use with COPY.)

The optimisations are:

Inline glyph scanning function in the search loop.
For "ASCII-safe" encodings, use the "monobyte" search loop.

The inlining optimisation works as follows: Previously the stream classes kept a pointer to a function that figures out glyph boundaries (the byte where the next character begins in a byte string). It looks up the function specialised for the current kind of encoding: UTF-8, GBK, SJIS, etc... or "monobyte" for single-byte encodings. In libpqxx I call those functions glyph scanners. But this way of working is painfully slow: the stream calls that function pointer for every single character it tries to read. Here, I rewrite the loop to use a different specialised function pointer, which works at a higher level: "Find any one of these special characters." That means that the inner loop is now inside that function, not on the outside calling in. Gives the compiler more of a chance to optimise the loop.

The other change is based on the fact that many encodings have two basic kinds of characters: ASCII ones which are in the 0..127 range, and non-ASCII ones in the non-ASCII byte range — they have the high bit in their bit value set to 1. And that means that we can never have the "SJIS" situation where an ASCII byte value (such as that of a backslash character) can also occur inside a multibyte character. When we know we're in an encoding where that can't ever happen (and UTF-8 is one of those!) then we don't need the glyph scanner for that encoding at all. We can just use the simpler "monobyte" glyph scanner which just always returns offset + 1.

Neither of these optimisations is particularly powerful on its own. Inlining UTF-8 scanning (for instance) will probably be a bit faster than a function pointer, but it won't be a huge difference. And calling a simpler glyph scanner won't do us much good, especially if that just means that we'll need to call it 3 times for a 3-byte character, for instance. But the two changes work well together: the monobyte scanner can be as simple as an offset++.

Unfortunately this is an ABI-breaking change. We're replacing a function pointer field with a pointer to a different type of function.

The optimisations are: 1. Inline glyph scanning function in the search loop. 2. For "ASCII-safe" encodings, use the "monobyte" search loop. The inlining optimisation works as follows: Previously the stream class kept a pointer to a function that figures out glyph boundaries (the byte where the next character begins in a byte string). It looks up the function specialised for the current kind of encoding: UTF-8, GBK, SJIS, etc... or "monobyte" for single-byte encodings. In libpqxx I call those functions _glyph scanners._ But this way of working is painfully slow: the stream calls that function pointer for every single character it tries to read. Here, I rewrite the loop to use a different specialised function pointer, which works at a higher level: "Find any one of these special characters." That means that the inner loop is now inside that function, not on the outside calling in. Gives the compiler more of a chance to optimise the loop. The other change is based on the fact that many encodings have two basic kinds of characters: ASCII ones which are in the 0..127 range, and non-ASCII ones in the non-ASCII byte range — they have the high bit in their bit value set to 1. And that means that we can never have the "SJIS" situation where an ASCII byte value (such as that of a backslash character) can also occur _inside_ a multibyte character. When we know we're in an encoding where that can't ever happen (and UTF-8 is one of those!) then we don't need the glyph scanner for that encoding at all. We can just use the simpler "monobyte" glyph scanner which just always returns `offset + 1`. Neither of these optimisations is particularly powerful on its own. Inlining UTF-8 scanning (for instance) will probably be a bit faster than a function pointer, but it won't be a huge difference. And calling a simpler glyph scanner won't do us much good, especially if that just means that we'll need to call it 3 times for a 3-byte character, for instance. But the two changes work well together: the monobyte scanner can be as simple as an `offset++`. Unfortunately this _is_ an ABI-breaking change. We're replacing a function pointer field with a pointer to a different type of function.

This is needed because future changes to array parsing require an ability for callers to specify the characters that the finder looks for.

jtv changed the title ~~Speed up stream_from decoding.~~ Speed up encoding handling in streams. Sep 23, 2022

jtv added 8 commits October 1, 2022 18:28

Fix test warning.

bee3017

Format.

904f323

Bump to 7.8, for ABI break.

e29de53

Format.

ccdd117

NEWS note.

aa8a24e

Speed up stream_to escaping as well.

87266f8

Make escape_char() more inlinable.

ad5dd24

jtv force-pushed the faster-stream_from branch from 828ff5b to ad5dd24 Compare October 1, 2022 16:33

jtv added 3 commits October 2, 2022 00:30

Move most decoding into headers.

29e0c11

This is needed because future changes to array parsing require an ability for callers to specify the characters that the finder looks for.

Test escaping in stream_to more.

6523a8f

Update docs.

d3237dc

jtv merged commit e809439 into master Oct 7, 2022

jtv deleted the faster-stream_from branch October 7, 2022 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up encoding handling in streams. #601

Speed up encoding handling in streams. #601

jtv commented Sep 20, 2022 •

edited

Speed up encoding handling in streams. #601

Speed up encoding handling in streams. #601

Conversation

jtv commented Sep 20, 2022 • edited

jtv commented Sep 20, 2022 •

edited