Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up encoding handling in streams. #601

Merged
merged 11 commits into from
Oct 7, 2022
Merged

Speed up encoding handling in streams. #601

merged 11 commits into from
Oct 7, 2022

Conversation

jtv
Copy link
Owner

@jtv jtv commented Sep 20, 2022

Speeds up scanning of text in various encodings in stream_to and stream_from. (Even stream_to needs to be able to do that, because it escapes data for use with COPY.)

The optimisations are:

  1. Inline glyph scanning function in the search loop.
  2. For "ASCII-safe" encodings, use the "monobyte" search loop.

The inlining optimisation works as follows: Previously the stream classes kept a pointer to a function that figures out glyph boundaries (the byte where the next character begins in a byte string). It looks up the function specialised for the current kind of encoding: UTF-8, GBK, SJIS, etc... or "monobyte" for single-byte encodings. In libpqxx I call those functions glyph scanners. But this way of working is painfully slow: the stream calls that function pointer for every single character it tries to read. Here, I rewrite the loop to use a different specialised function pointer, which works at a higher level: "Find any one of these special characters." That means that the inner loop is now inside that function, not on the outside calling in. Gives the compiler more of a chance to optimise the loop.

The other change is based on the fact that many encodings have two basic kinds of characters: ASCII ones which are in the 0..127 range, and non-ASCII ones in the non-ASCII byte range — they have the high bit in their bit value set to 1. And that means that we can never have the "SJIS" situation where an ASCII byte value (such as that of a backslash character) can also occur inside a multibyte character. When we know we're in an encoding where that can't ever happen (and UTF-8 is one of those!) then we don't need the glyph scanner for that encoding at all. We can just use the simpler "monobyte" glyph scanner which just always returns offset + 1.

Neither of these optimisations is particularly powerful on its own. Inlining UTF-8 scanning (for instance) will probably be a bit faster than a function pointer, but it won't be a huge difference. And calling a simpler glyph scanner won't do us much good, especially if that just means that we'll need to call it 3 times for a 3-byte character, for instance. But the two changes work well together: the monobyte scanner can be as simple as an offset++.

Unfortunately this is an ABI-breaking change. We're replacing a function pointer field with a pointer to a different type of function.

@jtv jtv changed the title Speed up stream_from decoding. Speed up encoding handling in streams. Sep 23, 2022
The optimisations are:
1. Inline glyph scanning function in the search loop.
2. For "ASCII-safe" encodings, use the "monobyte" search loop.

The inlining optimisation works as follows: Previously the stream class
kept a pointer to a function that figures out glyph boundaries (the byte
where the next character begins in a byte string).  It looks up the
function specialised for the current kind of encoding: UTF-8, GBK, SJIS,
etc... or "monobyte" for single-byte encodings.  In libpqxx I call those
functions _glyph scanners._  But this way of working is painfully slow:
the stream calls that function pointer for every single character it
tries to read.  Here, I rewrite the loop to use a different specialised
function pointer, which works at a higher level: "Find any one of these
special characters."  That means that the inner loop is now inside that
function, not on the outside calling in.  Gives the compiler more of a
chance to optimise the loop.

The other change is based on the fact that many encodings have two
basic kinds of characters: ASCII ones which are in the 0..127 range,
and non-ASCII ones in the non-ASCII byte range — they have the high
bit in their bit value set to 1.  And that means that we can never
have the "SJIS" situation where an ASCII byte value (such as that of a
backslash character) can also occur _inside_ a multibyte character.
When we know we're in an encoding where that can't ever happen (and
UTF-8 is one of those!) then we don't need the glyph scanner for that
encoding at all.  We can just use the simpler "monobyte" glyph scanner
which just always returns `offset + 1`.

Neither of these optimisations is particularly powerful on its own.
Inlining UTF-8 scanning (for instance) will probably be a bit faster
than a function pointer, but it won't be a huge difference.  And calling
a simpler glyph scanner won't do us much good, especially if that just
means that we'll need to call it 3 times for a 3-byte character, for
instance.  But the two changes work well together: the monobyte scanner
can be as simple as an `offset++`.

Unfortunately this _is_ an ABI-breaking change.  We're replacing a
function pointer field with a pointer to a different type of function.
This is needed because future changes to array parsing require an
ability for callers to specify the characters that the finder looks
for.
@jtv jtv merged commit e809439 into master Oct 7, 2022
@jtv jtv deleted the faster-stream_from branch October 7, 2022 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant