-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small parameter encoding improvement #1296
Conversation
Do you have a benchmark that quantifies the size of the improvement? Note that |
@@ -126,6 +129,30 @@ public void Write(ReadOnlySpan<char> chars, bool flush) | |||
} | |||
} | |||
|
|||
public void WriteAscii(ReadOnlySpan<char> chars, bool flush) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever called with flush: false
? If not, the parameter could be removed and the code simplified.
Reallocate(chars.Length); | ||
while (chars.Length > 0) | ||
{ | ||
encoder.Convert(chars, m_output.Span, false, out var charsUsed, out var bytesUsed, out var completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flush
parameter could very likely be true
here, because we're assuming the input string contains only ASCII characters, which convert one-to-one to bytes.
var encoder = Encoding.ASCII.GetEncoder(); | ||
if (m_output.Length < chars.Length) | ||
Reallocate(chars.Length); | ||
while (chars.Length > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this loop should be unnecessary, because ASCII characters convert to bytes one-to-one and the output buffer is resized to be large enough on the line above.
FWIW, you can |
I ran the following BenchmarkDotNet test: [Params("short", "a long ASCII string that's longer")]
public string Text { get; set; }
static Encoder s_utf8 = System.Text.Encoding.UTF8.GetEncoder();
static Encoder s_ascii = System.Text.Encoding.ASCII.GetEncoder();
static byte[] s_bytes = new byte[200];
[Benchmark(Baseline = true)]
public int CachedUtf8() => s_utf8.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);
[Benchmark]
public int CachedAscii() => s_ascii.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);
[Benchmark]
public int Ascii()
{
var encoder = System.Text.Encoding.ASCII.GetEncoder();
return encoder.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);
}
It looks like using ASCII encoding would be 5-10% faster for the encoding itself, and likely even faster if the complicated looping logic (for UTF-8 surrogate characters) was omitted from the ASCII version. However, this only occurs if the |
I tested another approach of just converting the chars to bytes: [Benchmark]
public int Loop()
{
var span = Text.AsSpan();
var output = s_bytes.AsSpan();
var i = 0;
foreach (var ch in span)
output[i++] = ch is >= (char)0x20 and <= (char)0x7F ? (byte) ch : (byte) '?';
return i;
} It's significantly faster for short strings, but surprisingly slow for longer ones:
If we expected that most of the strings would be short, then perhaps it would be better on average? |
If we avoid the range check (by assuming the input is truly all ASCII), then performance improves: [Benchmark]
public int Loop()
{
var span = Text.AsSpan();
var output = s_bytes.AsSpan();
var i = 0;
foreach (var ch in span)
output[i++] = unchecked((byte) ch);
return i;
}
|
Right, there is indeed some corrections to add :
btw, i've check that using latin1 can make even better performance : benchmark result
(tested with code :
In this specific case, it might be even a better solution, so I'll change methods name and implementation accordingly. This will be use mostly for decimal and datetime, so i think ASCII/latin1 encoder is preferable to loop. |
The ASCII encoder also appears to be vectorised: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs#L1155 Since the timing is close (I don't know why Latin1 benchmarks as slightly faster?) I would prefer to use the ASCII encoder as it matches the expected range of the input data better (i.e., there won't be any non-ASCII characters that need to be encoded as Latin1). |
change submitted. (for info: latin1 difference compare to ASCII https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Latin1Utility.cs#L537 ) |
@@ -226,7 +237,8 @@ private void Reallocate(int additional = 0) | |||
m_output = new(m_buffer, usedLength, m_buffer.Length - usedLength); | |||
} | |||
|
|||
private Encoder? m_encoder; | |||
private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type isn't thread-safe and can't be shared across multiple instances of ByteBufferWriter
. In particular, "An Encoder object maintains state information between successive calls to GetBytes or Convert methods so that it can correctly encode character sequences that span blocks."
Data would very likely be corrupted if m_utf8Encoder
was used simultaneously by multiple threads.
@@ -226,7 +237,8 @@ private void Reallocate(int additional = 0) | |||
m_output = new(m_buffer, usedLength, m_buffer.Length - usedLength); | |||
} | |||
|
|||
private Encoder? m_encoder; | |||
private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder(); | |||
private static Encoder m_asciiEncoder = Encoding.ASCII.GetEncoder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is used to perform a "one-shot" conversion (from UTF-16 to ASCII) I don't think the documentation provides any guarantee that it's safe to do that simultaneously from multiple threads. Therefore, this should also not be static
.
public static void WriteLengthEncodedAsciiString(this ByteBufferWriter writer, ReadOnlySpan<char> value) | ||
{ | ||
writer.WriteLengthEncodedInteger((ulong) value.Length); | ||
writer.Write(value, flush: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writer.Write(value, flush: true); | |
writer.WriteAscii(value); |
Did you mean WriteAscii
here?
private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder(); | ||
private static Encoder m_asciiEncoder = Encoding.ASCII.GetEncoder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder(); | |
private static Encoder m_asciiEncoder = Encoding.ASCII.GetEncoder(); | |
private Encoder? m_utf8Encoder; | |
private Encoder? m_asciiEncoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, m_asciiEncoder
probably isn't necessary... the Encoding.ASCII.GetBytes
static method is probably sufficient since we're encoding the whole string at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding.ASCII.GetBytes
is also much faster than a cached Encoder
object:
Method | Text | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
CachedUtf8 | 1234 | 16.584 ns | 0.3541 ns | 0.4349 ns | 1.00 | 0.00 |
CachedAscii | 1234 | 14.951 ns | 0.1972 ns | 0.1845 ns | 0.91 | 0.03 |
Ascii | 1234 | 5.975 ns | 0.1181 ns | 0.0986 ns | 0.36 | 0.01 |
Loop | 1234 | 3.002 ns | 0.0853 ns | 0.0798 ns | 0.18 | 0.01 |
CachedUtf8 | 1970-(...)00000 [26] | 19.331 ns | 0.4160 ns | 0.3891 ns | 1.00 | 0.00 |
CachedAscii | 1970-(...)00000 [26] | 17.843 ns | 0.2783 ns | 0.2324 ns | 0.92 | 0.02 |
Ascii | 1970-(...)00000 [26] | 9.683 ns | 0.1879 ns | 0.1569 ns | 0.50 | 0.01 |
Loop | 1970-(...)00000 [26] | 20.023 ns | 0.4302 ns | 0.4955 ns | 1.03 | 0.03 |
all goods remarks... I wouldn't have think a simple change like this would have taken so much exchanges. |
…re guaranteed to be ASCII. Using the ASCII encoder provides a significant performance improvement. Signed-off-by: rusher <diego.dupin@gmail.com>
Co-authored-by: Bradley Grainger <bgrainger@gmail.com>
Thanks for the improvement! |
Some types of parameters encoded as "text" or "length encoded text" are guaranteed to be ASCII. Using the ASCII encoder provides a significant performance improvement.