-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned #38
Conversation
{ | ||
public static ref byte GetOffset(this ref byte destination, int offset) | ||
{ | ||
return ref Unsafe.Add(ref destination, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ref Unsafe.Add(ref destination, offset); | |
return ref Unsafe.Add(ref destination, (uint)offset); |
to squeeze out a little bit more perf (with a quite easy change).
See sharplab.
The movsxd
(sign extending move) is eliminated, and the mov
shown is either not needed (after inlining) or the cpu's register renaming handles it w/o actually being executed in cpu's backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think you actually made a similar comment last week on AdsClient, I'll try to keep this in mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since offset will always be positive anyway, as will all the lengths returned from WireFormatting
methods, I could just as well move to uint
in all those places I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, uint
works too.
As said in the AdsClient, it depends on where the cast to uint
happens. Once here in this method, or when getting the uint
-casted length of the span (which is int
). I prefer having only one cast.
But you'll see when you make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bummer, this is net6, 7 and 8 only and based on conversion to UIntPtr
/ nuint
. I wasn't even targeting net6 yet. So now I'm wondering whether I want to go uint
all around and cast it to int
on < net6 or the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, then something like Unsafe.Add(ref src, (IntPtr)(uint)idx);
works when the language given mapping UIntPtr <-> nuint
isn't available (and the implicit cast from uint -> nuint
).
So I'd keep the parameters as int
and do the casts in that method.
Edit: I should have read all notifications, where I should have seen the new targets 😉
Sally7/Internal/WireFormatting.cs
Outdated
WriteInt32(ref destination, JobRequestHeader1); | ||
// Legacy, the | ||
WriteInt16(ref destination.GetOffset(4), 1 << 8); // PDU ref | ||
WriteInt16(ref destination.GetOffset(6), (short)paramLength); | ||
WriteInt16(ref destination.GetOffset(8), (short)dataLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is nice and very readable.
Perf-wise we have 4 memory writes -- which may or may not be coalesced by cpu's store buffer.
So maybe combine these writes into a long
+ short
(sure taking byte order into account), so we reduced that to 2 memory writes.
I'm not sure if a micro-benchmark will show any difference here. More likely a tool like Intel VTune can show differences in memory operations.
Further I believe that the code can be structed in a way that's still readable and understandable.
Maybe we should look into this in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a benchmark. I also have a separate repository for benchmarks, I'll see if I can put it into a usable form there as well. Either way I definitely agree, less writes is faster so it probably helps to combine it into a long + short write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice repo -- I'll dig into this next week 😃.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, lunch done and I gave it a try:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)
{
ulong tmp;
if (BitConverter.IsLittleEndian)
{
tmp = (ulong)BinaryPrimitives.ReverseEndianness(JobRequestHeader1);
tmp |= (ulong)BinaryPrimitives.ReverseEndianness((short)(1 << 8)) << 32;
tmp |= (ulong)(uint)BinaryPrimitives.ReverseEndianness((ushort)paramLength) << 48;
}
else
{
tmp = JobRequestHeader1;
tmp |= (ulong)(1 << 8) << 32;
tmp |= (ulong)(uint)paramLength << 48;
}
Unsafe.WriteUnaligned(ref destination, tmp);
WriteUInt16(ref destination.GetOffset(8), (ushort)dataLength);
return 10;
}
- code is not so pretty
- I'm not sure if the big-endian part is correct (this is where I'm always struggling)
- as this is the only place where this long+short is used, I don't think a helper is needed (similar to the
NetworkOrderSerializer
-class)
Maybe you have a better idea.
Codegen-wise it's only two memory writes at the expense of more register chaining. So hard to tell (w/o measuring) what's faster.
Out of interest LLVM-MCA shows this and this -- the long+short has better port usage, thus in that iterations in total less cycles are spent. But just looking at a single iteration it looks like the current code is a bit better.
In doubt I'd keep the code as is now, because with the NetworkOrderSerializer it's so beautiful to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is hard to get your head around (still hurts my head as well), but reversing the endianess of the entire long suffices. It does make some sense, because for big endian all of the data is in the order that we want it on the network, so for little endian it's all in the incorrect order. That reduces the code complexity as well, and it's actually really fast (see benchmarks). Will definitely use long+short writes for this.
Sally7/Internal/WireFormatting.cs
Outdated
return 3; | ||
} | ||
|
||
public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength) | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |
public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength) |
My gut tells me that it's worth it here.
For the JIT this method seems probably too big to inline, but the actual code is quite little, so we should give the JIT a hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know an easy way to verify if it will be inlined? I could at least benchmark the method call with and without AggressiveInlining
to see if it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I do either
- use https://sharplab.io/ and call the method from a dummy method to see what machine code gets emitted
- create a simple console app with a dummy method and set the proper env-variables (see [release/7.0-rc1] Improve DOTNET_JitDisasm and introduce DOTNET_JitDisasmSummary dotnet/runtime#74392 for more info)
- create a benchmark and use the
[DisassemblyDiagnoser]
to the see the machine code
Here with sharplab (some copy & pasting needed) one can see the forcing the inline will help.
W/o the inline it's a tail-call (note the jmp
as last instruction of method Foo.Bar
, whilst w/ inline it's nice code in method Foo.Bar_Inlined
.
Thanks again @gfoidl. Very nice that you jumped straight in, will keep the comments in mind for additional changes I intend to make as well. |
Sally7/Internal/WireFormatting.cs
Outdated
public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength) | ||
{ | ||
WriteInt32(ref destination, JobRequestHeader1); | ||
// Legacy, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, needs to be fixed.
|
||
internal static class NetworkOrderSerializer | ||
{ | ||
public static void WriteInt16(ref byte destination, short value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static void WriteInt16(ref byte destination, short value) | |
public static void WriteInt16(ref byte destination, ushort value) |
Found while inspecting the machine code (see other comment).
This was also sign-extended which isn't needed, as signed/unsigned does not make a difference in writing the bytes. So a little perf-boost here.
If the parameter is changed, the method name should be updated too to WriteUInt16
.
I'd do the same for WriteUInt32
then, to be consistent.
Either make the parameter unsigned, or cast to ushort
in the two places used in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final codegen should look like this, so no more sign-extending moves -- they are slow and not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. FYI: I'll be back on Monday -- wish you a nice weekend.
@@ -5,7 +5,7 @@ namespace Sally7.Internal; | |||
|
|||
internal static class WireFormatting | |||
{ | |||
private const int Tpkt = 0x03_00_00_00; | |||
private const uint Tpkt = 0x03_00_00_00; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change is fine, just as little side-note: the JIT is fine with the int
constant too, codegen is the same.
What I find nice, is that reverse endianness from the BinaryPrimitives is an intrinsic, so the JIT will create a correct constant and use that, so no reversing at runtime.
{ | ||
return WriteInt32(ref destination, Tpkt | length); | ||
return WriteUInt32(ref destination, (uint) (Tpkt | length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return WriteUInt32(ref destination, (uint) (Tpkt | length)); | |
return WriteUInt32(ref destination, Tpkt | (uint)length); |
Tpkt
is already a uint
, besides that the JIT will do the right thing here.
Sally7/S7ConnectionHelpers.cs
Outdated
var fnParameterLength = (ushort) (dataItems.Length * 12 + 2); | ||
ushort dataLength = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep these as int
, because
- in Span.Slice the parameter is
int
, so there's amovzx
to convert fromushort -> int
- for arithmetics the cpu works on 32/64-bit registers, so it
ushort -> int
at the cpu-level
we can avoid that conversions by using int
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change this back, I had my doubts around this part as well.
Sally7/S7ConnectionHelpers.cs
Outdated
private static int BuildS7JobRequest(Span<byte> buffer, ushort parameterLength, ushort dataLength) | ||
{ | ||
ref var start = ref MemoryMarshal.GetReference(buffer); | ||
var len = parameterLength + dataLength + 17; // Error omitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len
here will be int
, so the parameters should remain int
. See sharplab.
That's why -- from my experience -- it's quite often better to keep the parameters CLS compliant or use uint
(as it's a native size that cpu registers can understand), and do the cast / type interpretion only where it's needed.
Squeezing out perf unfortunately comes with some traps too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this one, makes sense but I didn't do any checking when changing to ushort...
A nice weekend to you as well! No need to hurry with reviewing, I'm not in a hurry to make these changes either. I'm aiming for GitHub activity every day, just so that I'll keep working on my projects and perhaps start some new projects too. I don't necessarily expect you to join my effort 😉 |
Sally7/Internal/WireFormatting.cs
Outdated
{ | ||
var header = JobRequestHeader1 | (ushort) paramLength; | ||
|
||
Unsafe.WriteUnaligned(ref destination, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you are not using WriteUint32 here? (assuming that is the correct type). Would encapsulate the operation bettter, in particular the conditional endian inversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to have something to fix for the next day, don't I? 😜 It's actually UInt64
/ ulong
, but yes, that's a definite improvement to make just like the other WireFormatting
calls.
Sally7/Internal/WireFormatting.cs
Outdated
public static uint WriteUInt16(ref byte destination, ushort value) | ||
{ | ||
NetworkOrderSerializer.WriteUInt16(ref destination, value); | ||
return sizeof(short); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something to access the compile time type of a variable in C# - which has no overhead?
Then you could just return sizeof(decltype(value));
- making the code more generic with less chance of errors.
(I don't know why you currently don't use sizeof(ushort)
- instead of adding the additional mental burden to know that sizeof for the same signed and unsigned type is the same)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This started out as short value
, thus return sizeof(short)
. Forgot to clean these up as well. Unfortunately it's not possible to do sizeof(value)
, but I bet there's quite some downsides to having that support in C#.
Interesting PR. While I am usually quite fond of doing optimizations, I am wondering if this really make a difference for such heavy I/O bound applications like you have with Sally7, even if you have tons of concurrent connections. It also is sad that you have to go through these extra steps and call |
That's a fair question. Performance wise there's quite some gain in the processing time, that doesn't change the I/O bounds. Still it could result in overall lower response times. Let's say it's better for the environment because we waste less CPU cycles. 💚
I'm not sure if Unsafe operations is the right definition. It's actually different from using the And yes, one of the features of the library was the fact that everything was mapped to
I'm going to divert from that approach. Basically because I'm dictating the direction of Sally7 and it's up to me to chose, but also because I value the performance gains. For me personally it's also unsellable that I'm writing more performant code in the next library (AdsClient) while performance is the main feature of Sally7. However (and that's a BIG however), I don't want to turn this into a project that is unreadable, unmaintainable and in the end unusable. This should be clean code and still have a relatively low barrier of entrance, it just won't be using structs for that in the long term. I also invested a bit of time in unit tests, I want to expand on that. I'd like to add server side support as well at some point, not sure yet how I can do that while sharing as much of the code as possible. Maybe I will end up using Either way, thanks for your comments, again dearly appreciated! ❤️ |
Processing time improvement can definitely be worthwhile - it just never was that big of a deal for the projects I worked on back in the days, since one side talked to a database and the other to a S7 PLC :-) And yeah it is both interesting in how you can squeeze more performance out of .NET, and sad that you have to know these special tricks at the same time. The AdsClient looks quite interesting. I worked on a small project using TwinCat at some point, and the library from Beckhoff that was already in place seemed to be okay. But that was with local communication only anyway. |
Yeah I guess Viscon is by far the biggest user of Sally7 if you factor in the number of requests. Then again our abstraction layer still needs some serious work as well, that's not allocation free at all...
I especially love all the details Günther is providing. I'm just using API's that are there, Günther is actually giving advice that squeezes every last bit of performance out of the CLR.
Let's call it ignorance 😄 I'm not familiar with Beckhoff PLC's, but we had another supplier at a customer that uses Beckhoff PLC's and we needed to do some communication. We asked their preferred form of communication and it was ADS. Searching for solutions I found the AdsClient library (I didn't write this from scratch). I think I did see NuGet packages for the Beckhoff library, but because I couldn't find any more information (should've just clicked the link on nuget.org though) I was a bit uncertain about using it. So then I started by contacting the maintainer of the AdsClient library and got the GitHub repository transferred to me. The NuGet package was still linked to the original author though. He no longer had control due to account changes on nuget.org, but with a lot of communicating back and forth with NuGet admins I got that transferred too (not actually using it now though). With all that out of the way I started doing some initial communication tests. That didn't go as planned at first, but soon enough we figured it was a similar issue to Siemens where both the IP and identifiers have to match. I could borrow a PLC though, so that allowed me to test locally and refactor the library into what it is now. There's still a lot of room for improvement there as well and just like Sally7 it's a bit short on unit tests, but it was fun to do and it works pretty well. Of course I still need to compare it against Beckhoff's library to see which one performs better and if it's the Beckhoff library I've got some work to do. Final thing I need to handle is multiple connections, which apparently is constrained to a single connection per IP, so I will be adding a proxy that solves that problem and is actually different from the router as provided by Beckhoff as well. Long story short, I'm not sure I needed to write the library. But I did, and while doing that I even fixed a bug in the NodeJS ads-client package as well (which I also used for reference). And I got to know about |
@gfoidl care to give this your final verdict? I preferably want to merge this PR before I start refactoring other parts of the code. |
No description provided.