Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned #38

mycroes · 2023-06-22T21:12:33Z

No description provided.

gfoidl · 2023-06-23T08:54:24Z

Sally7/Internal/TypeExtensions.cs

+{
+    public static ref byte GetOffset(this ref byte destination, int offset)
+    {
+        return ref Unsafe.Add(ref destination, offset);


Suggested change

return ref Unsafe.Add(ref destination, offset);

return ref Unsafe.Add(ref destination, (uint)offset);

to squeeze out a little bit more perf (with a quite easy change).

See sharplab.
The movsxd (sign extending move) is eliminated, and the mov shown is either not needed (after inlining) or the cpu's register renaming handles it w/o actually being executed in cpu's backend.

Thanks, I think you actually made a similar comment last week on AdsClient, I'll try to keep this in mind!

Also, since offset will always be positive anyway, as will all the lengths returned from WireFormatting methods, I could just as well move to uint in all those places I guess.

Yep, uint works too.
As said in the AdsClient, it depends on where the cast to uint happens. Once here in this method, or when getting the uint-casted length of the span (which is int). I prefer having only one cast.
But you'll see when you make the change.

Bummer, this is net6, 7 and 8 only and based on conversion to UIntPtr / nuint. I wasn't even targeting net6 yet. So now I'm wondering whether I want to go uint all around and cast it to int on < net6 or the other way around.

Ah, then something like Unsafe.Add(ref src, (IntPtr)(uint)idx); works when the language given mapping UIntPtr <-> nuint isn't available (and the implicit cast from uint -> nuint).
So I'd keep the parameters as int and do the casts in that method.

Edit: I should have read all notifications, where I should have seen the new targets 😉

gfoidl · 2023-06-23T09:01:02Z

Sally7/Internal/WireFormatting.cs

+        WriteInt32(ref destination, JobRequestHeader1);
+        // Legacy, the
+        WriteInt16(ref destination.GetOffset(4), 1 << 8); // PDU ref
+        WriteInt16(ref destination.GetOffset(6), (short)paramLength);
+        WriteInt16(ref destination.GetOffset(8), (short)dataLength);


This code is nice and very readable.

Perf-wise we have 4 memory writes -- which may or may not be coalesced by cpu's store buffer.
So maybe combine these writes into a long + short (sure taking byte order into account), so we reduced that to 2 memory writes.

I'm not sure if a micro-benchmark will show any difference here. More likely a tool like Intel VTune can show differences in memory operations.

Further I believe that the code can be structed in a way that's still readable and understandable.
Maybe we should look into this in a follow-up PR.

I'll add a benchmark. I also have a separate repository for benchmarks, I'll see if I can put it into a usable form there as well. Either way I definitely agree, less writes is faster so it probably helps to combine it into a long + short write.

Nice repo -- I'll dig into this next week 😃.

So, lunch done and I gave it a try:

[MethodImpl(MethodImplOptions.AggressiveInlining)] public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength) { ulong tmp; if (BitConverter.IsLittleEndian) { tmp = (ulong)BinaryPrimitives.ReverseEndianness(JobRequestHeader1); tmp |= (ulong)BinaryPrimitives.ReverseEndianness((short)(1 << 8)) << 32; tmp |= (ulong)(uint)BinaryPrimitives.ReverseEndianness((ushort)paramLength) << 48; } else { tmp = JobRequestHeader1; tmp |= (ulong)(1 << 8) << 32; tmp |= (ulong)(uint)paramLength << 48; } Unsafe.WriteUnaligned(ref destination, tmp); WriteUInt16(ref destination.GetOffset(8), (ushort)dataLength); return 10; }

code is not so pretty

I'm not sure if the big-endian part is correct (this is where I'm always struggling)

as this is the only place where this long+short is used, I don't think a helper is needed (similar to the NetworkOrderSerializer-class)

Maybe you have a better idea.
Codegen-wise it's only two memory writes at the expense of more register chaining. So hard to tell (w/o measuring) what's faster.
Out of interest LLVM-MCA shows this and this -- the long+short has better port usage, thus in that iterations in total less cycles are spent. But just looking at a single iteration it looks like the current code is a bit better.

In doubt I'd keep the code as is now, because with the NetworkOrderSerializer it's so beautiful to read.

I guess this is hard to get your head around (still hurts my head as well), but reversing the endianess of the entire long suffices. It does make some sense, because for big endian all of the data is in the order that we want it on the network, so for little endian it's all in the incorrect order. That reduces the code complexity as well, and it's actually really fast (see benchmarks). Will definitely use long+short writes for this.

gfoidl · 2023-06-23T09:03:27Z

Sally7/Internal/WireFormatting.cs

+        return 3;
+    }
+
+    public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)


Suggested change

public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)

[MethodImpl(MethodImplOptions.AggressiveInlining)]

public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)

My gut tells me that it's worth it here.
For the JIT this method seems probably too big to inline, but the actual code is quite little, so we should give the JIT a hint.

Do you know an easy way to verify if it will be inlined? I could at least benchmark the method call with and without AggressiveInlining to see if it makes a difference.

In general I do either

use https://sharplab.io/ and call the method from a dummy method to see what machine code gets emitted

create a simple console app with a dummy method and set the proper env-variables (see [release/7.0-rc1] Improve DOTNET_JitDisasm and introduce DOTNET_JitDisasmSummary dotnet/runtime#74392 for more info)

create a benchmark and use the [DisassemblyDiagnoser] to the see the machine code

Here with sharplab (some copy & pasting needed) one can see the forcing the inline will help.

W/o the inline it's a tail-call (note the jmp as last instruction of method Foo.Bar, whilst w/ inline it's nice code in method Foo.Bar_Inlined.

mycroes · 2023-06-23T10:24:00Z

Thanks again @gfoidl. Very nice that you jumped straight in, will keep the comments in mind for additional changes I intend to make as well.

mycroes · 2023-06-23T10:24:45Z

Sally7/Internal/WireFormatting.cs

+    public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)
+    {
+        WriteInt32(ref destination, JobRequestHeader1);
+        // Legacy, the


Whoops, needs to be fixed.

gfoidl · 2023-06-23T10:55:35Z

Sally7/Internal/NetworkOrderSerializer.cs

+
+internal static class NetworkOrderSerializer
+{
+    public static void WriteInt16(ref byte destination, short value)


Suggested change

public static void WriteInt16(ref byte destination, short value)

public static void WriteInt16(ref byte destination, ushort value)

Found while inspecting the machine code (see other comment).
This was also sign-extended which isn't needed, as signed/unsigned does not make a difference in writing the bytes. So a little perf-boost here.

If the parameter is changed, the method name should be updated too to WriteUInt16.
I'd do the same for WriteUInt32 then, to be consistent.
Either make the parameter unsigned, or cast to ushort in the two places used in this method.

Final codegen should look like this, so no more sign-extending moves -- they are slow and not needed.

gfoidl

Left some comments. FYI: I'll be back on Monday -- wish you a nice weekend.

gfoidl · 2023-06-24T07:52:03Z

Sally7/Internal/WireFormatting.cs

@@ -5,7 +5,7 @@ namespace Sally7.Internal;

 internal static class WireFormatting
 {
-    private const int Tpkt = 0x03_00_00_00;
+    private const uint Tpkt = 0x03_00_00_00;


Change is fine, just as little side-note: the JIT is fine with the int constant too, codegen is the same.
What I find nice, is that reverse endianness from the BinaryPrimitives is an intrinsic, so the JIT will create a correct constant and use that, so no reversing at runtime.

gfoidl · 2023-06-24T07:53:31Z

Sally7/Internal/WireFormatting.cs

    {
-        return WriteInt32(ref destination, Tpkt | length);
+        return WriteUInt32(ref destination, (uint) (Tpkt | length));


Suggested change

return WriteUInt32(ref destination, (uint) (Tpkt | length));

return WriteUInt32(ref destination, Tpkt | (uint)length);

Tpkt is already a uint, besides that the JIT will do the right thing here.

gfoidl · 2023-06-24T07:57:18Z

Sally7/S7ConnectionHelpers.cs

+            var fnParameterLength = (ushort) (dataItems.Length * 12 + 2);
+            ushort dataLength = 0;


I'd keep these as int, because

in Span.Slice the parameter is int, so there's a movzx to convert from ushort -> int

for arithmetics the cpu works on 32/64-bit registers, so it ushort -> int at the cpu-level

we can avoid that conversions by using int here.

Will change this back, I had my doubts around this part as well.

gfoidl · 2023-06-24T08:00:57Z

Sally7/S7ConnectionHelpers.cs

+        private static int BuildS7JobRequest(Span<byte> buffer, ushort parameterLength, ushort dataLength)
        {
            ref var start = ref MemoryMarshal.GetReference(buffer);
            var len = parameterLength + dataLength + 17; // Error omitted


len here will be int, so the parameters should remain int. See sharplab.

That's why -- from my experience -- it's quite often better to keep the parameters CLS compliant or use uint (as it's a native size that cpu registers can understand), and do the cast / type interpretion only where it's needed.
Squeezing out perf unfortunately comes with some traps too.

Thanks for this one, makes sense but I didn't do any checking when changing to ushort...

mycroes · 2023-06-24T21:21:04Z

Left some comments. FYI: I'll be back on Monday -- wish you a nice weekend.

A nice weekend to you as well! No need to hurry with reviewing, I'm not in a hurry to make these changes either. I'm aiming for GitHub activity every day, just so that I'll keep working on my projects and perhaps start some new projects too. I don't necessarily expect you to join my effort 😉

scamille · 2023-06-26T08:52:07Z

Sally7/Internal/WireFormatting.cs

+    {
+        var header = JobRequestHeader1 | (ushort) paramLength;
+
+        Unsafe.WriteUnaligned(ref destination,


Is there a reason you are not using WriteUint32 here? (assuming that is the correct type). Would encapsulate the operation bettter, in particular the conditional endian inversion.

I need to have something to fix for the next day, don't I? 😜 It's actually UInt64 / ulong, but yes, that's a definite improvement to make just like the other WireFormatting calls.

scamille · 2023-06-26T08:58:44Z

Sally7/Internal/WireFormatting.cs

+    public static uint WriteUInt16(ref byte destination, ushort value)
+    {
+        NetworkOrderSerializer.WriteUInt16(ref destination, value);
+        return sizeof(short);


Is there something to access the compile time type of a variable in C# - which has no overhead?

Then you could just return sizeof(decltype(value)); - making the code more generic with less chance of errors.

(I don't know why you currently don't use sizeof(ushort) - instead of adding the additional mental burden to know that sizeof for the same signed and unsigned type is the same)

This started out as short value, thus return sizeof(short). Forgot to clean these up as well. Unfortunately it's not possible to do sizeof(value), but I bet there's quite some downsides to having that support in C#.

scamille · 2023-06-26T09:23:28Z

Interesting PR. While I am usually quite fond of doing optimizations, I am wondering if this really make a difference for such heavy I/O bound applications like you have with Sally7, even if you have tons of concurrent connections.

It also is sad that you have to go through these extra steps and call Unsafe operations to squeeze out that extra performance :-) In particular since it seems to sacrifice some of the nice Span interface in favor of decoupled pointer/length arguments.

mycroes · 2023-06-26T10:06:27Z

Interesting PR. While I am usually quite fond of doing optimizations, I am wondering if this really make a difference for such heavy I/O bound applications like you have with Sally7, even if you have tons of concurrent connections.

That's a fair question. Performance wise there's quite some gain in the processing time, that doesn't change the I/O bounds. Still it could result in overall lower response times. Let's say it's better for the environment because we waste less CPU cycles. 💚

It also is sad that you have to go through these extra steps and call Unsafe operations to squeeze out that extra performance :-) In particular since it seems to sacrifice some of the nice Span interface in favor of decoupled pointer/length arguments.

I'm not sure if Unsafe operations is the right definition. It's actually different from using the unsafe keyword, though I guess the effects are roughly the same in the sense that you could have buffer overflows / underflows.

And yes, one of the features of the library was the fact that everything was mapped to structs and enums. I actually still have the following sentence in the README:

All protocols are mapped to structs and enums with the intent to create a project that is easy to comprehend and extend.

I'm going to divert from that approach. Basically because I'm dictating the direction of Sally7 and it's up to me to chose, but also because I value the performance gains. For me personally it's also unsellable that I'm writing more performant code in the next library (AdsClient) while performance is the main feature of Sally7. However (and that's a BIG however), I don't want to turn this into a project that is unreadable, unmaintainable and in the end unusable. This should be clean code and still have a relatively low barrier of entrance, it just won't be using structs for that in the long term.

I also invested a bit of time in unit tests, I want to expand on that. I'd like to add server side support as well at some point, not sure yet how I can do that while sharing as much of the code as possible. Maybe I will end up using structs and code generation to get the best of both worlds. 😆

Either way, thanks for your comments, again dearly appreciated! ❤️

…tHeader

scamille · 2023-06-27T18:31:14Z

That's a fair question. Performance wise there's quite some gain in the processing time, that doesn't change the I/O bounds. Still it could result in overall lower response times. Let's say it's better for the environment because we waste less CPU cycles. 💚

Processing time improvement can definitely be worthwhile - it just never was that big of a deal for the projects I worked on back in the days, since one side talked to a database and the other to a S7 PLC :-)

And yeah it is both interesting in how you can squeeze more performance out of .NET, and sad that you have to know these special tricks at the same time.

The AdsClient looks quite interesting. I worked on a small project using TwinCat at some point, and the library from Beckhoff that was already in place seemed to be okay. But that was with local communication only anyway.
Would you mind sharing (and maybe adding to the Readme) what the motivation was for creating your own library?

mycroes · 2023-06-27T20:08:04Z

That's a fair question. Performance wise there's quite some gain in the processing time, that doesn't change the I/O bounds. Still it could result in overall lower response times. Let's say it's better for the environment because we waste less CPU cycles. 💚

Processing time improvement can definitely be worthwhile - it just never was that big of a deal for the projects I worked on back in the days, since one side talked to a database and the other to a S7 PLC :-)

Yeah I guess Viscon is by far the biggest user of Sally7 if you factor in the number of requests. Then again our abstraction layer still needs some serious work as well, that's not allocation free at all...

And yeah it is both interesting in how you can squeeze more performance out of .NET, and sad that you have to know these special tricks at the same time.

I especially love all the details Günther is providing. I'm just using API's that are there, Günther is actually giving advice that squeezes every last bit of performance out of the CLR.

The AdsClient looks quite interesting. I worked on a small project using TwinCat at some point, and the library from Beckhoff that was already in place seemed to be okay. But that was with local communication only anyway. Would you mind sharing (and maybe adding to the Readme) what the motivation was for creating your own library?

Let's call it ignorance 😄 I'm not familiar with Beckhoff PLC's, but we had another supplier at a customer that uses Beckhoff PLC's and we needed to do some communication. We asked their preferred form of communication and it was ADS. Searching for solutions I found the AdsClient library (I didn't write this from scratch). I think I did see NuGet packages for the Beckhoff library, but because I couldn't find any more information (should've just clicked the link on nuget.org though) I was a bit uncertain about using it.

So then I started by contacting the maintainer of the AdsClient library and got the GitHub repository transferred to me. The NuGet package was still linked to the original author though. He no longer had control due to account changes on nuget.org, but with a lot of communicating back and forth with NuGet admins I got that transferred too (not actually using it now though).

With all that out of the way I started doing some initial communication tests. That didn't go as planned at first, but soon enough we figured it was a similar issue to Siemens where both the IP and identifiers have to match. I could borrow a PLC though, so that allowed me to test locally and refactor the library into what it is now. There's still a lot of room for improvement there as well and just like Sally7 it's a bit short on unit tests, but it was fun to do and it works pretty well. Of course I still need to compare it against Beckhoff's library to see which one performs better and if it's the Beckhoff library I've got some work to do. Final thing I need to handle is multiple connections, which apparently is constrained to a single connection per IP, so I will be adding a proxy that solves that problem and is actually different from the router as provided by Beckhoff as well.

Long story short, I'm not sure I needed to write the library. But I did, and while doing that I even fixed a bug in the NodeJS ads-client package as well (which I also used for reference). And I got to know about Unsafe.WriteUnaligned, Marshal.GetReference and sneaky Unsafe.As usage. So a big win for me and a good reason to start some new work on Sally7!

mycroes · 2023-06-29T17:48:01Z

@gfoidl care to give this your final verdict? I preferably want to merge this PR before I start refactoring other parts of the code.

mycroes added 2 commits June 22, 2023 23:11

refactor: Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned

abbf5ef

refactor: Hardcode offsets for writing job request header

63e4e2c

mycroes changed the title ~~refactor: Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned~~ Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned Jun 22, 2023

gfoidl reviewed Jun 23, 2023

View reviewed changes

mycroes commented Jun 23, 2023

View reviewed changes

gfoidl reviewed Jun 23, 2023

View reviewed changes

mycroes added 4 commits June 23, 2023 21:11

chore: Remove partial comment

f09e26a

feat: Add net6.0, net7.0 targeting

e3d5b95

refactor: Use unsigned where applicable

84a3871

fix: Actually use the PduRef field

4239dcb

gfoidl reviewed Jun 24, 2023

View reviewed changes

mycroes added 6 commits June 24, 2023 23:41

feat: Add SerializeJobRequestHeader benchmark

d2811f4

feat: Add PduRef to job request header benchmark

dd9b776

refactor: Revert back to int for paramLength and dataLength

0259a35

refactor: Write job request header using long and short

fab9d37

refactor: Add AsStruct extension method on ref byte

b2a62b3

feat: Add benchmark for optimized serialization of read request

b04a678

scamille reviewed Jun 26, 2023

View reviewed changes

mycroes added 3 commits June 26, 2023 20:32

chore: Use matching sizeof arguments for WireFormatting returns

429ff8a

chore: Add NetworkOrderSerializer.WriteUInt64

599b5ab

refactor: Use NetworkOrderSerializer in WireFormatting.WriteJobReques…

c643ddf

…tHeader

chore: Cleanup usings in WireFormatting

ea53c51

gfoidl approved these changes Jun 30, 2023

View reviewed changes

mycroes merged commit 9464e82 into main Jun 30, 2023

mycroes deleted the serialization branch June 30, 2023 08:56

	return ref Unsafe.Add(ref destination, offset);
	return ref Unsafe.Add(ref destination, (uint)offset);

	public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)
	[MethodImpl(MethodImplOptions.AggressiveInlining)]
	public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)

	public static void WriteInt16(ref byte destination, short value)
	public static void WriteInt16(ref byte destination, ushort value)

	return WriteUInt32(ref destination, (uint) (Tpkt \| length));
	return WriteUInt32(ref destination, Tpkt \| (uint)length);

		var fnParameterLength = (ushort) (dataItems.Length * 12 + 2);
		ushort dataLength = 0;

Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned #38

Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned #38

Conversation

mycroes commented Jun 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfoidl Jun 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfoidl Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mycroes commented Jun 23, 2023

Choose a reason for hiding this comment

gfoidl Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mycroes commented Jun 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scamille commented Jun 26, 2023

mycroes commented Jun 26, 2023

scamille commented Jun 27, 2023 • edited Loading

mycroes commented Jun 27, 2023

mycroes commented Jun 29, 2023

gfoidl Jun 24, 2023 •

edited

Loading

gfoidl Jun 23, 2023 •

edited

Loading

gfoidl Jun 23, 2023 •

edited

Loading

scamille commented Jun 27, 2023 •

edited

Loading