Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned #38

Merged
merged 16 commits into from
Jun 30, 2023

Conversation

mycroes
Copy link
Owner

@mycroes mycroes commented Jun 22, 2023

No description provided.

@mycroes mycroes changed the title refactor: Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned Serialize Tpkt, Data and Header using Unsafe.WriteUnaligned Jun 22, 2023
{
public static ref byte GetOffset(this ref byte destination, int offset)
{
return ref Unsafe.Add(ref destination, offset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ref Unsafe.Add(ref destination, offset);
return ref Unsafe.Add(ref destination, (uint)offset);

to squeeze out a little bit more perf (with a quite easy change).

See sharplab.
The movsxd (sign extending move) is eliminated, and the mov shown is either not needed (after inlining) or the cpu's register renaming handles it w/o actually being executed in cpu's backend.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think you actually made a similar comment last week on AdsClient, I'll try to keep this in mind!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since offset will always be positive anyway, as will all the lengths returned from WireFormatting methods, I could just as well move to uint in all those places I guess.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, uint works too.
As said in the AdsClient, it depends on where the cast to uint happens. Once here in this method, or when getting the uint-casted length of the span (which is int). I prefer having only one cast.
But you'll see when you make the change.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer, this is net6, 7 and 8 only and based on conversion to UIntPtr / nuint. I wasn't even targeting net6 yet. So now I'm wondering whether I want to go uint all around and cast it to int on < net6 or the other way around.

Copy link
Collaborator

@gfoidl gfoidl Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, then something like Unsafe.Add(ref src, (IntPtr)(uint)idx); works when the language given mapping UIntPtr <-> nuint isn't available (and the implicit cast from uint -> nuint).
So I'd keep the parameters as int and do the casts in that method.

Edit: I should have read all notifications, where I should have seen the new targets 😉

Comment on lines 47 to 51
WriteInt32(ref destination, JobRequestHeader1);
// Legacy, the
WriteInt16(ref destination.GetOffset(4), 1 << 8); // PDU ref
WriteInt16(ref destination.GetOffset(6), (short)paramLength);
WriteInt16(ref destination.GetOffset(8), (short)dataLength);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is nice and very readable.

Perf-wise we have 4 memory writes -- which may or may not be coalesced by cpu's store buffer.
So maybe combine these writes into a long + short (sure taking byte order into account), so we reduced that to 2 memory writes.

I'm not sure if a micro-benchmark will show any difference here. More likely a tool like Intel VTune can show differences in memory operations.

Further I believe that the code can be structed in a way that's still readable and understandable.
Maybe we should look into this in a follow-up PR.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a benchmark. I also have a separate repository for benchmarks, I'll see if I can put it into a usable form there as well. Either way I definitely agree, less writes is faster so it probably helps to combine it into a long + short write.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice repo -- I'll dig into this next week 😃.

Copy link
Collaborator

@gfoidl gfoidl Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, lunch done and I gave it a try:

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)
    {
        ulong tmp;

        if (BitConverter.IsLittleEndian)
        {
            tmp  = (ulong)BinaryPrimitives.ReverseEndianness(JobRequestHeader1);
            tmp |= (ulong)BinaryPrimitives.ReverseEndianness((short)(1 << 8)) << 32;
            tmp |= (ulong)(uint)BinaryPrimitives.ReverseEndianness((ushort)paramLength) << 48;
        }
        else
        {
            tmp  = JobRequestHeader1;
            tmp |= (ulong)(1 << 8) << 32;
            tmp |= (ulong)(uint)paramLength << 48;
        }

        Unsafe.WriteUnaligned(ref destination, tmp);

        WriteUInt16(ref destination.GetOffset(8), (ushort)dataLength);

        return 10;
    }
  • code is not so pretty
  • I'm not sure if the big-endian part is correct (this is where I'm always struggling)
  • as this is the only place where this long+short is used, I don't think a helper is needed (similar to the NetworkOrderSerializer-class)

Maybe you have a better idea.
Codegen-wise it's only two memory writes at the expense of more register chaining. So hard to tell (w/o measuring) what's faster.
Out of interest LLVM-MCA shows this and this -- the long+short has better port usage, thus in that iterations in total less cycles are spent. But just looking at a single iteration it looks like the current code is a bit better.

In doubt I'd keep the code as is now, because with the NetworkOrderSerializer it's so beautiful to read.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is hard to get your head around (still hurts my head as well), but reversing the endianess of the entire long suffices. It does make some sense, because for big endian all of the data is in the order that we want it on the network, so for little endian it's all in the incorrect order. That reduces the code complexity as well, and it's actually really fast (see benchmarks). Will definitely use long+short writes for this.

return 3;
}

public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)

My gut tells me that it's worth it here.
For the JIT this method seems probably too big to inline, but the actual code is quite little, so we should give the JIT a hint.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know an easy way to verify if it will be inlined? I could at least benchmark the method call with and without AggressiveInlining to see if it makes a difference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I do either

Here with sharplab (some copy & pasting needed) one can see the forcing the inline will help.

W/o the inline it's a tail-call (note the jmp as last instruction of method Foo.Bar, whilst w/ inline it's nice code in method Foo.Bar_Inlined.

@mycroes
Copy link
Owner Author

mycroes commented Jun 23, 2023

Thanks again @gfoidl. Very nice that you jumped straight in, will keep the comments in mind for additional changes I intend to make as well.

public static int WriteJobRequestHeader(ref byte destination, int paramLength, int dataLength)
{
WriteInt32(ref destination, JobRequestHeader1);
// Legacy, the
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, needs to be fixed.


internal static class NetworkOrderSerializer
{
public static void WriteInt16(ref byte destination, short value)
Copy link
Collaborator

@gfoidl gfoidl Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static void WriteInt16(ref byte destination, short value)
public static void WriteInt16(ref byte destination, ushort value)

Found while inspecting the machine code (see other comment).
This was also sign-extended which isn't needed, as signed/unsigned does not make a difference in writing the bytes. So a little perf-boost here.


If the parameter is changed, the method name should be updated too to WriteUInt16.
I'd do the same for WriteUInt32 then, to be consistent.
Either make the parameter unsigned, or cast to ushort in the two places used in this method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final codegen should look like this, so no more sign-extending moves -- they are slow and not needed.

Copy link
Collaborator

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. FYI: I'll be back on Monday -- wish you a nice weekend.

@@ -5,7 +5,7 @@ namespace Sally7.Internal;

internal static class WireFormatting
{
private const int Tpkt = 0x03_00_00_00;
private const uint Tpkt = 0x03_00_00_00;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is fine, just as little side-note: the JIT is fine with the int constant too, codegen is the same.
What I find nice, is that reverse endianness from the BinaryPrimitives is an intrinsic, so the JIT will create a correct constant and use that, so no reversing at runtime.

{
return WriteInt32(ref destination, Tpkt | length);
return WriteUInt32(ref destination, (uint) (Tpkt | length));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return WriteUInt32(ref destination, (uint) (Tpkt | length));
return WriteUInt32(ref destination, Tpkt | (uint)length);

Tpkt is already a uint, besides that the JIT will do the right thing here.

Comment on lines 77 to 78
var fnParameterLength = (ushort) (dataItems.Length * 12 + 2);
ushort dataLength = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep these as int, because

  • in Span.Slice the parameter is int, so there's a movzx to convert from ushort -> int
  • for arithmetics the cpu works on 32/64-bit registers, so it ushort -> int at the cpu-level

we can avoid that conversions by using int here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this back, I had my doubts around this part as well.

Comment on lines 115 to 118
private static int BuildS7JobRequest(Span<byte> buffer, ushort parameterLength, ushort dataLength)
{
ref var start = ref MemoryMarshal.GetReference(buffer);
var len = parameterLength + dataLength + 17; // Error omitted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len here will be int, so the parameters should remain int. See sharplab.

That's why -- from my experience -- it's quite often better to keep the parameters CLS compliant or use uint (as it's a native size that cpu registers can understand), and do the cast / type interpretion only where it's needed.
Squeezing out perf unfortunately comes with some traps too.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this one, makes sense but I didn't do any checking when changing to ushort...

@mycroes
Copy link
Owner Author

mycroes commented Jun 24, 2023

Left some comments. FYI: I'll be back on Monday -- wish you a nice weekend.

A nice weekend to you as well! No need to hurry with reviewing, I'm not in a hurry to make these changes either. I'm aiming for GitHub activity every day, just so that I'll keep working on my projects and perhaps start some new projects too. I don't necessarily expect you to join my effort 😉

{
var header = JobRequestHeader1 | (ushort) paramLength;

Unsafe.WriteUnaligned(ref destination,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are not using WriteUint32 here? (assuming that is the correct type). Would encapsulate the operation bettter, in particular the conditional endian inversion.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to have something to fix for the next day, don't I? 😜 It's actually UInt64 / ulong, but yes, that's a definite improvement to make just like the other WireFormatting calls.

public static uint WriteUInt16(ref byte destination, ushort value)
{
NetworkOrderSerializer.WriteUInt16(ref destination, value);
return sizeof(short);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something to access the compile time type of a variable in C# - which has no overhead?

Then you could just return sizeof(decltype(value)); - making the code more generic with less chance of errors.

(I don't know why you currently don't use sizeof(ushort) - instead of adding the additional mental burden to know that sizeof for the same signed and unsigned type is the same)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This started out as short value, thus return sizeof(short). Forgot to clean these up as well. Unfortunately it's not possible to do sizeof(value), but I bet there's quite some downsides to having that support in C#.

@scamille
Copy link
Contributor

Interesting PR. While I am usually quite fond of doing optimizations, I am wondering if this really make a difference for such heavy I/O bound applications like you have with Sally7, even if you have tons of concurrent connections.

It also is sad that you have to go through these extra steps and call Unsafe operations to squeeze out that extra performance :-) In particular since it seems to sacrifice some of the nice Span interface in favor of decoupled pointer/length arguments.

@mycroes
Copy link
Owner Author

mycroes commented Jun 26, 2023

Interesting PR. While I am usually quite fond of doing optimizations, I am wondering if this really make a difference for such heavy I/O bound applications like you have with Sally7, even if you have tons of concurrent connections.

That's a fair question. Performance wise there's quite some gain in the processing time, that doesn't change the I/O bounds. Still it could result in overall lower response times. Let's say it's better for the environment because we waste less CPU cycles. 💚

It also is sad that you have to go through these extra steps and call Unsafe operations to squeeze out that extra performance :-) In particular since it seems to sacrifice some of the nice Span interface in favor of decoupled pointer/length arguments.

I'm not sure if Unsafe operations is the right definition. It's actually different from using the unsafe keyword, though I guess the effects are roughly the same in the sense that you could have buffer overflows / underflows.

And yes, one of the features of the library was the fact that everything was mapped to structs and enums. I actually still have the following sentence in the README:

All protocols are mapped to structs and enums with the intent to create a project that is easy to comprehend and extend.

I'm going to divert from that approach. Basically because I'm dictating the direction of Sally7 and it's up to me to chose, but also because I value the performance gains. For me personally it's also unsellable that I'm writing more performant code in the next library (AdsClient) while performance is the main feature of Sally7. However (and that's a BIG however), I don't want to turn this into a project that is unreadable, unmaintainable and in the end unusable. This should be clean code and still have a relatively low barrier of entrance, it just won't be using structs for that in the long term.

I also invested a bit of time in unit tests, I want to expand on that. I'd like to add server side support as well at some point, not sure yet how I can do that while sharing as much of the code as possible. Maybe I will end up using structs and code generation to get the best of both worlds. 😆

Either way, thanks for your comments, again dearly appreciated! ❤️

@scamille
Copy link
Contributor

scamille commented Jun 27, 2023

That's a fair question. Performance wise there's quite some gain in the processing time, that doesn't change the I/O bounds. Still it could result in overall lower response times. Let's say it's better for the environment because we waste less CPU cycles. 💚

Processing time improvement can definitely be worthwhile - it just never was that big of a deal for the projects I worked on back in the days, since one side talked to a database and the other to a S7 PLC :-)

And yeah it is both interesting in how you can squeeze more performance out of .NET, and sad that you have to know these special tricks at the same time.

The AdsClient looks quite interesting. I worked on a small project using TwinCat at some point, and the library from Beckhoff that was already in place seemed to be okay. But that was with local communication only anyway.
Would you mind sharing (and maybe adding to the Readme) what the motivation was for creating your own library?

@mycroes
Copy link
Owner Author

mycroes commented Jun 27, 2023

That's a fair question. Performance wise there's quite some gain in the processing time, that doesn't change the I/O bounds. Still it could result in overall lower response times. Let's say it's better for the environment because we waste less CPU cycles. 💚

Processing time improvement can definitely be worthwhile - it just never was that big of a deal for the projects I worked on back in the days, since one side talked to a database and the other to a S7 PLC :-)

Yeah I guess Viscon is by far the biggest user of Sally7 if you factor in the number of requests. Then again our abstraction layer still needs some serious work as well, that's not allocation free at all...

And yeah it is both interesting in how you can squeeze more performance out of .NET, and sad that you have to know these special tricks at the same time.

I especially love all the details Günther is providing. I'm just using API's that are there, Günther is actually giving advice that squeezes every last bit of performance out of the CLR.

The AdsClient looks quite interesting. I worked on a small project using TwinCat at some point, and the library from Beckhoff that was already in place seemed to be okay. But that was with local communication only anyway. Would you mind sharing (and maybe adding to the Readme) what the motivation was for creating your own library?

Let's call it ignorance 😄 I'm not familiar with Beckhoff PLC's, but we had another supplier at a customer that uses Beckhoff PLC's and we needed to do some communication. We asked their preferred form of communication and it was ADS. Searching for solutions I found the AdsClient library (I didn't write this from scratch). I think I did see NuGet packages for the Beckhoff library, but because I couldn't find any more information (should've just clicked the link on nuget.org though) I was a bit uncertain about using it.

So then I started by contacting the maintainer of the AdsClient library and got the GitHub repository transferred to me. The NuGet package was still linked to the original author though. He no longer had control due to account changes on nuget.org, but with a lot of communicating back and forth with NuGet admins I got that transferred too (not actually using it now though).

With all that out of the way I started doing some initial communication tests. That didn't go as planned at first, but soon enough we figured it was a similar issue to Siemens where both the IP and identifiers have to match. I could borrow a PLC though, so that allowed me to test locally and refactor the library into what it is now. There's still a lot of room for improvement there as well and just like Sally7 it's a bit short on unit tests, but it was fun to do and it works pretty well. Of course I still need to compare it against Beckhoff's library to see which one performs better and if it's the Beckhoff library I've got some work to do. Final thing I need to handle is multiple connections, which apparently is constrained to a single connection per IP, so I will be adding a proxy that solves that problem and is actually different from the router as provided by Beckhoff as well.

Long story short, I'm not sure I needed to write the library. But I did, and while doing that I even fixed a bug in the NodeJS ads-client package as well (which I also used for reference). And I got to know about Unsafe.WriteUnaligned, Marshal.GetReference and sneaky Unsafe.As usage. So a big win for me and a good reason to start some new work on Sally7!

@mycroes
Copy link
Owner Author

mycroes commented Jun 29, 2023

@gfoidl care to give this your final verdict? I preferably want to merge this PR before I start refactoring other parts of the code.

@mycroes mycroes merged commit 9464e82 into main Jun 30, 2023
@mycroes mycroes deleted the serialization branch June 30, 2023 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants