-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in libreswan 3.32 #390
Comments
I managed to get some logs. Luckily they are short for this node in the cluster:
|
Do you have the configuration file? And possibly that of the remote end point ? I don't see anything useful in the logs. the sadb should not be null obviously. |
I don't have the exact configuration at that point unfortunately but it would look something like the following (and would be similar on both ends) - this is from a different cluster:
|
On Mon, 11 Jan 2021, Mark Gray wrote:
Subject: Re: [libreswan/libreswan] Segmentation fault in libreswan 3.32 (#390)
I don't have the exact configuration at that point unfortunately but it would look something like the following (and would be the same on both ends):
Hmm that looks regular enough.
I guess we would really need a gdb backtrace, as the logs you shared
shows no issue whatsoever.
|
The core dump is in IKEv1's Main Mode proposal code, but the config file seems to have ikev2=insist (the IKE= line won't work with ikev1 either). |
@cagney does that mean it is (unexpectedly) entering (for some reason) into ikev1 mode? |
Ah, yes.. The logs would suggest that ovn-cc3669-0-out-1 has a very complicated life, and at some point it flips from IKEv2 to IKEv1. Jan 8 07:59:50.519668: "ovn-cc3669-0-out-1" #5: STATE_PARENT_R2: received v2I2, PARENT SA established |
@cagney what can cause that? |
I will try to do this. It is a little tricky because I cannot reproduce everytime so I will need to run the CI in a loop and figure out a way to get a coredump or backtrace from the CI |
Does this help
|
This looks useful as well:
|
So definitely fallen back to IKEv1:
could you be able to print *st->st_connection (I'm interested in policy and ike_version)? At the end is a running commentry on the log. I noticed two things:
4.x's code is significantly different to 3.x for the above:
. Outgoing connection ovn-b486b9-0-in-1 starts establishing an IKE#2 and CHILD#3
Second outgoing connection ovn-b486b9-0-out-1 queued up waiting for ovn-b486b9-0-in-1 to establish:
ovn-b486b9-0-in-1 establishes IKE#2 and CHILD#3
and ovn-b486b9-0-out-1 initiates CHILD#4 under ovn-b486b9-0-in-1 IKE#2:
so far so good. But then there's an incoming IKE#9 for ovn-b486b9-0-out-1 and it starts to establish (if it establishes it will take over - the assumption is the other end feels the need to replace):
But then the remote end initiates CHILD#10 for ovn-b486b9-0-out-1 under IKE#2 (I suspect it's noticed that IKE#2 is established and usable so starts establishing the child) (should the CHILD#4 it replaced have been deleted, perhaps it is with a timeout?)
time passes (this end times out): The incomplete IKE#9 times out (presumably the remote end abandoned it). This triggers a "must-remain-up" (I suspect it shouldn't have got that far - it was already up):
which it does, but since IKE#2 is still available it grabs that:
time passes (suspect other end timed out) remote sends a delete for ovn-b486b9-0-out-1 CHILD#10; which triggers a call to event_force(EVENT_SA_REPLACE, st):
But then it goes on and wipes out IKE#2 and all other CHILD SAs.
finally, the attempt to revive ovn-b486b9-0-out-1 has flipped IKE versions:
|
Note this was a different run to the original log so the connection name could be different:
|
Thanks @cagney
I did this in a separate comment above
I have added my additional commentary.
Really difficult to do this because it is running in an OpenShift cluster and is failing intermittently. I have no way to reproduce this other than waiting so I don't think I can easily upgrade libreswan and then get it to reproduce to test it.
Each of these pairs represents and connection between this node in the cluster and another node in the cluster.
As the configuration on each node is the same, the remote node is trying to establish the same connection with this node. This happens at cluster install so all the nodes come up at about the same time.
Should it do that if it is shared?
One other thing to note is that we found a bug in our code that, while the cluster is starting, ipsec connections are deleted |
This still appears to be happening - even with the previously mentioned "fix". Do you have any idea what could cause this segfault? @cagney |
Were you testing mainline from git? v3.32's pick_initiator() has the check: in current sources that code's been replaced by the more robust: |
Thanks @cagney, I am using the RHEL 3.32-6 package. I had a quick look at the 3.32-7 code and I see the following in
What would cause the |
the policy can either contain IKEV1_ALLOW or IKEV2_ALLOW, but not both. But the check fails to take into consideration that policy could have neither set. Then it falls back to ikev1 mistakenly. Of course, the code should never be called with a policy that contains neither - we are expecting all connections to have one of these set. |
this should be fixed in libreswan 4.3 |
please re-open if this is still an issue, but we believe it has been addressed successfully. |
Thanks, when this gets integrated into OpenShift, I will have a better idea if it is resolved. Thanks for your help. |
kernel: pluto[7958]: segfault at 0 ip 00005562dc0fa6f6 sp 00007ffe9fc46a50 error 4 in pluto[5562dc067000+18c000]
Unfortunately, I don't have a core dump but I do have the location:
It looks like 'sadb' is null?
The text was updated successfully, but these errors were encountered: