-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition when creating/deleting namespace address set #4382
Conversation
// do not delete the address set if the namespace was added back | ||
// during the delay | ||
nsInfo := bnc.namespaces[ns] | ||
if nsInfo != nil && nsInfo.addressSet != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be racy: if the namespace didn't exist on getNamespaceLocked
, but was added before bnc.namespacesMutex.Lock()
, then nsInfo that we get on line 556 is not locked, and the creation may be in progress, so checking nsInfo.addressSet
is not safe.
So we may have to lock nsInfo here while holding namespacesMutex
, which is never done otherwise to not block all other namespace handlers, but I think it is deadlock-safe.
But really we should just use syncmap for namespaces and get rid of this locking spaghetti.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we can't get to line 556 while the creation is in progress because namespacesMutex
is held during all that time in ensureNamespaceLockedCommon
https://github.com/jcaamano/ovn-kubernetes/blob/f5e92013f07e83823c71e004918a09238439b877/go-controller/pkg/ovn/base_network_controller_namespace.go#L259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the gist of your code is OK, but I would suggest some changes. A common pattern we follow for getting an nsinfo is (lets assume M is the namespacesMutex, and N is an nsinfo mutex):
- Lock M
- Get N
- Unlock M
- Lock N
- Lock M
- Get N again (call it O)
- if O !=N, then unlock N, return; else return N
- unlock M
This pattern is consistent in both getNamespaceLocked and deleteNamespaceLocked, and the bug here is that pattern is not followed in the goroutine. However, with your fix you are explicitly calling getNamespaceLocked, and then locking and checking the namespacesMutex again, which is a waste of operations, also nsInfo is held with a read lock, when you are really deleting stuff that belongs to nsInfo:
- Lock M
- Get N
- Unlock M
- Lock N
- Lock M
- Get N again (call it O)
- if O !=N, then unlock N, return; else return N
- unlock M
- your code-> lock M
- Get N again and see if it is nil or address set is nil, do stuff
- unlock M
I would do:
addressSet := nsInfo.addressSet
go func() {
select {
case <-bnc.stopChan:
return
case <-time.After(20 * time.Second):
bnc.namespacesMutex.Lock()
nsInfo := bnc.namespaces[ns]
if nsInfo != nil {
// no nsInfo exists we are good to delete
defer bnc.namespacesMutex.Unlock()
} else {
// nsInfo exists, means someone added it
// only delete if the new NS's AddressSet shouldn't exist.
bnc.namespacesMutex.Unlock()
nsInfo.Lock()
bnc.namespacesMutex.Lock()
defer bnc.namespacesMutex.Unlock()
if nsInfo != bnc.namespaces[ns] {
nsInfo.Unlock()
return
}
defer nsInfo.Unlock()
if nsInfo.addressSet != nil {
klog.V(5).Infof("Skipping deferred deletion of AddressSet for NS %s: re-created", ns)
return
}
}
klog.V(5).Infof("Finishing deferred deletion of AddressSet for NS %s", ns)
if err := addressSet.Destroy(); err != nil {
klog.Errorf("Failed to delete AddressSet for NS %s: %v", ns, err.Error())
}
}
}()
Although, I have more questions about this code:
- This code:
if nsInfo.addressSet != nil {
klog.V(5).Infof("Skipping deferred deletion of AddressSet for NS %s: re-created", ns)
return
}
If something else added the namespace back, why is this previous goroutine that handled a delete trying to now go and check if that new nsInfo's address set is nil? Does it really need to do that?
- In the code we store the original addressSet object before the goroutine is called:
addressSet := nsInfo.addressSet
but later in the goroutine from the code in 1 we are looking at the latest nsInfo's address set. If we see the latest addressSet is nil, then we delete the old nsInfos addressSet. Maybe this is a given in OVN that they will have the same ID or whatever, but from a code perspective this is kind of weird. Shouldn't we be comparing if addressSet == nsInfo.addressSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we can't get to line 556 while the creation is in progress because
namespacesMutex
is held during all that time inensureNamespaceLockedCommon
https://github.com/jcaamano/ovn-kubernetes/blob/f5e92013f07e83823c71e004918a09238439b877/go-controller/pkg/ovn/base_network_controller_namespace.go#L259
that was unexpected :D Some more things I found in the code: nsInfo.addressSet
is never nil for nsInfo that you got from bnc.namespaces
, as it won't be stored on failed create https://github.com/jcaamano/ovn-kubernetes/blob/f5e92013f07e83823c71e004918a09238439b877/go-controller/pkg/ovn/base_network_controller_namespace.go#L276 (which also means it won't be cleaned up, but that is another problem) and it is not set to nil on delete here.
Also both create and delete namespace ops are happening with the bnc.namespacesMutex.Lock()
, meaning that as soon as you lock it, namespaces can't be created or deleted, so it is safe to read it and potentially you could do just
bnc.namespacesMutex.Lock()
defer bnc.namespacesMutex.Unlock()
nsInfo := bnc.namespaces[ns]
if nsInfo != nil {return}
addressSet.Destroy()
I think this part of the code requires some more comments, no matter what version you end up with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think that is safe. Just because we hold the namespacesMutex lock does not mean the nsInfo is not being used by anything else. getNamespaceLocked is used in other places in the code to get the nsInfo returned and locked (read or write) to modify things with the nsInfo and not holding the namespacesMutex lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is consistent in both getNamespaceLocked and deleteNamespaceLocked, and the bug here is that pattern is not followed in the goroutine. However, with your fix you are explicitly calling getNamespaceLocked, and then locking and checking the namespacesMutex again, which is a waste of operations
Yeah, I went for reusing code and minimizing the risk of introducing any other issue and the burden of maintaining yet more of it, and I was not concerned with any excessive lack of efficiency here. But if you lay down the code for me then I guess I can take it.
nsInfo is held with a read lock, when you are really deleting stuff that belongs to nsInfo:
I don't think that's an issue. The namespace would have to exist for me to get an RLock, and then would have to be removed (and added back) by someone else for me to end up deleting the address set inappropriately, but that can't happen while I have that RLock. Other situations... if the namespace just keeps existing normally, I won't do anything. If the namespace does not exist, and keeps not existing or is then added and removed, then I delete the address set, maybe sooner than expected with no major consequences. If the namespace does not exist and is then added, then I won't delete the address set.
Although, I have more questions about this code:
- This code:
if nsInfo.addressSet != nil { klog.V(5).Infof("Skipping deferred deletion of AddressSet for NS %s: re-created", ns) return }
If something else added the namespace back, why is this previous goroutine that handled a delete trying to now go and check if that new nsInfo's address set is nil? Does it really need to do that?
Are you asking me? This code was already there :P My best guess is the code does not assume that all namespaces have to have an address set nor on what that might depend. I did not have enough reasoning against that so I kept it.
- In the code we store the original addressSet object before the goroutine is called:
addressSet := nsInfo.addressSet
but later in the goroutine from the code in 1 we are looking at the latest nsInfo's address set. If we see the latest addressSet is nil, then we delete the old nsInfos addressSet. Maybe this is a given in OVN that they will have the same ID or whatever, but from a code perspective this is kind of weird. Shouldn't we be comparing if addressSet == nsInfo.addressSet?
Yes, the address set is just a pass-through to OVN DB with no meaningful state. It was already like that so again my best guess is the same as before, the code does not assume that all namespaces have to have an address set nor on what that might depend. Comparing a weird situation where it was added and removed behind our backs so technically we could end up removing the address set before the 20s delay from the last time it was removed. Not too concerning addressSet == nsInfo.addressSet
can be useful to avoid deleting the address set sooner than intended onbut we can add the check.
So lets recap what we want to do...I took the gamble that changing the least possible things would be faster but I see now that we don't know why some of this things are how they are...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My best guess is the code does not assume that all namespaces have to have an address set nor on what that might depend.
I guess we can add this assumption as a comment and upkeep the code equivalence to what was already there or get rid of the assumption altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed the version of the code shared by @trozet
I am keeping he assumption that a namespace info can exist without an associated address set unless we come to a different understanding.
f5e9201
to
65129d6
Compare
Changes unknown |
852cd43
to
d516608
Compare
Fix race condition: 1. deleteNamespaceLocked, after the 20s delay, checks that the current nsInfo for the namespace is nil 2. ensureNamespaceLockedCommon adds a new nsInfo referencing the existing address set 3. deleteNamespaceLocked destroys the existing address set Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
d516608
to
b1bb8ed
Compare
should work now :) |
Fix race condition:
Fixes: https://issues.redhat.com/browse/OCPBUGS-33005