Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: Delete the chassis records during deleteNode from sbdb #1113

Merged
merged 4 commits into from Mar 23, 2020

Conversation

abhat
Copy link
Contributor

@abhat abhat commented Mar 5, 2020

This pull request is to delete the chassis record from the sbdb during sync when the nodes get deleted from the ovnkube-master's perspective.

Signed-off by: Aniket Bhat anbhat@redhat.com

@abhat
Copy link
Contributor Author

abhat commented Mar 5, 2020

/assign @girishmg @dcbw
/cc @russellb

@girishmg
Copy link
Member

girishmg commented Mar 5, 2020

@abhat @dcbw this will not work since when the deleteNode event handler gets called in the master there is no guarantee that the onvkube-node Pod has been deleted on that node.

I initially had this code and was planning to submit the PR. However, during the testing I found that if the deleteNode() gets called first before ovn-controller container is killed, then the chassis will come back up. You will see message like:

ovn-controller.log-20200304.gz:2020-03-03T17:19:10.624Z|00012|chassis|WARN|Could not find Chassis : stored (11c0eead-da11-402c-9973-1c37f62bb4f6) ovs (11c0eead-da11-402c-9973-1c37f62bb4f6)

@abhat
Copy link
Contributor Author

abhat commented Mar 6, 2020

@abhat @dcbw this will not work since when the deleteNode event handler gets called in the master there is no guarantee that the onvkube-node Pod has been deleted on that node.

I initially had this code and was planning to submit the PR. However, during the testing I found that if the deleteNode() gets called first before ovn-controller container is killed, then the chassis will come back up. You will see message like:

ovn-controller.log-20200304.gz:2020-03-03T17:19:10.624Z|00012|chassis|WARN|Could not find Chassis : stored (11c0eead-da11-402c-9973-1c37f62bb4f6) ovs (11c0eead-da11-402c-9973-1c37f62bb4f6)

Understood, but then when the next time syncNodes() is called, it should still have the chassis in the chassisMap that I am building, and so long as the node doesn't get passed in the foundNodes, it should get deleted?

@girishmg
Copy link
Member

girishmg commented Mar 6, 2020

Understood, but then when the next time syncNodes() is called, it should still have the chassis in the chassisMap that I am building, and so long as the node doesn't get passed in the foundNodes, it should get deleted?

@abhat syncNodes() runs only once, when the ovnkube-master daemon starts. Thereafter it never runs. So, we cannot depend on ovnkube-master to restart to delete the chassis, right?

BTW, your code in syncNodes() is still valid, as a separate PR, to handle the case of deleting the stale chassis upon ovnkube-master restart.

@trozet
Copy link
Contributor

trozet commented Mar 6, 2020

Understood, but then when the next time syncNodes() is called, it should still have the chassis in the chassisMap that I am building, and so long as the node doesn't get passed in the foundNodes, it should get deleted?

@abhat syncNodes() runs only once, when the ovnkube-master daemon starts. Thereafter it never runs. So, we cannot depend on ovnkube-master to restart to delete the chassis, right?

BTW, your code in syncNodes() is still valid, as a separate PR, to handle the case of deleting the stale chassis upon ovnkube-master restart.

@girishmg adding some more context here...we need to handle the case when the cluster is scaled down. Therefore we need a goroutine that continues to check to remove the stale entry. If syncNodes only runs once we either need to make it run more often or use a different syncChassis thread or something.

@abhat
Copy link
Contributor Author

abhat commented Mar 6, 2020

Understood, but then when the next time syncNodes() is called, it should still have the chassis in the chassisMap that I am building, and so long as the node doesn't get passed in the foundNodes, it should get deleted?

@abhat syncNodes() runs only once, when the ovnkube-master daemon starts. Thereafter it never runs. So, we cannot depend on ovnkube-master to restart to delete the chassis, right?

BTW, your code in syncNodes() is still valid, as a separate PR, to handle the case of deleting the stale chassis upon ovnkube-master restart.

I see what you are saying. I didn't realize syncNodes was only called to process existing nodes during start up. 👍 to @trozet's comments. Alternatively, if we can ensure that the ovn-controller can check for some flag we can set before creating the chassis record, that will work too.

@dcbw
Copy link
Contributor

dcbw commented Mar 12, 2020

@girishmg adding some more context here...we need to handle the case when the cluster is scaled down. Therefore we need a goroutine that continues to check to remove the stale entry. If syncNodes only runs once we either need to make it run more often or use a different syncChassis thread or something.

Ok so yeah, we need a sync function since we don't know when ovn-controller will actually exit. So 3 things:

  1. syncNodes() needs to clean up chassis records that do not have a corresponding Node object in the Kube API. THis handles the case of a node being deleted while ovnkube master is down.
  2. deleteNode() should also delete the chassis record. This will likely work a good chunk of the time, but sometimes will race against ovn-controller re-creating the chassis record.
  3. a sync goroutine (every 5 minutes?) that gets the cached node list from the WatchFactory, calls ovn-sbctl --columns _uuid,hostname --format=table --data=bare --no-headings find chassis and calls ovn-sbctl destroy chassis <uuid> anything that's not in the kube api

@girishmg
Copy link
Member

Ok so yeah, we need a sync function since we don't know when ovn-controller will actually exit. So 3 things:

  1. syncNodes() needs to clean up chassis records that do not have a corresponding Node object in the Kube API. THis handles the case of a node being deleted while ovnkube master is down.
  2. deleteNode() should also delete the chassis record. This will likely work a good chunk of the time, but sometimes will race against ovn-controller re-creating the chassis record.
  3. a sync goroutine (every 5 minutes?) that gets the cached node list from the WatchFactory, calls ovn-sbctl --columns _uuid,hostname --format=table --data=bare --no-headings find chassis and calls ovn-sbctl destroy chassis <uuid> anything that's not in the kube api

@dcbw yes to all 3 of them.

@abhat abhat force-pushed the chassis_del branch 2 times, most recently from e547d5f to 26eb3c4 Compare March 18, 2020 00:24
@abhat abhat marked this pull request as ready for review March 18, 2020 00:24
@abhat abhat force-pushed the chassis_del branch 2 times, most recently from 4ac7e75 to ffc3b1a Compare March 18, 2020 01:30
@abhat abhat closed this Mar 18, 2020
@abhat abhat reopened this Mar 18, 2020
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
In some cases, the chassis may get created in the sbdb, but the
node logical switch may not exist and we can get the node deleted
from the kube API. This commit handles cleaning the chassis for such
a corner case.
@abhat abhat closed this Mar 19, 2020
go-controller/pkg/ovn/master.go Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/master.go Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/master.go Show resolved Hide resolved
go-controller/pkg/ovn/master.go Show resolved Hide resolved
go-controller/pkg/ovn/master.go Show resolved Hide resolved
go-controller/pkg/ovn/master.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/ovn.go Outdated Show resolved Hide resolved
Currently this sync mechanism is to run a sync nodes job
every 5 minutes which remove stale chassis from the sbdb.
@abhat
Copy link
Contributor Author

abhat commented Mar 23, 2020

@girishmg @trozet - addressed review comments and verified the fix in a dev setup. ptal.

@girishmg girishmg merged commit 6c0fbaf into ovn-org:master Mar 23, 2020
@abhat abhat deleted the chassis_del branch May 21, 2020 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants