Skip to content

Make the router data structures remove duplicates cleanly#10747

Merged
openshift-bot merged 1 commit intoopenshift:masterfrom
knobunc:bug/bz1371826-panic-in-haproxy
Sep 1, 2016
Merged

Make the router data structures remove duplicates cleanly#10747
openshift-bot merged 1 commit intoopenshift:masterfrom
knobunc:bug/bz1371826-panic-in-haproxy

Conversation

@knobunc
Copy link
Copy Markdown
Contributor

@knobunc knobunc commented Aug 31, 2016

The bug is caused by the modification of the route. In the code it
would add a new object to the list of routes by the host and not
remove them. When the route was deleted, it would remove all of the
routes with the same name, and then the list would be cleared.
Unfortunately, the code would then look at the number of items in the
list, and either delete it from the map, or clean items out of the
list. But since it was possible for there to be multiple duplicate
names, we could end up with an empty list. Later code would assume
that if there was a key in the map, then the associated list would
have items. That caused a panic.

This fix changes both places:

  • If there are duplicates, remove them all, and then if the list is
    empty, delete the key from the map
  • When a modification happens, remove the previous entry from the list

So, the first duplicate check is now just for protection, but the
proper fix is to clean out the lists before adding items to them.

Fixes bug 1371826

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Aug 31, 2016

@rajatchopra @openshift/networking PTAL

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Aug 31, 2016

[testextended][extended:core(router)]

@rajatchopra
Copy link
Copy Markdown
Contributor

LGTM. Ready to merge.

// Clean out any old form of this route
next := []*routeapi.Route{}
for i := range old {
if old[i].Name != route.Name {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to check namespace as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already checked the namespace before it can get in the list. And the delete loop at the bottom does the same check I implemented. However, if you want, I can add the namespace check too.

Copy link
Copy Markdown
Contributor

@liggitt liggitt Aug 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I only saw the comparison against the oldest one, but the whole list is supposed to be from a single namespace? if so, that's ok

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at line 122 it screens the list by the namespace of the oldest path. I saw that check and thought about it, but I think it is redundant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use routeNameKey(old[i]) != routeNameKey(route) instead of old[i].Name != route.Name

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Aug 31, 2016

@liggitt if it looks ok to you, can you chuck a merge tag on it please? Or approve it, and I'll merge. Thanks

@pravisankar
Copy link
Copy Markdown

LGTM

The bug is caused by the modification of the route.  In the code it
would add a new object to the list of routes by the host and not
remove them.  When the route was deleted, it would remove all of the
routes with the same name, and then the list would be cleared.
Unfortunately, the code would then look at the number of items in the
list, and either delete it from the map, or clean items out of the
list.  But since it was possible for there to be multiple duplicate
names, we could end up with an empty list.  Later code would assume
that if there was a key in the map, then the associated list would
have items.  That caused a panic.

This fix changes both places:
- If there are duplicates, remove them all, and then if the list is
  empty, delete the key from the map
- When a modification happens, remove the previous entry from the list

So, the first duplicate check is now just for protection, but the
proper fix is to clean out the lists before adding items to them.

Fixes bug 1371826
@knobunc knobunc force-pushed the bug/bz1371826-panic-in-haproxy branch from 8cf4f23 to 7ac6611 Compare September 1, 2016 00:54
@openshift-bot
Copy link
Copy Markdown
Contributor

Evaluated for origin testextended up to 7ac6611

@openshift-bot
Copy link
Copy Markdown
Contributor

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/455/) (Extended Tests: core(router))

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Sep 1, 2016

last one flaked [merge]

@liggitt
Copy link
Copy Markdown
Contributor

liggitt commented Sep 1, 2016

flake was #10765

@openshift-bot
Copy link
Copy Markdown
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Copy Markdown
Contributor

Evaluated for origin test up to 7ac6611

@knobunc knobunc closed this Sep 1, 2016
@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Sep 1, 2016

Argh. Wrong button :-(

@knobunc knobunc reopened this Sep 1, 2016
@eparis
Copy link
Copy Markdown
Member

eparis commented Sep 1, 2016

do we need to say [merge] again?

@marun
Copy link
Copy Markdown
Contributor

marun commented Sep 1, 2016

where's the test?

edit: My point is that a clean test result only validates existing assumptions. This code fixes an oversight in the previous coverage, and without additional coverage, there is no guarantee the fix here won't be broken by future changes.

@openshift-bot
Copy link
Copy Markdown
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/8626/)

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Sep 1, 2016

@marun, agreed, and I'm working on the test. But in interests of getting 3.3 out the door I was putting the fix in separately. I manually tested the PR for now.

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Sep 1, 2016

[merge] flaked on #10765 again

@openshift-bot
Copy link
Copy Markdown
Contributor

Evaluated for origin merge up to 7ac6611

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Sep 1, 2016

[merge] flaked on #10773

@openshift-bot
Copy link
Copy Markdown
Contributor

openshift-bot commented Sep 1, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/8642/) (Image: devenv-rhel7_4969)

@openshift-bot openshift-bot merged commit e4c00eb into openshift:master Sep 1, 2016
@spinolacastro
Copy link
Copy Markdown
Contributor

@knobunc I'm still able to reproduce it on v1.3.0-rc1 router image.
Creating a route and modifying to point to another service makes the router balances traffic.
Please, let me know if you need anything else to debug.

@knobunc
Copy link
Copy Markdown
Contributor Author

knobunc commented Sep 8, 2016

After talking to @spinolacastro over IRC. This seems like a new (but similar) issue, so a new bug will be opened.

@knobunc knobunc deleted the bug/bz1371826-panic-in-haproxy branch September 8, 2016 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants