Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TopologyAwareHints: Take lock in HasPopulatedHints #118189

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented May 22, 2023


What type of PR is this?

/kind bug

What this PR does / why we need it:

Prevent potential concurrent map access by taking a lock before reading the topology cache's hintsPopulatedByService map.

  • staging/src/k8s.io/endpointslice/topologycache/topologycache.go (setHintsLocked, hasPopulatedHintsLocked): New helper functions. These are the same as the existing SetHints and HasPopulatedHints methods except that these helpers assume that a lock is already held.
    (SetHints): Use setHintsLocked.
    (HasPopulatedHints): Take a lock and use hasPopulatedHintsLocked.
    (AddHints): Take a lock and use setHintsLocked and hasPopulatedHintsLocked.
  • staging/src/k8s.io/endpointslice/topologycache/topologycache_test.go (TestTopologyCacheRace): Add a goroutine that calls HasPopulatedHints.

Which issue(s) this PR fixes:

Fixes #118188.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a concurrent map access in TopologyCache's `HasPopulatedHints` method.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 22, 2023
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 22, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 22, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @Miciah. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 22, 2023
@aojea
Copy link
Member

aojea commented May 23, 2023

was this not fixed by #117249?

if not, can we add a test like the one I suggested in #117249 (comment) to cover the regression?

@Miciah
Copy link
Contributor Author

Miciah commented May 23, 2023

was this not fixed by #117249?

It was not, but thanks for pointing me to that change! I was able to reproduce the race condition by modifying TestTopologyCacheRace.

Patch for TestTopologyCacheRace
--- a/pkg/controller/endpointslice/topologycache/topologycache_test.go
+++ b/pkg/controller/endpointslice/topologycache/topologycache_test.go
@@ -686,6 +686,9 @@ func TestTopologyCacheRace(t *testing.T) {
        go func() {
                cache.AddHints(sliceInfo)
        }()
+       go func() {
+               cache.HasPopulatedHints(sliceInfo.ServiceKey)
+       }()
 }
 
 // Test Helpers
Test output
go test -race ./pkg/controller/endpointslice/topologycache
I0523 11:26:16.453841 1794088 topologycache.go:96] "Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, removing hints" serviceKey="ns/svc" addressType="IPv4"
I0523 11:26:16.454071 1794088 topologycache.go:96] "Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, removing hints" serviceKey="ns/svc" addressType="IPv4"
I0523 11:26:16.454202 1794088 topologycache.go:96] "Insufficient number of endpoints (2 endpoints, 3 zones), removing hints" serviceKey="ns/svc" addressType="IPv4"
I0523 11:26:16.454341 1794088 topologycache.go:162] "Topology Aware Hints has been enabled, adding hints." serviceKey="ns/svc" addressType="IPv4"
I0523 11:26:16.454494 1794088 topologycache.go:162] "Topology Aware Hints has been enabled, adding hints." serviceKey="ns/svc" addressType="IPv4"
I0523 11:26:16.454663 1794088 topologycache.go:162] "Topology Aware Hints has been enabled, adding hints." serviceKey="ns/svc" addressType="IPv4"
I0523 11:26:16.455526 1794088 topologycache.go:96] "Insufficient number of endpoints (2 endpoints, 3 zones), removing hints" serviceKey="ns/svc" addressType="IPv4"
==================
WARNING: DATA RACE
Write at 0x00c000757470 by goroutine 119:
  runtime.mapdelete_faststr()
      /gnu/store/b6pil63ly86hfvmismwaa8dqmnk1s301-go-1.20.2/lib/go/src/runtime/map_faststr.go:301 +0x0
  k8s.io/apimachinery/pkg/util/sets.Set[...].Delete()
      /home/mmasters/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/sets/set.go:62 +0x307
  k8s.io/kubernetes/pkg/controller/endpointslice/topologycache.(*TopologyCache).RemoveHints()
      /home/mmasters/src/k8s.io/kubernetes/pkg/controller/endpointslice/topologycache/topologycache.go:200 +0x21b
  k8s.io/kubernetes/pkg/controller/endpointslice/topologycache.(*TopologyCache).AddHints()
      /home/mmasters/src/k8s.io/kubernetes/pkg/controller/endpointslice/topologycache/topologycache.go:99 +0x5aa
  k8s.io/kubernetes/pkg/controller/endpointslice/topologycache.TestTopologyCacheRace.func2()
      /home/mmasters/src/k8s.io/kubernetes/pkg/controller/endpointslice/topologycache/topologycache_test.go:687 +0x3e

Previous read at 0x00c000757470 by goroutine 120:
  runtime.mapaccess2_faststr()
      /gnu/store/b6pil63ly86hfvmismwaa8dqmnk1s301-go-1.20.2/lib/go/src/runtime/map_faststr.go:108 +0x0
  k8s.io/apimachinery/pkg/util/sets.Set[...].Has()
      /home/mmasters/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/sets/set.go:83 +0x8c
  k8s.io/kubernetes/pkg/controller/endpointslice/topologycache.(*TopologyCache).HasPopulatedHints()
      /home/mmasters/src/k8s.io/kubernetes/pkg/controller/endpointslice/topologycache/topologycache.go:266 +0x45
  k8s.io/kubernetes/pkg/controller/endpointslice/topologycache.TestTopologyCacheRace.func3()
      /home/mmasters/src/k8s.io/kubernetes/pkg/controller/endpointslice/topologycache/topologycache_test.go:690 +0x34

Goroutine 119 (running) created at:
  k8s.io/kubernetes/pkg/controller/endpointslice/topologycache.TestTopologyCacheRace()
      /home/mmasters/src/k8s.io/kubernetes/pkg/controller/endpointslice/topologycache/topologycache_test.go:686 +0x1424
  testing.tRunner()
      /gnu/store/b6pil63ly86hfvmismwaa8dqmnk1s301-go-1.20.2/lib/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /gnu/store/b6pil63ly86hfvmismwaa8dqmnk1s301-go-1.20.2/lib/go/src/testing/testing.go:1629 +0x47

Goroutine 120 (finished) created at:
  k8s.io/kubernetes/pkg/controller/endpointslice/topologycache.TestTopologyCacheRace()
      /home/mmasters/src/k8s.io/kubernetes/pkg/controller/endpointslice/topologycache/topologycache_test.go:689 +0x14ce
  testing.tRunner()
      /gnu/store/b6pil63ly86hfvmismwaa8dqmnk1s301-go-1.20.2/lib/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /gnu/store/b6pil63ly86hfvmismwaa8dqmnk1s301-go-1.20.2/lib/go/src/testing/testing.go:1629 +0x47
==================
--- FAIL: Test_redistributeHints (0.00s)
    testing.go:1446: race detected during execution of test
FAIL
FAIL	k8s.io/kubernetes/pkg/controller/endpointslice/topologycache	0.082s
FAIL

if not, can we add a test like the one I suggested in #117249 (comment) to cover the regression?

Is modifying TestTopologyCacheRace sufficient, or do you want a separate test?

@Miciah Miciah force-pushed the 118188-TopologyAwareHints-take-lock-in-HasPopulatedHints branch from 2e25130 to b322858 Compare May 23, 2023 15:30
@shaneutt
Copy link
Member

/cc @shaneutt @aojea @robscott

@robscott
Copy link
Member

Thanks @Miciah!

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 25, 2023
@atiratree
Copy link
Member

/cc

@atiratree
Copy link
Member

It was not, but thanks for pointing me to that change! I was able to reproduce the race condition by modifying TestTopologyCacheRace.

maybe it would be good to increase the chance by calling SetHints directly or add a few more goroutines?

if not, can we add a test like the one I suggested in #117249 (comment) to cover the regression?

Is modifying TestTopologyCacheRace sufficient, or do you want a separate test?

@aojea

@aojea
Copy link
Member

aojea commented May 31, 2023

modifying it LGTM, duplicating the test too ... is just to avoid regressions

@aojea
Copy link
Member

aojea commented Jun 1, 2023

I mean, modifying if it covers both races, otherwise just duplicate it

return t.hasPopulatedHintsLocked(serviceKey)
}

func (t *TopologyCache) hasPopulatedHintsLocked(serviceKey string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming might be a bit misleading, what about hasPopulatedHintsWithoutLock or hasPopulatedHintsThreadUnsafe?

Copy link
Contributor Author

@Miciah Miciah Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I see the naming pattern of foo (or Foo) that takes a lock and fooLocked that assumes a lock is held all over the Kubernetes codebase when I grep for func .*Locked(. Here's one example out of dozens:

// AddIfNotPresent inserts an item, and puts it in the queue. If an item with
// the key is present in the map, no changes is made to the item.
//
// This is useful in a single producer/consumer scenario so that the consumer can
// safely retry items without contending with the producer and potentially enqueueing
// stale items.
func (h *Heap) AddIfNotPresent(obj interface{}) error {
id, err := h.data.keyFunc(obj)
if err != nil {
return KeyError{obj, err}
}
h.lock.Lock()
defer h.lock.Unlock()
if h.closed {
return fmt.Errorf(closedMsg)
}
h.addIfNotPresentLocked(id, obj)
h.cond.Broadcast()
return nil
}
// addIfNotPresentLocked assumes the lock is already held and adds the provided
// item to the queue if it does not already exist.
func (h *Heap) addIfNotPresentLocked(key string, obj interface{}) {
if _, exists := h.data.items[key]; exists {
return
}
heap.Push(h.data, &itemKeyValue{key, obj})
}

I see one match for func.*WithoutLock, but it actually takes a lock:

// cacheWithoutLock acquires the store's (write) mutex and calls cacheWithLock.
func (d *Diskv) cacheWithoutLock(key string, val []byte) error {
d.mu.Lock()
defer d.mu.Unlock()
return d.cacheWithLock(key, val)
}

I see only a couple matches for func.*ThreadUnsafe, e.g.:

func (i *objectCacheItem) stop() bool {
i.lock.Lock()
defer i.lock.Unlock()
return i.stopThreadUnsafe()
}
func (i *objectCacheItem) stopThreadUnsafe() bool {
if i.stopped {
return false
}
i.stopped = true
close(i.stopCh)
if !i.immutable {
i.store.unsetInitialized()
}
return true
}

In sum:

  • fooWithoutLock seems to mean the opposite of what we want here.
  • fooThreadUnsafe is unusual but has precedent.
  • fooLocked seems to be the prevailing pattern.

However, if you strongly prefer "hasPopulatedHintsThreadUnsafe", I'll make the change. Let me know if you indeed want me to do that.

Copy link
Member

@atiratree atiratree Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the prevailing pattern is fooLocked. It just seems odd to me that the we describe a precondition to a function in the function name instead of the intent what the function will do.

We also have the fooUnlocked variant:

func (npm *nominator) DeleteNominatedPodIfExists(pod *v1.Pod) {
npm.lock.Lock()
npm.deleteNominatedPodIfExistsUnlocked(pod)
npm.lock.Unlock()
}
func (npm *nominator) deleteNominatedPodIfExistsUnlocked(pod *v1.Pod) {
npm.delete(pod)
}

So we have no actual consistency in the codebase and it is probably the best to always check the implementation.

Since fooLocked is used the most, I am fine with doing it the old way, even though semantically fooThreadUnsafe would be better.

t.setHintsLocked(serviceKey, addrType, allocatedHintsByZone)
}

func (t *TopologyCache) setHintsLocked(serviceKey string, addrType discovery.AddressType, allocatedHintsByZone EndpointZoneInfo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+here

@atiratree
Copy link
Member

/priority important-soon

@atiratree
Copy link
Member

@Miciah can you please rebase the PR?

Prevent potential concurrent map access by taking a lock before reading the
topology cache's hintsPopulatedByService map.

* staging/src/k8s.io/endpointslice/topologycache/topologycache.go
(setHintsLocked, hasPopulatedHintsLocked): New helper functions.  These are
the same as the existing SetHints and HasPopulatedHints methods except that
these helpers assume that a lock is already held.
(SetHints): Use setHintsLocked.
(HasPopulatedHints): Take a lock and use hasPopulatedHintsLocked.
(AddHints): Take a lock and use setHintsLocked and hasPopulatedHintsLocked.
* staging/src/k8s.io/endpointslice/topologycache/topologycache_test.go
(TestTopologyCacheRace): Add a goroutine that calls HasPopulatedHints.
@Miciah Miciah force-pushed the 118188-TopologyAwareHints-take-lock-in-HasPopulatedHints branch from b322858 to 43f8ccf Compare July 18, 2023 13:25
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2023
@Miciah
Copy link
Contributor Author

Miciah commented Jul 18, 2023

Rebased for #118953.

@atiratree
Copy link
Member

@robscott @aojea would you mind revisiting/approving this PR?

@atiratree
Copy link
Member

seems this is a flake
/test pull-kubernetes-e2e-gce

@robscott
Copy link
Member

Thanks @Miciah! Would like to get another set of eyes on this for final LGTM in case I'm missing anything here.

/cc @aojea @swetharepakula
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023
@aojea
Copy link
Member

aojea commented Aug 7, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ed106aa738293b299a1b366e75e2874e9990fc26

@k8s-ci-robot k8s-ci-robot merged commit 10a252a into kubernetes:master Aug 15, 2023
12 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Aug 15, 2023
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 2, 2023
k8s-ci-robot added a commit that referenced this pull request Sep 7, 2023
…89-origin-release-1.26

Automated cherry pick of #118189: TopologyAwareHints: Take lock in HasPopulatedHints
k8s-ci-robot added a commit that referenced this pull request Sep 7, 2023
…89-origin-release-1.28

Automated cherry pick of #118189: TopologyAwareHints: Take lock in HasPopulatedHints
k8s-ci-robot added a commit that referenced this pull request Sep 8, 2023
…89-origin-release-1.27

Automated cherry pick of #118189: TopologyAwareHints: Take lock in HasPopulatedHints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Topology cache's HasPopulatedHints method can attempt concurrent map access
6 participants