Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1354: fix concurrent access on watches #458

Merged
merged 1 commit into from Oct 13, 2023

Conversation

jotak
Copy link
Member

@jotak jotak commented Oct 12, 2023

Description

CI tests are periodically failing due to a race condition - this should fix this issue

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Oct 12, 2023

@jotak: This pull request references NETOBSERV-1354 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.15.0" version, but no target version was set.

In response to this:

Description

CI tests are periodically failing due to a race condition - this should fix this issue

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jotak jotak added no-qe This PR doesn't necessitate QE approval no-doc This PR doesn't require documentation change on the NetObserv operator labels Oct 12, 2023
@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (707ab18) 54.83% compared to head (5f43bbe) 55.00%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #458      +/-   ##
==========================================
+ Coverage   54.83%   55.00%   +0.17%     
==========================================
  Files          47       47              
  Lines        6381     6394      +13     
==========================================
+ Hits         3499     3517      +18     
+ Misses       2640     2635       -5     
  Partials      242      242              
Flag Coverage Δ
unittests 55.00% <100.00%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
pkg/watchers/watcher.go 68.55% <100.00%> (+2.80%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good ! Thanks

@@ -28,7 +29,8 @@ var (
type Watcher struct {
ctrl controller.Controller
cache cache.Cache
watched map[string]interface{}
watches map[string]bool
wmut sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually started with sync.Map but this isn't type safe (works with any) and the API is less convenient than the typical map API .. since I'm only using the mutex in a couple of places, all in all I find it easier to use traditional map + mutex

@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 13, 2023
@jpinsonneau
Copy link
Contributor

Seems that QE is also having this issue on OCP 4.14.0-0.nightly-2023-10-13-073537 using network-observability-rhel9-operator@sha256:8af24598e648bd3956e64d9d6ed50439f87e4a58852cf18e8c7c466af31e756f image:

fatal error: concurrent map read and map write
goroutine 1061 [running]:
github.com/netobserv/network-observability-operator/pkg/watchers.(*Watcher).isWatched(...)
/remote-source/app/pkg/watchers/watcher.go:92
github.com/netobserv/network-observability-operator/pkg/watchers.(*Watcher).watch.func1({0xc0052f2ac0?, 0x44d1f4?}, {0x1e50500?, 0xc0021498c0?})
/remote-source/app/pkg/watchers/watcher.go:77 +0xec
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc000710668?, {0x1e3cc98?, 0xc00355f630?}, {0x1e444f8, 0xc0000a41a0}, {0x1e50500?, 0xc0021498c0?}, 0xc000710668?)
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x5f
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x1e3cc98?, {0x1e3cc98, 0xc00355f630}, {{0x1e50500?, 0xc0021498c0?}}, {0x1e444f8, 0xc0000a41a0})
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe8
sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc002e1f400, {0x1b29160?, 0xc0021498c0})
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x296
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:239
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/remote-source/app/vendor/k8s.io/client-go/tools/cache/shared_informer.go:974 +0x148
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc002dca738?, {0x1e24e60, 0xc003560c00}, 0x1, 0xc004e3d140)
/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x1853b20?, 0x3b9aca00, 0x0, 0xc0?, 0x1853b20?)
/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc004e8d170)
/remote-source/app/vendor/k8s.io/client-go/tools/cache/shared_informer.go:968 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x85

@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:ef480b3
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-ef480b3
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-ef480b3

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:ef480b3 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-ef480b3

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-ef480b3
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@jotak
Copy link
Member Author

jotak commented Oct 13, 2023

Merging to unblock failing tests ... but @msherif1234 if you think I should not use map+mutex we can continue discussing!
/approve

@openshift-ci
Copy link

openshift-ci bot commented Oct 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit 15c43ae into netobserv:main Oct 13, 2023
12 checks passed
@nathan-weinberg
Copy link
Contributor

Tried this with the pre-merge image and indeed fixed issues I was having with new flowcollector 👍

@jotak jotak deleted the race-watches branch November 1, 2023 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm no-doc This PR doesn't require documentation change on the NetObserv operator no-qe This PR doesn't necessitate QE approval ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants