Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitor store to resource watch #27492

Closed
wants to merge 2 commits into from

Conversation

xueqzhan
Copy link
Contributor

@xueqzhan xueqzhan commented Oct 24, 2022

Main changes with this PR:

  1. Reorg existing monitor files (for node, pod, cluster operator etc) so that codes can be called from within monitor package and outside monitor package (e.g. for resource watch)
  2. Add one monitor store for resource watch. For resources supported by monitor (currently node, pod, cluster operator and cluster version), an effort has been made to maintain the same event formats. Events is not supported yet.
  3. Monitor store can be selected instead of git repo. But in a real observer, both stores can be started with two different instances of openshift-tests binary.

TRT-469

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xueqzhan
Once this PR has been reviewed and has the lgtm label, please assign spadgett for approval by writing /assign @spadgett in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xueqzhan
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@dgoodwin dgoodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was challenging for me to review but I'm starting to get a handle on it. Would be good to do some discussion sometime this week and that might help me clarify more. I was struggling with the dual entrypoints, and whether or not we could simplify the process for adding a new resource type to monitor in the future, as right now it requires several changes in several places. Hopefully this will become more clear with some discussion, if it's even feasible at all.

@@ -16,9 +16,19 @@ import (
"k8s.io/client-go/tools/cache"
)

func startEventMonitoring(ctx context.Context, m Recorder, client kubernetes.Interface) {
reMatchFirstQuote := regexp.MustCompile(`"([^"]+)"( in (\d+(\.\d+)?(s|ms)$))?`)
var reMatchFirstQuote = regexp.MustCompile(`"([^"]+)"( in (\d+(\.\d+)?(s|ms)$))?`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you godoc this regex variable, that's a tough one for a reader to understand what it does and how/where it's used at a glance.

nodeInformer := informercorev1.NewNodeInformer(client, time.Hour, nil)
nodeInformer.AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {},
AddFunc: NodeAddFunc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeAddFunc remains empty, was there a need to do anything there that got missed?

@@ -21,7 +25,35 @@ import (
"github.com/openshift/origin/pkg/monitor/resourcewatch/storage"
)

var (
monitorStoreStr = "monitor"
gitStoreStr = "git"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regards to running the watch twice in the observer, this does have memory implications, we mirror every watched resource in memory, and doing so twice means we double that. I don't know how significant it is though with the size of clusters we're dealing with, but it could be worth mentioning. Would it be non-trivial to compose the two stores into one, so we only listwatch once and dispatch to both?

flags := cmd.Flags()
flags.StringVar(&store, "store", store, "Store to use for resource watch. Currently supported values are git or monitor.")
flags.StringVar(&artifactDir, "artifact-dir", artifactDir, "The directory to write test reports to.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be worth a convo after scrum and when David is there, but I've gotten confused about where we go with the monitor intervals.

Does this PR shut off the normal monitoring of pod/node resources while openshift-tests runs tests, in favor of only monitoring these from an observer? My concern is, we can't assume observers are used globally can we? In which case we'd lose the old intervals if the observer wasn't running.

The one thing I want to very explicitly avoid is both processes catching the same events, and then trying to deduplicate them when they probably have slightly different timestamps.

Getting them from the observer seems better because we have more timeline available to us. How can we balance rolling that out, but keep other jobs not using an observer still getting their events, avoid duplication, or do we try to ensure that everyone uses the observer?

func (s *monitorStorage) OnAdd(obj interface{}) {
objUnstructured, ok := obj.(*unstructured.Unstructured)
if !ok {
klog.Warningf("Object is not unstructured: %v", obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you want a return here.

if err != nil {
klog.Warningf("Decoding %s failed with error: %v", objUnstructured.GetName(), err)
}
monitor.NodeAddFunc(nodeObj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get called when we fail to decode, looks like we should return here and for all examples like this below.

if err != nil {
return err
}
configStore, err = storage.NewMonitorStorage(artifactDir, eventsClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make any more sense to launch the startClusterOperatorMonitoring funcs, instead of an actual separate store? This is a very loose question just something to think about. Are there benefits to separate store?

Then again, maybe the startX funcs die off depending on the other question posed here about duplicated intervals.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2022

@xueqzhan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade 3665393 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-agnostic-ovn-cmd 3665393 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-metal-ipi-ovn-ipv6 3665393 link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-metal-ipi-sdn 3665393 link false /test e2e-metal-ipi-sdn
ci/prow/e2e-aws-csi 3665393 link false /test e2e-aws-csi
ci/prow/e2e-gcp-csi 3665393 link false /test e2e-gcp-csi
ci/prow/e2e-aws-ovn-single-node-serial 3665393 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-image-registry 3665393 link true /test e2e-aws-ovn-image-registry
ci/prow/e2e-gcp-ovn-image-ecosystem 3665393 link true /test e2e-gcp-ovn-image-ecosystem
ci/prow/e2e-gcp-ovn-builds 3665393 link true /test e2e-gcp-ovn-builds

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2022
@openshift-merge-robot
Copy link
Contributor

@xueqzhan: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xueqzhan xueqzhan closed this Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants