Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Idling to v1alpha2 #19205

Closed

Conversation

DirectXMan12
Copy link
Contributor

@DirectXMan12 DirectXMan12 commented Apr 3, 2018

This updates idling to v1alpha2. It strips away the idling controller and API types (in favor of https://github.com/openshift/service-idler), and rewrites the proxy logic to use the Idler type
instead of annotations (and similarly with oc idle).

TODO:

  • Remove old controller
  • Update proxy
  • Update oc idle (simple toggle version)
  • Update router controller
  • Fix e2e tests
  • idling proxy metrics (follow-up PR, if needed)
  • oc create idler with previous discovery code (follow-up PR, if needed)
  • oc adm router support (follow-up PR)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2018
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2018
@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 3, 2018
@DirectXMan12
Copy link
Contributor Author

@knobunc ptal

@DirectXMan12
Copy link
Contributor Author

@sallyom FYI

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Nice work Solly.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 3, 2018
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 4, 2018
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2018
@DirectXMan12
Copy link
Contributor Author

I haven't actually tested any of the new commits to see if they cause explosions or whatnot, and they certainly don't pass units tests, but it compiles. The router was a bit hairy. That took up a decent chunk of my evening trying to get a clean-ish solution.

@knobunc knobunc requested a review from pravisankar April 4, 2018 10:47
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow... the router changes are rather large. This is going to need some time to review, and a change this large is a bit scary.

@pravisankar
Copy link

/retest

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2018
@DirectXMan12 DirectXMan12 force-pushed the feature/idling-v2alpha1 branch 2 times, most recently from 95e0a4d to 9137c9a Compare May 18, 2018 19:12
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2018
glog.Fatalf("error: Could not initialize Kubernetes Proxy. You must run this process as root (and if containerized, in the host network namespace as privileged) to use the service proxy: %v", err)
}
signaler := unidler.NewIdlerSignaler(c.IdlingClientset.IdlingV1alpha2(), lookup)
// NB: this must not be the masquerade bit (0x1 by default) or the drop bit (0x8000 by default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what "this" applies to. And perhaps move it to where mark bit is defined, or set...

return res, nil
}

// IdlerServiceLookup knows how to look which idler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is weird and redundant with the one two lines later... perhaps just move that one up and get rid of this one?

glog.V(8).Infof("hybrid proxy: ignore unchanged idled state for %s/%s on update", newIdler.Namespace, newIdler.Name)
return
}
// TODO: deal with the case where someone adds a trigger service during idling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this now?


func (h *idlerChangeHandler) OnAdd(obj interface{}) {
// the addition of an idler is always going to be either a no-op
// (if not idled) or a switch to idled (if idled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording is a little confusing.

func (h *idlerChangeHandler) switchIdledOn(namespace string, svcNames []string) {
glog.V(6).Infof("hybrid proxy: switch idling on for %s/%v", namespace, svcNames)
for _, svcName := range svcNames {
// make sure this is before the endpoints, so that if we have missing endpoints, we can still switch endpoints on and off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"services on and off"

}

func (p *UnidlerProxy) OnServiceSynced() {
// TODO: do we need to do anything here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enlighten me...

kcmdutil.AddFilenameOptionFlags(cmd, &opts.FilenameOptions, "Filename, directory, or URL to a file identifying the resource to get from a server.")
cmd.Flags().StringVarP(&opts.LabelSelector, "selector", "l", opts.LabelSelector, "Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)")
cmd.Flags().BoolVar(&opts.AllNamespaces, "all-namespaces", opts.AllNamespaces, "If present, list the requested object(s) across all namespaces. Namespace in current context is ignored even if specified with --namespace.")
cmd.Flags().BoolVar(&opts.Unidle, "un", opts.Unidle, "If present, unidle the given idlers.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd got with --unidle... But I like --un more.

})

// TODO: Work out how to make this test work correctly when run on AWS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this?

@@ -119,6 +127,7 @@ func (o *TemplateRouter) Bind(flag *pflag.FlagSet) {
flag.StringVar(&o.Ciphers, "ciphers", util.Env("ROUTER_CIPHERS", ""), "Specifies the cipher suites to use. You can choose a predefined cipher set ('modern', 'intermediate', or 'old') or specify exact cipher suites by passing a : separated list.")
flag.BoolVar(&o.StrictSNI, "strict-sni", isTrue(util.Env("ROUTER_STRICT_SNI", "")), "Use strict-sni bind processing (do not use default cert).")
flag.StringVar(&o.MetricsType, "metrics-type", util.Env("ROUTER_METRICS_TYPE", ""), "Specifies the type of metrics to gather. Supports 'haproxy'.")
flag.BoolVar(&o.EnableUnidling, "enable-unidling", false, "Whether or not to enable support for interacting with the idling API to properly wake up idled services.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an env too please? We'll need to add 'oc adm' support and that's easier with env.

if err != nil {
return err
}

factory := o.RouterSelection.NewFactory(routeclient, projectclient.Project().Projects(), kc)
factory.RouteModifierFn = o.RouteUpdate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did I go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2018
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2018
@DirectXMan12 DirectXMan12 force-pushed the feature/idling-v2alpha1 branch 2 times, most recently from 83f6606 to b4f2a8a Compare June 19, 2018 23:49
@DirectXMan12
Copy link
Contributor Author

seems like the node proxy is now running under the sdn pod, but nobody bound the system:node-proxier role to the openshift-sdn:sdn service account...

@smarterclayton
Copy link
Contributor

/hold

@openshift/api-review the API needs review, I plan on looking at it but haven't yet.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 6, 2018

// NFQueueNumber is the number of the NFQueue used to intercept unidling traffic.
// It must be unique.
NFQueueNumber uint16 `json:"nfQueueNumber"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are incredibly specific config values - I'm not sure a human should ever set them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, specifically, it needs to never conflict with the two Kubernetes mark bits (one is the masquerade bit, and the other is similiar), and it needs to not conflict with any other rules people set on their systems -- that's why it's configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is not something a human should have to set ever. Can you determine values for this for 99% of cases that won't conflict?

return fmt.Errorf("no idler found for service %s", service.String())
}

if idler.Spec.WantIdle == false {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the API design yet, but unless this becomes a constant you should never be comparing to a boolean, just testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, agreed -- must have been tired when I wrote that :-P

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 6, 2018

@DirectXMan12: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/unit 9137c9a link /test unit
ci/openshift-jenkins/integration b4f2a8a link /test integration
ci/openshift-jenkins/verify b4f2a8a link /test verify
ci/openshift-jenkins/gcp b4f2a8a link /test gcp
ci/openshift-jenkins/launch-gcp 07324c8 link /test launch-gcp
ci/openshift-jenkins/cmd c1d7389 link /test cmd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

@DirectXMan12: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 6, 2018
@smarterclayton
Copy link
Contributor

I don't think this is going to make it for 3.11. There's too much unresolved in the install path, the API needs a thorough design review, and most of the reviewers have other critical path items right now.

@knobunc
Copy link
Contributor

knobunc commented Jul 9, 2018

@smarterclayton is there no way that we can get this for 3.11? We've been waiting for it so that we can fix the issues that openshift.io has uncovered with the existing unidling (e.g. events getting eaten due to rate limiting). I had hoped that since @DirectXMan12 had this PR open so early that we would be able to get it for this release.

I understand that the installation is effectively a manual process at the moment, but in 3.12 we would make an operator for it. Given that this is effectively for internal use, so that we can get some mileage on it, how much work do you need to have done for the installer in 3.11? And since the API is alpha and subject to change, how thorough does the review need to be?

@knobunc
Copy link
Contributor

knobunc commented Jul 10, 2018

@deads2k @liggitt -- Is there any chance either of you can review this API so we can get the unidler working for openshift.io? The current tech preview unidling uses events to pass messages and they get rate limited. This alpha object would be far better... but it needs API review.

@knobunc
Copy link
Contributor

knobunc commented Jul 10, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: DirectXMan12, knobunc
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: derekwaynecarr

If they are not already assigned, you can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deads2k
Copy link
Contributor

deads2k commented Jul 11, 2018

@deads2k @liggitt -- Is there any chance either of you can review this API so we can get the unidler working for openshift.io? The current tech preview unidling uses events to pass messages and they get rate limited. This alpha object would be far better... but it needs API review.

Looks like it's coming from this repo: https://github.com/openshift/service-idler, but the whole repo has only ever had two pulls. Was the rest reviewed sometime? It's pretty unusual not to have an api in https://github.com/openshift/api when we want to use and support it. I think the only exceptions have been: 1) things originated in origin that we forgot, 2) a toy API we immediately mutated/removed for clusterup.

@deads2k
Copy link
Contributor

deads2k commented Jul 11, 2018

/hold

I don't think this is going to make it for 3.11. There's too much unresolved in the install path, the API needs a thorough design review, and most of the reviewers have other critical path items right now.

It looks like the main API controlling the behavior is coming in via https://github.com/openshift/service-idler . For APIs that are going to be exposed to end users and that we expect to support (and if I'm reading the API right, we would end up supporting it and at least providing some kind of migration instructions), we expect those to live in https://github.com/openshift/api . A good test is probably, "is this owned by openshift and does it have a client".

The idea of the API seems reasonable. A persistent object describing the idled/unidled state of a particular set of resources. Details about the, particularly around edge case behavior and some potential for conflicting information deserve more discussion.

@DirectXMan12
Copy link
Contributor Author

Looks like it's coming from this repo: https://github.com/openshift/service-idler, but the whole repo has only ever had two pulls. Was the rest reviewed sometime?

Yes @knobunc has reviewed it, and we've been testing it unofficially for a while now. Particularly, @sallyom has been using it to rewrite and simplify the OpenShift Online automated idling.

It's pretty unusual not to have an api in https://github.com/openshift/api when we want to use and support it. I think the only exceptions have been: 1) things originated in origin that we forgot, 2) a toy API we immediately mutated/removed for clusterup.

The CRD approach was originally suggested by @smarterclayton. I my mind, there's two reasons to keep this as a CRD, at least for the time being.

  1. It's easier to iterate -- the path of "iterate as CRD, move to API object when stable" seems pretty reasonable.

  2. It's easier to share around as a self-contained building block. Others upstream have expressed interest in building automated unidling, and having this in a separate repo means we can easily point them at this repo as a place to start.

The idea of the API seems reasonable. A persistent object describing the idled/unidled state of a particular set of resources. Details about the, particularly around edge case behavior and some potential for conflicting information deserve more discussion.

We have had a lot of that discussion, but I'd be happy to discuss further. Both when we (@knobunc and I) designed the original system (more on that below), and when we were iterating on the new system, based on what we learned from the existing system. We know the existing system doesn't work at the scales we need for online (in addition to not handling some corner cases we'd like it to handle), and we'd like to get some mileage on this system in Online before we fully support it (i.e. tech preview for a release or two), but we can't do that until we can deploy both the idling/unidling controller (the easy part -- it's a self-contained pod with a CRD) and an updated unidling proxy with router support (the harder part, since the router and proxy are part of OpenShift proper).

@deads2k
Copy link
Contributor

deads2k commented Jul 11, 2018

The CRD approach was originally suggested by @smarterclayton. I my mind, there's two reasons to keep this as a CRD, at least for the time being.

It's easier to iterate -- the path of "iterate as CRD, move to API object when stable" seems pretty reasonable.

It's easier to share around as a self-contained building block. Others upstream have expressed interest in building automated unidling, and having this in a separate repo means we can easily point them at this repo as a place to start.

Using a CRD is not a way to skip an API review. See https://github.com/openshift/api/blob/master/servicecertsigner/v1alpha1/types.go#L82-L88 an example of a CRD that is based on an API reviewed type.

openshift/api has a light footprint and openshift APIs should live there. Living in openshift/api does not mean that your code has to live in openshift/origin.

@sallyom
Copy link
Contributor

sallyom commented Jul 11, 2018

/test launch-gcp

@DirectXMan12
Copy link
Contributor Author

DirectXMan12 commented Jul 11, 2018

As per a discussion with @deads2k, I'm going to file a PR against openshift/api. The API will continue to be served as a custom resource, but the API definitions will live in openshift/api so they can be released together with other openshift APIs.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 8, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@DirectXMan12 DirectXMan12 deleted the feature/idling-v2alpha1 branch December 21, 2018 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/networking size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants