Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: fix offline and air-gapped support #67397

Merged
merged 3 commits into from
Sep 3, 2018

Conversation

neolit123
Copy link
Member

@neolit123 neolit123 commented Aug 14, 2018

What this PR does / why we need it:

Change the error output of getAllDefaultRoutes() so that it includes
information on which files were probed for the IP routing tables
even if such files are obvious.

Introduce a new error type which can be used to figure out of this
error is exactly of the "no routes" type.

If netutil.ChooseBindAddress() fails looking up IP route tables
it will fail with an error in which case the kubeadm config
code will hard stop.

This scenario is possible if the Linux user intentionally disables
the WiFi from the distribution settings. In such a case the distro
could empty files such files as /proc/net/route and ChooseBindAddress()
will return an error.

For improved offline support, don't error on such scenarios but instead
show a warning. This is done by using the NoRoutesError type.
Also default the address to 0.0.0.0.

While doing that, prevent some commands like init, join and also
phases like controlplane and certs from using such an invalid
address.

If there is no internet, label versions fail and this breaks
air-gapped setups unless the users pass an explicit version.

To work around that:

  • Remain using 'release/stable-x.xx' as the default version.
  • On timeout or any error different from status 404 return error
  • On status 404 fallback to using the version of the client via
    kubeadmVersion()

Add unit tests for kubeadmVersion().

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
refs kubernetes/kubeadm#1041

Special notes for your reviewer:
1st and second commits fix offline support.
3rd commit fixes air-gabbed support (as discussed in the linked issue)

the api-machinery change is only fmt.Errorf() related.

Release note:

kubeadm: fix air-gapped support and also allow some kubeadm commands to work without an available networking interface

/cc @kubernetes/sig-cluster-lifecycle-pr-reviews
/cc @kubernetes/sig-api-machinery-pr-reviews
/assign @kad
/assign @xiangpengzhao
/area UX
/area kubeadm
/kind bug

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 14, 2018
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/kubeadm kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 14, 2018
@@ -56,9 +56,12 @@ func SetInitDynamicDefaults(cfg *kubeadmapi.InitConfiguration) error {
// This is the same logic as the API Server uses
ip, err := netutil.ChooseBindAddress(addressIP)
if err != nil {
return err
defaultBind := "127.0.0.1"
Copy link
Contributor

@chuckha chuckha Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem I see with this is that ChooseBindAddress has more than one error that it can return. We should really be returning a specific error type and only defaulting on the error type we expect otherwise we might ignore an actual error.

Copy link
Member Author

@neolit123 neolit123 Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense.
but given it returns a string should we just string compare for "no default routes", like the unit tests are doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like comparing strings for errors, I'd prefer type checking as described https://blog.golang.org/error-handling-and-go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will add the error type check.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 14, 2018
@neolit123
Copy link
Member Author

neolit123 commented Aug 14, 2018

update:

  • included a custom error type so that we can verify if the error type is exactly about "no routes"

case *netutil.NoRoutesError:
defaultBind := "127.0.0.1"
glog.Warningf("could not obtain advertise address for the API Server: %v; using: %s", err, defaultBind)
cfg.API.AdvertiseAddress = defaultBind
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will lead to the error. net space inside container with apiserver will be not the same as host, thus if we can't get bind ip, init must fail. What we can do, is to try to keep AdvertiseAddress empty, and in places where it is important, do the checks for it to have some normal values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kad i was hesitant about adding 127.0.0.1.
i will give "" a try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but please have a check that it is valid in the places where it is really used. e.g. with https://golang.org/pkg/net/#IP.IsGlobalUnicast

@neolit123
Copy link
Member Author

@kad
updated, but i went for "0.0.0.0" for a couple of reasons:

  • if we pass "" this will fail this validation:
advertiseAddress: Invalid value: "": ip address is not valid

https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/apis/kubeadm/validation/validation.go#L303
and none of the commands that we want to work offline would work

@xiangpengzhao
Copy link
Contributor

The solution SGTM 👍

@kad
Copy link
Member

kad commented Aug 15, 2018

@neolit123 Here is example of what I've meant with checking of validity of undetected AdvertiseAddress:

kad:/work/go/src/k8s.io/kubernetes # _output/bin/kubeadm alpha phase certs apiserver
W0815 17:24:01.245638   28981 masterconfig.go:62] could not obtain advertise address for the API Server: no default routes found in "/proc/net/route" or "/proc/net/ipv6_route"; using: 0.0.0.0
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [kad kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 0.0.0.0]
kad:/work/go/src/k8s.io/kubernetes # openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text | grep DNS
                DNS:kad, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, IP Address:10.96.0.1, IP Address:0.0.0.0
kad:/work/go/src/k8s.io/kubernetes # 

That's an error situation. If AdvertiseAddress is not detected nor specified manually in config, code that tries to use it (certificate creation, init) should fail.

@neolit123
Copy link
Member Author

neolit123 commented Aug 15, 2018

@kad

i will make the warning to be more clear:
WARNING: could not obtain advertise address...

That's an error situation. If AdvertiseAddress is not detected nor specified manually in config, code that tries to use it (certificate creation, init) should fail.

i will add the check to the cert phase.

but it becomes a question of what command should fail, because things like kubeadm config print-default should arguably not fail even if using 0.0.0.0, which is not exactly a valid address and can render a cluster unusable if someone uses the config as is.

@kad
Copy link
Member

kad commented Aug 15, 2018

In print defaults I think this field should be set to empty if it is 0, to force user to set proper value there.
As for which commands should fail: in my opinion, anything that does real actions and can render cluster to unusable state should fail if this IP is incorrect. This means certs and manifests generations as minimum.

@@ -56,9 +56,17 @@ func SetInitDynamicDefaults(cfg *kubeadmapi.InitConfiguration) error {
// This is the same logic as the API Server uses
ip, err := netutil.ChooseBindAddress(addressIP)
if err != nil {
return err
switch err.(type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make it to a private function, since it is used in both masterconfig and nodeconfig.

@fedebongio
Copy link
Contributor

/assign @cheftako

@neolit123
Copy link
Member Author

neolit123 commented Aug 16, 2018

@kad @dixudx @xiangpengzhao updated and also included a patch for the air-gapped fix we talked about (on timeout, fallback to local client version).

@neolit123 neolit123 changed the title kubeadm: fix offline support in case of a missing IP route table kubeadm: fix offline and air-gapped support Aug 16, 2018
@neolit123
Copy link
Member Author

neolit123 commented Aug 16, 2018

/test pull-kubernetes-integration

^ needs a small fix:

init_test.go:107: 
	CmdInitKubernetesVersion test case "invalid version string is detected" failed with an error: <nil>

WIP

@kad
Copy link
Member

kad commented Aug 25, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2018
@neolit123
Copy link
Member Author

rebased after conflicts with master.

}
return nil, err
}
*valueToUpdate = ip.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics are weird of this signature, you already return the ip where any consumer can call ip.String()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd either call that externally or return the string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will update this.

// including the offset e.g. `alpha.0.206`. This is done to comply with GCR image tags.
pre := v.PreRelease()
patch := v.Patch()
if len(pre) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts don't we have utils for this stuff somewhere?


// VerifyAPIServerBindAddress can be used to verify if a bind address for the API Server is 0.0.0.0,
// in which case this address is not valid and should not be used.
func VerifyAPIServerBindAddress(address string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there negative tests for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are missing. will add them.

Change the error output of getAllDefaultRoutes() so that it includes
information on which files were probed for the IP routing tables
even if such files are obvious.

Introduce a new error type which can be used to figure out of this
error is exactly of the "no routes" type.
@timothysc
Copy link
Member

@neolit123 - Could you squash some of the commits?

neolit123 and others added 2 commits August 27, 2018 23:03
1) Do not fail in case a bind address cannot be obtained

If netutil.ChooseBindAddress() fails looking up IP route tables
it will fail with an error in which case the kubeadm config
code will hard stop.

This scenario is possible if the Linux user intentionally disables
the WiFi from the distribution settings. In such a case the distro
could empty files such files as /proc/net/route and ChooseBindAddress()
will return an error.

For improved offline support, don't error on such scenarios but instead
show a warning. This is done by using the NoRoutesError type.
Also default the address to 0.0.0.0.

While doing that, prevent some commands like `init`, `join` and also
phases like `controlplane` and `certs` from using such an invalid
address.

Add unit tests for the new function for address verification.

2) Fallback to local client version

If there is no internet, label versions fail and this breaks
air-gapped setups unless the users pass an explicit version.

To work around that:
- Remain using 'release/stable-x.xx' as the default version.
- On timeout or any error different from status 404 return error
- On status 404 fallback to using the version of the client via
kubeadmVersion()

Add unit tests for kubeadmVersion().

Co-authored-by: Alexander Kanevskiy <alexander.kanevskiy@intel.com>
@neolit123
Copy link
Member Author

@timothysc

Could you squash some of the commits?

squashed two - kubeadm related.
left the auto-generated commit separate and the one for api-machinery separate too.

p.s. i do not agree with the kubernetes commit squash policies. ;)

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

LGTM , but I'd like @kad todo final lgtm.

@kad
Copy link
Member

kad commented Aug 27, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2018
@neolit123
Copy link
Member Author

looks like a verify timeout.

/test pull-kubernetes-verify
/test pull-kubernetes-e2e-gce

@timothysc
Copy link
Member

@kubernetes/sig-api-machinery-pr-reviews - this requires your approval.

Copy link
Member

@dixudx dixudx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks @neolit123

@neolit123
Copy link
Member Author

@sttts could you please take a look at the api machinery change for approval.
thank you.

@sttts
Copy link
Contributor

sttts commented Sep 3, 2018

/retest

@sttts
Copy link
Contributor

sttts commented Sep 3, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dixudx, kad, neolit123, sttts, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 67397, 68019). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@k8s-github-robot k8s-github-robot merged commit d47a513 into kubernetes:master Sep 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet