Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API server fails to discover Kubernetes API on Azure #1295

Closed
ams0 opened this issue Jul 12, 2020 · 10 comments · Fixed by #1300
Closed

API server fails to discover Kubernetes API on Azure #1295

ams0 opened this issue Jul 12, 2020 · 10 comments · Fixed by #1300
Labels

Comments

@ams0
Copy link

ams0 commented Jul 12, 2020

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

On both AKS and aks-engine, API server fails to connect to Kubernetes APIs with

http: TLS handshake error from 10.240.0.68:47996: no valid subject specified

on AKS, communication to the control plane is mediated by the aks-link tunnel pod:

$> kgpo -A -o wide | grep 10.240.0.68
kube-system   aks-link-d66d9f786-mq92g                              2/2     Running   0          5h51m   10.240.0.68    aks-base-72121939-vmss000002     <none>           <none>

the CDI API server apparently autodiscovers the IP of the aks-link pod to be that of the k8s API server but the TLS handshake fails for 10.240.0.68 (I suppose the certificate returned is valid for the full DNS name of the API server, in my case, kubevirt-k8s-12c7e9-0604dc01.hcp.westeurope.azmk8s.io).

On aks-engine instead, the controller and api pods are running on the controller nodes (which are in the same vnet as the worker nodes) with hostNetwork: true and thus have the same IP as the master node:

$> kgno -l kubernetes.azure.com/role=master -o wide
NAME                    STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-master-30073033-0   Ready    master   32h   v1.18.4   10.240.255.5   <none>        Ubuntu 18.04.4 LTS   5.3.0-1031-azure   containerd://1.3.2+azure
$> kg po -n kube-system -l component=kube-apiserver -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP             NODE                    NOMINATED NODE   READINESS GATES
kube-apiserver-k8s-master-30073033-0   1/1     Running   0          32h   10.240.255.5   k8s-master-30073033-0   <none>           <none>
$> k logs -n cdi cdi-apiserver-c5d685df4-zjhsp
2020/07/12 07:34:39 http: TLS handshake error from 10.240.255.5:62069: no valid subject specified
2020/07/12 07:34:50 http: TLS handshake error from 10.240.255.5:15648: no valid subject specified
2020/07/12 07:34:50 http: TLS handshake error from 10.240.255.5:48778: no valid subject specified

I edited the cdi-apiserver deployment adding

      - args:
        - -v=1
        - -server=kubernetes.default.svc.cluster.local

but the operator reverts back my changes (I think due to this line.

At this moment, CDI is unusable on Azure in both managed and unmanaged clusters.

What you expected to happen:

CDI api server to be able to find out the kubernetes api endpoint correctly.

How to reproduce it (as minimally and precisely as possible):

  1. Create an AKS cluster
  2. deploy CDI:
#Install CDI
VERSION=v1.20.0
kubectl apply -f https://github.com/kubevirt/containerized-data-importer/releases/download/$VERSION/cdi-operator.yaml
kubectl apply -f https://github.com/kubevirt/containerized-data-importer/releases/download/$VERSION/cdi-cr.yaml

Anything else we need to know?:

Environment:

  • CDI version (use kubectl get deployments cdi-deployment -o yaml): 1.20
  • Kubernetes version (use kubectl version): 1.18.4
  • Cloud provider or hardware configuration: Azure
  • Install tools: kubectl
  • Others:
@DanielQujun
Copy link

I think it might be the CN field mismatch the domain of URL requested by Kube-apiserver.

@mhenriks
Copy link
Member

cdi-apiserver uses the following CNs in its sever cert:
cdi-api
cdi-api.<namespace>
cdi-api.<namespace>.svc

What is azure expecting?

kubernetes.default.svc.cluster.local does not seem correct as that is the CN of kube-apiserver

But maybe azure is expecting cdi-api.<namespace>.svc.cluster.local?

@ams0
Copy link
Author

ams0 commented Jul 13, 2020

Ah I see now, indeed, this is the k8s api trying to call the cdi apiserver and failing due to TLS handshake failure. It became clear when I did:

kubectl api-versions -v9

<truncate>
I0713 21:46:21.410598   27658 round_trippers.go:443] GET https://kubenetvir-k8s-12c7e9-1c1fcde3.hcp.westeurope.azmk8s.io:443/apis/upload.cdi.kubevirt.io/v1beta1?timeout=32s 503 Service Unavailable in 206 milliseconds
<truncate>

F0713 21:46:21.882784   27658 helpers.go:115] error: unable to retrieve the complete list of server APIs: upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request, upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request

Deploying the operator and the API in kube-system doesn't help either.

@mhenriks
Copy link
Member

Take a look a the kube-apiserver log. there may be a message that looks something like this:

E0713 21:05:11.400995       1 controller.go:114] loading OpenAPI spec for "v1beta1.upload.cdi.kubevirt.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'x509: certificate is valid for cdi-api, cdi-api.cdi, cdi-api.cdi.svc, not cdi-api.cdi.svc.cluster.local', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]

It tells you what CN kube-apiserver was given and what was expected

@ams0
Copy link
Author

ams0 commented Jul 13, 2020

Indeed, just checked with aks-engine, the error is:

E0713 21:33:14.628884       1 available_controller.go:420] v1beta1.upload.cdi.kubevirt.io failed with: failing or missing response from https://10.0.198.152:443/apis/upload.cdi.kubevirt.io/v1beta1: Get https://10.0.198.152:443/apis/upload.cdi.kubevirt.io/v1beta1: remote error: tls: bad certificate

whereas 10.0.198.152 is the service ClusterIP

$> kg svc -n cdi cdi-api
NAME      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
cdi-api   ClusterIP   10.0.198.152   <none>        443/TCP   5m34s

It still baffles me why this should be the case?

@mhenriks
Copy link
Member

Yeah, I am not sure why it is using the ip instead of the service name. That makes certificate generation much more difficult. Does the KubeVirt apiserver have a similar issue?

@ams0
Copy link
Author

ams0 commented Jul 14, 2020

It seems not, kubevirt api server reports no error:

{"component":"virt-api","level":"info","msg":"certificate from /etc/virt-api/certificates with common name 'virt-api.kubevirt.pod.cluster.local' retrieved.","pos":"cert-manager.go:182","timestamp":"2020-07-14T09:02:09.856358Z"}

Interestingly, kubectl api-resources only reports error on upload.cdi.kubevirt.io/v1alpha1:

$> k api-resources --api-group=cdi.kubevirt.io
NAME          SHORTNAMES   APIGROUP          NAMESPACED   KIND
cdiconfigs                 cdi.kubevirt.io   false        CDIConfig
cdis          cdi,cdis     cdi.kubevirt.io   false        CDI
datavolumes   dv,dvs       cdi.kubevirt.io   true         DataVolume
error: unable to retrieve the complete list of server APIs: upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request, upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request

thanks for the support by the way! I'm running kube-virt just fine in AKS/aks-engine, using pre-populated PVCs and DVs, but I'd still like to use the uploadproxy for other use cases.

@mhenriks
Copy link
Member

May I suggest trying CDI version 1.19.0? In 1.20.0, we added the v1beta1 API endpoints. I wonder if that is related.

@ams0
Copy link
Author

ams0 commented Jul 14, 2020

Still same:

$> k api-resources --api-group=cdi.kubevirt.io
NAME          SHORTNAMES   APIGROUP          NAMESPACED   KIND
cdiconfigs                 cdi.kubevirt.io   false        CDIConfig
cdis          cdi,cdis     cdi.kubevirt.io   false        CDI
datavolumes   dv,dvs       cdi.kubevirt.io   true         DataVolume
error: unable to retrieve the complete list of server APIs: upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request
$> k get po -n cdi cdi-apiserver-67445cb6c4-zs96k -o yaml | grep image
    image: kubevirt/cdi-apiserver:v1.19.0

Happy to give you access to the cluster if you wish to troubleshoot, ping me on slack, alessandro on k8s slack

mhenriks added a commit to mhenriks/containerized-data-importer that referenced this issue Jul 14, 2020
see kubevirt#1295

Signed-off-by: Michael Henriksen <mhenriks@redhat.com>
@mhenriks
Copy link
Member

Issue traced down to lack of value for requestheader-allowed-names in extension-apiserver-authentication

Thanks for the help @ams0!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants