Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate with Calico #709

Closed
caseydavenport opened this issue Oct 20, 2016 · 29 comments
Closed

Integrate with Calico #709

caseydavenport opened this issue Oct 20, 2016 · 29 comments
Assignees

Comments

@caseydavenport
Copy link
Member

caseydavenport commented Oct 20, 2016

CNI support has been added, hooray! https://github.com/kubernetes/kops/pull/621/files

With the above merged, it should be easy to add Calico.

This issue is to track the testing / documentation for Calico + kops.

@chrislovecnm
Copy link
Contributor

@caseydavenport This issue is to track the testing & documentation for Calico + kops. 😀

@caseydavenport
Copy link
Member Author

@chrislovecnm Yes, that's correct :) Updated.

@chrislovecnm
Copy link
Contributor

@caseydavenport I am assigning this to you.

@chrislovecnm
Copy link
Contributor

@caseydavenport you can coordinate with @razic on this. He is dropping in support for weave. Here is the issue #777 as well.

@caseydavenport
Copy link
Member Author

@chrislovecnm @razic happy to coordinate.

I'll also introduce @heschlie.

@chrislovecnm
Copy link
Contributor

@caseydavenport who do you want this assigned to?

@Buzer
Copy link

Buzer commented Nov 5, 2016

Currently following was required when installing Calico on multi-AZ kops cluster which was created with cni networking:

  1. Modify calico.yaml with following

    1.1. Add annotations (this should be fixed with Allow scheduling on tainted masters (e.g for kops) projectcalico/calico#163)
    1.2. Change latest tags to actual versions
    1.3. Change etcd_endpoints to list of etcd nodes (http://etcd-$AZ.internal.$NAME:4001)

  2. Check some master's IP from AWS

  3. Copy modified file to master

  4. SSH to master

    4.1. Run docker run --rm --net=host -e ETCD_ENDPOINTS=http://127.0.0.1:4001 calico/ctl pool add $NETWORK/$SIZE --ipip --nat-outgoing
    4.2. SSH Apply the modified calico.yaml

I think 4.1. is likely the only part that's somewhat hard to do with just #777 assuming manifests are templates and some variables (in this case etcd endpoints and some user-defined things like the CIDR for pool) will be provided to them. One possible way to do it purely within the manifest could be to run it as a job, make it write some key to etcd & modify Calico pod so that the so that it will wait until that key exists before starting Calico, but that's pretty hacky solution.

@caseydavenport
Copy link
Member Author

@Buzer thanks for the detailed steps :)

Sounds like we need to get projectcalico/calico#163 merged and into a release to address 1.1 and 1.2 above.

For 1.3, etcd_endpoints will likely need to be templated, or we could set up a Kubernetes service which fronts the etcd cluster with a well-known clusterIP, similar to kubedns?

For 4.1, we do something similar for kubeadm - we tell Calico not to create a pool by default, and use a Job to configure Calico. This seems to work nicely once it has the right annotations to run as a critical pod, run on master allowed.

For 4.2, I suspect we can do this as part of the install in some way so users don't need to SSH in manually?

@caseydavenport
Copy link
Member Author

@chrislovecnm could you assign @heschlie? Thanks!

@chrislovecnm
Copy link
Contributor

He is not a member of the kubernetes org, so alas I cannot. Will keep it assigned to you..

@caseydavenport
Copy link
Member Author

Ah, right. Fine to keep assigned to me!

@chrislovecnm
Copy link
Contributor

SGTM

@Buzer
Copy link

Buzer commented Nov 6, 2016

I'm not too familiar with etcd's (or kubernetes services) internals, but how well it would deal with various error situations? And does etcd allow writing to any node (ex. do non-leader nodes internally forward requests that they cannot handle to current leader or is it clients' responsibility)?

Job approach sounds good. I was thinking it initially, but couldn't find a way to disable automatic pool creation with a quick look.

I assume 4.2. will be handled by #777?

@stonith
Copy link

stonith commented Nov 7, 2016

I've been trying to get this working and found that latest(master) referenced in the k8s hosted calico.yaml doesn't seem to work for me, it can't route to the internet but other version do work such as v1.0.0-beta-4-gfd4cf3c.

Also, with the v1 changes, updating the pool is slightly different:

sudo cat << EOF | calicoctl replace -f -
- apiVersion: v1
  kind: ipPool
  metadata:
    cidr: 192.168.0.0/16
  spec:
    ipip:
      enabled: true
    nat-outgoing: true
EOF

@chrislovecnm
Copy link
Contributor

@caseydavenport I need a tested method. Can you reach out to me?

@chrislovecnm
Copy link
Contributor

Another test failed.

  1m        1m      1   {default-scheduler }                            Normal      Scheduled   Successfully assigned dns-controller-844861676-aqetd to ip-172-20-157-53.us-west-2.compute.internal
  1m        1s      86  {kubelet ip-172-20-157-53.us-west-2.compute.internal}           Warning     FailedSync  Error syncing pod, skipping: failed to "SetupNetwork" for "dns-controller-844861676-aqetd_kube-system" with SetupNetworkError: "Failed to setup network for pod \"dns-controller-844861676-aqetd_kube-system(2fdf1feb-a910-11e6-9022-029c1f6e6435)\" using network plugins \"cni\": nodes \"ip-172-20-157-53\" not found; Skipping pod"

Created on full private vpc cluster. Using kops head, and required nodeup.

https://raw.githubusercontent.com/projectcalico/calico/master/master/getting-started/kubernetes/installation/hosted/k8s-backend/calico.yaml

Install command

@caseydavenport
Copy link
Member Author

@chrislovecnm will reach out on Monday. That looks like the wrong manifest.

@heschlie has been out sick.

@heschlie
Copy link
Contributor

First, I'm currently using the following kops if it makes a difference:

$ kops version
Version git-18879f7

@chrislovecnm Here is the manifest I have been trying to get deployed, @caseydavenport might want to review it make sure it is sane:

https://gist.github.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26

The job in the above manifest never gets to run, but it is necessary, I run the container manually below to get calicoctl to setup the networking.

I am deploying the cluster with the following command:

kops create cluster --zones us-west-2c $NAME --networking cni --master-size m4.large

I am using the m4.large at the suggestion of #728

Once the cluster master is online I need to ssh to it and do a couple things.

  1. scp calico.yaml admin@$MASTER_IP:/admin/home/
  2. ssh admin@$MASTER_IP to get the internal IP
  3. Add api.$NAME and api.internal.$NAME to Route53
  4. sudo docker run --rm --net=host -e ETCD_ENDPOINTS=http://127.0.0.1:4001 calico/ctl:v0.22.0 pool add 172.20.96.0/19 --ipip --nat-outgoing
  5. kubectl apply -f calico.yaml

I have not been able to get a working deployment. My main issue is when running kops with --networking cni is that it doesn't seem to create the necessary DNS entries in Route53 and thus the nodes cannot connect to it.

I've tried adding the DNS entries by hand and it get the process further along but docker seems to be having trouble pulling all of the images, and the master is struggling to create the kubedns pods and I'm seeing this when I describe those pods:

container "kubedns" is unhealthy, it will be killed and re-created

Even after getting the DNS entries (api.$NAME, api.internal.$NAME) in Route53 the kube-dns-v20 pods were still not coming online. kubedns seems to be trying to hit the API at 100.64.0.1:443 but is not able to reach it:

$ kubectl logs -n kube-system kube-dns-v20-3531996453-hkrty kubedns
I1113 18:08:34.945961       1 server.go:94] Using https://100.64.0.1:443 for kubernetes master, kubernetes API: <nil>
I1113 18:08:34.946567       1 server.go:99] v1.5.0-alpha.0.1651+7dcae5edd84f06-dirty
I1113 18:08:34.946588       1 server.go:101] FLAG: --alsologtostderr="false"
I1113 18:08:34.946643       1 server.go:101] FLAG: --dns-port="10053"
I1113 18:08:34.946650       1 server.go:101] FLAG: --domain="cluster.local."
I1113 18:08:34.946667       1 server.go:101] FLAG: --federations=""
I1113 18:08:34.946673       1 server.go:101] FLAG: --healthz-port="8081"
I1113 18:08:34.946676       1 server.go:101] FLAG: --kube-master-url=""
I1113 18:08:34.946680       1 server.go:101] FLAG: --kubecfg-file=""
I1113 18:08:34.946744       1 server.go:101] FLAG: --log-backtrace-at=":0"
I1113 18:08:34.946752       1 server.go:101] FLAG: --log-dir=""
I1113 18:08:34.946771       1 server.go:101] FLAG: --log-flush-frequency="5s"
I1113 18:08:34.946811       1 server.go:101] FLAG: --logtostderr="true"
I1113 18:08:34.946823       1 server.go:101] FLAG: --stderrthreshold="2"
I1113 18:08:34.946827       1 server.go:101] FLAG: --v="0"
I1113 18:08:34.946832       1 server.go:101] FLAG: --version="false"
I1113 18:08:34.946848       1 server.go:101] FLAG: --vmodule=""
I1113 18:08:34.946928       1 server.go:138] Starting SkyDNS server. Listening on port:10053
I1113 18:08:34.946977       1 server.go:145] skydns: metrics enabled on : /metrics:
I1113 18:08:34.946993       1 dns.go:166] Waiting for service: default/kubernetes
I1113 18:08:34.949742       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I1113 18:08:34.949813       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I1113 18:09:04.948943       1 dns.go:172] Ignoring error while waiting for service default/kubernetes: Get https://100.64.0.1:443/api/v1/namespaces/default/services/kubernetes: dial tcp 100.64.0.1:443: i/o timeout. Sleeping 1s before retrying.
E1113 18:09:04.949791       1 reflector.go:214] pkg/dns/dns.go:155: Failed to list *api.Service: Get https://100.64.0.1:443/api/v1/services?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E1113 18:09:04.949862       1 reflector.go:214] pkg/dns/dns.go:154: Failed to list *api.Endpoints: Get https://100.64.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout

I'm still learning the intricacies of kubernetes, but this doesn't look right to me, maybe someone can chime in with some info so I can understand what I might be doing wrong or need to do to get this operational.

The kube-dns pods seem to be the last remaining issue atm, but I can't establish if Calico is working until the kube-dns pods are up so there could likely still be more issues.

@Buzer
Copy link

Buzer commented Nov 13, 2016

The DNS entries are created by the dns-controller and it should start after the CNI is configured (ex. after Calico starts). Judging from your kubedns logs, you likely have something similar in dns controller logs (or errors about accessing route53). You can also try to exec into the dns controller if it's started to see if you have network connectivity there or not.

Few things you might want to check (issues on which I have ran into):

  • Check if /etc/cni/net.d/calico-kubeconfig exists and kubeconfig points to existing file
  • Confirm that there are NAT rules in place (iptables-save | grep felix-masq-ipam-pools)

@chrislovecnm
Copy link
Contributor

Does calico not allow for a full manifest install? I have to run another docker? How is that managed by k8s?

@caseydavenport
Copy link
Member Author

Calico can be installed entirely through a k8s manifest - it's basically just a DaemonSet and a ReplicaSet. An example of that can be found here.

It's also possible to use Jobs to provide arbitrary configuration options to Calico, and a Secret for any certificates you might want to provide (e.g for etcd).

As was hinted at in a few places above, the only things we should need to do:

  • Set ETCD_ENDPOINTS in the linked manifest to point at the etcd node(s).
  • Configure the Calico IP pool used for allocating pod addresses (can be done via a Job).

I'll sync with @heschlie on this early tomorrow and let you know @chrislovecnm.

@heschlie
Copy link
Contributor

@Buzer It looks like the dns-controller is starting before I get a chance to deploy the calico.yaml manifest. I've checked that the proper config files exist in /etc/cni/net.d/ as well and appear to be setup correctly, and that the CNI binaries are in /opt/cni/bin, and the NAT rules are in place (though the job is still not running in the yaml file so I need to run calicoctl via docker run still) but the dns-controller never seems to be able to talk to the API.

I also tried restarting docker and kublet, no luck, and tried rebooting the master as well just in case it caused it to come online.

It seems as though the CNI plugin just isn't being used on the containers. I also double checked that the kubelet was set to use CNI, and it appears to be as well.

Will sync with @caseydavenport tomorrow am, just wanted to verify I couldn't get it up and running with the extra info.

@heschlie
Copy link
Contributor

@caseydavenport and I found that the pods necessary to get CNI up and running weren't able to run on the tainted master. We nailed down the proper annotations to add and applied it to the calico-node DS, the configure-calico Job, and the calico-policy-controller RS. After that the cluster came up properly, and policies were being enforced appropriately.

Here is the manifest that works to get the cluster online, the etcd_endpoints will need to be changed before deploying it.

https://gist.github.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26

Here is the process to bring up the cluster:

  • Deploy cluster with kops create cluster --zones --master-size m4.large $ZONES --networking cni $CLUSTER_NAME
  • Wait for master to come online, get the IP from the AWS console
  • ssh to master ssh admin@$MASTER_IP
  • Download calico.yaml manifest TODO: Get more permanent location for this manifest wget https://gist.githubusercontent.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26/raw/a9645c9f310b51837ff5a2769a66a2b1a3c24342/calico.yaml
  • Change etcd_endpoints in ConfigMap to match your endpoint(s) typically something like http://$ZONE.internal.$NAME where $ZONE is one of the zones you picked, and $NAME is the cluster name. Repeat for each zone, e.g.
etcd_endpoints: "http://etcd-us-west-2c.internal.k8s.testing.example.com:4001,http://etcd-us-east-2c.internal.k8s.testing.example.com:4001"
  • Deploy manifest kubectl apply -f calico.yaml

There are still two lingering problems one can see above:

  1. etcd_endpoints has to be populated manually. As @caseydavenport mentioned above, could kops template this somehow?
  2. The need to manually SSH into the master to deploy the manifest.

I know there is an issue talking about making deploying a CNI provider as simple as --networking calico which could end up resolving both of those problems.

@kris-nova @chrislovecnm I'd like to leave those last two steps in your hands if that is alright.

@stonith
Copy link

stonith commented Nov 15, 2016

This is the following I've done to get kops working with calico:

  1. create cluster: kops create cluster --cloud=aws --master-zones=<master_zone> --zones=<zone_A>,<zone_B> --master-size=t2.large --ssh-public-key=~/.ssh/id_rsa.pub --kubernetes-version=1.4.5 --networking=cni <cluster_name> --yes
  2. ssh to master, wget https://raw.githubusercontent.com/projectcalico/calico/master/v2.0/getting-started/kubernetes/installation/hosted/calico.yaml
  3. edit etcd_endpoints to the master's private IP in calico.yaml
  4. edit cidr to 100.64.0.0/10 and add ipip: enabled: true in calico.yaml
  5. manually add the api.internal.<cluster_name> entry to R53 to point to master's private ip (because of Allow scheduling on tainted masters (e.g for kops) projectcalico/calico#163)
  6. kubectl -f apply calico.yaml

Network policies work but the kube-dns service doesn't seem to. It defaults to 100.64.0.10 in kops but it doesn't answer to dns traffic from pods.

UPDATE: Something changed between when I originally tested this workflow and yesterday but now the kube-dns service is reachable for me.

@jayv
Copy link

jayv commented Nov 16, 2016

Supposedly IPIP is not required on AWS and bad for performance, do this for all instances:
aws ec2 modify-instance-attribute --instance-id $INSTANCE_ID --source-dest-check "{\"Value\": false}"

See section 3 on: http://docs.projectcalico.org/v1.5/getting-started/kubernetes/installation/aws

@stonith
Copy link

stonith commented Nov 16, 2016

@jayv IPIP is required for cross-AZ communication, that example is in a single AZ.

@jayv
Copy link

jayv commented Nov 16, 2016

Ah bummer, why can't we have nice things :(

@caseydavenport
Copy link
Member Author

@stonith @jayv yeah, it is a bummer.

Ideally we just use ipip across AZ boundaries. See this issue: https://github.com/projectcalico/calico-containers/issues/1310

@chrislovecnm
Copy link
Contributor

Closing as the PR is completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants