Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico v3 upgrade #5102

Merged
merged 4 commits into from
Nov 13, 2018
Merged

Calico v3 upgrade #5102

merged 4 commits into from
Nov 13, 2018

Conversation

tmjd
Copy link
Contributor

@tmjd tmjd commented May 3, 2018

Addresses #5101

I have been able to install with Kops 1.9.0 and then using my changes here am able to upgrade to Calico v3.

Note: This PR is currently using unreleased changes from calico/upgrade that exist as projectcalico/calico-upgrade#35, before this PR is merged I expect a release to be made based on those changes and then the images named tmjd/calico-upgrade to be replaced.

TODO:

  • Add docs for using the newly added field
  • Validation, if possible, of the cluster configuration when Calico APIVersion v3 is specified that an appropriate etcd version is being used also.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 3, 2018
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 8, 2018
@tmjd tmjd changed the title [WIP] Calico v3 upgrade Calico v3 upgrade May 8, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2018
@tmjd
Copy link
Contributor Author

tmjd commented May 9, 2018

/assign @geojaz

@chrislovecnm
Copy link
Contributor

/assign @blakebarnett @caseydavenport

We may need to hold this till we get etc2 -> etcd3 upgrade working. @blakebarnett thoughts?

@blakebarnett
Copy link

As long as it only applies if the cluster is already running etcd3 it should be fine, might make it confusing for those still on etcd2 but, 1.10 is supposed to support upgrading right?

```

You will need to change that block, and add an additional field, to look like this:

```
networking:
calico:
apiVersion: v3
Copy link
Member

@caseydavenport caseydavenport May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be majorVersion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't Calico release a version v4.x and still be on the same api version and then it would be mis-named if we upgrade the manifests in Kops? Or it could be preferable to make that upgrade require an explicit change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it as is. If anything calicoAPIVersion might be less ambiguous, if you're up for making that change.

Copy link
Member

@caseydavenport caseydavenport Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now, it might be best represented as majorVersion? @tmjd

@caseydavenport
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 31, 2018
@caseydavenport
Copy link
Member

@tmjd this LGTM, though I'd suggest using the recently released Calico v3.1.3.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2018
@justinsb justinsb added this to the 1.10 milestone Jun 2, 2018
@caseydavenport
Copy link
Member

/assign @geojaz

@maver1ck
Copy link

@tmjd Do you plan also upgrade canal version ?

@tmjd
Copy link
Contributor Author

tmjd commented Jul 16, 2018

@maver1ck I didn't have any concrete plans though might get to it eventually. I believe the upgrade for Canal should be easier than Calico because the Canal installation uses kubernetes as the data store (instead of etcd) and that upgrade is much easier. I'd be happy to review a PR for Canal if one is submitted.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2018
@justinsb justinsb modified the milestones: 1.10, 1.11 Jul 20, 2018
@justinsb
Copy link
Member

I think really solid etcd upgrades are going to be in 1.11, so moving this to 1.11

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2018
@caseydavenport
Copy link
Member

/test pull-kops-e2e-kubernetes-aws

@caseydavenport
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2018
@caseydavenport
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2018
@caseydavenport
Copy link
Member

@geojaz could you PTAL? I'd very much like to get this merged.

@caseydavenport
Copy link
Member

Also @justinsb

Copy link
Member

@geojaz geojaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

thanks to those who helped push this to completion @tmjd and @caseydavenport

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caseydavenport, geojaz, tmjd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2018
@k8s-ci-robot k8s-ci-robot merged commit bac89b8 into kubernetes:master Nov 13, 2018
@tmjd tmjd deleted the calico-v3-upgrade branch November 14, 2018 16:32
@ottoyiu
Copy link
Contributor

ottoyiu commented Jan 16, 2019

We are currently running Kubernetes 1.10 using etcd2 and Calico 2.6.8 in production, and are looking to upgrade to etcd3 when we switch over to Kubernetes 1.11 since kops1.11 has been released.

@caseydavenport , @tmjd How do we do the calico-upgrade data migration if we no longer have an etcd2 cluster to rely on, but the schema has not been converted from v1 to v3 yet? calico-upgrade does not seem to allow you to specify an etcd3 cluster with the v1 apiVersion as source. Is running a temporary etcd2 cluster using a backup the only way we can achieve this?

The steps we're trying to do as part of our upgrade:

  1. upgrade etcd2 to etcd3 using kops, etcd2 will no longer exist after this step
  2. calico-upgrade to do schema migration
  3. switch to calico 3.3.1.

Your expertise and advice is highly appreciated. Thank you!

@sstarcher
Copy link
Contributor

@ottoyiu I believe I'm trying to do the same thing you are - #6261

We have etcd3 already running, but we are running calico 2 and want to upgrade to calico 3.

@ottoyiu
Copy link
Contributor

ottoyiu commented Jan 16, 2019

@sstarcher glad I'm not the only one with this problem.

The way I came up with is to download a backup that etcd-manager has made of etcd2 cluster prior to upgrade, and running my own local etcd cluster:

docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -v /home/oyiu/Downloads/member:/var/etcd/data -p 4001:4001 -p 2380:2380 -p 2379:2379  --name etcd quay.io/coreos/etcd:v2.3.8  -name etcd0  -advertise-client-urls http://${HostIP}:2379,http://${HostIP}:4001  -listen-client-urls http://0.0.0.0
:2379,http://0.0.0.0:4001  -initial-advertise-peer-urls http://${HostIP}:2380  -listen-peer-urls http://0.0.0.0:2380  -initial-cluster-token etcd-cluster-1  -initial-cluster etcd0=http://${HostIP}:2380 -data-dir /var/etcd/data  --force-new-cluster

let's see if I can get calico-upgrade to run now.

@ottoyiu
Copy link
Contributor

ottoyiu commented Jan 16, 2019

seems to work with my local etcd2 cluster:

 ./calico-upgrade dry-run --output-dir=temp --apiconfigv1 calico-upgrade-v1.yaml --apiconfigv3 calico-upgrade-v3.yaml --ignore-v3-data
Preparing reports directory
 * creating report directory if it does not exist
 * validating permissions and removing old reports
Checking Calico version is suitable for migration
 * determined Calico version of: v2.6.7
 * the v1 API data can be migrated to the v3 API
Validating conversion of v1 data to v3
 * handling FelixConfiguration (global) resource
 * handling ClusterInformation (global) resource
 * handling FelixConfiguration (per-node) resources
 * handling BGPConfiguration (global) resource
 * handling Node resources
 * handling BGPPeer (global) resources
 * handling BGPPeer (node) resources
 * handling HostEndpoint resources
 * handling IPPool resources
 * handling GlobalNetworkPolicy resources
 * handling Profile resources
 * handling WorkloadEndpoint resources
 * data conversion successful
Data conversion validated successfully
Validating the v3 datastore
 * the v3 datastore is not empty

-------------------------------------------------------------------------------

Successfully validated v1 to v3 conversion.
See report(s) below for details of the conversion.
Reports:
- name conversion: temp/convertednames

then I had to empty etcd3's cluster of old calico keys:

> ETCDCTL_API=3 etcdctl --endpoints=http://xxx:4001 del /calico/ --prefix 
> calico-upgrade start --output-dir=temp --apiconfigv1 calico-upgrade-v1.yaml --apiconfigv3 calico-upgrade-v3.yaml
-bash: ./calico-upgrade: No such file or directory
oyiu@otto-cayvr [~][dw-internal:kube-system][(unknown)*] cd ~/Downloads/
oyiu@otto-cayvr [~/Downloads][dw-internal:kube-system] ./calico-upgrade start --output-dir=temp --apiconfigv1 calico-upgrade-v1.yaml --apiconfigv3 calico-upgrade-v3.yaml
Preparing reports directory
 * creating report directory if it does not exist
 * validating permissions and removing old reports
Checking Calico version is suitable for migration
 * determined Calico version of: v2.6.7
 * the v1 API data can be migrated to the v3 API
Validating conversion of v1 data to v3
 * handling FelixConfiguration (global) resource
 * handling ClusterInformation (global) resource
 * handling FelixConfiguration (per-node) resources
 * handling BGPConfiguration (global) resource
 * handling Node resources
 * handling BGPPeer (global) resources
 * handling BGPPeer (node) resources
 * handling HostEndpoint resources
 * handling IPPool resources
 * handling GlobalNetworkPolicy resources
 * handling Profile resources
 * handling WorkloadEndpoint resources
 * data conversion successful
Data conversion validated successfully
Validating the v3 datastore
 * the v3 datastore is empty

-------------------------------------------------------------------------------

Successfully validated v1 to v3 conversion.

You are about to start the migration of Calico v1 data format to Calico v3 data
format. During this time and until the upgrade is completed Calico networking
will be paused - which means no new Calico networked endpoints can be created.
No Calico configuration should be modified using calicoctl during this time.

Type "yes" to proceed (any other input cancels): yes
Pausing Calico networking
 * successfully paused Calico networking in the v1 configuration
Calico networking is now paused - waiting for 15s
Querying current v1 snapshot and converting to v3
 * handling FelixConfiguration (global) resource
 * handling ClusterInformation (global) resource
 * handling FelixConfiguration (per-node) resources
 * handling BGPConfiguration (global) resource
 * handling Node resources
 * handling BGPPeer (global) resources
 * handling BGPPeer (node) resources
 * handling HostEndpoint resources
 * handling IPPool resources
 * handling GlobalNetworkPolicy resources
 * handling Profile resources
 * handling WorkloadEndpoint resources
 * data converted successfully
Storing v3 data
 * Storing resources in v3 format
 * success: resources stored in v3 datastore
Migrating IPAM data
 * listing and converting IPAM allocation blocks
 * listing and converting IPAM affinity blocks
 * listing IPAM handles
 * storing IPAM data in v3 format
 * IPAM data migrated successfully
Data migration from v1 to v3 successful
 * check the output for details of the migrated resources
 * continue by upgrading your calico/node versions to Calico v3.x

-------------------------------------------------------------------------------

Successfully migrated Calico v1 data to v3 format.
Follow the detailed upgrade instructions available in the release documentation
to complete the upgrade. This includes:
 * upgrading your calico/node instances and orchestrator plugins (e.g. CNI) to
   the required v3.x release
 * running 'calico-upgrade complete' to complete the upgrade and resume Calico
   networking

See report(s) below for details of the migrated data.
Reports:
- name conversion: temp/convertednames

@blakebarnett
Copy link

@ottoyiu awesome, I was getting ready to start this process also, thanks for posting your work. Would this conversion be best to run on etcd2 before upgrading to etcd3 then or would the changes that happen in the meantime throw it off too much? I worry about doing it on a busy production cluster like this...

@sstarcher
Copy link
Contributor

@blakebarnett let me know what you end up doing. I might need to spend some time trying to find out what's the least disruptive way to upgrade from calico v2, etcdv3 to calico v3.

@tmjd
Copy link
Contributor Author

tmjd commented Jan 16, 2019

I don't know what the kops etcdv2 to etcdv3 conversion does specifically. Does it retain the etcd data that existed in etcdv2? If you didn't know the etcd server v3 supports both the v2 and v3 etcd API, the data and users are completely separate. So it could be possible that the Calico data that was in the etcd v2 server still exists and after doing the kops upgrade (which updates the etcd server from v2 to v3) it is just necessary to go through the process described in the docs https://github.com/kubernetes/kops/blob/master/docs/calico-v3.md#upgrading-an-existing-cluster.
When I added the upgrade and wrote the docs it was before the etcd v2 to v3 upgrade was available so I was not able to test that.

@ottoyiu you said

calico-upgrade does not seem to allow you to specify an etcd3 cluster with the v1 apiVersion as source

I guess perhaps it is not obvious from the docs but an etcd v3 server can be configured as the v1 apiVersion source, like what is described at https://docs.projectcalico.org/v3.1/getting-started/kubernetes/upgrade/setup#configuring-calico-upgrade-to-connect-to-the-etcdv2-datastore. The datastoreType of etcdv2 is actually the etcd API version so will work for an etcd v3 server (assuming the v2 api hasn't been disabled).

@ottoyiu
Copy link
Contributor

ottoyiu commented Jan 16, 2019

@tmjd it's good to know that datastoreType is merely the API version of etcd; I will try running the migration on an etcd3 cluster with compatibility mode enable and will report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet