Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor usability issue when you try to restore from a scheduled backup #634

Closed
sokada1221 opened this issue Jul 4, 2019 · 16 comments
Closed
Assignees
Labels
type/bug Something isn't working

Comments

@sokada1221
Copy link
Contributor

sokada1221 commented Jul 4, 2019

Bug Report

What version of Kubernetes are you using?

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-19T22:12:47Z", GoVersion:"go1.12.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?
latest master

$ kubectl exec -n tidb-admin tidb-controller-manager-545d6c854d-xhrzx -- tidb-controller-manager -V
TiDB Operator Version: version.Info{TiDBVersion:"2.1.0", GitVersion:"v1.0.0-beta.3", GitCommit:"6257dfaad68f55f745f20f6f5d19b10bea2b0bea", GitTreeState:"clean", BuildDate:"2019-06-06T09:51:04Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

$ kubectl get sc
NAME            PROVISIONER                    AGE
ebs-gp2         kubernetes.io/aws-ebs          20h
gp2 (default)   kubernetes.io/aws-ebs          20h
local-storage   kubernetes.io/no-provisioner   20h
$ kubectl get pvc -n shinno-cluster
NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pd-shinno-cluster-pd-0            Bound    pvc-f3bd7c65-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         20h
pd-shinno-cluster-pd-1            Bound    pvc-f3bf7154-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         20h
pd-shinno-cluster-pd-2            Bound    pvc-f3c186e2-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         20h
shinno-cluster-monitor            Bound    pvc-f2c948ca-9dd5-11e9-81c4-026779400d00   100Gi      RWO            ebs-gp2         20h
shinno-cluster-scheduled-backup   Bound    pvc-ea9f71fb-9e7e-11e9-81c4-026779400d00   100Gi      RWO            ebs-gp2         4m23s
tikv-shinno-cluster-tikv-0        Bound    local-pv-2ced23ed                          366Gi      RWO            local-storage   20h
tikv-shinno-cluster-tikv-1        Bound    local-pv-6935efbf                          366Gi      RWO            local-storage   20h
tikv-shinno-cluster-tikv-2        Bound    local-pv-facd00f4                          366Gi      RWO            local-storage   20h

What's the status of the TiDB cluster pods?

$ kubectl get po -n shinno-cluster -o wide
NAME                                               READY   STATUS      RESTARTS   AGE   IP            NODE                                        NOMINATED NODE
shinno-cluster-discovery-d6c4df7f-m5ht6            1/1     Running     0          20h   10.0.54.124   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-monitor-55f87b9755-djq2h            2/2     Running     0          20h   10.0.58.189   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-pd-0                                1/1     Running     0          20h   10.0.52.76    ip-10-0-52-61.us-east-2.compute.internal    <none>
shinno-cluster-pd-1                                1/1     Running     1          20h   10.0.30.162   ip-10-0-27-179.us-east-2.compute.internal   <none>
shinno-cluster-pd-2                                1/1     Running     0          20h   10.0.46.121   ip-10-0-45-73.us-east-2.compute.internal    <none>
shinno-cluster-scheduled-backup-1562260500-bfwbl   0/1     Completed   0          21m   10.0.51.241   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-scheduled-backup-1562261400-g67bx   0/1     Completed   0          15m   10.0.48.123   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-scheduled-backup-1562262300-cr7hj   0/1     Completed   0          65s   10.0.48.123   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-tidb-0                              1/1     Running     0          20h   10.0.61.93    ip-10-0-53-222.us-east-2.compute.internal   <none>
shinno-cluster-tidb-1                              1/1     Running     0          20h   10.0.43.87    ip-10-0-40-191.us-east-2.compute.internal   <none>
shinno-cluster-tikv-0                              1/1     Running     1          20h   10.0.24.56    ip-10-0-19-9.us-east-2.compute.internal     <none>
shinno-cluster-tikv-1                              1/1     Running     1          20h   10.0.55.241   ip-10-0-48-170.us-east-2.compute.internal   <none>
shinno-cluster-tikv-2                              1/1     Running     0          20h   10.0.46.237   ip-10-0-34-248.us-east-2.compute.internal   <none>

What did you do?

  1. Deploy with terraform apply from deploy/aws
  2. Enable scheduled backup
  3. Retry to restore from S3 bucket according to the doc

What did you expect to see?
Restore to work according to the doc without any error.

What did you see instead?
Seems like we're hitting the character limit. The scheduled backup name should probably be shortened.

$ helm install charts/tidb-backup --namespace=shinno-cluster
Error: release lopsided-magpie failed: Job.batch "shinno-cluster-restore-scheduled-backup-2019-07-03T000020-tidb-cluster-scheduled-backup-1562112000-dvfqg" is invalid: [metadata.name: Invalid value: "shinno-cluster-restore-scheduled-backup-2019-07-03T000020-tidb-cluster-scheduled-backup-1562112000-dvfqg": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: "shinno-cluster-restore-scheduled-backup-2019-07-03T000020-tidb-cluster-scheduled-backup-1562112000-dvfqg": must be no more than 63 characters]

Workaround
Copy the content of scheduled backup to a new folder with shorter name.

@weekface
Copy link
Contributor

weekface commented Jul 5, 2019

@shinnosuke-okada We have removed the long pod name from the scheduled backup dir in this PR: #576

https://github.com/pingcap/tidb-operator/pull/576/files#diff-87269cdf14e9a2fb6ca87a9e728057a5R7

Could you use the latest charts/tidb-cluster chart and try again?

@sokada1221
Copy link
Contributor Author

@weekface Is it possible to make the latest helm charts available via dev tag or something? Refactored AWS deployment now takes helm charts from pingcap repository, and only the following versions are available:

$ helm search tidb-cluster -l
NAME                    CHART VERSION   APP VERSION     DESCRIPTION                  
pingcap/tidb-cluster    v1.0.0-beta.3                   A Helm chart for TiDB Cluster
pingcap/tidb-cluster    v1.0.0-beta.2                   A Helm chart for TiDB Cluster

@gregwebs
Copy link
Contributor

gregwebs commented Jul 5, 2019

You can probably change the helm repository to instead point to this github repo or your local folder.

It is also always possible to run helm install as a local-exec.

You can also disable the helm provisioner part in the terraform and do the helm install manually for now.

There is definitely a problem with flexibly selecting the proper version. I opened an issue for this: #640

@sokada1221
Copy link
Contributor Author

Thanks @gregwebs!

Yes, I was actually trying to upgrade from a local dir but it somehow failed. After that, terraform can neither fix nor destroy so I'm currently redeploying a fresh cluster. I'll post the result as soon as I have it ready. Thanks.

@gregwebs
Copy link
Contributor

gregwebs commented Jul 5, 2019

Sorry, we are currently making terraform changes that are not backwards-compatible. I am pushing towards us having proper usage as a module. After that there are more possibilities for dealing with terraform changes.

@sokada1221
Copy link
Contributor Author

Sounds good. No problem - I understand it's a WIP :)

Just a quick update. I deployed with the latest code from master branch but scheduled backup is hitting a segmentation fault. Also, I cannot ssh into the bastion node for some reason.

$ kubectl logs shinno-cluster-scheduled-backup-1562358600-wb57r -n shinno-cluster
+ set -euo pipefail
+ getent hosts shinno-cluster-tidb
+ head
+ awk '{print $1}'
+ host=172.20.67.254
+ echo shinno-cluster-scheduled-backup-1562358600-wb57r
+ awk -F- '{print $(NF-1)}'
+ timestamp=1562358600
+ date -u -d @1562358600 '+%Y%m%d-%H%M%S'
+ backupName=scheduled-backup-20190705-203000
+ backupPath=/data/scheduled-backup-20190705-203000
+ echo 'making dir /data/scheduled-backup-20190705-203000'
making dir /data/scheduled-backup-20190705-203000
+ mkdir -p /data/scheduled-backup-20190705-203000
+ /usr/bin/mysql -h172.20.67.254 -P4000 -uroot -p -Nse 'select variable_value from mysql.tidb where variable_name='"'"'tikv_gc_life_time'"'"';'
Segmentation fault (core dumped)
+ gc_life_time=

@aylei
Copy link
Contributor

aylei commented Jul 8, 2019

I will take a look at the bastion issue.

@LinuxGit Can you take a look at the segmentation fault listed above?

@LinuxGit
Copy link
Contributor

LinuxGit commented Jul 9, 2019

@shinnosuke-okada I've submitted a new issue #643.
When the password in backup-secret is empty, segmentation fault error will occur.

@weekface
Copy link
Contributor

The Segmentation fault error was fixed by #649, @shinnosuke-okada can you have a try?

@sokada1221
Copy link
Contributor Author

Checked out the latest master and merged my changes today. I somehow ended up with an incomplete cluster.

$ kubectl get pods -n shinno-cluster
NAME                                      READY   STATUS    RESTARTS   AGE
shinno-cluster-discovery-d6c4df7f-dgkl9   1/1     Running   0          3m35s
shinno-cluster-monitor-55f87b9755-f528k   2/2     Running   0          3m35s
shinno-cluster-pump-0                     1/1     Running   0          3m35s
shinno-cluster-pump-1                     1/1     Running   0          2m14s
$ helm history shinno-cluster
REVISION	UPDATED                 	STATUS  	CHART           	DESCRIPTION
1       	Thu Jul 11 15:02:09 2019	DEPLOYED	tidb-cluster-dev	Install complete

Will investigate further tomorrow. Thanks.

@aylei
Copy link
Contributor

aylei commented Jul 12, 2019

The deployment of tidb-operator failed for some reason, could you please provide the output of “helm ls” and “kubectl get po -n tidb-admin”?

@sokada1221
Copy link
Contributor Author

@aylei Looks like this:

$ helm ls
NAME            REVISION        UPDATED                         STATUS          CHART                   APP VERSION     NAMESPACE     
shinno-cluster  1               Thu Jul 11 15:02:09 2019        DEPLOYED        tidb-cluster-dev                        shinno-cluster
tidb-operator   1               Thu Jul 11 15:02:06 2019        DEPLOYED        tidb-cluster-dev                        tidb-admin    
$ kubectl get po -n tidb-admin
NAME                                       READY   STATUS    RESTARTS   AGE
tidb-operator-discovery-567549cf4f-cnf8n   1/1     Running   0          7h27m
tidb-operator-monitor-865bdf479c-f4xqj     2/2     Running   0          7h27m

@aylei
Copy link
Contributor

aylei commented Jul 12, 2019

@shinnosuke-okada It turns out the chart of tidb-operator is wrong, it's actually a tidb-cluster.

@aylei
Copy link
Contributor

aylei commented Jul 12, 2019

Did your merge change this

resource "helm_release" "tidb-operator" {
?

@sokada1221
Copy link
Contributor Author

Oh yes, you're right - it was a careless mistake on my end. Sorry, and thank you! Let me quickly verify this issue again.

@sokada1221
Copy link
Contributor Author

Verified with the latest master that the scheduled backup name is short enough. For example:
scheduled-backup-20190712-045000

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants