Minor usability issue when you try to restore from a scheduled backup #634

sokada1221 · 2019-07-04T17:50:01Z

Bug Report

What version of Kubernetes are you using?

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-19T22:12:47Z", GoVersion:"go1.12.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?
latest master

$ kubectl exec -n tidb-admin tidb-controller-manager-545d6c854d-xhrzx -- tidb-controller-manager -V
TiDB Operator Version: version.Info{TiDBVersion:"2.1.0", GitVersion:"v1.0.0-beta.3", GitCommit:"6257dfaad68f55f745f20f6f5d19b10bea2b0bea", GitTreeState:"clean", BuildDate:"2019-06-06T09:51:04Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

$ kubectl get sc
NAME            PROVISIONER                    AGE
ebs-gp2         kubernetes.io/aws-ebs          20h
gp2 (default)   kubernetes.io/aws-ebs          20h
local-storage   kubernetes.io/no-provisioner   20h
$ kubectl get pvc -n shinno-cluster
NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pd-shinno-cluster-pd-0            Bound    pvc-f3bd7c65-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         20h
pd-shinno-cluster-pd-1            Bound    pvc-f3bf7154-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         20h
pd-shinno-cluster-pd-2            Bound    pvc-f3c186e2-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         20h
shinno-cluster-monitor            Bound    pvc-f2c948ca-9dd5-11e9-81c4-026779400d00   100Gi      RWO            ebs-gp2         20h
shinno-cluster-scheduled-backup   Bound    pvc-ea9f71fb-9e7e-11e9-81c4-026779400d00   100Gi      RWO            ebs-gp2         4m23s
tikv-shinno-cluster-tikv-0        Bound    local-pv-2ced23ed                          366Gi      RWO            local-storage   20h
tikv-shinno-cluster-tikv-1        Bound    local-pv-6935efbf                          366Gi      RWO            local-storage   20h
tikv-shinno-cluster-tikv-2        Bound    local-pv-facd00f4                          366Gi      RWO            local-storage   20h

What's the status of the TiDB cluster pods?

$ kubectl get po -n shinno-cluster -o wide
NAME                                               READY   STATUS      RESTARTS   AGE   IP            NODE                                        NOMINATED NODE
shinno-cluster-discovery-d6c4df7f-m5ht6            1/1     Running     0          20h   10.0.54.124   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-monitor-55f87b9755-djq2h            2/2     Running     0          20h   10.0.58.189   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-pd-0                                1/1     Running     0          20h   10.0.52.76    ip-10-0-52-61.us-east-2.compute.internal    <none>
shinno-cluster-pd-1                                1/1     Running     1          20h   10.0.30.162   ip-10-0-27-179.us-east-2.compute.internal   <none>
shinno-cluster-pd-2                                1/1     Running     0          20h   10.0.46.121   ip-10-0-45-73.us-east-2.compute.internal    <none>
shinno-cluster-scheduled-backup-1562260500-bfwbl   0/1     Completed   0          21m   10.0.51.241   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-scheduled-backup-1562261400-g67bx   0/1     Completed   0          15m   10.0.48.123   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-scheduled-backup-1562262300-cr7hj   0/1     Completed   0          65s   10.0.48.123   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-tidb-0                              1/1     Running     0          20h   10.0.61.93    ip-10-0-53-222.us-east-2.compute.internal   <none>
shinno-cluster-tidb-1                              1/1     Running     0          20h   10.0.43.87    ip-10-0-40-191.us-east-2.compute.internal   <none>
shinno-cluster-tikv-0                              1/1     Running     1          20h   10.0.24.56    ip-10-0-19-9.us-east-2.compute.internal     <none>
shinno-cluster-tikv-1                              1/1     Running     1          20h   10.0.55.241   ip-10-0-48-170.us-east-2.compute.internal   <none>
shinno-cluster-tikv-2                              1/1     Running     0          20h   10.0.46.237   ip-10-0-34-248.us-east-2.compute.internal   <none>

What did you do?

Deploy with terraform apply from deploy/aws
Enable scheduled backup
Retry to restore from S3 bucket according to the doc

What did you expect to see?
Restore to work according to the doc without any error.

What did you see instead?
Seems like we're hitting the character limit. The scheduled backup name should probably be shortened.

$ helm install charts/tidb-backup --namespace=shinno-cluster
Error: release lopsided-magpie failed: Job.batch "shinno-cluster-restore-scheduled-backup-2019-07-03T000020-tidb-cluster-scheduled-backup-1562112000-dvfqg" is invalid: [metadata.name: Invalid value: "shinno-cluster-restore-scheduled-backup-2019-07-03T000020-tidb-cluster-scheduled-backup-1562112000-dvfqg": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: "shinno-cluster-restore-scheduled-backup-2019-07-03T000020-tidb-cluster-scheduled-backup-1562112000-dvfqg": must be no more than 63 characters]

Workaround
Copy the content of scheduled backup to a new folder with shorter name.

The text was updated successfully, but these errors were encountered:

weekface · 2019-07-05T06:42:01Z

@shinnosuke-okada We have removed the long pod name from the scheduled backup dir in this PR: #576

https://github.com/pingcap/tidb-operator/pull/576/files#diff-87269cdf14e9a2fb6ca87a9e728057a5R7

Could you use the latest charts/tidb-cluster chart and try again?

sokada1221 · 2019-07-05T15:56:55Z

@weekface Is it possible to make the latest helm charts available via dev tag or something? Refactored AWS deployment now takes helm charts from pingcap repository, and only the following versions are available:

$ helm search tidb-cluster -l
NAME                    CHART VERSION   APP VERSION     DESCRIPTION                  
pingcap/tidb-cluster    v1.0.0-beta.3                   A Helm chart for TiDB Cluster
pingcap/tidb-cluster    v1.0.0-beta.2                   A Helm chart for TiDB Cluster

gregwebs · 2019-07-05T16:46:47Z

You can probably change the helm repository to instead point to this github repo or your local folder.

It is also always possible to run helm install as a local-exec.

You can also disable the helm provisioner part in the terraform and do the helm install manually for now.

There is definitely a problem with flexibly selecting the proper version. I opened an issue for this: #640

sokada1221 · 2019-07-05T17:19:59Z

Thanks @gregwebs!

Yes, I was actually trying to upgrade from a local dir but it somehow failed. After that, terraform can neither fix nor destroy so I'm currently redeploying a fresh cluster. I'll post the result as soon as I have it ready. Thanks.

gregwebs · 2019-07-05T20:41:52Z

Sorry, we are currently making terraform changes that are not backwards-compatible. I am pushing towards us having proper usage as a module. After that there are more possibilities for dealing with terraform changes.

sokada1221 · 2019-07-05T21:38:01Z

Sounds good. No problem - I understand it's a WIP :)

Just a quick update. I deployed with the latest code from master branch but scheduled backup is hitting a segmentation fault. Also, I cannot ssh into the bastion node for some reason.

$ kubectl logs shinno-cluster-scheduled-backup-1562358600-wb57r -n shinno-cluster
+ set -euo pipefail
+ getent hosts shinno-cluster-tidb
+ head
+ awk '{print $1}'
+ host=172.20.67.254
+ echo shinno-cluster-scheduled-backup-1562358600-wb57r
+ awk -F- '{print $(NF-1)}'
+ timestamp=1562358600
+ date -u -d @1562358600 '+%Y%m%d-%H%M%S'
+ backupName=scheduled-backup-20190705-203000
+ backupPath=/data/scheduled-backup-20190705-203000
+ echo 'making dir /data/scheduled-backup-20190705-203000'
making dir /data/scheduled-backup-20190705-203000
+ mkdir -p /data/scheduled-backup-20190705-203000
+ /usr/bin/mysql -h172.20.67.254 -P4000 -uroot -p -Nse 'select variable_value from mysql.tidb where variable_name='"'"'tikv_gc_life_time'"'"';'
Segmentation fault (core dumped)
+ gc_life_time=

aylei · 2019-07-08T14:14:37Z

I will take a look at the bastion issue.

@LinuxGit Can you take a look at the segmentation fault listed above?

LinuxGit · 2019-07-09T03:59:49Z

@shinnosuke-okada I've submitted a new issue #643.
When the password in backup-secret is empty, segmentation fault error will occur.

weekface · 2019-07-11T08:45:41Z

The Segmentation fault error was fixed by #649, @shinnosuke-okada can you have a try?

sokada1221 · 2019-07-11T19:07:46Z

Checked out the latest master and merged my changes today. I somehow ended up with an incomplete cluster.

$ kubectl get pods -n shinno-cluster
NAME                                      READY   STATUS    RESTARTS   AGE
shinno-cluster-discovery-d6c4df7f-dgkl9   1/1     Running   0          3m35s
shinno-cluster-monitor-55f87b9755-f528k   2/2     Running   0          3m35s
shinno-cluster-pump-0                     1/1     Running   0          3m35s
shinno-cluster-pump-1                     1/1     Running   0          2m14s
$ helm history shinno-cluster
REVISION	UPDATED                 	STATUS  	CHART           	DESCRIPTION
1       	Thu Jul 11 15:02:09 2019	DEPLOYED	tidb-cluster-dev	Install complete

Will investigate further tomorrow. Thanks.

aylei · 2019-07-12T01:59:15Z

The deployment of tidb-operator failed for some reason, could you please provide the output of “helm ls” and “kubectl get po -n tidb-admin”?

sokada1221 · 2019-07-12T02:30:59Z

@aylei Looks like this:

$ helm ls
NAME            REVISION        UPDATED                         STATUS          CHART                   APP VERSION     NAMESPACE     
shinno-cluster  1               Thu Jul 11 15:02:09 2019        DEPLOYED        tidb-cluster-dev                        shinno-cluster
tidb-operator   1               Thu Jul 11 15:02:06 2019        DEPLOYED        tidb-cluster-dev                        tidb-admin    
$ kubectl get po -n tidb-admin
NAME                                       READY   STATUS    RESTARTS   AGE
tidb-operator-discovery-567549cf4f-cnf8n   1/1     Running   0          7h27m
tidb-operator-monitor-865bdf479c-f4xqj     2/2     Running   0          7h27m

aylei · 2019-07-12T02:33:41Z

@shinnosuke-okada It turns out the chart of tidb-operator is wrong, it's actually a tidb-cluster.

aylei · 2019-07-12T02:34:59Z

Did your merge change this

tidb-operator/deploy/aws/tidb-operator/main.tf

Line 76 in 8e8807a

resource "helm_release" "tidb-operator" {

?

sokada1221 · 2019-07-12T02:39:19Z

Oh yes, you're right - it was a careless mistake on my end. Sorry, and thank you! Let me quickly verify this issue again.

sokada1221 · 2019-07-12T04:58:15Z

Verified with the latest master that the scheduled backup name is short enough. For example:
scheduled-backup-20190712-045000

Thanks everyone!

gregwebs assigned aylei Jul 8, 2019

LinuxGit mentioned this issue Jul 9, 2019

scheduled-backup: segmentation fault when backup user's password is empty #643

Closed

aylei added the type/bug Something isn't working label Jul 9, 2019

weekface self-assigned this Jul 9, 2019

aylei mentioned this issue Jul 9, 2019

Several small fixes for terraform aws #646

Merged

weekface mentioned this issue Jul 10, 2019

bugfix: segmentation fault when backup user's password is empty #649

Merged

sokada1221 closed this as completed Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor usability issue when you try to restore from a scheduled backup #634

Minor usability issue when you try to restore from a scheduled backup #634

sokada1221 commented Jul 4, 2019 •

edited

Loading

weekface commented Jul 5, 2019

sokada1221 commented Jul 5, 2019

gregwebs commented Jul 5, 2019

sokada1221 commented Jul 5, 2019

gregwebs commented Jul 5, 2019

sokada1221 commented Jul 5, 2019

aylei commented Jul 8, 2019

LinuxGit commented Jul 9, 2019 •

edited

Loading

weekface commented Jul 11, 2019

sokada1221 commented Jul 11, 2019

aylei commented Jul 12, 2019

sokada1221 commented Jul 12, 2019

aylei commented Jul 12, 2019

aylei commented Jul 12, 2019

sokada1221 commented Jul 12, 2019

sokada1221 commented Jul 12, 2019

Minor usability issue when you try to restore from a scheduled backup #634

Minor usability issue when you try to restore from a scheduled backup #634

Comments

sokada1221 commented Jul 4, 2019 • edited Loading

Bug Report

weekface commented Jul 5, 2019

sokada1221 commented Jul 5, 2019

gregwebs commented Jul 5, 2019

sokada1221 commented Jul 5, 2019

gregwebs commented Jul 5, 2019

sokada1221 commented Jul 5, 2019

aylei commented Jul 8, 2019

LinuxGit commented Jul 9, 2019 • edited Loading

weekface commented Jul 11, 2019

sokada1221 commented Jul 11, 2019

aylei commented Jul 12, 2019

sokada1221 commented Jul 12, 2019

aylei commented Jul 12, 2019

aylei commented Jul 12, 2019

sokada1221 commented Jul 12, 2019

sokada1221 commented Jul 12, 2019

sokada1221 commented Jul 4, 2019 •

edited

Loading

LinuxGit commented Jul 9, 2019 •

edited

Loading