Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spurious crashes in moov sbx environment #214

Closed
adamdecaf opened this issue Dec 15, 2019 · 1 comment
Closed

spurious crashes in moov sbx environment #214

adamdecaf opened this issue Dec 15, 2019 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@adamdecaf
Copy link
Member

Over the weekend watchman crashed and got into a loop of being unavailable from Kubernetes's failed health checks.

Logs:

$ kubectl logs -n apps watchman-798b4b7dd7-hbxwd -c watchman 
ts=2019-12-15T21:22:08.95746364Z caller=main.go:56 startup="Starting watchman server version v0.13.0-rc5"
ts=2019-12-15T21:22:08.957600559Z caller=database.go:18 database="looking for mysql database provider"
ts=2019-12-15T21:22:08.99747607Z caller=download.go:153 download="Starting refresh of data"
ts=2019-12-15T21:22:08.99940274Z caller=main.go:123 admin="listening on [::]:9090"
ts=2019-12-15T21:22:48.617066911Z caller=download.go:205 download="Finished refresh of data"
ts=2019-12-15T21:22:48.795282067Z caller=main.go:150 main="data refreshed 178.180365ms ago" SDNs=8076 AltNames=10559 Addresses=12662 SSI=289 DPL=576 BISEntities=1268
ts=2019-12-15T21:22:48.795388321Z caller=main.go:230 main="Setting data refresh interval to 12h0m0s (default)"
ts=2019-12-15T21:22:48.812958924Z caller=main.go:204 exit=terminated
ts=2019-12-15T21:22:48.813957644Z caller=main.go:194 startup="binding to :8080 for HTTP server"
ts=2019-12-15T21:22:48.814387141Z caller=main.go:196 exit="http: Server closed"

kubectl describe pod

$ kubectl describe pods -n apps watchman-798b4b7dd7-hbxwd 
Name:               watchman-798b4b7dd7-hbxwd
Namespace:          apps
Priority:           10
PriorityClassName:  normal-priority
Node:               gke-sbx-sbx-permanent-nodes-236c1737-wvds/10.128.0.55
Start Time:         Sat, 14 Dec 2019 20:23:05 -0800
Labels:             app=watchman
                    pod-template-hash=798b4b7dd7
Annotations:        <none>
Status:             Running
IP:                 10.60.2.93
Controlled By:      ReplicaSet/watchman-798b4b7dd7
Containers:
  mysql:
    Container ID:   docker://9673ec47e7f109566a32a7d6b28e36cff3008ae025de22a3b45ad84a100a49d2
    Image:          mysql:8.0
    Image ID:       docker-pullable://mysql@sha256:c93ba1bafd65888947f5cd8bd45deb7b996885ec2a16c574c530c389335e9169
    Port:           3306/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 14 Dec 2019 20:23:25 -0800
    Ready:          True
    Restart Count:  0
    Environment:
      MYSQL_DATABASE:              watchman
      MYSQL_USER:                  watchman
      MYSQL_RANDOM_ROOT_PASSWORD:  yes
      MYSQL_PASSWORD:              <set to the key 'password' in secret 'watchman-mysql-password'>  Optional: false
    Mounts:
      /var/lib/mysql from watchman-mysql-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xp9n8 (ro)
  watchman:
    Container ID:  docker://819114a1280b05e39f95f8c3b0c5010f7be482362d20ea3d31de9460ce06c974
    Image:         moov/watchman:v0.13.0-rc5
    Image ID:      docker-pullable://moov/watchman@sha256:6d89bf3583a0abeb8d6451fe28e672814f5365d0ca2c79cd9b7b84ef0e9f50ac
    Ports:         8080/TCP, 9090/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      -http.addr=:8080
      -admin.addr=:9090
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 15 Dec 2019 13:22:08 -0800
      Finished:     Sun, 15 Dec 2019 13:22:48 -0800
    Ready:          False
    Restart Count:  314
    Limits:
      cpu:     200m
      memory:  250Mi
    Requests:
      cpu:      25m
      memory:   100Mi
    Liveness:   http-get http://:8080/ping delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:8080/ping delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      LOG_FORMAT:        plain
      TRADEGOV_API_KEY:  <set to the key 'tradegov' in secret 'watchman-api-keys'>  Optional: false
      DATABASE_TYPE:     mysql
      MYSQL_ADDRESS:     tcp(localhost:3306)
      MYSQL_DATABASE:    watchman
      MYSQL_USER:        watchman
      MYSQL_PASSWORD:    <set to the key 'password' in secret 'watchman-mysql-password'>  Optional: false
    Mounts:
      /opt/moov/watchman/ from watchman-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xp9n8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  watchman-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  watchman-data
    ReadOnly:   false
  watchman-mysql-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  watchman-mysql-data
    ReadOnly:   false
  default-token-xp9n8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-xp9n8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                                                Message
  ----     ------     ----                 ----                                                -------
  Warning  Unhealthy  18m (x922 over 17h)  kubelet, gke-sbx-sbx-permanent-nodes-236c1737-wvds  Liveness probe failed: Get http://10.60.2.93:8080/ping: dial tcp 10.60.2.93:8080: connect: connection refused
  Warning  BackOff    3m (x3777 over 17h)  kubelet, gke-sbx-sbx-permanent-nodes-236c1737-wvds  Back-off restarting failed container
@adamdecaf adamdecaf added the bug Something isn't working label Dec 15, 2019
@adamdecaf adamdecaf self-assigned this Dec 15, 2019
@adamdecaf adamdecaf added this to In progress in Current Work Dec 15, 2019
adamdecaf added a commit to adamdecaf/watchman that referenced this issue Dec 16, 2019
adamdecaf added a commit to moov-io/infra that referenced this issue Dec 16, 2019
@adamdecaf
Copy link
Member Author

The container got SIGTERM from kubernetes and I’ve noticed it can take a bit to restart as it’ll redownload. Over the weekend there was a cycle of watchman being unavailable because of slow startups (downloads) and k8s backoff. I've bumped up timeouts and memory/cpu for the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Current Work
  
Done
Development

No branches or pull requests

1 participant