Skip to content
This repository has been archived by the owner on Jun 14, 2019. It is now read-only.

test results are reused when build gets OOM-killed #237

Closed
0xmichalis opened this issue Jan 7, 2019 · 7 comments · Fixed by #240
Closed

test results are reused when build gets OOM-killed #237

0xmichalis opened this issue Jan 7, 2019 · 7 comments · Fixed by #240
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@0xmichalis
Copy link
Contributor

0xmichalis commented Jan 7, 2019

When a pod gets OOM-killed, /retesting stops working and ci-operator reuses the OOM-killed test run's result. See https://openshift-gce-devel.appspot.com/pr/openshift_openshift-azure/984 (1725 to 1727). I had to get into the namespace and manually delete the build in order to restart the test successfully.

$ oc get po -owide
NAME        READY     STATUS      RESTARTS   AGE       IP               NODE                  NOMINATED NODE
bin-build   0/1       OOMKilled   0          38m       172.16.142.7     origin-ci-ig-n-n2gj   <none>
src-build   0/1       Completed   0          41m       172.16.101.161   origin-ci-ig-n-c194   <none>

Pod manifest:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    openshift.io/build.name: bin
    openshift.io/scc: privileged
  creationTimestamp: 2019-01-07T09:52:03Z
  labels:
    openshift.io/build.name: bin
  name: bin-build
  namespace: ci-op-lidk3nqq
  ownerReferences:
  - apiVersion: build.openshift.io/v1
    controller: true
    kind: Build
    name: bin
    uid: e292b624-1261-11e9-8771-42010a8e0002
  resourceVersion: "45134124"
  selfLink: /api/v1/namespaces/ci-op-lidk3nqq/pods/bin-build
  uid: e298e9a1-1261-11e9-8223-42010a8e0003
spec:
  containers:
  - args:
    - --loglevel=0
    command:
    - openshift-docker-build
    env:
    - name: BUILD
      value: |
        {"kind":"Build","apiVersion":"build.openshift.io/v1","metadata":{"name":"bin","namespace":"ci-op-lidk3nqq","selfLink":"/apis/build.openshift.io/v1/namespaces/ci-op-lidk3nqq/builds/bin","uid":"e292b624-1261-11e9-8771-42010a8e0002","resourceVersion":"45133383","creationTimestamp":"2019-01-07T09:52:03Z","labels":{"build-id":"1725","created-by-ci":"true","creates":"bin","job":"pull-ci-openshift-openshift-azure-master-e2e-azure","persists-between-builds":"false","prow.k8s.io/id":"7822e157-1261-11e9-aa36-0a58ac106442"},"annotations":{"ci.openshift.io/job-spec":"{\"type\":\"presubmit\",\"job\":\"pull-ci-openshift-openshift-azure-master-e2e-azure\",\"buildid\":\"1725\",\"prowjobid\":\"7822e157-1261-11e9-aa36-0a58ac106442\",\"refs\":{\"org\":\"openshift\",\"repo\":\"openshift-azure\",\"base_ref\":\"master\",\"base_sha\":\"8bcbff9c0ae982005dae2e5a1ce9683a9ec164b2\",\"pulls\":[{\"number\":984,\"author\":\"y-cote\",\"sha\":\"7a0b60659b065592bec0298af81a32bc520cfa6b\"}]}}"},"ownerReferences":[{"apiVersion":"image.openshift.io/v1","kind":"ImageStream","name":"pipeline","uid":"94da95a4-1261-11e9-8771-42010a8e0002","controller":true}]},"spec":{"serviceAccount":"builder","source":{"type":"Dockerfile","dockerfile":"FROM pipeline:src\nRUN [\"/bin/bash\", \"-c\", \"set -o errexit; umask 0002; make all\"]"},"strategy":{"type":"Docker","dockerStrategy":{"from":{"kind":"DockerImage","name":"docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline@sha256:2bb5dd5fa431f9c38a98c887dd89ad4a84c741129cb5336bd89b0c90a6f8acf8"},"pullSecret":{"name":"builder-dockercfg-fqtcl"},"noCache":true,"forcePull":true,"imageOptimizationPolicy":"SkipLayers"}},"output":{"to":{"kind":"DockerImage","name":"docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline:bin"},"pushSecret":{"name":"builder-dockercfg-fqtcl"}},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"100m","memory":"200Mi"}},"postCommit":{},"nodeSelector":null,"triggeredBy":null},"status":{"phase":"New","outputDockerImageReference":"docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline:bin","output":{}}}
    - name: PUSH_DOCKERCFG_PATH
      value: /var/run/secrets/openshift.io/push
    - name: PULL_DOCKERCFG_PATH
      value: /var/run/secrets/openshift.io/pull
    image: docker.io/openshift/origin-docker-builder:v3.11.0
    imagePullPolicy: IfNotPresent
    name: docker-build
    resources:
      limits:
        cpu: "2"
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 200Mi
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /tmp/build
      name: buildworkdir
    - mountPath: /var/run/docker.sock
      name: docker-socket
    - mountPath: /var/run/crio/crio.sock
      name: crio-socket
    - mountPath: /var/run/secrets/openshift.io/push
      name: builder-dockercfg-fqtcl-push
      readOnly: true
    - mountPath: /var/run/secrets/openshift.io/pull
      name: builder-dockercfg-fqtcl-pull
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: builder-token-nkzjj
      readOnly: true
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: builder-dockercfg-fqtcl
  initContainers:
  - args:
    - --loglevel=0
    command:
    - openshift-manage-dockerfile
    env:
    - name: BUILD
      value: |
        {"kind":"Build","apiVersion":"build.openshift.io/v1","metadata":{"name":"bin","namespace":"ci-op-lidk3nqq","selfLink":"/apis/build.openshift.io/v1/namespaces/ci-op-lidk3nqq/builds/bin","uid":"e292b624-1261-11e9-8771-42010a8e0002","resourceVersion":"45133383","creationTimestamp":"2019-01-07T09:52:03Z","labels":{"build-id":"1725","created-by-ci":"true","creates":"bin","job":"pull-ci-openshift-openshift-azure-master-e2e-azure","persists-between-builds":"false","prow.k8s.io/id":"7822e157-1261-11e9-aa36-0a58ac106442"},"annotations":{"ci.openshift.io/job-spec":"{\"type\":\"presubmit\",\"job\":\"pull-ci-openshift-openshift-azure-master-e2e-azure\",\"buildid\":\"1725\",\"prowjobid\":\"7822e157-1261-11e9-aa36-0a58ac106442\",\"refs\":{\"org\":\"openshift\",\"repo\":\"openshift-azure\",\"base_ref\":\"master\",\"base_sha\":\"8bcbff9c0ae982005dae2e5a1ce9683a9ec164b2\",\"pulls\":[{\"number\":984,\"author\":\"y-cote\",\"sha\":\"7a0b60659b065592bec0298af81a32bc520cfa6b\"}]}}"},"ownerReferences":[{"apiVersion":"image.openshift.io/v1","kind":"ImageStream","name":"pipeline","uid":"94da95a4-1261-11e9-8771-42010a8e0002","controller":true}]},"spec":{"serviceAccount":"builder","source":{"type":"Dockerfile","dockerfile":"FROM pipeline:src\nRUN [\"/bin/bash\", \"-c\", \"set -o errexit; umask 0002; make all\"]"},"strategy":{"type":"Docker","dockerStrategy":{"from":{"kind":"DockerImage","name":"docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline@sha256:2bb5dd5fa431f9c38a98c887dd89ad4a84c741129cb5336bd89b0c90a6f8acf8"},"pullSecret":{"name":"builder-dockercfg-fqtcl"},"noCache":true,"forcePull":true,"imageOptimizationPolicy":"SkipLayers"}},"output":{"to":{"kind":"DockerImage","name":"docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline:bin"},"pushSecret":{"name":"builder-dockercfg-fqtcl"}},"resources":{"limits":{"cpu":"2","memory":"4Gi"},"requests":{"cpu":"100m","memory":"200Mi"}},"postCommit":{},"nodeSelector":null,"triggeredBy":null},"status":{"phase":"New","outputDockerImageReference":"docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline:bin","output":{}}}
    image: docker.io/openshift/origin-docker-builder:v3.11.0
    imagePullPolicy: IfNotPresent
    name: manage-dockerfile
    resources:
      limits:
        cpu: "2"
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 200Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /tmp/build
      name: buildworkdir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: builder-token-nkzjj
      readOnly: true
  nodeName: origin-ci-ig-n-n2gj
  nodeSelector:
    node-role.kubernetes.io/compute: "true"
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: builder
  serviceAccountName: builder
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - emptyDir: {}
    name: buildworkdir
  - hostPath:
      path: /var/run/docker.sock
      type: ""
    name: docker-socket
  - hostPath:
      path: /var/run/crio/crio.sock
      type: ""
    name: crio-socket
  - name: builder-dockercfg-fqtcl-push
    secret:
      defaultMode: 384
      secretName: builder-dockercfg-fqtcl
  - name: builder-dockercfg-fqtcl-pull
    secret:
      defaultMode: 384
      secretName: builder-dockercfg-fqtcl
  - name: builder-token-nkzjj
    secret:
      defaultMode: 420
      secretName: builder-token-nkzjj
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-01-07T09:52:12Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-01-07T09:54:19Z
    message: 'containers with unready status: [docker-build]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'containers with unready status: [docker-build]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-01-07T09:52:03Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://265dce80a1cfd77facdd8932ca9aff0ae44885869a7cef5182633613f738a37b
    image: docker.io/openshift/origin-docker-builder:v3.11.0
    imageID: docker-pullable://docker.io/openshift/origin-docker-builder@sha256:fc838f97f05142dbacfb52f7f46d4d62255ed817d8122d54a11dccea9eb2951a
    lastState: {}
    name: docker-build
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://265dce80a1cfd77facdd8932ca9aff0ae44885869a7cef5182633613f738a37b
        exitCode: 137
        finishedAt: 2019-01-07T09:54:18Z
        message: |2

          Pulling image docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline@sha256:2bb5dd5fa431f9c38a98c887dd89ad4a84c741129cb5336bd89b0c90a6f8acf8 ...
          Pulled 2/3 layers, 93% complete
          Pulled 3/3 layers, 100% complete
          Extracting
          --> FROM docker-registry.default.svc:5000/ci-op-lidk3nqq/pipeline@sha256:2bb5dd5fa431f9c38a98c887dd89ad4a84c741129cb5336bd89b0c90a6f8acf8 as 0
          --> RUN ["/bin/bash","-c","set -o errexit; umask 0002; make all"]
          rm -f coverage.out azure-controllers etcdbackup sync recoveretcdcluster metricsbridge e2e
          go generate ./...
          bindata.go
          bindata.go
          go build ./...
        reason: OOMKilled
        startedAt: 2019-01-07T09:52:13Z
  hostIP: 10.142.0.6
  initContainerStatuses:
  - containerID: docker://795a0345af7bee6c4c6ba41e4a54ef01619a6df9df3ef73f64d5979dfeb62172
    image: docker.io/openshift/origin-docker-builder:v3.11.0
    imageID: docker-pullable://docker.io/openshift/origin-docker-builder@sha256:fc838f97f05142dbacfb52f7f46d4d62255ed817d8122d54a11dccea9eb2951a
    lastState: {}
    name: manage-dockerfile
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: docker://795a0345af7bee6c4c6ba41e4a54ef01619a6df9df3ef73f64d5979dfeb62172
        exitCode: 0
        finishedAt: 2019-01-07T09:52:11Z
        reason: Completed
        startedAt: 2019-01-07T09:52:11Z
  phase: Failed
  podIP: 172.16.142.7
  qosClass: Burstable
  startTime: 2019-01-07T09:52:07Z
Build manifest (from a different test):
apiVersion: build.openshift.io/v1
kind: Build
metadata:
  annotations:
    ci.openshift.io/job-spec: '{"type":"presubmit","job":"pull-ci-openshift-openshift-azure-master-images","buildid":"1381","prowjobid":"4563641c-1343-11e9-bf48-0a58ac104a03","refs":{"org":"openshift","repo":"openshift-azure","base_ref":"master","base_sha":"4e8dfdaa95b77ddfead8ccb18a55fe5ecfa30add","pulls":[{"number":995,"author":"kargakis","sha":"d6667a9ca4b967409915bff9c3f523408598dae7"}]}}'
    openshift.io/build.pod-name: bin-build
  creationTimestamp: 2019-01-08T12:46:37Z
  labels:
    build-id: "1381"
    created-by-ci: "true"
    creates: bin
    job: pull-ci-openshift-openshift-azure-master-images
    persists-between-builds: "false"
    prow.k8s.io/id: 4563641c-1343-11e9-bf48-0a58ac104a03
  name: bin
  namespace: ci-op-0i9gkyc8
  ownerReferences:
  - apiVersion: image.openshift.io/v1
    controller: true
    kind: ImageStream
    name: pipeline
    uid: 50fcb7ee-1343-11e9-8223-42010a8e0003
  resourceVersion: "45858638"
  selfLink: /apis/build.openshift.io/v1/namespaces/ci-op-0i9gkyc8/builds/bin
  uid: 7043d64d-1343-11e9-a7de-42010a8e0004
spec:
  nodeSelector: null
  output:
    pushSecret:
      name: builder-dockercfg-kwfdd
    to:
      kind: ImageStreamTag
      name: pipeline:bin
      namespace: ci-op-0i9gkyc8
  postCommit: {}
  resources:
    limits:
      cpu: "2"
      memory: 4Gi
    requests:
      cpu: 100m
      memory: 200Mi
  serviceAccount: builder
  source:
    dockerfile: |-
      FROM pipeline:src
      RUN ["/bin/bash", "-c", "set -o errexit; umask 0002; make all"]
    type: Dockerfile
  strategy:
    dockerStrategy:
      forcePull: true
      from:
        kind: ImageStreamTag
        name: pipeline:src
        namespace: ci-op-0i9gkyc8
      imageOptimizationPolicy: SkipLayers
      noCache: true
    type: Docker
  triggeredBy: null
status:
  completionTimestamp: 2019-01-08T12:48:44Z
  duration: 126000000000
  logSnippet: |-
    rm -f coverage.out azure-controllers etcdbackup sync recoveretcdcluster metricsbridge e2e
    go generate ./...
    bindata.go
    bindata.go
    go build ./...
  message: The build pod was killed due to an out of memory condition.
  output: {}
  outputDockerImageReference: docker-registry.default.svc:5000/ci-op-0i9gkyc8/pipeline:bin
  phase: Failed
  reason: OutOfMemoryKilled
  startTimestamp: 2019-01-08T12:46:38Z
/kind bug @droslean @bbguimaraes @petr-muller @stevekuznetsov
@smarterclayton
Copy link
Contributor

Interesting, this could be a template instance bug as well.

@stevekuznetsov
Copy link
Contributor

We need to update this:

isFailed := func(b *buildapi.Build) bool {
return b.Status.Phase == buildapi.BuildPhaseFailed ||
b.Status.Phase == buildapi.BuildPhaseCancelled ||
b.Status.Phase == buildapi.BuildPhaseError
}

@bbguimaraes
Copy link
Contributor

I don't think it is that simple. Unless the original build is removed, we are going to be stuck in that loop.

ci-operator reuses the build (conceptually) because the inputs have not changed. If that is the case, we are likely to hit the OOM scenario again. Isn't the proper solution here to increase the resource requirements of the job?

@0xmichalis
Copy link
Contributor Author

0xmichalis commented Jan 9, 2019

Isn't the proper solution here to increase the resource requirements of the job?

Unless the OOM error is transient, probably yes. But OOM does not occur only because we didn't request enough memory. You can be OOM'd if you end up in a busy node with higher priority pods, or there may be an underlying issue in the node which seems to be the current case.

EDIT

You can be OOM'd if you end up in a busy node with higher priority pods

This is probably wrong, I think you are getting Evicted instead of OOM'd? In any case, /retest should work in all cases.

@mjudeikis
Copy link
Contributor

mjudeikis commented Jan 10, 2019

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-azure/1008/pull-ci-openshift-openshift-azure-master-e2e-azure/1802

bin-build   0/1       OOMKilled   0          6m        172.16.74.51   origin-ci-ig-n-x7lk
src-build   0/1       Completed   0          9m        172.16.24.11   origin-ci-ig-n-38wf

@droslean
Copy link
Member

It seems that this case applies for each phase. ci-operator considers OOMkilled as failed and just print the logs. In any state, ci-operator won't re-create the build. It does that only if it not exist. see here

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants