Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests failing; Looks like run_e2e_workflow.py is not waiting for workflows to finish #125

Closed
jlewi opened this issue May 10, 2018 · 4 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented May 10, 2018

See for example
kubeflow/kubeflow#787

Logs here:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/?log#log

Relevant bit:

INFO|2018-05-10T18:32:32|/src/kubeflow/testing/py/kubeflow/testing/util.py|58| Updating workflows kubeflow-test-infra.jlewi-kubeflow-gke-deploy-test-4-3a8b
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py|161| URL for workflow: http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-kubeflow-gke-deploy-787-f893ce6-1462-040e?tab=workflow
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/argo_client.py|21| Workflow kubeflow-presubmit-kubeflow-e2e-gke-787-f893ce6-1462-b429 in namespace kubeflow-test-infra; phase=Running
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/argo_client.py|21| Workflow kubeflow-presubmit-kubeflow-e2e-minikube-787-f893ce6-1462-b59b in namespace kubeflow-test-infra; phase=Running
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/argo_client.py|21| Workflow kubeflow-presubmit-tf-serving-image-787-f893ce6-1462-0913 in namespace kubeflow-test-infra; phase=Running
INFO|2018-05-10T18:32:34|/src/kubeflow/testing/py/kubeflow/testing/util.py|512| Writing gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/finished.json
INFO|2018-05-10T18:32:34|/src/kubeflow/testing/py/kubeflow/testing/util.py|522| Uploading file /tmp/tmpRunE2eWorkflowbjXGWwlog to gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/build-log.txt.

So looks like we ended while workflows were still running.

@jlewi
Copy link
Contributor Author

jlewi commented May 10, 2018

Doesn't look like run_e2e_workflow.py has changed in a while. But I do think we upgraded Argo recently.

@jlewi
Copy link
Contributor Author

jlewi commented May 10, 2018

My conjecture is that wait_for_workflows raises an exception (after calling log_status).
This doesn't show up in the log uploaded to prow because it we don't catch it and in the finally blog we upload the current log to prow. So the exception will get propogated and cause the process to exit but it won't be captured in the log uploaded to prow.

try:
    results = argo_client.wait_for_workflows(api_client, NAMESPACE,
                                             workflow_names,
                                             status_callback=argo_client.log_status)
    for r in results:
      phase = r.get("status", {}).get("phase")
      name = r.get("metadata", {}).get("name")
      workflow_phase[name] = phase
      if phase != "Succeeded":
        success = False
      logging.info("Workflow %s/%s finished phase: %s", NAMESPACE, name, phase)
  except util.TimeoutError:
    success = False
    logging.error("Time out waiting for Workflows %s to finish", ",".join(workflow_names))
  finally:
    success = prow_artifacts.finalize_prow_job(args.bucket, success, workflow_phase, ui_urls)

    # Upload logs to GCS. No logs after this point will appear in the
    # file in gcs
    file_handler.flush()
    util.upload_file_to_gcs(
      file_handler.baseFilename,
      os.path.join(prow_artifacts.get_gcs_dir(args.bucket), "build-log.txt"))

@jlewi
Copy link
Contributor Author

jlewi commented May 10, 2018

Yup that's the problem

If we get the logs for the pod run by prow see here

INFO|2018-05-10T18:32:34|/src/kubeflow/testing/py/kubeflow/testing/util.py|522| Uploading file /tmp/tmpRunE2eWorkflowbjXGWwlog to gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/build-log.txt.
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 281, in <module>
    final_result = main()
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 271, in main
    return run(args, file_handler)
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 168, in run
    status_callback=argo_client.log_status)
  File "/src/kubeflow/testing/py/kubeflow/testing/argo_client.py", line 51, in wait_for_workflows
    GROUP, VERSION, namespace, PLURAL, n)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 697, in get_namespaced_custom_object
    (data) = self.get_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 797, in get_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 342, in request
    headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 10 May 2018 18:32:33 GMT', 'Audit-Id': 'd6977031-5f33-4bfa-8548-c5f3c1276ca2', 'Content-Length': '332', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"workflows.argoproj.io \"kubeflow-presubmit-kubeflow-gke-deploy-787-f893ce6-1462-040e\" not found","reason":"NotFound","details":{"name":"kubeflow-presubmit-kubeflow-gke-deploy-787-f893ce6-1462-040e","group":"argoproj.io","kind":"workflows"},"code":404}
INFO|2018-05-10T18:32:34|/src/kubeflow/testing/py/kubeflow/testing/util.py|522| Uploading file /tmp/tmpRunE2eWorkflowbjXGWwlog to gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/build-log.txt.
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 281, in <module>
    final_result = main()
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 271, in main
    return run(args, file_handler)
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 168, in run
    status_callback=argo_client.log_status)
  File "/src/kubeflow/testing/py/kubeflow/testing/argo_client.py", line 51, in wait_for_workflows
    GROUP, VERSION, namespace, PLURAL, n)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 697, in get_namespaced_custom_object
    (data) = self.get_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 797, in get_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 342, in request
    headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 10 May 2018 18:32:33 GMT', 'Audit-Id': 'd6977031-5f33-4bfa-8548-c5f3c1276ca2', 'Content-Length': '332', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"workflows.argoproj.io \"kubeflow-presubmit-kubeflow-gke-deploy-787-f893ce6-1462-040e\" not found","reason":"NotFound","details":{"name":"kubeflow-presubmit-kubeflow-gke-deploy-787-f893ce6-1462-040e","group":"argoproj.io","kind":"workflows"},"code":404}

jlewi added a commit to jlewi/testing that referenced this issue May 10, 2018
show up in build-log.txt reported to Gubernator.

Related to kubeflow#125
jlewi added a commit that referenced this issue May 10, 2018
…#126)

show up in build-log.txt reported to Gubernator.

Related to #125
@jlewi
Copy link
Contributor Author

jlewi commented May 14, 2018

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants