Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/payload/task_graph: Handle node pushing and result collection without a goroutine #264

Merged

Commits on Jun 9, 2020

  1. pkg/payload/task_graph: Handle node pushing and result collection wit…

    …hout a goroutine
    
    The RunGraph implementation was unchanged since it landed in
    cb4e037 (payload: Create a task graph that can split a payload into
    chunks, 2019-01-17, openshift#88), with the exception of later logging,
    c2ac20f (status: Report the operators that have not yet deployed,
    2019-04-09, openshift#158) with the adjusted return type, and eaa3d19
    (pkg/payload/task_graph: RunGraph error when canceled, 2019-10-21, openshift#255)
    with its "I was canceled" error backstop.
    
    The old code launched a goroutine for the pushing/reaping, which was
    an unecessary, and made error reporting on any outstanding tasks more
    complicated.  Dropping this goroutine let me get rid of errCh which
    used to be used to pass errors back to the main goroutine.  I also
    dropped the wg.Wait() goroutine, which used to be part of supporting
    the 'range errCh' used in the main goroutine to collect failed jobs.
    But because errCh was filled directly from the worker goroutines,
    nothing used to be collecting the completeness of graph coverage from
    the pushing/reaping goroutine.
    
    I've also restructured pushing/reaping to be more resistant to locking
    and spinning.  The old implementation had a node-pushing loop which
    attempted non-blocking pushes, then a result-reaping loop, then some
    condition checking, then a blocking result reaping attempt (after "we
    did nothing this round, so we have to wait for more").  The idea with
    the blocking reap seems to have been that if the pusher/reaper didn't
    push anything (because workCh was full) and the earlier 'for
    len(completeCh) > 0' reaping didn't pull anything in (because workers
    were busy working), that the pusher/reaper should block until it
    collected a result, in the hopes that worker which returned the result
    would have cleared a job out of workCh to give the pusher/reaper space
    to push a new job with the next loop iteration.  But if Go decides to
    give the pusher/reaper more time in the scheduler, it might attempt
    the next workCh push before the job gets around to being scheduled and
    popping from workCh.  And then the pusher/reaper might trickle down to
    the blocking reap and wait for another worker (hopefully
    maxParallelism > 1) to return a result to unblock the pusher/reaper
    and give it another shot at a workCh push.  During this time, the
    worker that returned earlier is idling with nothing to do.
    
    With this commit, when we have a next node to push, we have a single
    switch statement that blocks until we are either able to push the node
    or to reap a result.  So there's no need for a non-blocking push, and
    no chance at spinning, although it does mean we need to recalculate
    the next node after each channel action.  When we've been canceled, we
    stop pushing into workCh, but continue to reap from resultCh until we
    have no in-flight jobs left.  And if we have nothing to push, and
    there's nothing in-flight to reap, we're obviously done, so that
    choice is a lot simpler now.
    
    I've dropped the "Waiting for workers to complete" log line, because
    wg.Wait() call should block for much less time now.  And because the
    main RunGraph goroutine is doing the waiting, we no longer need the
    'runTask, ok := <-workCh' read to protect against workCh being closed
    early.  With the wg.Wait() now getting called after we have drained
    all in-flight jobs (previously we launched it immediately after
    launching workers), there is less call for the "Waiting for..." line.
    
    But the most externally noticeable change is that now, if we run
    without any failing jobs to give us errors, I'm filling in a new
    "incomplete task nodes" error so folks don't have to look in the CVO
    logs to see how far the CVO got before hanging.  It also allows us to
    not return the "I was canceled" error in cases where the cancellation
    happened late enough that we were still able to successfully process
    the full graph.
    
    The unit change fills in a less-stubby Kubernetes object, avoiding:
    
      --- FAIL: TestRunGraph/cancelation_without_task_errors_is_reported (1.00s)
          task_graph_test.go:914: error 0 "1 incomplete task nodes, beginning with %!s(PANIC=String method: runtime error: invalid memory address or nil pointer dereference)" doesn't contain "1 incomplete task nodes, beginning with b"
    
    when Task.String calls Unstructured.GetName which explodes on the lack
    of expected Kubernetes-object metadata.
    wking committed Jun 9, 2020
    Configuration menu
    Copy the full SHA
    55ef3d3 View commit details
    Browse the repository at this point in the history