# Troubleshooting

This tutorial steps through tecnhiques to identify errors and pipeline failures, and
avoid common pitfalls.


## Things to check if Pydra gets stuck

I There are a number of common gotchas, related to running multi-process code, that can
cause Pydra workflows to get stuck and not execute correctly. If using the concurrent
futures worker (e.g. `worker="cf"`), check these issues first before filing a bug report
or reaching out for help.

### Applying `nest_asyncio` when running within a notebook

When using the concurrent futures worker within a Jupyter notebook you need to apply
`nest_asyncio` with the following lines

In [3]:
# This is needed to run parallel workflows in Jupyter notebooks
import nest_asyncio
nest_asyncio.apply()

### Enclosing multi-process code within `if __name__ == "__main__"`

If running a script that executes a workflow with the concurrent futures worker
(i.e. `worker="cf"`) on macOS or Windows, then the submissing/execution call needs to
be enclosed within a `if __name__ == "__main__"` blocks, e.g.

In [None]:
from pydra.tasks.testing import UnsafeDivisionWorkflow
from pydra.engine.submitter import Submitter

# This workflow will fail because we are trying to divide by 0
wf = UnsafeDivisionWorkflow(a=10, b=5, denominator=2)

if __name__ == "__main__":
    with Submitter(worker="cf") as sub:
        result = sub(wf)

### Remove stray lockfiles

During the execution of a task, a lockfile is generated to signify that a task is running.
These lockfiles are released after a task completes, either successfully or with an error,
within a *try/finally* block. However, if a task/workflow is terminated by an interactive
debugger the finally block may not be executed causing stray lockfiles to hang around. This
can cause the Pydra to hang waiting for the lock to be released. If you suspect this to be
an issue, and there are no other jobs running, then simply remove all lock files from your
cache directory (e.g. `rm <your-run-cache-dir>/*.lock`) and re-submit your job.

If the  `clean_stale_locks` flag is set (by default when using the *debug* worker), locks that
were created before the outer task was submitted are removed before the task is run.
However, since these locks could be created by separate submission processes, ``clean_stale_locks`
is not switched on by default when using production workers (e.g. `cf`, `slurm`, etc...).

## Finding errors

### Running in *debug* mode

By default, Pydra will run with the *debug* worker, which executes each task serially
within a single process without use of `async/await` blocks, to allow raised exceptions
to propagate gracefully to the calling code. If you are having trouble with a pipeline,
ensure that `worker=debug` is passed to the submission/execution call (the default).

### Reading error files

When a task raises an error, it is captured and saved in pickle file named `_error.pklz`
within task's cache directory. For example, when calling the toy `UnsafeDivisionWorkflow`
with a `denominator=0`, the task will fail.

In [None]:
# This workflow will fail because we are trying to divide by 0
wf = UnsafeDivisionWorkflow(a=10, b=5).split(denominator=[3, 2 ,0])

with Submitter(worker="cf") as sub:
    result = sub(wf)
    
if result.errored:
    print("Workflow failed with errors:\n" + str(result.errors))
else:
    print("Workflow completed successfully :)")

## Tracing upstream issues

Failures are common in scientific analysis, even for well tested workflows, due to
the novel nature and of scientific experiments and known artefacts that can occur.
Therefore, it is always to sanity-check results produced by workflows. When a problem
occurs in a multi-stage workflow it can be difficult to identify at which stage the
issue occurred.

Currently in Pydra you need to step backwards through the tasks of the workflow, load
the saved task object and inspect its inputs to find the preceding nodes. If any of the
inputs that have been generated by previous nodes are not ok, then you should check the
tasks that generated them in turn.