Actor Error handling

Failure - the gift nobody wants

In the DocumentSetWorker, Akka actors are used to perform various tasks. The actors can fail in various ways:

An expected failure, handled by our code, leading to some nice error message for the user.
An unexpected failure, that our code should recover from and log as an error (generating an email to us).
A catastrophic failure, such as an Out of Memory Error, where we can make no guarantees about the consistency of the JVM, and should lead to restarting the whole process.

Our current approach is to also retry the failing operations in the last 2 cases. We retry each top level job a fixed number of times, as configured by the max_job_retry_attempts key, then set the job to a permanent error state.

Expected Failures

When executing code that may throw an exception, actors catch the exception, and convert it to an error message. For example, if PdfBox throws an exception during text extraction, the Actor catches the exception, writes a DocumentProcessingError entry in the database, and reports the text extraction task as done. From the system perspective, this failure represents successfully completing the task.

Unexpected Failures

Akka defines a supervision hierarchy, where parent actors are responsible for handling uncaught exceptions in their child actors. Since we want our system to continue running, the supervisors in the DocumentSetWorker simply restart any failing child actors (using akka's SupervisorStrategy) and log the exception. The job queue used to manage Text Extraction for File Import jobs will detect if a worker actor dies, remove it from the worker pool, and reschedule the task the worker was working on.

Catastrophic Failures

Failures such as Out of Memory errors leave the JVM in a state where it is not guaranteed that our code can run properly. Instead of trying to handle Errors we try to force the process to stop, triggering a restart. Our akka SupervisorStrategies therefore escalate any Errors (non-Exception Throwables), until they cause the JVM to exit. Errors are logged when possible, but this logging cannot be relied on, since the error may be sufficiently severe that the JVM simply dies.

Optimistic retry

Whether an error led to restarting a single actor or the entire jvm, the system will attempt to retry the task that was in process when the failure occurred. If the failure was due to some random event, it is possible that the task will succeed when it is retried. However, if the failure consistently occurs with the same input, the system will continuously fail and retry. This is why we retry each job at most max_job_retry_attempts times.

Provide feedback

Saved searches