# Error Handling

## Holds

In previous tutorials we mentioned that HTMap is able to track the status of your components and inform you about something called a "hold".
A hold occurs when HTCondor notices something wrong about your map component.
Perhaps an input file is missing, or your component tried to open a file that didn't exist.

The last one is easy to force, so let's do it and see what happens:

In [1]:
import htmap

@htmap.mapped
def foo(_):
    return "I didn't get held!"

In [2]:
htmap.remove('will-get-held')
will_get_held = foo.map(
    'will-get-held',
    [None],
    map_options = htmap.MapOptions(
        input_files = ['this-file-does-not-exist']
    ),
)

In [3]:
list(will_get_held)

MapComponentHeld: component 0 of map will-get-held is held. Reason: [13] Error from slot1@galain: SHADOW at 10.0.1.6 failed to send file(s) to <10.0.1.6:64868>: error reading from D:/GitHubProjects/htmap/docs/source/tutorials/this-file-does-not-exist: (errno 2) No such file or directory; STARTER failed to receive file(s) from <10.0.1.6:9618>

Yikes!
HTMap has raised an exception to inform us that a component of our map got held.
It also tells us why HTCondor held the component: `Error from slot1@galain: SHADOW at 10.0.1.6 failed to send file(s) to <10.0.1.6:63896>: error reading from D:/GitHubProjects/htmap/docs/source/tutorials/this-file-does-not-exist: (errno 2) No such file or directory; STARTER failed to receive file(s) from <10.0.1.6:9618>`.

This time around the hold reason is pretty clear: a local file that HTCondor expected to exist didn't.
We could fix the problem by creating the file, and then releasing the map, which tells HTCondor to try again:

In [4]:
from pathlib import Path

path = Path('this-file-does-not-exist')
path.touch()  # this creates an empty file

Now the map will run successfully:

In [5]:
will_get_held.release()
print(list(will_get_held))

["I didn't get held!"]


And, of course, clean up:

In [6]:
path.unlink()

Unfortunately, holds will often not be so easy to resolve.
Sometimes they are simply ephemeral errors that can be resolved by releasing the map without changing anything.
But sometimes you'll need to talk to your HTCondor pool administrator to figure out what's going wrong.

## Execution Errors

HTMap can also detect Python exceptions that occur during component execution.
To see this in action, let's define a function where a component will have a problem:

In [7]:
@htmap.mapped
def inverse(x):
    return 1 / x

When `x = 0`, `inverse(x)` will fail with a `ZeroDivisionError`.
If we run it locally, the error will halt execution and drop a traceback into our laps:

In [8]:
inverse(0)

ZeroDivisionError: division by zero

The traceback has a lot of critically-useful information in it. In fact, it tells us exactly the line that raised the error (remember that tracebacks should be read in reverse - the last block of source code is where the error began).

HTMap is able to transport this kind of information back from an executing component, but like the regular output of a map we won't see it until we try to load up the output for the failed component.
We'll make a one-component map to demonstrate what happens:

In [9]:
bad_map = inverse.transient_map([0])
list(bad_map)

MapComponentError: component 0 of map tmp-1541480851-0 encountered error while executing. Error report:
=========  Start error report for component 0 of map tmp-1541480851-0  =========
Landed on execute node galain.eau.wi.charter.com (10.0.1.6) at 2018-11-06 05:07:33.948272

Python executable is C:\Program Files\Python36\python.exe (version 3.6.5 final)
with installed packages
  cloudpickle==0.5.3
  numpy==1.15.2

Working directory contents are
  C:\condor\execute\dir_18632\.chirp.config
  C:\condor\execute\dir_18632\.job.ad
  C:\condor\execute\dir_18632\.machine.ad
  C:\condor\execute\dir_18632\0.in
  C:\condor\execute\dir_18632\condor_exec.py
  C:\condor\execute\dir_18632\func
  C:\condor\execute\dir_18632\_condor_stderr
  C:\condor\execute\dir_18632\_condor_stdout

Exception and traceback (most recent call last):
  File "<ipython-input-7-769ac4dfb4b6>", line 3, in inverse
    return 1 / x

    Local variables:
      x = 0

  ZeroDivisionError: division by zero

==========  End error report for component 0 of map tmp-1541480851-0  ==========

Neat!
This traceback is, unfortunately, harder to read than the other one.
We need to ignore everything above `MapComponentError: component 0 of map tmp-1541469343-1 encountered stderr while executing. Error report:` - it's just about the internal error that HTMap is raising to propagate the error to us.
The real error is the stuff below `=========  Start error report for component 0 of map tmp-1541469343-1  =========`.

Since we're trying to debug remotely, HTMap has gathered some metadata about the HTCondor "execute node" where the component was running.
First it tell us where it is and when the component started executing.
Next, the report tells us about the Python environment that was used to execute your function, including a list of installed packages.
We also get a listing of the contents of the working directory - in this example, because we didn't add any extra input files, it's just a bunch of files that HTCondor and HTMap are using.

The meat of the error is the last thing in the error report.
We get roughly the same information that we got in the local traceback, but we also get a printout of the local variables in each stack frame.

Since the local HTMap error is raised as soon as it finds a bad component, you may find it convenient to look at _all_ of the error reports for your map (hopefully not too many!).
[htmap.Map.error_reports](../api.rst#htmap.Map.error_reports) provides exactly this functionality:

In [10]:
htmap.remove('worse')
worse_map = inverse.map('worse', [0, 0, 0])
for report in worse_map.error_reports():
    print(report + '\n')

Landed on execute node galain.eau.wi.charter.com (10.0.1.6) at 2018-11-06 05:07:44.016664

Python executable is C:\Program Files\Python36\python.exe (version 3.6.5 final)
with installed packages
  cloudpickle==0.5.3
  numpy==1.15.2

Working directory contents are
  C:\condor\execute\dir_15820\.chirp.config
  C:\condor\execute\dir_15820\.job.ad
  C:\condor\execute\dir_15820\.machine.ad
  C:\condor\execute\dir_15820\0.in
  C:\condor\execute\dir_15820\condor_exec.py
  C:\condor\execute\dir_15820\func
  C:\condor\execute\dir_15820\_condor_stderr
  C:\condor\execute\dir_15820\_condor_stdout

Exception and traceback (most recent call last):
  File "<ipython-input-7-769ac4dfb4b6>", line 3, in inverse
    return 1 / x

    Local variables:
      x = 0

  ZeroDivisionError: division by zero


Landed on execute node galain.eau.wi.charter.com (10.0.1.6) at 2018-11-06 05:07:44.024664

Python executable is C:\Program Files\Python36\python.exe (version 3.6.5 final)
with installed packages
  cloudpic

Unlike holds, you generally won't want to re-run components that experienced errors (they'll just fail again).
The only common case would be cases where you're having a dependency problem and can fix the error by changing something in your delivery method - for example, updating the Docker image that your map is using.

Instead, an error is generally a signal that you've got a bug in your own code.
Remove your map, debug the error locally, then create a new map.