## Fault-Tolerance

One of the most important aspects of the MapReduce distributed system is fault-tolerance.

If <ins>at all times at least one</ins> *master* service is alive, we guarantee fault tolerance for the following scenarios:
1. *Worker* failures at any time.
2. *Master* failures at any time.

Our system not only guarantees fault-tolerance, but also that the computations in the event of *worker*, *master* failures will continue exactly where they left off as if nothing bad has happened. No recomputation will take place if not necessary. 

This notebook aims to simulate an extreme crisis situation in our MapReduce system: a perfect storm. Imagine a scenario where all the workers, in the midst of executing a MapReduce job, abruptly fail. As if this isn't catastrophic enough, the master responsible for coordinating that very job also shuts down unexpectedly, leaving a lone master as the sole survivor in our distributed system.

Despite such a seemingly apocalyptic event, our system is designed to rise like a phoenix from the ashes. We will revive the system, scaling it back to full operation, and watch as it picks up the computation exactly where it left off, continuing towards successful completion as if the disaster never occurred. This notebook serves to demonstrate the robustness of our system, even when faced with the most severe challenges imaginable.

Note: Addressing *worker* failures alone was comparatively simple. Likewise, overcoming *master* failures presented its unique challenges, that forced us to revise certain parts of our system. However, managing the simultaneous failures of both *workers* and *masters* - especially within the same job - has been an entirely different level of complexity. This scenario necessitates an extremely robust and meticulous system implementation. There are numerous intricate edge-cases that open the door for potential race condition issues. The system must not only handle failures independently but also efficiently coordinate and synchronize their concurrent occurrence.

### Authenticate

Use the `Auth` in-between interface for fast authentication.

In [None]:
from mapreduce.authentication.auth import Auth

In [None]:
auth = Auth(username='admin', password='admin')

In [None]:
auth.is_authenticated()

### Initialize the docker-compose network 

In [None]:
from mapreduce.cluster.local_cluster import LocalCluster

In [None]:
cluster = LocalCluster(
    auth=auth,
    n_workers=4,
    n_masters=2,
    initialize=True,
    verbose=False
)

As in the previous notebook, let's assume that our objective is to count how many times each character appears in the a list of words.

In [None]:
def map_func(data):
    result = []
    for string in data:
        for char in string:
            result.append((char, 1))
    return result

def reduce_func(data):
    return sum(data)

Create some reasonable amount of data.

In [None]:
import random
import string

def generate_random_string(str_len):
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for _ in range(str_len))

def generate_random_string_list(n, str_len):
    return [generate_random_string(str_len) for _ in range(n)]

data = generate_random_string_list(n=500_000, str_len=20)

In [None]:
data[:10]

### Kill workers amidst computation

This part (the next one aswell) will not be entirely sequential. We have configured `docker-compose` so that the *workers* and *masters* use the *container-id* as *hostname*. Hence, grab the 4 *worker* container-ids from bellow and prepare the `docker kill` statement.

In [None]:
cluster.local_monitoring.print_zoo()

In [None]:
cluster.local_monitoring.print_hdfs('jobs')

Note, if `requested_n_workers` is not provided it defaults to `None` which is translated to the underlying computation as "use as many workers as possible".

By now, you should have replaced the `docker kill` command with the appropriate ids from above (loaded our gun). We are ready to submit the job and kill the 4 workers on the spot.

In [None]:
import time

future = cluster.mapreduce(
    data=data, 
    map_func=map_func, 
    reduce_func=reduce_func
)

# ~ 1s not received map, ~ 2-4s in map ~ 4.5-21s in shuffle 
# ~ 21.5s before reduce (no reduce task created), ~ 22s job completes
time.sleep(3)

# This is shell
!docker kill a140e0bbeaf1 b09d2243f538 9e79739d768c 2b8234e85d64

In [None]:
future

We have now killed the 4 workers. Let's take a look at the state of the computation.

In [None]:
cluster.local_monitoring.print_zoo()

### Kill the master handling the job

Observe that our cluster is currently bereft of workers, thus bringing the computation to a standstill. But we're not stopping there. Let's intensify this scenario further. We'll seek out the master overseeing the job and kill him. In doing so, we'll bring our system to its most fundamental state required to test our fault-tolerance guarantees - that is, maintaining at least one surviving master in the midst of chaos.

In [None]:
!docker kill 4d2517722e41

In [None]:
cluster.local_monitoring.print_zoo()

Notice that all the *responsibilities* of the dead master have transfered to the last surviving master (dead *worker* tasks and the job). For now, our system is in a bit of a freeze. 

### Scale the system

Let's jumpstart the system by scaling it and get the computation back on track. We will add 10 workers to the system so the computation terminates quickly (remember that we passed `requested_n_workers` as `None` - use as many workers as possible).

In [None]:
cluster.scale(n_workers=10)

In [None]:
cluster.local_monitoring.print_zoo()

### Shutdown

In [None]:
cluster.shutdown_cluster(cleanup=True)