### Config

Add to `/etc/hosts` the following:
```
127.0.0.1       datanode
```

We do this because the hadoop *namenode* (that we talk to for HDFS) returns the hostname of the datanode (i.e., `datanode`) but this returned hostname is inside the docker-compose network. This happens internally in the `kazoo` library hence this is the most straight-forward solution.

# Ecosystem

Please use `light` theme.

![system-architecture](images/docker_compose.png)

We have a `docker-compose` network of many different services. They are all managed by `mapreduce.cluster.LocalCluster` using different `docker-compose` commands for different purposes (for instance, `.scale`, `.clear`, `.shutdown_cluster`, etc.). Moreover, as users we submit jobs to the system using `.mapreduce` and get back `concurrent.future`s, but more to that later... In order to use `LocalCluster` one must authenticate with `Auth` (just use `username`='admin', `password`='admin' ðŸ˜Š).

The only requirement for the MapReduce system to work is accessible (`IP`, `PORT`) to `Zookeeper` and `HDFS` (in order to submit a job) and thats it!

Services:

1. `HDFS` : The Hadoop Distributed Filesystem, as the name suggests, works as a distributed Filesystem. The reason we chose `HDFS` is because we wanted to be able to effortlessly deploy it in a real cluster (asssumptions on a shared filesystem would make things more difficult in deployment).
2. `Zookeeper` :  The `Zookeeper` service (replicated 3 times) is mainly used for its extremely helpful recipes. We found quite handy following:
   1. Distributed Mutual Exclusion recipe
   2. Setting up `Watcher` callbacks for different purposes (`ChildrenWatch` - watches for z-node children updates, `DataWatch` - watches for specific z-node data updates).
   3. Distributed sequential ID generator recipe
3. `Worker` : The workers perform the Tasks. We have three endpoints, i.e., `/map-task`, `/shuffle-task` and `/reduce-task`.
4. `Master` : The masters, each able to handle MapReduce jobs in parallel (`threading`), are responsible for the proper execution of the Jobs. They send asynchronously `POST` requests to the workers' endpoints and wait for results. Moreover, they are responsible for the fault-tolerancy of the distributed system (handle `worker` deaths and other `master` deaths). Notably, the code for the masters is <ins>entirely</ins> `callbacks`. <ins>Things happen, and the master handles them accordingly</ins> (in coordination with other masters through Zookeeper recipes)! 

(Note: Yes, the `GIL` exists, but each master is I/O bound (spends most of the time waiting for things to happen). Hence, the `threading` is totally fine if not better than alternatives)

### ?

In [None]:
import sys
sys.path.append('/home/miketheologitis/MapReduce-Implementation')

### Authenticate

Use the `Auth` in-between interface for fast authentication.

In [None]:
from mapreduce.authentication.auth import Auth

In [None]:
auth = Auth(username='admin', password='admin')

In [None]:
auth.is_authenticated()

### Initialize the docker-compose network 

As mentioned before, we use `LocalCluster` as the docker-compose network management and job-submission service.

In [None]:
from mapreduce.cluster.local_cluster import LocalCluster

In [None]:
cluster = LocalCluster(
    auth=auth,
    n_workers=4,
    n_masters=1,
    initialize=True,
    verbose=False
)

For sanity check, we can use `docker ps` in a terminal.

The `LocalCluster` has a `LocalMonitoring` instance which has methods for printing the state of the cluster in a more beautiful manner.

Let's use it and print the `Zookeeper` z-nodes current state and filesystem.

In [None]:
cluster.local_monitoring.print_zoo()

For HDFS (will contain nothing right now):

In [None]:
cluster.local_monitoring.print_hdfs('jobs')

## MapReduce first job submission

We need to first define the input data, the map function and the reduce function. The assumption is that the map-reduce functions follow the following:

`map([x1, x2, ...]) -> [(k1, v2), (k2, v2), ...]`

`reduce([v1, v2, ...]) -> y`

where every element is arbitrary (any data structure)

Let's assume that our objective is to count how many times each character appears in the a list of words.

In [None]:
data = ['dasdsagf', 'mike', 'george', 'gertretr123', 'dsadsajortriojtiow']

def map_func(data):
    result = []
    for string in data:
        for char in string:
            result.append((char, 1))
    return result

def reduce_func(data):
    return sum(data)

For a quick sanity check

In [None]:
map_func(['dasdsagf'])

We are now ready to submit the job onto the MapReduce distributed system. We will use `.mapreduce` from `LocalCluster`. Note that we will return a `concurrent.futures` future object which represents a computation that hasn't necessarily completed yet. It's essentially a <ins>promise</ins> to hold the result of a computation that might still be ongoing - hence the name "future".

In [None]:
future = cluster.mapreduce(
    data=data, 
    map_func=map_func, 
    reduce_func=reduce_func, 
    requested_n_workers=4
)

In [None]:
future

Let's inspect what is happening behind the scenes.

In [None]:
cluster.local_monitoring.print_zoo()

In [None]:
cluster.local_monitoring.print_hdfs('jobs')

We can get the result of the computation using `.result()`

In [None]:
future.result()

## Heavy-Load Computation

Let's submit the system to a lot of concurrent jobs and tasks. But, first we must scale the cluster. (Obviously, alive services are not impacted by this - they continue their tasks as normal)

In [None]:
cluster.scale(
    n_masters=3,
    n_workers=10
)

In [None]:
cluster.local_monitoring.print_zoo()

Let's create the same `data` list of strings but increase it in size a bit.

In [None]:
import random
import string

def generate_random_string(str_len):
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for _ in range(str_len))

def generate_random_string_list(n, str_len):
    return [generate_random_string(str_len) for _ in range(n)]

# Generate 100k random strings of length 20
data = generate_random_string_list(n=100_000, str_len=20)

In [None]:
len(data)

In [None]:
data[:10]

Now it is time to submit a few jobs. We will submit the same `data`, `map_func` and `reduce_func` for ease. We will submit 10 such jobs and we expect the system to handle them concurrently. Note that us, as `host`, must upload the data (*map func*, *reduce* func, and *data*) onto HDFS so this is why the following will not finish immediately.

In [None]:
futures = []

for i in range(10):
    print(f"Job {i+1} submitted.", end='\r')
    future = cluster.mapreduce(data, map_func, reduce_func, requested_n_workers=2)
    futures.append(future)

We will print the futures to see if they are `running` or `finished`.

In [None]:
for future in futures:
    print(future)

In [None]:
cluster.local_monitoring.print_zoo()

Let's inspect some random results.

In [None]:
futures[1].result()

In [None]:
#cluster.clear()

See all results per row.

In [None]:
for row in zip(*[future.result() for future in futures]):
    print(row)
    print()

### Shutdown

Shutdown the cluster and cleanup the persistent HDFS.

In [None]:
cluster.shutdown_cluster(cleanup=True)