### Cluster Init

Since the cluster workers have CPU (4 cores) Dask will try to assign 4 tasks on a single worker (running in parallel). First of all, Dask does not know that a single Task (which is a Tensorflow simulation) will likely utilize 4 cores anyway, and more importandly it does not take into account the very limited RAM (~3GB) each worker has. Hence, the workers will run out of memory if we do not do something about this. 

We can use the `resources` functionality to define custom resources of our workers. We define `PROCESS` resource which we assign to be one. When we later `.submit` tasks we will inform Dask that on a worker a single task uses all of the worker's `PROCESS` resource, i.e., `{"PROCESS" : 1}` so that Dask will not assign another Task to this worker. See docs [Resources](https://distributed.dask.org/en/stable/resources.html) and relevant *stackoverflow* question [one task per worker](https://stackoverflow.com/questions/45052535/dask-distributed-how-to-run-one-task-per-worker-making-that-task-running-on-a).

Note: Dask obviously does not understand what `PROCESS` resrouce means, it is conceptual; it just knows that this arbitrary resource named `PROCESS` has one (it could be GPU resource, CPU, RAM whatever we think it is).

In [19]:
!rm -rf slurm-*.out

In [20]:
from dask_jobqueue import SLURMCluster
import dask

cluster = SLURMCluster(
    local_directory='/storage/tuclocal/mtheologitis',
    processes=1,
    cores=4,
    memory='3 GB',
    queue="aTUC",
    walltime="240:00:00", # Set the walltime to 10 days (240 hours)
    interface='enp2s3',
    scheduler_options={'interface': 'eno1'},
    worker_extra_args=['--resources PROCESS=1']
)

In [21]:
NUMBER_OF_WORKERS = 8

cluster.scale(NUMBER_OF_WORKERS)

### Client Init

In [22]:
from dask.distributed import Client

client = Client(cluster)

In [23]:
client.wait_for_workers(n_workers=NUMBER_OF_WORKERS)

In [24]:
client

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.0.0.100:8787/status,

0,1
Dashboard: http://10.0.0.100:8787/status,Workers: 8
Total threads: 32,Total memory: 22.32 GiB

0,1
Comm: tcp://10.0.0.100:39613,Workers: 8
Dashboard: http://10.0.0.100:8787/status,Total threads: 32
Started: Just now,Total memory: 22.32 GiB

0,1
Comm: tcp://10.0.0.105:42965,Total threads: 4
Dashboard: http://10.0.0.105:42137/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.105:41093,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ngv8rr34,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ngv8rr34
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 47.1%,Last seen: Just now
Memory usage: 107.52 MiB,Spilled bytes: 0 B
Read bytes: 6.30 MiB,Write bytes: 258.79 kiB

0,1
Comm: tcp://10.0.0.106:42731,Total threads: 4
Dashboard: http://10.0.0.106:34417/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.106:39843,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ldwpwbe4,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ldwpwbe4
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 107.08 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.101:45483,Total threads: 4
Dashboard: http://10.0.0.101:36335/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.101:35413,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-zvofz7cb,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-zvofz7cb
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 107.79 MiB,Spilled bytes: 0 B
Read bytes: 1.34 MiB,Write bytes: 71.33 kiB

0,1
Comm: tcp://10.0.0.103:42701,Total threads: 4
Dashboard: http://10.0.0.103:39753/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.103:34605,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-f7nzfr0r,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-f7nzfr0r
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 107.50 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.108:43483,Total threads: 4
Dashboard: http://10.0.0.108:37733/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.108:33791,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-nj9syzoz,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-nj9syzoz
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 0.0%,Last seen: Just now
Memory usage: 56.15 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.102:41613,Total threads: 4
Dashboard: http://10.0.0.102:33735/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.102:45693,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ah2zq_cr,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ah2zq_cr
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 106.97 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.107:41827,Total threads: 4
Dashboard: http://10.0.0.107:36791/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.107:34747,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-i_nltotu,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-i_nltotu
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 106.63 MiB,Spilled bytes: 0 B
Read bytes: 4.01 MiB,Write bytes: 222.20 kiB

0,1
Comm: tcp://10.0.0.104:39397,Total threads: 4
Dashboard: http://10.0.0.104:33879/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.104:43639,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-qtz532b4,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-qtz532b4
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 47.2%,Last seen: Just now
Memory usage: 105.64 MiB,Spilled bytes: 0 B
Read bytes: 5.23 MiB,Write bytes: 215.59 kiB


In [25]:
cluster

0,1
Dashboard: http://10.0.0.100:8787/status,Workers: 8
Total threads: 32,Total memory: 22.32 GiB

0,1
Comm: tcp://10.0.0.100:39613,Workers: 8
Dashboard: http://10.0.0.100:8787/status,Total threads: 32
Started: Just now,Total memory: 22.32 GiB

0,1
Comm: tcp://10.0.0.105:42965,Total threads: 4
Dashboard: http://10.0.0.105:42137/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.105:41093,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ngv8rr34,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ngv8rr34
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 47.1%,Last seen: Just now
Memory usage: 107.52 MiB,Spilled bytes: 0 B
Read bytes: 6.30 MiB,Write bytes: 258.79 kiB

0,1
Comm: tcp://10.0.0.106:42731,Total threads: 4
Dashboard: http://10.0.0.106:34417/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.106:39843,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ldwpwbe4,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ldwpwbe4
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 107.08 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.101:45483,Total threads: 4
Dashboard: http://10.0.0.101:36335/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.101:35413,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-zvofz7cb,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-zvofz7cb
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 107.79 MiB,Spilled bytes: 0 B
Read bytes: 1.34 MiB,Write bytes: 71.33 kiB

0,1
Comm: tcp://10.0.0.103:42701,Total threads: 4
Dashboard: http://10.0.0.103:39753/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.103:34605,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-f7nzfr0r,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-f7nzfr0r
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 107.50 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.108:43483,Total threads: 4
Dashboard: http://10.0.0.108:37733/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.108:33791,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-nj9syzoz,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-nj9syzoz
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 0.0%,Last seen: Just now
Memory usage: 56.15 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.102:41613,Total threads: 4
Dashboard: http://10.0.0.102:33735/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.102:45693,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ah2zq_cr,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-ah2zq_cr
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 106.97 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://10.0.0.107:41827,Total threads: 4
Dashboard: http://10.0.0.107:36791/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.107:34747,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-i_nltotu,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-i_nltotu
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 106.63 MiB,Spilled bytes: 0 B
Read bytes: 4.01 MiB,Write bytes: 222.20 kiB

0,1
Comm: tcp://10.0.0.104:39397,Total threads: 4
Dashboard: http://10.0.0.104:33879/status,Memory: 2.79 GiB
Nanny: tcp://10.0.0.104:43639,
Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-qtz532b4,Local directory: /storage/tuclocal/mtheologitis/dask-worker-space/worker-qtz532b4
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 47.2%,Last seen: Just now
Memory usage: 105.64 MiB,Spilled bytes: 0 B
Read bytes: 5.23 MiB,Write bytes: 215.59 kiB


### Load Data Lazily

In [26]:
import tensorflow as tf
from dask import delayed

@delayed
def load_data():
    (X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
    X_train, X_test = X_train / 255.0, X_test / 255.0

    return X_train, y_train, X_test, y_test

In [27]:
data_delayed = load_data()

### Upload the simulation module

This code must run **only** when all the workers have been initialized by the `cluster`. Later created workers will not have this.

For future needs: We can create a callback if necessary so that new workers get this file uploaded to them automatically. 

In [28]:
client.upload_file('TF_Simulation_FDA_CNN.py')

{'tcp://10.0.0.101:45483': {'status': 'OK'},
 'tcp://10.0.0.102:41613': {'status': 'OK'},
 'tcp://10.0.0.103:42701': {'status': 'OK'},
 'tcp://10.0.0.104:39397': {'status': 'OK'},
 'tcp://10.0.0.105:42965': {'status': 'OK'},
 'tcp://10.0.0.106:42731': {'status': 'OK'},
 'tcp://10.0.0.107:41827': {'status': 'OK'},
 'tcp://10.0.0.108:43483': {'status': 'OK'}}

### Training Simulation

We import the `TF_Simulation_FDA_CNN.py` which corresponds to `10_TF_Simulation_FDA_CNN.ipynb` from `progress_notebooks` directory. 

We have a few modifications to reduce RAM usage as much as possible.
1. In `prepare_federated_data_for_test`:
    1. modified `shuffle_size` -> `2*batch_size` (we have to be careful with this because data randomness is importand)
    2. removed `.prefetch(tf.data.AUTOTUNE)`.

In [29]:
def worker_single_fda_simulation(data_delayed, fda_name, num_clients, batch_size, num_steps_until_rtc_check, 
                                 theta, num_epochs, sketch_width=-1, sketch_depth=-1, bench_test=False):
    
    import TF_Simulation_FDA_CNN as sim
    import gc
    
    X_train, y_train, X_test, y_test = data_delayed.compute()
    
    train_dataset, test_dataset = sim.convert_to_tf_dataset(X_train, y_train, X_test, y_test)
    
    del X_train, y_train, X_test, y_test
    
    epoch_metrics, round_metrics = sim.single_simulation(
        fda_name, num_clients, train_dataset, test_dataset, batch_size, num_steps_until_rtc_check,
        theta, num_epochs, sketch_width=sketch_width, sketch_depth=sketch_depth, bench_test=bench_test
    )
    
    del train_dataset, test_dataset
    
    gc.collect()  # force garbage collection
    sim.tf.keras.backend.clear_session()  # Clear TensorFlow session
    
    return epoch_metrics, round_metrics

### Dask Tasks

In [30]:
num_clients_list = [50, 53, 55, 60, 74]
batch_size_list = [32]
num_steps_until_rtc_check_list = [1]
theta_list = [1.]
num_epochs = 1

sketch_width = 500
sketch_depth = 7

In `TF_Simulation_FDA_CNN.py` we have general methods for testing all combinations given the lists above. In the Cluster enviroment we want to break-up the tests into smaller tasks and considering the limited RAM of each worker in the Cluster we chose to break up the tests to the bottom, that is, a single simulation (*naive*, *linear* or *sketch*) given fixed parameters. This has time-cost drawbacks like recomputation of Tensorflow Graphs (specifically, in `AmsSketch`) and creation of many `AmsSketch` instances, one for each *sketch* test, computation of the `delayed` dataset for each test etc. But it is definitely the safest thing to do considering our RAM requirements.

In [31]:
futures = []

for num_clients in num_clients_list:
    for batch_size in batch_size_list:
        for num_steps_until_rtc_check in num_steps_until_rtc_check_list:
            for theta in theta_list:
                
                for fda_name in ["naive", "linear", "sketch"]:
        
                    future = client.submit(
                        worker_single_fda_simulation,
                        data_delayed=data_delayed, 
                        fda_name=fda_name,
                        num_clients=num_clients, 
                        batch_size=batch_size, 
                        num_steps_until_rtc_check=num_steps_until_rtc_check,
                        theta=theta, 
                        num_epochs=num_epochs,
                        sketch_width=sketch_width if fda_name == "sketch" else -1,
                        sketch_depth=sketch_depth if fda_name == "sketch" else -1,
                        bench_test=True,
                        resources={'PROCESS': 1}  # Tell Dask that the resource `PROCESS` is consumed in one task!
                    ) 

                    futures.append(future)

Since later on we `.release` on completed futures in order for workers to release the memory associated with a future (intermidiate results in their RAM) there is no point for progress bar.

### Gather and Save results

Due to the low RAM of Workers we need to be careful so caching results in their RAM until all Tasks have completed is not the way to go (if they die we lose the results - must recompute them, and we produce the caching overhead on them). Thus, we use `as_completed` to force Workers return their results immediately upon completion.

To go a step further, we also save each test's metrics in temporary `.parquet` files which we will combine when time comes.

In [32]:
import os

def save_result_to_parquet(df, directory, file_prefix):
    if not os.path.exists(directory):
        os.makedirs(directory)

    file_name = f"{file_prefix}_{len(os.listdir(directory))}.parquet"
    file_path = os.path.join(directory, file_name)
    
    df.to_parquet(file_path)

In [None]:
from dask.distributed import as_completed
import pandas as pd

tmp_epoch_metrics_dir = 'results/tmp_epoch_metrics'
tmp_round_metrics_dir = 'results/tmp_round_metrics'

total_futures = len(futures)
num_completed = 0

for future, result in as_completed(futures, with_results=True):

    epoch_metrics, round_metrics = result

    epoch_metrics_df = pd.DataFrame(epoch_metrics)
    save_result_to_parquet(epoch_metrics_df, tmp_epoch_metrics_dir, 'epoch_metrics')

    round_metrics_df = pd.DataFrame(round_metrics)
    save_result_to_parquet(round_metrics_df, tmp_round_metrics_dir, 'round_metrics')

    num_completed += 1
    print(f"\rProgress on Gathered-Saved Results: {num_completed} / {total_futures}", end="", flush=True)  # Print progress

    future.release()  # Do not store results in worker's memory anymore, we have saved it

Progress on Gathered-Saved Results: 12 / 15

### Combine & Save Results

TODO: maybe leave this to `~/metrics/combine_metrics.py`

In [None]:
# Read multiple Parquet files and combine them
all_epoch_metrics_df = pd.read_parquet(tmp_epoch_metrics_dir)
all_round_metrics_df = pd.read_parquet(tmp_round_metrics_dir)

all_epoch_metrics_df.to_parquet('results/epoch_metrics.parquet')
all_round_metrics_df.to_parquet('results/round_metrics.parquet')

### Terminate `Client` and `Cluster`

In [18]:
client.close()
cluster.close()