In [1]:
# imports
import sys
sys.path.append("../") # go to parent dir
from cluster_simulator.cluster import Cluster, Tier, bandwidth_share_model, compute_share_model, get_tier, convert_size
from cluster_simulator.phase import DelayPhase, ComputePhase, IOPhase
from cluster_simulator.application import Application
import simpy
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from cluster_simulator.analytics import display_cluster, display_apps, display_run
import numpy as np
from itertools import groupby
from operator import itemgetter
from loguru import logger
import itertools
import time

## Content of the workshop
Demo and review of the progress done on the recommendation system since april'2022:
1. Implemented features:
   1. Phases: Compute phase and I/O phases (there is also a delay phase)
   2. Application is a sequence of phases
   3. Cluster: a set of compute nodes with attached tiers
   4. Compute nodes, tiers capacities and bandwidths are globally shared resources
2. Not yet implemented features:
   1. Ephemeral tiers (burst buffer) with dedicated resources (datanode) and destaging/eviction mecanism (WIP)
   2. Workflow as a graph (dag) of phases
   3. Placement optimization heuristics (WIP)
3. What will be presented in this workshop:
   1. How application are described/represented, limitations and remarks
   2. Metrics consequent to running apps:
      1. individually
      2. in parallel
      3. concurrent I/O and bandwidth consumption
   3. Wiring any optimization heuristics with the simulator
      1. Principles/How to
      2. Example with black box optimizer (BBO)
      3. Results and discussion



---
![image](recommendation_system_diagram.png)


- performance model to be updated in time
- probability of tiers failure : how to use this in the model? Using bbo?
- real apps read (identified group of data) that could have been place in a specific tier, so read operation can only be done from this specific tier
- access latencies for each tier
- duplication as penalty to take into account changing tier for another IO

### 1.Implemented features 
The simulation environment is based on simpy, a discrete event simulation library in python.
An application is a sequence of:
- Compute phases:
  - duration: in seconds, as it may run with 1 core
  - cores: number of cores dedicated to the phase, at least 1, cores are shared and limited
  - a function to simulate parallelization (sqrt(1+cores)/sqrt(2))
  - if 10s with 1 core, it may take 10s/1.22 with 2 cores (bad //)
 
- IOPhases:
  - operation : 'read' or 'write'
  - volume: in bytes
  - pattern: 1 for pure sequential, 0 for random, 0.2: 20% seq and 80% random
  - (not implemeted yet): the blocksize, so the bandwidth is considered in the asymptotic part
  
- Application:
  - a linear sequence of phases
  - next phase cannot be executed until previous has succeeded
  - if an app starts by reading from an empty tier, tier level is automatically adjusted
     

In [2]:
# preparing execution environment variables
env = simpy.Environment()
data = simpy.Store(env)
app1 = Application(env, name="read1GB->Comp15s->Write10GB", # name of the app in the display
                   compute=[0, 15],  # two events, first at 0 and second at 15, and compute between them
                   read=[1e9, 0],    # read 1GB at 0, before compute phase, at the end do nothing (0)
                   write=[0, 10e9],  # write 0GB at first event, and 10GB at the second, after compute phase                   
                   data=data)    

#### App formalism
![image](app_formalism.png)

A cluster is a set of:
- compute nodes as shared resources:
   - cores: number of units of computing (can be replaced by CPU or cores depending on the app)
- storage tiers:
   - list of storage tiers with their characteristics:
      - name
      - capacity in GB, also a shared resource
      - a bandwidth (described below)
      - bandwidth share model:
         - tier hw gives a max_bandwidth value
         - I/O processes shares equally bandwidth when concurrent
         - examples, **2** processes are doing I/O on nvram tier, one writing seq (**50%** of 515 MB/s) and other read random (**50%** of 760MB/s) 
      

In [3]:
nvram_bandwidth = {'read':  {'seq': 780, 'rand': 760},   # throughput for read ops in MB/s
                   'write': {'seq': 515, 'rand': 505}}   # throughput for write ops in MB/s
ssd_bandwidth =   {'read':  {'seq': 210, 'rand': 190},
                   'write': {'seq': 100, 'rand': 100}}   # data is taken from IEEE'2013
hdd_bandwidth =   {'read':  {'seq': 80, 'rand': 80},
                   'write': {'seq': 40, 'rand': 40}}

# we register the tiers
hdd_tier = Tier(env, 'HDD', bandwidth=hdd_bandwidth, capacity=1e12)
ssd_tier = Tier(env, 'SSD', bandwidth=ssd_bandwidth, capacity=200e9)
nvram_tier = Tier(env, 'NVRAM', bandwidth=nvram_bandwidth, capacity=80e9)
# we attach the tiers to a cluster:
cluster = Cluster(env, compute_nodes=3,   # number of physical nodes
                       cores_per_node=2,  # available cores per node
                       tiers=[hdd_tier, nvram_tier]) # associate storage tiers to the cluster
                    #          ^tier 0,    ^tier 1, tier...
                    
logger.remove()


cluster = Cluster(env, compute_nodes=3, cores_per_node=2, tiers=[ssd_tier, nvram_tier])
# registring the app in the simulation env
env.process(app1.run(cluster, placement=[1, 1]))



<Process(run) object at 0x2db88caad48>

In [4]:
start_time = time.time()
# execution the simulation env
env.run()
print(f"Execution time = {time.time()-start_time} seconds")

def print_app_data(data):
    for item in data.items:
        print(item)
print_app_data(app1.data)
fig = display_run(app1.data, cluster, width=800, height=800)
fig.show()

Execution time = 0.0010004043579101562 seconds
{'app': 'read1GB->Comp15s->Write10GB', 'type': 'read', 'cpu_usage': 1, 't_start': 0, 't_end': 1.2820512820512822, 'bandwidth_concurrency': 1, 'bandwidth': 780.0, 'phase_duration': 1.2820512820512822, 'volume': 1000000000.0000001, 'tiers': ['SSD', 'NVRAM'], 'data_placement': {'placement': 'NVRAM'}, 'init_level': {'SSD': 0, 'NVRAM': 0}, 'tier_level': {'SSD': 0, 'NVRAM': 1000000000.0}}
{'app': 'read1GB->Comp15s->Write10GB', 'type': 'compute', 'cpu_usage': 1, 't_start': 1.2820512820512822, 't_end': 16.28205128205128, 'bandwidth': 0, 'phase_duration': 15.0, 'volume': 0, 'tiers': ['SSD', 'NVRAM'], 'data_placement': None, 'init_level': {'SSD': 0, 'NVRAM': 1000000000.0}, 'tier_level': {'SSD': 0, 'NVRAM': 1000000000.0}}
{'app': 'read1GB->Comp15s->Write10GB', 'type': 'write', 'cpu_usage': 1, 't_start': 16.28205128205128, 't_end': 35.69952701020662, 'bandwidth_concurrency': 1, 'bandwidth': 515.0, 'phase_duration': 19.41747572815534, 'volume': 1000000

#### Concurrent I/O
nvram_bandwidth :
-  'read':  {'seq': 780, 'rand': 760},   # throughput for read ops in MB/s
-  'write': {'seq': 515, 'rand': 505}}   # throughput for write ops in MB/s

ssd_bandwidth :
-   'read':  {'seq': 210, 'rand': 190}
-   'write': {'seq': 100, 'rand': 100}

In [5]:
logger.remove()
env = simpy.Environment()
data = simpy.Store(env)

# we register the tiers
ssd_tier = Tier(env, 'SSD', bandwidth=ssd_bandwidth, capacity=200e9)
nvram_tier = Tier(env, 'NVRAM', bandwidth=nvram_bandwidth, capacity=80e9)
cluster = Cluster(env, compute_nodes=1, cores_per_node=2, tiers=[ssd_tier, nvram_tier])
# defining two apps
app1 = Application(env, name="#read1G->comp2s->write3G", compute=[0, 2],
                           read=[1e9, 0], write=[0, 3e9], data=data)
app2 = Application(env, name="#read2G->comp1s->write1G", compute=[0, 4],
                           read=[2e9, 0], write=[0, 1e9], data=data)
# registring apps
env.process(app1.run(cluster, placement=[1, 1]))
env.process(app2.run(cluster, placement=[1, 1]))
env.run()
# display
fig = display_run(data, cluster, width=800, height=900)
fig.show()

### 3. Wiring any optimization heuristics with the simulator
refer to https://github.com/bds-ailab/shaman and https://shaman-app.readthedocs.io/en/latest/user-guide/launching/

![image](bbo.png)

In [6]:
import simpy
from loguru import logger
import numpy as np
import pandas as pd
import math
from cluster_simulator.cluster import Cluster, Tier, bandwidth_share_model, compute_share_model, get_tier, convert_size
from cluster_simulator.phase import DelayPhase, ComputePhase, IOPhase
from cluster_simulator.utils import name_app
import copy
import time
import cluster_simulator.analytics
from cluster_simulator.application import Application

# imports for surrogate models
from sklearn.gaussian_process import GaussianProcessRegressor
from bbo.optimizer import BBOptimizer
# from bbo.optimizer import timeit
from bbo.heuristics.surrogate_models.next_parameter_strategies import expected_improvement

# imports for genetic algorithms
from bbo.heuristics.genetic_algorithm.selections import tournament_pick
from bbo.heuristics.genetic_algorithm.crossover import double_point_crossover
from bbo.heuristics.genetic_algorithm.mutations import mutate_chromosome_to_neighbor
from loguru import logger
import warnings
warnings.filterwarnings("ignore")

In [7]:

class ClusterBlackBox:
    def __init__(self):
        self.env = simpy.Environment()
        self.data = simpy.Store(self.env)

        self.nvram_bandwidth = {'read':  {'seq': 780, 'rand': 760},
                                'write': {'seq': 515, 'rand': 505}}
        self.ssd_bandwidth = {'read':  {'seq': 210, 'rand': 190},
                              'write': {'seq': 100, 'rand': 100}}

        self.nvram_tier = Tier(self.env, 'NVRAM', bandwidth=self.nvram_bandwidth, capacity=80e9)
        self.ssd_tier = Tier(self.env, 'SSD', bandwidth=self.ssd_bandwidth, capacity=200e9)
        
        self.cluster = Cluster(self.env,  compute_nodes=1, cores_per_node=5,
                               tiers=[self.nvram_tier, self.ssd_tier])

        app1 = Application(self.env,
                           compute=[0, 10],
                           read=[1e9, 0],
                           write=[0, 5e9],
                           data=self.data)
        app2 = Application(self.env,
                           compute=[0, 20, 30],
                           read=[3e9, 0, 0],
                           write=[0, 5e9, 10e9],
                           data=self.data)
        app3 = Application(self.env,
                           compute=[0, 10],
                           read=[4e9, 0],
                           write=[0, 7e9],
                           data=self.data)

        self.apps = [app1, app2, app3]
        self.ios = self.get_io_nbr()
        self.n_tiers = len(self.cluster.tiers)
        self.parameter_space = np.array([np.arange(0, self.n_tiers, 1)]*sum(self.ios))

    def get_io_nbr(self):
        io_app = []
        for app in self.apps:
            io_app.append(len([io for io in app.read if io > 0]) +
                          len([io for io in app.write if io > 0]))
        return io_app

    def compute(self, placement=None):  # np.array([[0, 1], [0, 1]])
        self.__init__()  
        # https://stackoverflow.com/questions/45061369/simpy-how-to-run-a-simulation-multiple-times
        start_index = 0
        #print(placement)
        for i_app, app in enumerate(self.apps):
            place_tier = placement[start_index: start_index + self.ios[i_app]]
            start_index = self.ios[i_app]
            self.env.process(app.run(self.cluster, placement=place_tier))
        # run the simulation
        self.env.run()
        return app.get_fitness()
    
    def display_placement(self, placement):
        self.__init__()  
        # https://stackoverflow.com/questions/45061369/simpy-how-to-run-a-simulation-multiple-times
        start_index = 0
        #print(placement)
        for i_app, app in enumerate(self.apps):
            place_tier = placement[start_index: start_index + self.ios[i_app]]
            start_index = self.ios[i_app]
            self.env.process(app.run(self.cluster, placement=place_tier))
        # run the simulation
        self.env.run()
        fig = display_run(self.data, self.cluster, width=800, height=900)
        fitness = app.get_fitness()
        appslist = ", ".join([app.name for app in self.apps])
        print(f"The apps {appslist} lasts {round(fitness, 3)} seconds when placement = {placement}")
        return fig
    
        

In [8]:
logger.remove()
cbb = ClusterBlackBox()
PARAMETER_SPACE = cbb.parameter_space
# combinations are self.n_tiers ** sum(self.ios)
NBR_ITERATION = 120  # cbb.n_tiers ** sum(cbb.ios)

np.random.seed(5)
bbopt = BBOptimizer(black_box=cbb,
                    heuristic="surrogate_model",
                    max_iteration=NBR_ITERATION,
                    initial_sample_size=20,
                    parameter_space=PARAMETER_SPACE,
                    next_parameter_strategy=expected_improvement,
                    regression_model=GaussianProcessRegressor)
start_time = time.time()
bbopt.optimize()
print("-----------------")
print(NBR_ITERATION)
print(f"Running {NBR_ITERATION}/{cbb.n_tiers**sum(cbb.ios)} iterations on BBO "
      f"take {round(time.time() - start_time, 3)} seconds")
bbopt.summarize()
#print(bbopt.history["fitness"])
#print(bbopt.best_parameters_in_grid)

-----------------
120
Running 120/128 iterations on BBO take 3.782 seconds
------ Optimization loop summary ------
Number of iterations: 140
Elapsed time: 195.80292391777039
Best parameters: [1 1 0 0 0 0 1]
Best fitness value: 71.69280557630073
Percentage of explored space: 16.40625
Percentage of static moves: 85.0
Cost of global exploration: 7735.942423272503
Mean fitness gain per iteration: -2.2761319377230538
--- Heuristic specific summary ---
Final RMSE: 1.0
None


### Let now observe the solutions

In [9]:
fig = cbb.display_placement(placement=[0, 1, 1, 0, 0, 1, 1]) # solution found by bbopt
fig = cbb.display_placement(placement=[0, 0, 0, 0, 0, 0, 0]) # most intuitive solution

fig.show()

The apps A9, N6, U0 lasts 75.134 seconds when placement = [0, 1, 1, 0, 0, 1, 1]
The apps H4, L1, M8 lasts 77.809 seconds when placement = [0, 0, 0, 0, 0, 0, 0]


#### We could go further and explore if it is interesting to use lower tier

In [17]:

class ClusterBlackBox:
    def __init__(self):
        self.env = simpy.Environment()
        self.data = simpy.Store(self.env)

        self.nvram_bandwidth = {'read':  {'seq': 780, 'rand': 760},
                                'write': {'seq': 515, 'rand': 505}}
        self.ssd_bandwidth = {'read':  {'seq': 210, 'rand': 190},
                              'write': {'seq': 100, 'rand': 100}}
        self.hdd_bandwidth = {'read':  {'seq': 80, 'rand': 80},
                         'write': {'seq': 40, 'rand': 40}}        

        self.nvram_tier = Tier(self.env, 'NVRAM', bandwidth=self.nvram_bandwidth, capacity=80e9)
        self.ssd_tier = Tier(self.env, 'SSD', bandwidth=self.ssd_bandwidth, capacity=200e9)
        self.hdd_tier = Tier(self.env, 'HDD', bandwidth=hdd_bandwidth, capacity=1e12)
        self.cluster = Cluster(self.env,  compute_nodes=1, cores_per_node=5,
                               tiers=[self.nvram_tier, self.ssd_tier, self.hdd_tier])

        app1 = Application(self.env,
                           compute=[0, 10],
                           read=[1e9, 0],
                           write=[0, 5e9],
                           data=self.data)
        app2 = Application(self.env,
                           compute=[0, 20, 30],
                           read=[3e9, 0, 0],
                           write=[0, 5e9, 10e9],
                           data=self.data)
        app3 = Application(self.env,
                           compute=[0, 10],
                           read=[4e9, 0],
                           write=[0, 7e9],
                           data=self.data)

        self.apps = [app1, app2, app3]
        self.ios = self.get_io_nbr()
        self.n_tiers = len(self.cluster.tiers)
        self.parameter_space = np.array([np.arange(0, self.n_tiers, 1)]*sum(self.ios))

    def get_io_nbr(self):
        io_app = []
        for app in self.apps:
            io_app.append(len([io for io in app.read if io > 0]) +
                          len([io for io in app.write if io > 0]))
        return io_app

    def compute(self, placement=None):  # np.array([[0, 1], [0, 1]])
        self.__init__()  
        # https://stackoverflow.com/questions/45061369/simpy-how-to-run-a-simulation-multiple-times
        start_index = 0
        #print(placement)
        for i_app, app in enumerate(self.apps):
            place_tier = placement[start_index: start_index + self.ios[i_app]]
            start_index = self.ios[i_app]
            self.env.process(app.run(self.cluster, placement=place_tier))
        # run the simulation
        self.env.run()
        return app.get_fitness()
    
    def display_placement(self, placement):
        self.__init__()  
        # https://stackoverflow.com/questions/45061369/simpy-how-to-run-a-simulation-multiple-times
        start_index = 0
        #print(placement)
        for i_app, app in enumerate(self.apps):
            place_tier = placement[start_index: start_index + self.ios[i_app]]
            start_index = self.ios[i_app]
            self.env.process(app.run(self.cluster, placement=place_tier))
        # run the simulation
        self.env.run()
        fig = display_run(self.data, self.cluster, width=800, height=900)
        fitness = app.get_fitness()
        appslist = ", ".join([app.name for app in self.apps])
        print(f"The apps {appslist} lasts {round(fitness, 3)} seconds when placement = {placement}")
        return fig
   
logger.remove()
cbb = ClusterBlackBox()
PARAMETER_SPACE = cbb.parameter_space
# combinations are self.n_tiers ** sum(self.ios)
NBR_ITERATION = 100  # cbb.n_tiers ** sum(cbb.ios)

np.random.seed(5)
bbopt = BBOptimizer(black_box=cbb,
                    heuristic="surrogate_model",
                    max_iteration=NBR_ITERATION,
                    initial_sample_size=10,
                    parameter_space=PARAMETER_SPACE,
                    next_parameter_strategy=expected_improvement,
                    regression_model=GaussianProcessRegressor)
start_time = time.time()
bbopt.optimize()
print(NBR_ITERATION)
print(f"Running {NBR_ITERATION}/{cbb.n_tiers**sum(cbb.ios)} on BBO take {(time.time() - start_time)}")
bbopt.summarize()
#print(bbopt.history["fitness"])
#print(bbopt.best_parameters_in_grid) 
        

100
Running 100/2187 on BBO take 41.929540157318115
------ Optimization loop summary ------
Number of iterations: 140
Elapsed time: 2576.902407169342
Best parameters: [2 1 0 0 0 1 1]
Best fitness value: 72.5
Percentage of explored space: 2.652034750800183
Percentage of static moves: 58.57142857142858
Cost of global exploration: 9069.952167573529
Mean fitness gain per iteration: -0.25865022267899973
--- Heuristic specific summary ---
Final RMSE: 1.0
None


In [18]:
fig1 = cbb.display_placement(placement=bbopt.best_parameters_in_grid) # solution found by bbopt
fig2 = cbb.display_placement(placement=[0, 0, 0, 0, 0, 0, 0]) # most intuitive solution
fig1.show()


The apps B0, Q8, J9 lasts 72.5 seconds when placement = [2 1 0 0 0 1 1]
The apps O2, O4, R0 lasts 77.809 seconds when placement = [0, 0, 0, 0, 0, 0, 0]


In [19]:
fig2.show()

Conclusions:
1. Placing all I/O on most performant tier is not always the best solution, even though capacities of tiers is not reached.
2. The more you send I/O on the same tier, the more we have to share bandwidth and deteriorate the transfer speed.
3. Simulation allows us to roughly predict what would happen with a given placement, and to optimize accordingly. Otherwise, we should build a dataset for each usecase that could cost many (100 ?) runs.

##### Next steps:
1. Still need to improve many aspects of the Execution Simulator
2. Add support for ephemeral tier (buffering and destaging) (WIP)
3. Open support for many other heuristics (interfacing Simulator more properly with heuristics)
4. Implementing other parts of the Recommendation System.
5. Is application sequential formalism enough to describe a usecase workflow?

