# `mle-monitor`: Lightweight Resource Monitoring
### Author: [@RobertTLange](https://twitter.com/RobertTLange) [Last Update: December 2021][![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mle-infrastructure/mle-monitor/blob/main/examples/getting_started.ipynb)

"Did I already run this experiment before? How many resources are currently available on my cluster?" If these are common questions you encounter during your daily life as a researcher, then `mle-monitor` is made for you. It provides a lightweight API for tracking your experiments using a pickle protocol database (e.g. for hyperparameter searches and/or multi-configuration/multi-seed runs). Furthermore, it comes with built-in resource monitoring on Slurm/Grid Engine clusters and local machines/servers. Finally, it leverages [`rich`](https://github.com/willmcgugan/rich) in order to provide a terminal dashboard that is updated online with new protocolled experiments and the current state of resource utilization. Here is an example of the dashboard on a Grid Engine cluster:

![](../docs/monitor-promo-gif.gif)

In [18]:
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

try:
    import mle_monitor
except:
    !pip install -q mle-monitor
    import mle_monitor

`mle-monitor` comes with three core functionalities:

- **`MLEProtocol`**:
- **`MLEResource`**:
- **`MLEDashboard`**:

<img src="https://github.com/mle-infrastructure/mle-monitor/blob/main/docs/mle_monitor_structure.png?raw=true" alt="drawing" width="900"/>


Finally, `mle-monitor` is part of the [`mle-infrastructure`](https://github.com/mle-infrastructure) and comes with a set of handy built-in synergies. We will wrap-up by outlining a full workflow .

# Experiment Management with `MLEProtocol`

In [19]:
from mle_monitor import MLEProtocol

# Load the protocol from a local file (create new if it doesn't exist yet)
protocol = MLEProtocol(protocol_fname="mle_protocol.db")

In order to add a new experiment to the protocol database you have to provide a dictionary containing the experiment meta data:

| Search Type           | Description | Default |
|----------------------- | ----------- | --------------- |
| `purpose`          |  Purpose of experiment  | 'None provided' |
| `project_name`        |  Project name of experiment  | 'default' |
| `exec_resource`    |  Resource jobs are run on | 'local' |
| `experiment_dir`  |  Experiment log storage directory   | 'experiments' |
| `experiment_type`     | Type of experiment to run | 'single' |
| `base_fname`     | Main code script to execute | 'main.py' |
| `config_fname`     | Config file path of experiment | 'base_config.yaml' |
| `num_seeds`     | Number of evaluations seeds | 1 |
| `num_total_jobs`     | Number of total jobs to run | 1 |
| `num_job_batches`     | Number of jobs in single batch | 1 |
| `num_jobs_per_batch`     | Number of sequential job batches | 1 |
| `time_per_job`     | Expected duration: days-hours-minutes | '00:01:00' |
| `num_cpus`     | Number of CPUs used in job | 1 |
| `num_gpus`     | Number of GPUs used in job | 0 |

In [9]:
meta_data = {
        "purpose": "Test Protocol",  # Purpose of experiment
        "project_name": "MNIST",  # Project name of experiment
        "exec_resource": "local",  # Resource jobs are run on
        "experiment_dir": "log_dir",  # Experiment log storage directory
        "experiment_type": "hyperparameter-search",  # Type of experiment to run
        "base_fname": "main.py",  # Main code script to execute
        "config_fname": "base_config.json",  # Config file path of experiment
        "num_seeds": 5,  # Number of evaluations seeds
        "num_total_jobs": 10,  # Number of total jobs to run
        "num_jobs_per_batch": 5,  # Number of jobs in single batch
        "num_job_batches": 2,  # Number of sequential job batches
        "time_per_job": "00:05:00",  # Expected duration: days-hours-minutes
        "num_cpus": 2,  # Number of CPUs used in job
        "num_gpus": 1,  # Number of GPUs used in job
    }
e_id = protocol.add(meta_data, save=False)
protocol.get(e_id)

{'purpose': 'Test Protocol',
 'project_name': 'MNIST',
 'exec_resource': 'local',
 'experiment_dir': 'log_dir',
 'experiment_type': 'hyperparameter-search',
 'base_fname': 'main.py',
 'config_fname': 'base_config.json',
 'num_seeds': 5,
 'num_total_jobs': 10,
 'num_jobs_per_batch': 5,
 'num_job_batches': 2,
 'time_per_job': '00:05:00',
 'num_cpus': 2,
 'num_gpus': 1,
 'git_hash': 'f7da2f18e74dce53f72b9561baa29ef3c9dd161e',
 'loaded_config': [{'train_config': {'lrate': 0.1},
   'model_config': {'num_layers': 5},
   'log_config': {'time_to_track': ['step_counter'],
    'what_to_track': ['loss'],
    'time_to_print': ['step_counter'],
    'what_to_print': ['loss'],
    'print_every_k_updates': 10,
    'overwrite_experiment_dir': 1}}],
 'e-hash': 'fd42ea4263abe3a5f952239371153643',
 'retrieved_results': False,
 'stored_in_cloud': False,
 'report_generated': False,
 'job_status': 'running',
 'start_time': '12/04/2021 10:22:07',
 'duration': '00:10:00',
 'stop_time': '12/04/2021 20:22:07'}

In [10]:
# Print a summary of the last experiments
sub_df = protocol.summary()

# ... and a more detailed version
sub_df = protocol.summary(full=True)

In [11]:
# Update some element in the database
protocol.update(e_id, "exec_resource", "slurm-cluster", save=False)

# Abort the experiment - changes status
protocol.abort(e_id, save=False)
sub_df = protocol.summary()

In [12]:
# Get the status of the experiment
protocol.status(e_id)

'aborted'

In [13]:
# Get the monitoring data - used later in dashboard
total_data, last_data, time_data, protocol_table = protocol.monitor()
total_data, last_data, time_data

({'total': '4',
  'run': '2',
  'done': '1',
  'aborted': '1',
  'sge': '0',
  'slurm': '1',
  'gcp': '0',
  'local': '3',
  'report_gen': '0',
  'gcs_stored': '1',
  'retrieved': '0'},
 {'e_id': '4',
  'job_status': 'aborted',
  'e_dir': 'log_dir',
  'e_type': 'hyperparameter-search',
  'e_script': 'main.py',
  'e_config': 'base_config.json',
  'report_gen': False},
 {'num_seeds': 5,
  'total_jobs': 10,
  'total_batches': 2,
  'jobs_per_batch': 5,
  'time_per_batch': '00:05:00',
  'start_time': '12/04/2021 10:22:07',
  'stop_time': '12/04/2021 20:22:07',
  'est_duration': '00:10:00'})

# More Flexibile Storage of Additional Meta-Data

You can also store other data specific to an experiment using a dictionary of data as follows:

In [15]:
extra_data = {"extra_config": {"lrate": 3e-04}}
e_id = protocol.add(meta_data, extra_data, save=False)
protocol.get(e_id)["extra_config"]

{'lrate': 0.0003}

## Syncing your Protocol DB with a GCS Bucket

You can also automatically sync your protocol database with a Google Cloud Storage (GCS) bucket. This will require you to have created a GCP project and a GCS bucket. Furthermore you will have to provide you `.json` authentication key path. If you don't have one yet, have a look [here](https://cloud.google.com/docs/authentication/getting-started). Alternatively, just make sure that the environment variable `GOOGLE_APPLICATION_CREDENTIALS` is set to the right path.

In [18]:
# Sync your protocol with a GCS bucket
cloud_settings = {
    "project_name": "mle-toolbox",        # Name of your GCP project
    "bucket_name": "mle-protocol",        # Name of your GCS bucket
    "protocol_fname": "mle_protocol.db",  # Name of DB file in GCS bucket
    "use_protocol_sync": True,            # Whether to sync the protocol
    "use_results_storage": False          # Whether to upload zipped dir at completion
}
protocol = MLEProtocol(protocol_fname="mle_protocol.db",
                       cloud_settings=cloud_settings,
                       verbose=True)

In [20]:
e_id = protocol.add(meta_data)
success = protocol.gcs_send()

# Resource Monitoring with `MLEResource`

In [22]:
from mle_monitor import MLEResource

resource = MLEResource(resource_name="local")
user_data, host_data, util_data = resource.monitor()

In [23]:
user_data.keys()

dict_keys(['pid', 'p_name', 'mem_util', 'cpu_util', 'cmdline', 'total_cpu_util', 'total_mem_util'])

You can also monitor slurm or grid engine clusters
```python
resource = MLEResource(
    resource_name="slurm-cluster",
    monitor_config={"partitions": ["partition-1", "partition-2"]},
)
resource = MLEResource(
    resource_name="sge-cluster",
    monitor_config={"queues": ["queue-1", "queue-2"]}
)
```

# Dashboard Visualization with `MLEDashboard`

In [None]:
from mle_monitor import MLEDashboard

dashboard = MLEDashboard(protocol, resource)

In [None]:
# Get a static snapshot of the protocol & resource utilisation
dashboard.snapshot()

In [None]:
# Run monitoring in while loop - dashboard
dashboard.live()

- Add widget animation!/screenshot

# Integration with the MLE-Infrastructure Ecosystem 🔺

In [None]:
try:
    from mle_hyperopt import RandomSearch
    from mle_scheduler import MLEQueue
    from mle_logging import load_meta_log
except:
    !pip install -q mle-hyperopt,mle-scheduler,mle-logging
    from mle_hyperopt import RandomSearch
    from mle_scheduler import MLEQueue
    from mle_logging import load_meta_log

We again start by adding an experiment to the protocol at launch time.

In [None]:
# Load (existing) protocol database and add experiment data
protocol_db = MLEProtocol("mle_protocol.db")
meta_data = {
    "purpose": "random search",  # Purpose of experiment
    "project_name": "surrogate",  # Project name of experiment
    "exec_resource": "local",  # Resource jobs are run on
    "experiment_dir": "logs_search",  # Experiment log storage directory
    "experiment_type": "hyperparameter-search",  # Type of experiment to run
    "base_fname": "train.py",  # Main code script to execute
    "config_fname": "base_config.json",  # Config file path of experiment
    "num_seeds": 2,  # Number of evaluations seeds
    "num_total_jobs": 4,  # Number of total jobs to run
    "num_jobs_per_batch": 4,  # Number of jobs in single batch
    "num_job_batches": 1,  # Number of sequential job batches
    "time_per_job": "00:00:02",  # Expected duration: days-hours-minutes
}
new_experiment_id = protocol_db.add(meta_data)

Afterwards, we leverage `mle-hyperopt` to instantiate a random search strategy with its parameter space. We then ask for two configurations and store them as `.yaml` files in our working directory:

In [None]:
# Instantiate random search class
strategy = RandomSearch(
    real={"lrate": {"begin": 0.1, "end": 0.5, "prior": "log-uniform"}},
    integer={"batch_size": {"begin": 1, "end": 5, "prior": "uniform"}},
    categorical={"arch": ["mlp", "cnn"]},
    verbose=True,
)

# Ask for configurations to evaluate & run parallel eval of seeds * configs
configs, config_fnames = strategy.ask(2, store=True)
configs

Next, we can use a `MLEQueue` from `mle-scheduler` to run our training script `train.py` for our two configurations and two different random seeds. Afterwards, we merge the resulting logs into a single `meta_log.hdf5` and retrieve the mean (over seeds) test loss score for both configurations.

In [None]:
queue = MLEQueue(
    resource_to_run="local",
    job_filename="train.py",
    config_filenames=config_fnames,
    random_seeds=[1, 2],
    experiment_dir="logs_search",
    protocol_db=protocol_db,
)
queue.run()

# Merge logs of random seeds & configs -> load & get final scores
queue.merge_configs(merge_seeds=True)
meta_log = load_meta_log("logs_search/meta_log.hdf5")
test_scores = [meta_log[r].stats.test_loss.mean[-1] for r in queue.mle_run_ids]

In [None]:
# Update the hyperparameter search strategy
strategy.tell(configs, test_scores)

# Wrap up experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)