# `mle-monitor`: Lightweight Resource Monitoring
### Author: [@RobertTLange](https://twitter.com/RobertTLange) [Last Update: November 2021][![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mle-infrastructure/mle-monitor/blob/main/examples/getting_started.ipynb)

<img src="https://github.com/mle-infrastructure/mle-monitor/blob/main/docs/mle_monitor_structure.png?raw=true" alt="drawing" width="900"/>

In [53]:
from datetime import datetime
start_time = datetime.now().strftime("%m/%d/%y %I:%M %p")
time_t = datetime.now().strftime("%m/%d/%y %I:%M %p")
stop_time = datetime.strptime(time_t, "%m/%d/%y %H:%M %p")
start_time = datetime.strptime(start_time, "%m/%d/%y %H:%M %p")

In [62]:
t = datetime.now().strftime("%m/%d/%y %I:%M %p")
dt.datetime.strptime(t, "%m/%d/%y %H:%M %p")

datetime.datetime(2021, 12, 4, 4, 10)

In [7]:
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

try:
    import mle_monitor
except:
    !pip install -q mle-monitor
    import mle_monitor

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Experiment Management with `MLEProtocol`

In [8]:
from mle_monitor import MLEProtocol

# Load the protocol from a local file (create new if it doesn't exist yet)
protocol = MLEProtocol(protocol_fname="mle_protocol.db")

In [9]:
meta_data = {
        "purpose": "Test Protocol",  # Purpose of experiment
        "project_name": "MNIST",  # Project name of experiment
        "exec_resource": "local",  # Resource jobs are run on
        "experiment_dir": "log_dir",  # Experiment log storage directory
        "experiment_type": "hyperparameter-search",  # Type of experiment to run
        "base_fname": "main.py",  # Main code script to execute
        "config_fname": "base_config.json",  # Config file path of experiment
        "num_seeds": 5,  # Number of evaluations seeds
        "num_total_jobs": 10,  # Number of total jobs to run
        "num_jobs_per_batch": 5,  # Number of jobs in single batch
        "num_job_batches": 2,  # Number of sequential job batches
        "time_per_job": "00:05:00",  # Expected duration: days-hours-minutes
        "num_cpus": 2,  # Number of CPUs used in job
        "num_gpus": 1,  # Number of GPUs used in job
    }
e_id = protocol.add(meta_data, save=False)
protocol.get(e_id)

{'purpose': 'Test Protocol',
 'project_name': 'MNIST',
 'exec_resource': 'local',
 'experiment_dir': 'log_dir',
 'experiment_type': 'hyperparameter-search',
 'base_fname': 'main.py',
 'config_fname': 'base_config.json',
 'num_seeds': 5,
 'num_total_jobs': 10,
 'num_jobs_per_batch': 5,
 'num_job_batches': 2,
 'time_per_job': '00:05:00',
 'num_cpus': 2,
 'num_gpus': 1,
 'git_hash': 'f7da2f18e74dce53f72b9561baa29ef3c9dd161e',
 'loaded_config': [{'train_config': {'lrate': 0.1},
   'model_config': {'num_layers': 5},
   'log_config': {'time_to_track': ['step_counter'],
    'what_to_track': ['loss'],
    'time_to_print': ['step_counter'],
    'what_to_print': ['loss'],
    'print_every_k_updates': 10,
    'overwrite_experiment_dir': 1}}],
 'e-hash': 'fd42ea4263abe3a5f952239371153643',
 'retrieved_results': False,
 'stored_in_cloud': False,
 'report_generated': False,
 'job_status': 'running',
 'start_time': '12/04/2021 10:22:07',
 'duration': '00:10:00',
 'stop_time': '12/04/2021 20:22:07'}

In [10]:
# Print a summary of the last experiments
sub_df = protocol.summary()

# ... and a more detailed version
sub_df = protocol.summary(full=True)

In [11]:
# Update some element in the database
protocol.update(e_id, "exec_resource", "slurm-cluster", save=False)

# Abort the experiment - changes status
protocol.abort(e_id, save=False)
sub_df = protocol.summary()

In [12]:
# Get the status of the experiment
protocol.status(e_id)

'aborted'

In [13]:
# Get the monitoring data - used later in dashboard
total_data, last_data, time_data, protocol_table = protocol.monitor()
total_data, last_data, time_data

({'total': '4',
  'run': '2',
  'done': '1',
  'aborted': '1',
  'sge': '0',
  'slurm': '1',
  'gcp': '0',
  'local': '3',
  'report_gen': '0',
  'gcs_stored': '1',
  'retrieved': '0'},
 {'e_id': '4',
  'job_status': 'aborted',
  'e_dir': 'log_dir',
  'e_type': 'hyperparameter-search',
  'e_script': 'main.py',
  'e_config': 'base_config.json',
  'report_gen': False},
 {'num_seeds': 5,
  'total_jobs': 10,
  'total_batches': 2,
  'jobs_per_batch': 5,
  'time_per_batch': '00:05:00',
  'start_time': '12/04/2021 10:22:07',
  'stop_time': '12/04/2021 20:22:07',
  'est_duration': '00:10:00'})

# More Flexibile Storage of Additional Meta-Data

You can also store other data specific to an experiment using a dictionary of data as follows:

In [15]:
extra_data = {"extra_config": {"lrate": 3e-04}}
e_id = protocol.add(meta_data, extra_data, save=False)
protocol.get(e_id)["extra_config"]

{'lrate': 0.0003}

## Syncing your Protocol DB with a GCS Bucket

You can also automatically sync your protocol database with a Google Cloud Storage (GCS) bucket. This will require you to have created a GCP project and a GCS bucket. Furthermore you will have to provide you `.json` authentication key path. If you don't have one yet, have a look [here](https://cloud.google.com/docs/authentication/getting-started). Alternatively, just make sure that the environment variable `GOOGLE_APPLICATION_CREDENTIALS` is set to the right path.

In [18]:
# Sync your protocol with a GCS bucket
cloud_settings = {
    "project_name": "mle-toolbox",        # Name of your GCP project
    "bucket_name": "mle-protocol",        # Name of your GCS bucket
    "protocol_fname": "mle_protocol.db",  # Name of DB file in GCS bucket
    "use_protocol_sync": True,            # Whether to sync the protocol
    "use_results_storage": False          # Whether to upload zipped dir at completion
}
protocol = MLEProtocol(protocol_fname="mle_protocol.db",
                       cloud_settings=cloud_settings,
                       verbose=True)

In [20]:
e_id = protocol.add(meta_data)
success = protocol.gcs_send()

# Resource Monitoring with `MLEResource`

In [22]:
from mle_monitor import MLEResource

resource = MLEResource(resource_name="local")
user_data, host_data, util_data = resource.monitor()

In [23]:
user_data.keys()

dict_keys(['pid', 'p_name', 'mem_util', 'cpu_util', 'cmdline', 'total_cpu_util', 'total_mem_util'])

# Dashboard Visualization with `MLEDashboard`

In [None]:
from mle_monitor import MLEDashboard

dashboard = MLEDashboard(protocol, resource)

In [None]:
# Get a static snapshot of the protocol & resource utilisation
dashboard.snapshot()

In [None]:
# Run monitoring in while loop - dashboard
dashboard.live()