# MLOPs Parsl workflow

This notebook is the stand-alone companion to the Parsl MLOPs workflow in `main.py` in this repository. This notebook is designed to be run directly on an HPC resource while the `main.py` in this workflow uses the `parsl_utils` to launch MLOPs applications from a central coordinating node (i.e. a laptop or the Parallel Works platform). This workflow simulates a typical MLOPs situation with the following tasks:
1. start an MLFlow tracking server
2. start DVC tracking within an architve repository + remote
3. download and preprocess training data
4. run training loop and store results on-the-fly with MLFlow
5. commit and push resulting models with DVC to repo + remote
6. use the model for inference and generate figures.
7. reusing the model for inference and generating figures


## Installs

In [None]:
# Conda does not install monitoring, so use pip.
#! conda install -y -c conda-forge parsl

! pip install 'parsl[monitoring, visualization]'

## Imports

Based on the instructions in the [Parsl Tutorial](https://parsl.readthedocs.io/en/latest/1-parsl-introduction.html)

In [None]:
import parsl
import os
from parsl.app.app import python_app, bash_app
from parsl.configs.local_threads import Config

# We want to use monitoring, so we must use HTEX
from parsl.executors import HighThroughputExecutor
from parsl.monitoring.monitoring import MonitoringHub
from parsl.addresses import address_by_hostname
import logging

#=================================================
# Log everything to stdout (ends up in pink boxes 
# in the notebook). This information is logged anyway
# in ./runinfo/<run_id>/parsl.log
#parsl.set_stream_logger() # <-- log everything to stdout
#==================================================

print(parsl.__version__)

# Configure Parsl

This configuration must use the HTEX since we also want to enable [Parsl monitoring](https://parsl.readthedocs.io/en/latest/userguide/monitoring.html).

In [None]:
config = Config(
   executors=[
       HighThroughputExecutor(
           label="local_htex",
           cores_per_worker=1,
           max_workers_per_node=2,
           address=address_by_hostname(),
       )
   ],
   monitoring=MonitoringHub(
       hub_address=address_by_hostname(),
       hub_port=55055,
       monitoring_debug=False,
       resource_monitoring_interval=10,
   ),
   strategy='none'
)

# Loading the configuration starts a Parsl DataFlowKernel
dfk = parsl.load(config)

## Start Parsl monitoring - Option 1 - direct shell invocation to background

This step can be done at any point provided that a database file exists.  The default location of this file is in `./runinfo/monitoring.db` and this file is created when the Parsl configuration is loaded. When the notebook kernel is restarted, additional Parsl workflow runs' information is appended to the monitoring information in `./runinfo`. It is possible to view this information "offline" (i.e. no active running Parsl workflows, see Option 3, at the end of this notebook).

This launch is commented out here since it is also possible to launch `parsl-visualize` from a Parsl app within the workflow, which is done below. This command is retained as a functional example. The advantage to running `parsl-visualize` as a Parsl app is that the visualization server is up and running while the workflow is running and then is shut down when the workflow is cleaned up. Otherwise, when `parsl-visualize` is launched via `os.system` the running child process can persist even after workflow shut down or notebook kernel restart.

In [None]:
# Launch Parsl 
#os.system('parsl-visualize 1> parsl_vis.stdout 2> parsl_vis.stderr &')

## Define Parsl apps

Parsl workflows are divided into the smallest unit of execution, the app. There are two types of Parsl apps:
1. Python apps are useful when launching pure Python code (i.e. TensorFlow)
2. Bash apps are useful when launching tasks on the command line (i.e. starting the MLFlow server)

Here, the applications are *defined* but not run.

In [None]:
@python_app
def slow_hello ():
    import time
    time.sleep(5)
    return 'Hello World from slow Python app!'

@bash_app
def echo_hello(stdout='echo-hello.stdout', stderr='echo-hello.stderr'):
    return 'echo "Hello World from fast Bash app!"'

@bash_app
def start_parsl_visualize(stdout='parsl_vis_app.stdout', stderr='parsl_vis_app.stderr'):
    return 'parsl-visualize'

## Start Parsl monitoring - Option 2 - Monitoring as a Parsl app

This approach is helpful if we want Parsl Monitoring processes to be cleaned up after the workflow is complete.

In [None]:
# Start Parsl visualization in a
# separate cell since we only want
# to run this app one time. This
# invocation of parsl_visualize is
# technically part of the workflow.
future = start_parsl_visualize()

## Run the workflow

The workflow code below runs the applications.

In [None]:
# Example Python app
future = slow_hello()

#print(slow_hello().result())

# Example Bash app
future = echo_hello()

#echo_hello().result()

with open('echo-hello.stdout', 'r') as f:
     print(f.read())

## Stop Parsl

The cells above can be rerun any number of times; this will simply send more and more apps to be run by Parsl. When the workflow is truly complete, it is time to call the cleanup() command. This command runs implicitly when a `main.py` script finishes executing, but it is *not* run in a notebook unless it is explicitly called as it is below.

In [None]:
dfk.cleanup()

## Clean up some log files

In [None]:
# Application logs
! rm echo-hello.stdout
! rm echo-hello.stderr

# Remove log files if parsl-visualize is started from os.system (Option 1)
! rm parsl_vis.stdout
! rm parsl_vis.stderr

# Remove log files if parsl-visualize is started from Parsl app (Option 2)
! rm parsl_vis_app.stdout
! rm parsl_vis_app.stderr

# This directory contains Parsl monitoring along with other logs
! rm -rf runinfo

## Start Parsl Monitoring - Option 3 - Post workflow manual invocation

Once the Parsl `./runinfo/monitoring.db` is created, it is possible to start Parsl Monitoring and browse the results of workflow in an offline manner.  In this scenario, `parsl-visualize` can be started on the command line provided that a Conda env with `parsl[visualize]` installed is activated. For example:
```
source pw/.miniconda3/etc/profile.d/conda.sh
conda activate base
parsl-visualize sqlite:////${HOME}/mlops-parsl-workflow/runinfo/monitoring.db
```
(You may need to adjust the path to the Conda environment, its name, and the path to `monitoring.db`.)