# MLOPs Parsl workflow

This notebook is the stand-alone companion to the Parsl MLOPs workflow in `main.py` in this repository. This notebook is designed to be run directly on an HPC resource while the `main.py` in this workflow uses the `parsl_utils` to launch MLOPs applications from a central coordinating node (i.e. a laptop or the Parallel Works platform). This workflow simulates a typical MLOPs situation with the following tasks:
1. start an MLFlow tracking server
2. start DVC tracking within an architve repository + remote
3. download and preprocess training data
4. run training loop and store results on-the-fly with MLFlow
5. commit and push resulting models with DVC to repo + remote
6. use the model for inference and generate figures.
7. reusing the model for inference and generating figures


## Installs

In [None]:
# conda installs
! conda install -y -c conda-forge tensorflow
! conda install -y -c conda-forge matplotlib
! conda install -y -c conda-forge pandas
! conda install -y -c conda-forge dvc 

# pip installs
! pip install mlflow
! pip install 'parsl[monitoring, visualization]' # Conda does not install monitoring, so use pip

## Imports

Based on the instructions in the [Parsl Tutorial](https://parsl.readthedocs.io/en/latest/1-parsl-introduction.html)

In [1]:
import os
import numpy as np
import pandas as pd

os.environ["KERAS_BACKEND"] = "tensorflow"

# ml dependencies
import tensorflow as tf
import keras
from keras import ops
from keras import layers

# mlflow dependencies
import mlflow
from mlflow import MlflowClient

# parsl dependencies
import parsl
import logging
from parsl.app.app import python_app, bash_app
from parsl.configs.local_threads import Config
from parsl.executors import HighThroughputExecutor # we want to use monitoring, so we must use HTEX
from parsl.monitoring.monitoring import MonitoringHub
from parsl.addresses import address_by_hostname

#=================================================
# Log everything to stdout (ends up in pink boxes 
# in the notebook). This information is logged anyway
# in ./runinfo/<run_id>/parsl.log
#parsl.set_stream_logger() # <-- log everything to stdout
#==================================================

print(parsl.__version__)

2024-09-17 17:39:11.272432: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-17 17:39:11.289098: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-17 17:39:11.294269: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-17 17:39:11.306749: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


2024.09.16


# Configure Parsl

This configuration must use the HTEX since we also want to enable [Parsl monitoring](https://parsl.readthedocs.io/en/latest/userguide/monitoring.html).

In [2]:
config = Config(
   executors=[
       HighThroughputExecutor(
           label="local_htex",
           cores_per_worker=1,
           max_workers_per_node=2,
           address=address_by_hostname(),
       )
   ],
   monitoring=MonitoringHub(
       hub_address=address_by_hostname(),
       hub_port=55055,
       monitoring_debug=False,
       resource_monitoring_interval=10,
   ),
   strategy='none'
)

# Loading the configuration starts a Parsl DataFlowKernel
dfk = parsl.load(config)

## Start Parsl monitoring - Option 1 - direct shell invocation to background

This step can be done at any point provided that a database file exists.  The default location of this file is in `./runinfo/monitoring.db` and this file is created when the Parsl configuration is loaded. When the notebook kernel is restarted, additional Parsl workflow runs' information is appended to the monitoring information in `./runinfo`. It is possible to view this information "offline" (i.e. no active running Parsl workflows, see Option 3, at the end of this notebook).

This launch is commented out here since it is also possible to launch `parsl-visualize` from a Parsl app within the workflow, which is done below. This command is retained as a functional example. The advantage to running `parsl-visualize` as a Parsl app is that the visualization server is up and running while the workflow is running and then is shut down when the workflow is cleaned up. Otherwise, when `parsl-visualize` is launched via `os.system` the running child process can persist even after workflow shut down or notebook kernel restart.

In [3]:
# Launch Parsl 
#os.system('parsl-visualize 1> parsl_vis.stdout 2> parsl_vis.stderr &')

## Define Parsl apps

Parsl workflows are divided into the smallest unit of execution, the app. There are two types of Parsl apps:
1. Python apps are useful when launching pure Python code (i.e. TensorFlow)
2. Bash apps are useful when launching tasks on the command line (i.e. starting the MLFlow server)

Here, the applications are *defined* but not run.

### Python Apps

In [4]:
# create the model directory for saving outputs
env_name = "digits_env" # <name of your env>
model_dir = './model-dir' 

@python_app
def make_dir(model_dir):
    import os
    model_dir = './model-dir'
    return os.makedirs(model_dir, exist_ok = True)

In [5]:
@python_app
def build_model(inputs=[], outputs=[]):

    # imports ---------------------------------------------------------------------------------------------
    
    import os
    import numpy as np
    import pandas as pd

    os.environ["KERAS_BACKEND"] = "tensorflow"

    # ml dependencies
    import tensorflow as tf
    import keras
    from keras import ops
    from keras import layers

    # mlflow dependencies
    import mlflow
    from mlflow import MlflowClient
    
    # definition library
    import sys
    sys.path.append('/home/lobielodan/mlops-parsl-workflow')
    from definitions import Sampling, VAE
    
    # -----------------------------------------------------------------------------------------------------

    # build encoder ---------------------------------------------------------------------------------------
    
    latent_dim = 2
    encoder_inputs = keras.Input(shape=(28, 28, 1))
    
    x = layers.Conv2D(32, 3, activation="relu", strides=2, padding="same")(encoder_inputs)
    x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(x)
    x = layers.Flatten()(x)
    x = layers.Dense(16, activation="relu")(x)
    
    z_mean = layers.Dense(latent_dim, name="z_mean")(x)
    z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)
    z = Sampling()([z_mean, z_log_var])
    
    encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")
    with open('model_build_out.txt', 'a') as f:
        encoder.summary(print_fn=lambda x: f.write(x + '\n'))
    
    # -----------------------------------------------------------------------------------------------------
    
    # build decoder ---------------------------------------------------------------------------------------
    
    latent_dim = 2
    latent_inputs = keras.Input(shape=(latent_dim,))
    
    x = layers.Dense(7 * 7 * 64, activation="relu")(latent_inputs)
    x = layers.Reshape((7, 7, 64))(x)
    x = layers.Conv2DTranspose(64, 3, activation="relu", strides=2, padding="same")(x)
    x = layers.Conv2DTranspose(32, 3, activation="relu", strides=2, padding="same")(x)
    
    decoder_outputs = layers.Conv2DTranspose(1, 3, activation="sigmoid", padding="same")(x)
    decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")
    with open('model_build_out.txt', 'a') as f:
        decoder.summary(print_fn=lambda x: f.write(x + '\n'))
    
    # -----------------------------------------------------------------------------------------------------
    
    # build model -----------------------------------------------------------------------------------------    
    
    vae = VAE(encoder, decoder)
    vae.compile(optimizer=keras.optimizers.Adam())
    
    return vae

In [6]:
@python_app
def train_model(inputs=[], outputs=[]): # inputs = [num, model, data, experiment]
    model_dir = './model-dir' 
    early_stopping_cb = keras.callbacks.EarlyStopping(patience = 5, restore_best_weights = True)
    
    # if the model has already been trained at least once, load that model
    if os.path.exists(os.path.join(model_dir, 'vae.weights.h5')): 
        inputs[1].load_weights(os.path.join(model_dir, 'vae.weights.h5'))
    
    mlflow.autolog() # start autologging
    
    run_name = f"{inputs[0]}_test" # define a run name for this iteration of training
    artifact_path = f"{inputs[0]}"  # define an artifact path that the model will be saved to
    
    # initiate the MLflow run context 
    # - training needs to happen inside of the mlflow run or you will run into problems with double logging
    with mlflow.start_run(run_name = run_name, experiment_id = inputs[3]) as run:
        
        history = inputs[1].fit(inputs[2], epochs=30, batch_size=128, callbacks = [early_stopping_cb])
        inputs[1].save_weights(os.path.join(model_dir, 'vae.weights.h5')) # save model weights after training

        hist_pd = pd.DataFrame(history.history)
        hist_pd.to_csv(os.path.join(model_dir, f'history_{inputs[0]}.csv'), index = False)

### Bash Apps

In [7]:
@bash_app
def start_parsl_visualize(stdout='parsl_vis_app.stdout', stderr='parsl_vis_app.stderr'):
    return 'parsl-visualize --listen 127.0.0.1 --port 8080'

@bash_app
def start_mlflow(stdout='mlflow_app.stdout', stderr='mlflow_app.stderr'):
    return 'mlflow server --host 127.0.0.1 --port 8081'

## Start Parsl monitoring - Option 2 - Monitoring as a Parsl app

This approach is helpful if we want Parsl Monitoring processes to be cleaned up after the workflow is complete.

In [8]:
# Start Parsl visualization in a
# separate cell since we only want
# to run this app one time. This
# invocation of parsl_visualize is
# technically part of the workflow.

# parsl_future = start_parsl_visualize()
mlflow_future = start_mlflow()

## Run the workflow

The workflow code below runs the applications.

In [9]:
future = make_dir(model_dir)

In [10]:
# utilize and set up the initialized server for tracking 
client = MlflowClient(tracking_uri = "http://127.0.0.1:8081")
mlflow.set_tracking_uri("http://127.0.0.1:8081")

In [None]:
# provide an experiment description that will appear in the UI
experiment1_description = (
    "This is the digits forecasting project."
    "This experiment contains the digit model for randomized numbers (0-9) trained separately."
)

# provide searchable tags for the experiment
experiment1_tags = {
    "project_name": "digit-forecasting",
    "model_type": "randomzied",
    "team": "digit-ml",
    "project_quarter": "Q3-2024",
    "mlflow.note.content": experiment1_description,
}

# create the experiment and give it a unique name
digit_experiment1 = client.create_experiment(
    name="Randomize_Model", tags=experiment1_tags
)

digit_experiment1 = mlflow.set_experiment("Randomize_Model")

In [13]:
build = build_model()

NameError: name 'vae' is not defined

In [12]:
build

<AppFuture at 0x14c3792633d0 state=pending>

Exception in thread HTEX-Result-Queue-Thread:
Traceback (most recent call last):
  File "/home/lobielodan/pw/.miniconda3c/envs/digits_env/lib/python3.9/site-packages/keras/src/models/model.py", line 537, in from_config
    return cls(**config)
TypeError: __init__() missing 2 required positional arguments: 'encoder' and 'decoder'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lobielodan/pw/.miniconda3c/envs/digits_env/lib/python3.9/site-packages/keras/src/saving/serialization_lib.py", line 718, in deserialize_keras_object
    instance = cls.from_config(inner_config)
  File "/home/lobielodan/pw/.miniconda3c/envs/digits_env/lib/python3.9/site-packages/keras/src/models/model.py", line 539, in from_config
    raise TypeError(
TypeError: Unable to revive model from config. When overriding the `get_config()` method, make sure that the returned config contains all items used as arguments in the  constructor to <class 'defi

In [None]:
model = build.result()

In [None]:
# retraining the model n times
count = 0
n = 5

(x_train, Y_train), (x_test, Y_test) = keras.datasets.mnist.load_data()
mnist_digits = np.expand_dims(np.concatenate([x_train, x_test], axis=0), -1).astype("float32") / 255

for arr in np.array_split(mnist_digits, n):
    count += 1
    train_model([f"rand_{count}", model, arr, digit_experiment1.experiment_id])

## Stop Parsl

The cells above can be rerun any number of times; this will simply send more and more apps to be run by Parsl. When the workflow is truly complete, it is time to call the cleanup() command. This command runs implicitly when a `main.py` script finishes executing, but it is *not* run in a notebook unless it is explicitly called as it is below.

In [None]:
dfk.cleanup()

## Clean up some log files

In [None]:
# maybe write a script for this

In [None]:
# Application logs
! rm echo-hello.stdout
! rm echo-hello.stderr

# Remove log files if parsl-visualize is started from os.system (Option 1)
! rm parsl_vis.stdout
! rm parsl_vis.stderr

# Remove log files if parsl-visualize is started from Parsl app (Option 2)
! rm parsl_vis_app.stdout
! rm parsl_vis_app.stderr

# This directory contains Parsl monitoring along with other logs
! rm -rf runinfo

## Start Parsl Monitoring - Option 3 - Post workflow manual invocation

Once the Parsl `./runinfo/monitoring.db` is created, it is possible to start Parsl Monitoring and browse the results of workflow in an offline manner.  In this scenario, `parsl-visualize` can be started on the command line provided that a Conda env with `parsl[visualize]` installed is activated. For example:
```
source pw/.miniconda3/etc/profile.d/conda.sh
conda activate base
parsl-visualize sqlite:////${HOME}/mlops-parsl-workflow/runinfo/monitoring.db
```
(You may need to adjust the path to the Conda environment, its name, and the path to `monitoring.db`.)