# Parameterization


Often times notebooks will need to execute by injecting variables and values from externally to the notebook. This might be service credentials, specific endpoints, variations for the notebook (e.g. if a notebook is executing a hyperparameter grid search, they may be executed many times in parallel but only with external variables changing). Today, there are two primary ways that notebooks are parameterized - via new tool (like Papermill) or with code generation directly into the notebook (e.g. regex replacement against fields below). 

__Today:__ Values are just hard coded in, and require manual changes

In [None]:
epochs = 200
data_source = "http://data.contoso.com/blob/important_data.csv"
postgresql_credentials = ""
test_cert_store_root = "/var/opt/secrets/test-certificates" # maps to local file system, not uri

# Lack Environment Description

Import behavior is always one of the most challenging (and most common to get wrong). Most commonly, people will import several packages at the top of a file. They are unlikely to include specific versions and may use structures which are hard to introspect (e.g. 'from foo import bar as qaz'). However, the libraries may be imported outside the notebook itself, via inline bash commands or via a command line (Jupyter notebooks execute inside the command line environment - so if packages were imported there, the notebook will run normally). This will often lead to mismatched environments when the package is deployed to another environment or containerized. Because the time in deployment of complex pipelines is so long, this could be 10 minutes or more before noticing that something is wrong.

__Today:__ Requires the packages are already installed (lack version)

In [None]:
import numpy
import matplotlib
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Use of Inline Local Commands

In addition to setting up complicated imports, often data scientists will do inline bash to get (or change) system information. This is particularly pernitious because it is hard to introspect, capture errors and may change the environment altogether.

__Today__: Use an escape sequence to execute locally.

In [None]:
# Upgrade pip
!python -m pip install --upgrade pip

# Install a package
!pip install tensorflow

# Execute a command using a CLI - manually parsing output, regexing for errors, and using P.wait()
wait = True 
try:
    if no_output:
        p = Popen(cmd_actual)
    else:
        p = Popen(cmd_actual, stdout=PIPE, stderr=PIPE, bufsize=1)
        with p.stdout:
            for line in iter(p.stdout.readline, b''):
                line = line.decode()
                if return_output:
                    output = output + line
                else:
                    if cmd.startswith("azdata notebook run"): # Hyperlink the .ipynb file
                        regex = re.compile('  "(.*)"\: "(.*)"') 
                        match = regex.match(line)
                        if match:
                            if match.group(1).find("HTML") != -1:
                                display(Markdown(f' - "{match.group(1)}": "{match.group(2)}"'))
                            else:
                                display(Markdown(f' - "{match.group(1)}": "[{match.group(2)}]({match.group(2)})"'))

                                wait = False
                                break # otherwise infinite hang, have not worked out why yet.
                    else:
                        print(line, end='')

    if wait:
        p.wait()
except FileNotFoundError as e:
    raise FileNotFoundError(f"Executable '{cmd_actual[0]}' not found in path (where/which)") from e

# Exception Handling and Live Tracing

One of the most powerful components of Jupyter notebooks is the ability to both print out inputs and outputs in between cells as a tool to debug and understand what is going on. However, as notebooks move to production, things like exceptions and print outs can get lost because the standard method of user understanding is with non-loggable tools. With standard observability, this is problematic, but with errors (particularly unhandled ones) this can lead to business critical reliability issues.

__Today__: Use printf to understand processes, stdout for raising exceptions.

In [None]:
# For loading a Kubernetes client for use
import os
from IPython.display import Markdown

# ISSUE: Try-except block for loading libraries
try:
    from kubernetes import client, config
    from kubernetes.stream import stream
except ImportError: 

    # Install the Kubernetes module
    import sys
    !{sys.executable} -m pip install kubernetes    
    
    try:
        from kubernetes import client, config
        from kubernetes.stream import stream
    except ImportError:
        display(Markdown(f'HINT: Use [SOP059 - Install Kubernetes Python module](../install/sop059-install-kubernetes-module.ipynb) to resolve this issue.'))
        raise

if "KUBERNETES_SERVICE_PORT" in os.environ and "KUBERNETES_SERVICE_HOST" in os.environ:
    config.load_incluster_config()
else:
    # ISSUE: Try-except block for loading Kubeconfig
    try:
        config.load_kube_config()
    except:
        display(Markdown(f'HINT: Use [TSG118 - Configure Kubernetes config](../repair/tsg118-configure-kube-config.ipynb) to resolve this issue.'))
        raise

# ISSUE: No try-except block for loading the API (could fail silently)
api = client.CoreV1Api()

# ISSUE: Print out for notifying that client loaded correctly (should be auto-logged/traceable)
print('Kubernetes client instantiated')

# Idea - check or wrap? See AST syntax tracing here - https://engineering.soroco.com/abstract-syntax-tree-for-patching-code-and-assessing-code-quality/

In [None]:
# ISSUE: Functions and tracing often cross cells
def important_math(a, b):
    return (a * 10, b / 20)

a = 20
b = 40
print(f"A: {a}")
print(f"B: {b}")

In [None]:
# ISSUE: Second cell for execution - works when executing monolithically, does not work when cells auto-split

a, b = important_math(a, b)

print(f"A: {a}")
print(f"B: {b}")

# Not built to execute headlessly or declaratively
In addition to the parameterization at top, often times scripts will be written to either interactively pick up input from users during the run (via inputs or via changing parameters), and not built to execute idempotently. This means that notebooks can be dropped into automation automatically without rewriting significant portions of code - and often without tests. Ideally, the system would be more aware of the environment it was being executed in and could have overrides (via environment variables, file injection) for inputs, and "execute once" as first class objects.

Further, especially when executing headlessly, a very common pattern is to just poll indefinitely until it fails - further poor use in automation environments.

__Today__: Fragile notebooks that have to be hand run and checked at each step.


In [None]:
# ISSUE: Inject values for a given function via hand changed variables
import re

pod = None # All
container = "app-service-proxy"
expressions_to_analyze = [
    re.compile(".{23}[error]")
]


RESOURCE_GROUP_NAME = 'SampleDB-Resource-Group'
LOCATION = "westus"

resource_client = get_client_from_cli_profile(ResourceManagementClient)
rg_result = resource_client.resource_groups.create_or_update(RESOURCE_GROUP_NAME,
    { "location": LOCATION })

# ISSUE: More magic constants
db_server_name = os.environ.get("DB_SERVER_NAME", f"SampleDB-MySQL-{random.randint(1,100000):05}")
db_admin_name = os.environ.get("DB_ADMIN_NAME", "azureuser")
db_admin_password = os.environ.get("DB_ADMIN_PASSWORD", "ChangePa$$w0rd24")

# ISSUE: Creation of resources without checking to see if they have already been created (relying on the service to fail nicely or not create duplicates)
import random, os
from azure.common.client_factory import get_client_from_cli_profile
from azure.mgmt.resource import ResourceManagementClient

from azure.mgmt.rdbms.mysql import MySQLManagementClient
from azure.mgmt.rdbms.mysql.models import ServerForCreate, ServerPropertiesForDefaultCreate, ServerVersion

# ISSUE: No checking to see if the client was provisioned properly, or raised exceptions
mysql_client = get_client_from_cli_profile(MySQLManagementClient)

# ISSUE: Provision the server and wait for the result - just polling
poller = mysql_client.servers.create(RESOURCE_GROUP_NAME,
    db_server_name,
    ServerForCreate(
        location=LOCATION,
        properties=ServerPropertiesForDefaultCreate(
            administrator_login=db_admin_name,
            administrator_login_password=db_admin_password,
            version=ServerVersion.five_full_stop_seven
        )
    )
)

server = poller.result()

# ISSUE: Block of code for addition services (firewall and database provisioning) - no global concept of complete or undo

RULE_NAME = "allow_ip"

# ISSUE: Relying on os specifics for IP address 
ip_address = os.environ["PUBLIC_IP_ADDRESS"]


poller = mysql_client.firewall_rules.create_or_update(RESOURCE_GROUP_NAME,
    db_server_name, RULE_NAME,
    ip_address,  # Start ip range
    ip_address   # End ip range
)
firewall_rule = poller.result()

# ISSUE: Another magic constant, inline with code
DB_NAME = "example-db1"

# ISSUE: Another polling example, no structured reporting
poller = mysql_client.databases.create_or_update(RESOURCE_GROUP_NAME,
    db_server_name, DB_NAME)

db_result = poller.result()
print(f"Provisioned MySQL database {db_result.name} with ID {db_result.id}")

In [None]:
# No caching for common functions


In [None]:
# No parallel execution



In [None]:
# Run step statelessly with High Mem


In [None]:
# Run step with GPU


In [None]:
# Out of order execution (to prevent blocking) - DAG follows from step early on (skipping a bunch of long running steps)

cmd = """echo "CPU %\t MEM %\t MEM\t PROCESS" &&
ps aux |
awk '
    {mem[$11] += int($6/1024)};
    {cpuper[$11] += $3};
    {memper[$11] += $4};
END {
    for (i in mem) {
        print cpuper[i] "%\t", memper[i] "%\t", mem[i] "MB\t", i
    }
}' |
sort -k3nr
"""

pod_list = api.list_namespaced_pod(namespace)
pod_names = [pod.metadata.name for pod in pod_list.items]

for pod in pod_list.items:
    container_names = [container.name for container in pod.spec.containers]

    for container in container_names:
        print (f"CONTAINER: {container} / POD: {pod.metadata.name}")
        try:
            print(stream(api.connect_get_namespaced_pod_exec, pod.metadata.name, namespace, command=['/bin/sh', '-c', cmd], container=container, stderr=True, stdout=True))
        except Exception:
            print (f"Failed to get CPU/Memory for container: {container} in POD: {pod.metadata.name}")

In [None]:
# No retry automation for external service


In [None]:
# Save to common storage location (with metadata) - future debugging, and means everything here can be torn down