# Simple and Robust ML Pipeline

With this notebook, I want to highlight the fundamentals of a robust ML pipeline.
They are:
- Steps are [pure functions](https://en.wikipedia.org/wiki/Pure_function)
- Keep the functionality in a step small
- Persistence between the steps
- Onion-like abstraction layers

ML Code is just a small part when doing Machine Learning

Original Source: *Sculley, David, et al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems 28 (2015): 2503-2511.*

![ML Code is just a small part when doing Machine Learning](https://cloud.google.com/architecture/images/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning-1-elements-of-ml.png)

![How a data pipeline looks like](https://cloud.google.com/architecture/images/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning-2-manual-ml.svg)

# This is about workflow and tooling

- We write this code for humans, not for machines
- We write the code for our coworkers
- Each component should be independent
- Persist all outputs and metadata!
- Verbosity is your friend
- Use typed Python
- Do use CSV as serialization between steps only for small data. Parquet is the better option for bigger data
- lego variable naming improves traceability
- be consistent with variable and file naming
- After each step, return where to find the outputs
- consistency is very important, things should always be only one way
- It is about bringing habits to code
- Write unit tests for every component
- Write an end2end test with artificial data and verify the output

This is your small pipeline.

In [7]:
import uuid

In [8]:
def short_hash():
    return uuid.uuid4().split("-")[0]

In [1]:
import os

In [2]:
pipeline_run_id = "first_pipeline_run"
# In the future: pipeline_run_id = str(uuid.uuid4())
output_dir = pipeline_run_id

### We create a directory for the pipeline run outputs

In [3]:
os.makedirs(output_dir, exist_ok=True)

### [Methods] First develop interactively

In [7]:
# load data
import os
import pandas as pd

extract_output_dir = f"{output_dir}/extract"
os.makedirs(extract_output_dir)

raw_data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
raw_data_path = f"{extract_output_dir}/raw_data.parquet"
raw_data.to_parquet(raw_data_path)

raw_data_statistics = raw_data.describe()
raw_data_statistics_path = f"{extract_output_dir}/raw_data_statistics.csv"
raw_data_statistics.to_csv(raw_data_statistics_path)

In [10]:
# Delete the contents
!rm -r $extract_output_dir

### [Methods] Then turn it into a self-contained function

In [51]:
def extract(output_dir: str) -> dict:
    params = locals()
    from loguru import logger
    import os
    import pandas as pd
    import uuid
    logger.info(f"extract started with {params}.")

    extract_output_dir = f"{output_dir}/extract"
    os.makedirs(extract_output_dir)
    
    raw_data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    raw_data_path = f"{extract_output_dir}/raw_data.parquet"
    raw_data.to_parquet(raw_data_path)
    
    raw_data_statistics = raw_data.describe()
    raw_data_statistics_path = f"{extract_output_dir}/raw_data_statistics.csv"
    raw_data_statistics.to_csv(raw_data_statistics_path)
    logger.info("extract finished.")
    return {
        "raw_data_path": raw_data_path,
        "raw_data_statistics_path": raw_data_statistics_path
    }

### Test the function

In [13]:
extract_result = extract(output_dir=output_dir)
extract_result

2021-06-17 00:03:54.393 | INFO     | __main__:extract:5 - Load data started.
2021-06-17 00:03:54.522 | INFO     | __main__:extract:16 - Load data finished.


{'raw_data_path': 'first_pipeline_run/extract/raw_data.parquet',
 'raw_data_statistics_path': 'first_pipeline_run/extract/raw_data_statistics.csv'}

In [66]:
def prepare(raw_data_path: str, features: list, standardize: bool, output_dir: str) -> dict:
    """Prepare the selected features and standardize if wanted."""
    params = locals()
    from loguru import logger
    import os
    import pandas as pd
    logger.info(f"prepare started with {params}")

    prepare_output_dir = f"{output_dir}/prepare"
    os.makedirs(prepare_output_dir)

    raw_data = pd.read_parquet(raw_data_path)
    X = raw_data[features]
    y = raw_data["species"]

    mean = X.mean()
    std = X.std()

    if standardize:
        logger.info("Standardize X.")
        X = (X - mean) / std

    X_path = f"{prepare_output_dir}/X.parquet"
    y_path = f"{prepare_output_dir}/y.parquet"

    X.to_parquet(X_path)
    y.to_frame().to_parquet(y_path)
    logger.info("prepare finished.")
    return {
        "mean": mean.to_dict(),
        "std": std.to_dict(),
        "X_path": X_path,
        "y_path": y_path
    }

In [30]:
prepare_result = prepare(raw_data_path=extract_result["raw_data_path"], features=["sepal_length", "sepal_width", "petal_length", "petal_width"], standardize=True, output_dir=pipeline_run_id)

2021-06-17 00:10:12.673 | INFO     | __main__:prepare:7 - prepare started with {'raw_data_path': 'first_pipeline_run/extract/raw_data.parquet', 'features': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'standardize': True, 'output_dir': 'first_pipeline_run'}
2021-06-17 00:10:12.692 | INFO     | __main__:prepare:26 - prepare finished.


In [39]:
def train(X_path: str, y_path: str, output_dir: str, clf_params: dict) -> dict:
    params = locals()
    from joblib import dump
    from loguru import logger
    import pandas as pd
    from sklearn.metrics import classification_report as clf_report
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier

    logger.info(f"train started with {params}.")
    train_output_dir = f"{output_dir}/train"
    os.makedirs(train_output_dir)

    X = pd.read_parquet(X_path)
    y = pd.read_parquet(y_path)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.8)
    
    clf = DecisionTreeClassifier(random_state=42, **clf_params)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    classification_report = clf_report(y_true=y_test, y_pred=y_pred, output_dict=True)
    model_path = f"{train_output_dir}/model.joblib"
    dump(clf, model_path)
    logger.info("train finished.")
    return {
        "model_path": model_path,
        "classification_report": classification_report
    }

In [42]:
train_result = train(X_path=prepare_result["X_path"], y_path=prepare_result["y_path"], output_dir=pipeline_run_id, clf_params={"max_depth": 2})
train_result

2021-06-17 00:15:08.638 | INFO     | __main__:train:10 - train started with {'X_path': 'first_pipeline_run/prepare/X.parquet', 'y_path': 'first_pipeline_run/prepare/y.parquet', 'output_dir': 'dudu', 'clf_params': {'max_depth': 2}}.
2021-06-17 00:15:08.659 | INFO     | __main__:train:25 - train finished.


{'model_path': 'dudu/train/model.joblib',
 'classification_report': {'setosa': {'precision': 1.0,
   'recall': 1.0,
   'f1-score': 1.0,
   'support': 10},
  'versicolor': {'precision': 1.0,
   'recall': 0.8888888888888888,
   'f1-score': 0.9411764705882353,
   'support': 9},
  'virginica': {'precision': 0.9166666666666666,
   'recall': 1.0,
   'f1-score': 0.9565217391304348,
   'support': 11},
  'accuracy': 0.9666666666666667,
  'macro avg': {'precision': 0.9722222222222222,
   'recall': 0.9629629629629629,
   'f1-score': 0.9658994032395567,
   'support': 30},
  'weighted avg': {'precision': 0.9694444444444444,
   'recall': 0.9666666666666667,
   'f1-score': 0.9664109121909632,
   'support': 30}}}

In [72]:
def validate(classification_report: dict, macro_avg_f1_score_min: float) -> dict:
    params = locals()
    from loguru import logger
    logger.info("validate started.")
    macro_avg_f1_score = classification_report["macro avg"]["f1-score"] 
    
    if macro_avg_f1_score < macro_avg_f1_score_min:
        passed = False
    else: 
        passed = True
    logger.info("validate finished.")
    return {
        "passed": passed
    }

In [49]:
validate_result = validate(classification_report=train_result["classification_report"], macro_avg_f1_score_min=0.95)
validate_result

{'validate': 'passed'}

In [50]:
validate_result = validate(classification_report=train_result["classification_report"], macro_avg_f1_score_min=0.99)
validate_result

{'validate': 'failed'}

In [78]:
import uuid

def pipeline(
    output_dir: str,
    prepare_features: list,
    prepare_standardize: bool,
    train_clf_params: dict,
    validate_macro_avg_f1_score_min: float
):
    if not output_dir:
        output_dir = str(uuid.uuid4())
    extract_result = extract(output_dir=output_dir)
    
    # TODO: json.dumps
    prepare_result = prepare(raw_data_path=extract_result["raw_data_path"], features=prepare_features, standardize=prepare_standardize, output_dir=output_dir)
    train_result = train_result = train(X_path=prepare_result["X_path"], y_path=prepare_result["y_path"], output_dir=output_dir, clf_params=train_clf_params)
    validate_result = validate(classification_report=train_result["classification_report"], macro_avg_f1_score_min=validate_macro_avg_f1_score_min)


In [79]:
import yaml

In [80]:
with open("config/run1.yaml") as f:
    config = yaml.safe_load(f)
config

{'output_dir': None,
 'prepare': {'features': ['sepal_length',
   'sepal_width',
   'petal_length',
   'petal_width'],
  'standardize': False},
 'train': {'clf_params': {'criterion': 'gini', 'max_depth': 2}},
 'validate': {'macro_avg_f1_score_min': 0.95}}

In [83]:
pipeline(
    output_dir=config["output_dir"],
    prepare_features=config["prepare"]["features"],
    prepare_standardize=config["prepare"]["standardize"],
    train_clf_params=config["train"]["clf_params"],
    validate_macro_avg_f1_score_min=config["validate"]["macro_avg_f1_score_min"],
)

2021-06-17 00:57:18.385 | INFO     | __main__:extract:7 - extract started with {'output_dir': '57494766-001f-4e2b-86cd-d9d9e8685f54'}.
2021-06-17 00:57:18.524 | INFO     | __main__:extract:20 - extract finished.
2021-06-17 00:57:18.525 | INFO     | __main__:prepare:7 - prepare started with {'raw_data_path': '57494766-001f-4e2b-86cd-d9d9e8685f54/extract/raw_data.parquet', 'features': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'standardize': False, 'output_dir': '57494766-001f-4e2b-86cd-d9d9e8685f54'}
2021-06-17 00:57:18.533 | INFO     | __main__:prepare:28 - prepare finished.
2021-06-17 00:57:18.533 | INFO     | __main__:train:10 - train started with {'X_path': '57494766-001f-4e2b-86cd-d9d9e8685f54/prepare/X.parquet', 'y_path': '57494766-001f-4e2b-86cd-d9d9e8685f54/prepare/y.parquet', 'output_dir': '57494766-001f-4e2b-86cd-d9d9e8685f54', 'clf_params': {'criterion': 'gini', 'max_depth': 2}}.
2021-06-17 00:57:18.546 | INFO     | __main__:train:25 - train finished.
202

In [None]:
with open("config/run2.yaml") as f:
    config = yaml.safe_load(f)
config

In [84]:
pipeline(
    output_dir=config["output_dir"],
    prepare_features=config["prepare"]["features"],
    prepare_standardize=config["prepare"]["standardize"],
    train_clf_params=config["train"]["clf_params"],
    validate_macro_avg_f1_score_min=config["validate"]["macro_avg_f1_score_min"],
)

2021-06-17 00:57:21.838 | INFO     | __main__:extract:7 - extract started with {'output_dir': 'ba941bc9-5877-4be0-b94d-e5d17f77ddb3'}.
2021-06-17 00:57:22.010 | INFO     | __main__:extract:20 - extract finished.
2021-06-17 00:57:22.011 | INFO     | __main__:prepare:7 - prepare started with {'raw_data_path': 'ba941bc9-5877-4be0-b94d-e5d17f77ddb3/extract/raw_data.parquet', 'features': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 'standardize': False, 'output_dir': 'ba941bc9-5877-4be0-b94d-e5d17f77ddb3'}
2021-06-17 00:57:22.021 | INFO     | __main__:prepare:28 - prepare finished.
2021-06-17 00:57:22.022 | INFO     | __main__:train:10 - train started with {'X_path': 'ba941bc9-5877-4be0-b94d-e5d17f77ddb3/prepare/X.parquet', 'y_path': 'ba941bc9-5877-4be0-b94d-e5d17f77ddb3/prepare/y.parquet', 'output_dir': 'ba941bc9-5877-4be0-b94d-e5d17f77ddb3', 'clf_params': {'criterion': 'gini', 'max_depth': 2}}.
2021-06-17 00:57:22.036 | INFO     | __main__:train:25 - train finished.
202