# Learning Objectives

Present an overview of tests to be conducted before a containerized model is released for deployment.

# Setup

In [1]:
!pip install -q gradio_client

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.4/314.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.9/129.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from gradio_client import Client

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tqdm import tqdm

In [3]:
client = Client("pgurazada1/machine-failure-predictor-mlops-demo")

Loaded as API: https://pgurazada1-machine-failure-predictor-mlops-demo.hf.space ✔


# Baseline Checks

Test Data

In [4]:
dataset = fetch_openml(data_id=42890, as_frame=True, parser="auto")

data_df = dataset.data

target = 'Machine failure'
numeric_features = [
    'Air temperature [K]',
    'Process temperature [K]',
    'Rotational speed [rpm]',
    'Torque [Nm]',
    'Tool wear [min]'
]
categorical_features = ['Type']

X = data_df[numeric_features + categorical_features]
y = data_df[target]

Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

Xtest_sample = Xtest.sample(100)
ytest_sample = ytest.loc[Xtest_sample.index]

Xtest_sample_rows = list(Xtest_sample.itertuples(index=False, name=None))

Predictions on the test data

In [5]:
baseline_test_predictions = []

for row in tqdm(Xtest_sample_rows):
    try:
        job = client.submit(
            air_temperature=row[0],
            process_temperature=row[1],
            rotational_speed=row[2],
            torque=row[3],
            tool_wear=row[4],
            type=row[5],
            api_name="/predict"
        )

        prediction = job.result()['label']

        baseline_test_predictions.append(int(prediction))

    except Exception as e:
        print(e)

100%|██████████| 100/100 [01:15<00:00,  1.33it/s]


Estimation of accuracy on the test sample.

In [6]:
print(classification_report(ytest_sample, baseline_test_predictions))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99        96
           1       1.00      0.75      0.86         4

    accuracy                           0.99       100
   macro avg       0.99      0.88      0.93       100
weighted avg       0.99      0.99      0.99       100



If the F1-score is more than the existing baseline (human or a previous model version), we move on to unit tests.

# Unit Tests

## Perturbation tests

Perturbation analysis involves introducing deliberate changes or perturbations to the input data and observing the corresponding impact on model predictions. This task helps evaluate the stability and robustness of the model. By systematically perturbing variables or introducing simulated variations, organizations can assess how sensitive the model is to different inputs and determine if it responds in an expected manner. For instance, in a credit risk assessment model, perturbation analysis could involve altering individual features such as income or credit utilization ratios to observe how the model's predictions change. This analysis helps identify potential vulnerabilities or inconsistencies in the model's behavior and informs the need for recalibration or retraining.

*Baseline*

In [7]:
job = client.submit(
    air_temperature=300.8,
    process_temperature=310.3,
    rotational_speed=1538,
    torque=36.1,
    tool_wear=198,
    type="L",
    api_name="/predict"
)

In [8]:
print(job.result()['label'])

0


*Test (perturbed baseline)*

In [9]:
job = client.submit(
    air_temperature=301.8,
    process_temperature=310.3,
    rotational_speed=1538,
    torque=36.1,
    tool_wear=198,
    type="L",
    api_name="/predict"
)

In [10]:
print(job.result()['label'])

0


Output in the above cell indicates that the model is robust to minor variations in air temperature.

# Known edge-cases (criticial subgroups)

In some applications, it is important to monitor the performance of machine learning models specifically for critical subgroups or segments of the population. These subgroups may be defined by demographic characteristics, geographic location, or other relevant factors. For example, in healthcare, it is crucial to ensure that a medical diagnosis model performs well across different demographic groups to avoid bias or disparities in patient care. If significant disparities or performance gaps are detected, further investigation can be conducted to understand the root causes and take necessary corrective actions, such as retraining the model.

In this case, we could test the model by presenting the edge cases of critical equipment failures and ascertaining that the model is able to detect these crucial failures.

*Critical equipment state (known failure test case)*

In this scenario, a known edge case is that when manufacturing products of type 'M', the tool wear should be less than 1010 minutes. Let us see if the model can recognize this failure state.

In [11]:
job = client.submit(
    air_temperature=303.6,
    process_temperature=311.8,
    rotational_speed=1421,
    torque=44.8,
    tool_wear=1010,
    type="M",
    api_name="/predict"
)

In [12]:
print(job.result()['label'])

1


Output in the above cell indiates that the model is able to correctly predict the failure of a known edge case.

More instances of such unit tests could be facilitated by presenting a simple interface to the tester like so:

In [13]:
# @title Unit Test Interface

air_temperature=303.6 # @param
process_temperature=311.8 # @param
rotational_speed=1421 # @param
torque=44.8 # @param
tool_wear=1010 # @param
type="M" # @param ['L', 'M', 'H']

job = client.submit(
    air_temperature=air_temperature,
    process_temperature=process_temperature,
    rotational_speed=rotational_speed,
    torque=torque,
    tool_wear=tool_wear,
    type=type,
    api_name="/predict"
)

failure_expected = 'Yes' if job.result()['label'] == '1' else 'No'
print(f"Failure expected?: {failure_expected}")

Failure expected?: Yes


If the unit tests pass, the model is ready to be tagged for release to staging and production.