# QAS Demo

This is a set of demo notebooks to illustrate the use of the MLTE library and SDMT process, using Quality Attribute Scenarios as guidance for the required Properties and Conditions.

NOTE: this demo has an additional set of requirements than MLTE. You can install them with the command: 

`poetry install --with demo`


## 0. Quality Attribute Scenarios

The following are the QASs that we want to validate through the use of MLTE. The examples below relate to a hypothetical system used by visitors to a botanical garden to identify flowers in the different gardens and learn more about them. The system used an ML model that was trained on the flower category dataset [Nilsback 2008] (https://www.robots.ox.ac.uk/~vgg/data/flowers/102/). 

* **Fairness - Model Impartial to Photo Location**
  * The model receives a picture taken at the garden and, regardless of the garden location, can correctly identify the correct flowers at least 90% of the time. Test data needs to include pictures of the flowers from the different gardens, grouped by the garden that the image was taken at. The quantity of the flower images should be representative of the garden population they are taken from. The total accuracy of the model across each garden population should be higher or equal to 0.9.
* **Robustness- Model Robust to Noise (Image Blur)**
  * The model receives a picture taken at a garden by a member of the general public, and it is a bit blurry.  The model should still be able to successfully identify the flower at the same rate as non-blurry images. Test data needs to include blurred flower images.  Blurred images will be created using ImageMagick. Three datasets will be generated, each with different amounts of blur: minimal blur, maximum blur, and in between minimal and maximum blur. Blurry images are successfully identified at rates equal to that of non-blurred images. This will be measured using the Wilcoxon Rank-Sum test, with significance at p-value <=0.05.
* **Robustness - Model Robust to Noise (Channel Loss)**
  * The model receives a picture taken at a garden using a loaned device. These devices are known to sometimes lose a channel (i.e., RGB channel). The model should still be able to successfully identify the flower at the same rate as full images. Test data needs to include images with a missing channel. Test images will be generated by removing the R, G and B channels in the original test data using ImageMagic, therefore producing three data sets. Images with a missing channel are successfully identified at rates equal to that of original images. This will be measured using the Wilcoxon Rank-Sum test, with significance at p-value <=0.05.
* **Performance on Operational Platform**
  * The model will need to run on the devices loaned out by the garden centers to visitors. These are small, inexpensive devices with limited CPU power, as well as limited memory and disk space (512 MB and 150 MB, respectively). The original test dataset can be used. 1- Executing the model on the loaned platform will not exceed maximum CPU usage of 30% to ensure reasonable response time. CPU usage will be measure using ps. 2- Memory usage at inference time will not exceed available memory of 512 MB. This will be measured using pmap. 3 - Disk usage will not exceed available disk space of 150 MB. This will be measured using by adding the size of each file in the path for the model code.
* **Interpretability - Understanding Model Results**
  * The application that runs on the loaned device should indicate the main features that were used to recognize the flower, as part of the educational experience. The app will display the image highlighting the most informative features in flower identification, in addition to the flower name. The original test data set can be used. The model needs to return evidence, in this case a heat map implementing the Integrated Gradients algorithm, showing the pixels that were most informative in the classification decision. This evidence should be returned with each inference. 

* **Functional Correctness - Accuracy**
  * The model receives receives a picture taken at the garden and can identify it correctly at least 90% of the time during normal operation.

* **Functional Correctness - Input and Output Specification**
  * The model reads inputs and provides outputs according to established input and output specifications during normal operation. During test execution all data in the test dataset produces an output that conforms to the output specification.

* **Resilience - Input Validation**
  * During normal operation, if the ML pipeline receives an input that does conform to the input specification it will generate the output "N/A" which the app will interpret as an error. The ML pipeline will create a log entry with the tag "Model - Input Validation Error - [Input]."

* **Monitorability - Detect OOD inputs**
  * During normal operation, the ML pipeline will log errors when out of distribution data is observed. The ML pipeline will create a log entry with the tag "Model - Input OOD Error - [Input]."

* **Monitorability - Monitor shifts in output (confidence) distribution**
  * During normal operation, ML pipeline will log errors when the output distribution changes. The ML pipeline will create a log entry with the tag "Model - Output Confidence Error - [Output]."

* **Performance - Inference Time on Operational Platform**
  * During normal operation, running on the operational platform, the model returns an output within two seconds.

# 1. Requirements

## 1.1 Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces.

In [None]:
import os
from mlte.session import set_context, set_store

store_path = os.path.join(os.getcwd(), "store")
os.makedirs(
    store_path, exist_ok=True
)  # Ensure we are creating the folder if it is not there.

set_context("OxfordFlower", "0.0.1")
set_store(f"local://{store_path}")

## 1.2 Build a `NegotiationCard`

In MLTE, we negotiation requirements with the help of a `NegotiationCard`. This can be done manually through code, but it is easier to use the MLTE UI to do so. Below we are copying a pre-built one that applies to this scenario. In MLTE, we define requirements by constructing a `NegotiationCard` that will include explicit Quality Attribute Scenarios with the requirements.

In [None]:
!sh copy_nc.sh

## 1.3 Define a TestSuite

In the first phase of SDMT, we define a `TestSuite` that represents the tests the completed model must will have to pass in order to be acceptable for use in the system into which it will be integrated.

In MLTE, we define the tests that will be required for the different requirements in a `TestSuite`. Note that a new `Evidence` types (`MultipleRanksums`) had to be created in this case to handle the data and `Validator` for that case, and two stand-alone `Validator`s were defined in `validators.py` to validate data using existing `Evidence` types.

First we need to load up our `NegotiationCard`, so we can get the list of ids of its quaity attribute scenarios, that will be added to the `TestCase`s here. Those ids are the way to link the `TestCase`s to their quality attribute requirements.

In [None]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
card.save(force=True)
card.print_quality_scenarios()

Now we can create our `TestSuite`, consisting of a list of `TestCases`, each of them addressing one or more Quality Attribute Scenarios from our `NegotiationCard`. When defining the `TestCase`s below, we need to set the id of the corresponding Quality Attribute Scenario we want to test in its "quality_scenarios" attribute.

In [None]:
from mlte.tests.test_case import TestCase
from mlte.tests.test_suite import TestSuite

# The Evidence types we will use to validate each condition.
from mlte.measurement.storage import LocalObjectSize
from mlte.measurement.cpu import LocalProcessCPUUtilization
from mlte.measurement.memory import LocalProcessMemoryConsumption
from mlte.evidence.types.image import Image
from mlte.evidence.types.real import Real
from demo.scenarios import validators
from demo.scenarios.evidence.multiple_ranksums import MultipleRanksums
from mlte.evidence.types.string import String


# The full test suite.
test_suite = TestSuite(
    test_cases=[
        # Fairness QAS test cases.
        TestCase(
            identifier="accuracy across gardens",
            goal="Check if model performs well accross different populations",
            quality_scenarios=["qas_id_001"],
            validator=validators.all_accuracies_more_or_equal_than(0.9),
        ),
        # Robustness QAS test cases.
        TestCase(
            identifier="ranksums blur2x8",
            goal="Check blur and noise for 2x8 case",
            quality_scenarios=["qas_id_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="ranksums blur5x8",
            goal="Check blur and noise for 5x8 case",
            quality_scenarios=["qas_id_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="ranksums blur0x8",
            goal="Check blur and noise for 0x8 case",
            quality_scenarios=["qas_id_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="multiple ranksums for clade2",
            goal="Check consistency in clade 2",
            quality_scenarios=["qas_id_003"],
            validator=MultipleRanksums.all_p_values_greater_or_equal_than(0.05),
        ),
        TestCase(
            identifier="multiple ranksums between clade2 and 3",
            goal="Check consistency between clades",
            quality_scenarios=["qas_id_003"],
            validator=MultipleRanksums.all_p_values_greater_or_equal_than(0.05),
        ),
        # Resource QASs test cases.
        TestCase(
            identifier="model size",
            goal="Check storage consumption",
            quality_scenarios=["qas_id_004"],
            validator=LocalObjectSize.get_output_type().less_than(150000000),
        ),
        TestCase(
            identifier="predicting memory",
            goal="Check memory used while predicting",
            quality_scenarios=["qas_id_004"],
            validator=LocalProcessMemoryConsumption.get_output_type().average_consumption_less_than(
                512000.0
            ),
        ),
        TestCase(
            identifier="predicting cpu",
            goal="Check cpu % used while predicting",
            quality_scenarios=["qas_id_004"],
            validator=LocalProcessCPUUtilization.get_output_type().max_utilization_less_than(
                30.0
            ),
        ),
        TestCase(
            identifier="predicting cpu time",
            goal="Check cpu time used while predicting",
            quality_scenarios=["qas_id_011"],
            validator=Real.less_than(2.0),
        ),
        # Interpretability QAS test case.
        TestCase(
            identifier="image attributions",
            goal="Check what the model is doing",
            quality_scenarios=["qas_id_005"],
            validator=Image.register_info("Inspect the image."),
        ),
        # Accuracy QAS test cases.
        TestCase(
            identifier="overall accuracy",
            goal="Measure the overall accuracy of your end to end pipeline",
            quality_scenarios=["qas_id_006"],
            validator=Real.greater_than(0.9),
        ),
        # Interoperability: I/O spec QAS test cases.
        TestCase(
            identifier="input format validation success",
            goal="Model input format must conform to specified format",
            quality_scenarios=["qas_id_007"],
            validator=String.contains("Model - Input Validation Pass"),
        ),
        TestCase(
            identifier="output format validation success",
            goal="Model output format must conform to specified format",
            quality_scenarios=["qas_id_007"],
            validator=String.contains("Model - Output Validation Pass"),
        ),
        # Resilience: Input Validation QAS test cases.
        TestCase(
            identifier="input format validation error",
            goal="Model inputs must conform to specified format",
            quality_scenarios=["qas_id_008"],
            validator=String.contains("Model - Input Validation Error"),
        ),
        # Monitorability QAS test cases.
        TestCase(
            identifier="detect ood inputs",
            goal="Monitor inputs for OOD data and unexpected shifts",
            quality_scenarios=["qas_id_009"],
            validator=String.contains("Model - Input OOD Error"),
        ),
        TestCase(
            identifier="monitor output confidence shift",
            goal="Monitor inputs for OOD data and unexpected shifts",
            quality_scenarios=["qas_id_010"],
            validator=String.contains("Model - Output Confidence Error"),
        ),
    ]
)
test_suite.save(parents=True, force=True)