# QAS Demo

This is a set of demo notebooks to illustrate the use of the MLTE library and SDMT process, using Quality Attribute Scenarios as guidance for the required Properties and Conditions.

NOTE: this demo has an additional set of requirements than MLTE. You can install them with the command: 

`poetry install --with demo`


## 0. Quality Attribute Scenarios

The following are the QASs that we want to validate through the use of MLTE. The examples below relate to a hypothetical system used by visitors to a botanical garden to identify flowers in the different gardens and learn more about them. The system used an ML model that was trained on the flower category dataset [Nilsback 2008] (https://www.robots.ox.ac.uk/~vgg/data/flowers/102/). 

* **Fairness - Model Impartial to Photo Location**
The Garden Buddy application receives a picture taken at the garden and, regardless of the garden location, can correctly identify the correct flowers at least 90% of the time. Test data needs to include pictures of the flowers from the different gardens, grouped by the garden that the image was taken at. The quantity of the flower images should be representative of the garden population they are taken from. The total accuracy of the model across each garden population should be higher or equal to 0.9.
    
* **Robustness- Model Robust to Noise (Image Blur)**
The Garden Buddy application receives a picture taken at a garden by a member of the general public, and it is a bit blurry.  The model should still be able to successfully identify the flower at the same rate as non-blurry images. Test data needs to include blurred flower images.  Blurred images will be created using ImageMagick. Three datasets will be generated, each with different amounts of blur: minimal blur, maximum blur, and in between minimal and maximum blur. Blurry images are successfully identified at rates equal to that of non-blurred images. This will be measured using the Wilcoxon Rank-Sum test, with significance at p-value <=0.05.
    
* **Resilience - Model Resilient to Hardware Sensor Failure (Channel Loss)**
The Garden Buddy application receives a picture taken at a garden using a loaned device. These devices are known to sometimes lose a channel (i.e., RGB channel). The model should still be able to successfully identify the flower at the same rate as full images. Test data needs to include images with a missing channel. Test images will be generated by removing the R, G and B channels in the original test data using ImageMagic, therefore producing three data sets. Images with a missing channel are successfully identified at rates equal to that of original images. This will be measured using the Wilcoxon Rank-Sum test, with significance at p-value <=0.05.
    
* **Resource Utilization - Performance on Operational Platform**
The Garden Buddy application will need to run on the devices loaned out by the garden centers to visitors. These are small, inexpensive devices with limited CPU power, as well as limited memory and disk space (512 MB and 150 MB, respectively). The original test dataset can be used. 1- Executing the model on the loaned platform will not exceed maximum CPU usage of 30% to ensure reasonable response time. CPU usage will be measure using ps. 2- Memory usage at inference time will not exceed available memory of 512 MB. This will be measured using pmap. 3 - Disk usage will not exceed available disk space of 150 MB. This will be measured using by adding the size of each file in the path for the model code.
    
* **Explainability - Understanding Model Results**
The application that runs on the loaned device should indicate the main features that were used to recognize the flower, as part of the educational experience. The app will display the image highlighting the most informative features in flower identification, in addition to the flower name. The original test data set can be used. The model needs to return evidence, in this case a heat map implementing the Integrated Gradients algorithm, showing the pixels that were most informative in the classification decision. This evidence should be returned with each inference. 

* **Functional Correctness - Accuracy**
The Garden Buddy application receives receives a picture taken at the garden and can identify it correctly at least 90% of the time during normal operation.

* **Functional Correctness - Input and Output Specification**
The Garden Buddy application reads inputs and provides outputs according to established input and output specifications during normal operation. During test execution all data in the test dataset produces an output that conforms to the output specification.

* **Reliability - Input Validation** 
During normal operation, if the ML pipeline receives an input that does conform to the input specification it will generate the output "N/A" which the app will interpret as an error. The ML pipeline will create a log entry with the tag "Model - Input Validation Error - [Input]."

* **Analyzability - Detect OOD inputs**
During normal operation, the ML pipeline will log errors when out of distribution data is observed. The ML pipeline will create a log entry with the tag "Model - Input OOD Error - [Input]."

* **Monitorability - Monitor Shifts in Output (Confidence) Distribution**
During normal operation, ML pipeline will log errors when the output distribution changes. The ML pipeline will create a log entry with the tag "Model - Output Confidence Error - [Output]."

* **Time Behavior - Inference Time on Operational Platform**
During normal operation, running on the operational platform, the model returns an output within two seconds.
 
* **Repeatability - Produce the Statistically Similar Output distribution Upon Repeated Sampling of Input Test Data**
The ML component receives a picture taken at the garden from the Garden Buddy application during normal operations. The ML component should demonstrate consistently similar performance for pictures from the same garden from day to day such that garden visitors and garden employees have consistent experiences using the app.

* **Reproducibility - Train the Statistically Similar ML Components when Upon Repeated Sampling of Training Data**
Garden Buddy stakeholders want to switch to an automated retraining process and want assurances that the results will be similar to the manual training process. The ML model should produce the same results on the test data set regardless of the training data used to train the ML component during development.
 
* **Domain Adaptability - Using ML Component in new Operational Environment without Loss of Functionality**
System stakeholders would like the Garden Buddy app to work in a VR garden environment that is under development. The performance of the ML model in the new VR  environment should be similar to that in the current garden environment.
 
* **Testability - ML Component has same results in operationing and development environments**
The developed ML component is handed over to the operations for integration into the ML-enabled system after development of the ML component is complete. The delivery of the ML component will include a test data set and corresponding component outputs. During integration testing of the ML component the differences between provided outputs and actual outputs should be less than 0.25%.
 
* **Understandability - ML Component documentation and design choices are easy to understand**
A new model developer joins the team and is asked to review the ML component code as part of their onboarding process. The new developer is able to fully understand the ML component design and implementation choices in 2 person/days."
 
* **Maintainability - Using ML Component in new Operational Environment without Loss of Functionality**
Given that new flowers will be added to the garden, the ML component will need to be repeatedly updated and redeployed to keep up with changes. As new flowers are added to the garden, developers can update the ML component to recognize the new flowers in less than 1 person/day.
 
* **Modifiability - Using ML Component in new Operational Environment without Loss of Functionality**
As the loaned devices are replaced over time due to wear or obsolescence, changes to the ML component to integrate new inputs or input formats should not take more than 1/2 person/day.
 
* **Replaceability - Using ML Component in new Operational Environment without Loss of Functionality**
Given how fast the machine learning field moves, system stakeholders would want to take advantage of new algorithms in a timely manner. The effort to incorporate a new trained model should be localized to a minimal number of modules and take no longer than 16 hours of development time.

* **Retrainability - Using ML Component in new Operational Environment without Loss of Functionality**
Given that the ML component will need to be periodically retrained to keep pace with the shifts in appearance of the garden flowers due to growth and changes in seasons, system stakeholders want to limit the amount of development time and resources need to retrain the ML component. The effort required to retrain the ML component on an updated flower population, while not changing the model architecture or outputs, should be no more than 2 hours. 
 
* **Reuseability - Using ML Component in new Operational Environment without Loss of Functionality**
  System stakeholders would like to use the ML component in an upcoming mobile app version of Garden Buddy that members that are home gardeners can use to identify flowers in their gardens. Integrating the same ML component in the Garden Buddy app should be done in 4 hours of development time.

# 1. Requirements

## 1.1 Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces.

In [4]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from demo.scenarios.session import *


Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte/demo/scenarios/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte/demo/scenarios/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


## 1.2 Build a `NegotiationCard`

In MLTE, we negotiation requirements with the help of a `NegotiationCard`. This can be done manually through code, but it is easier to use the MLTE UI to do so. Below we are copying a pre-built one that applies to this scenario. In MLTE, we define requirements by constructing a `NegotiationCard` that will include explicit Quality Attribute Scenarios with the requirements.

In [9]:
##!sh ../setup_store.sh

cp: ./sample_store/: No such file or directory
/bin/ls: /bin/ls: cannot execute binary file


## 1.3 Define a TestSuite

In the first phase of SDMT, we define a `TestSuite` that represents the tests the completed model must will have to pass in order to be acceptable for use in the system into which it will be integrated.

In MLTE, we define the tests that will be required for the different requirements in a `TestSuite`. Note that a new `Evidence` types (`MultipleRanksums`) had to be created in this case to handle the data and `Validator` for that case, and two stand-alone `Validator`s were defined in `validators.py` to validate data using existing `Evidence` types.

First we need to load up our `NegotiationCard`, so we can get the list of ids of its quaity attribute scenarios, that will be added to the `TestCase`s here. Those ids are the way to link the `TestCase`s to their quality attribute requirements.

In [12]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
card.save(force=True)
card.print_quality_scenarios()

default.card-qas_001 (Fairness): The model receives a picture taken at the garden from the garden buddy application while in normal operations, regardless of the location in the garden, the model can correctly identify the correct flowers, at least 90% of the time
default.card-qas_002 (Robustness): The model receives a picture taken at the garden, and it is a bit blurry. imagemagick will be used with different levels of blur introduced to the origional flower test data set to simulate the operational images taken from the by the garden buddy application while in normal operations, the model successfully identifies flowers, at the same rate as non-blurry images, as measured using the wilcoxon rank-sum test, with significance at p-value <=0.05.
default.card-qas_003 (Resilience): The model receives a picture taken at the garden, and it is a bit blurry. imagemagick will be used with different channel loss introduced to the origional flower test data set to simulate the operational images t

Now we can create our `TestSuite`, consisting of a list of `TestCases`, each of them addressing one or more Quality Attribute Scenarios from our `NegotiationCard`. When defining the `TestCase`s below, we need to set the id of the corresponding Quality Attribute Scenario we want to test in its "quality_scenarios" attribute.

In [48]:
from mlte.tests.test_case import TestCase
from mlte.tests.test_suite import TestSuite

# The Evidence types we will use to validate each condition.
from mlte.measurement.storage import LocalObjectSize
from mlte.measurement.cpu import LocalProcessCPUUtilization
from mlte.measurement.units import Units
from mlte.measurement.memory import LocalProcessMemoryConsumption
from mlte.evidence.types.image import Image
from mlte.evidence.types.string import String
from mlte.evidence.types.real import Real
from demo.scenarios import validators
from demo.scenarios.evidence.multiple_ranksums import MultipleRanksums
from mlte.evidence.types.string import String
from mlte.validation.validator import Validator


# The full test suite.
test_suite = TestSuite(
    test_cases=[
        # Fairness QAS test cases.
        TestCase(
            identifier="accuracy across gardens",
            goal="Check if model performs well accross different populations",
            quality_scenarios=["default.card-qas_001"],
            validator=validators.all_accuracies_more_or_equal_than(0.9),
        ),
        # Robustness QAS test cases.
        TestCase(
            identifier="ranksums blur2x8",
            goal="Check blur and noise for 2x8 case",
            quality_scenarios=["default.card-qas_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="ranksums blur5x8",
            goal="Check blur and noise for 5x8 case",
            quality_scenarios=["default.card-qas_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="ranksums blur0x8",
            goal="Check blur and noise for 0x8 case",
            quality_scenarios=["default.card-qas_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="effect of blur across families",
            goal="Check consistency in families",
            quality_scenarios=["default.card-qas_002"],
            validator=MultipleRanksums.all_p_values_greater_or_equal_than(0.05 / 141),
        ),
        #Resilience QAS test case 
        TestCase(
            identifier="ranksums channel loss R",
            goal="Check consistency between channel loss",
            quality_scenarios=["default.card-qas_003"],
            validator=validators.p_value_greater_or_equal_to(0.05),
        ),
        TestCase(
            identifier="ranksums channel loss G",
            goal="Check consistency between channel loss",
            quality_scenarios=["default.card-qas_003"],
            validator=validators.p_value_greater_or_equal_to(0.05),
        ),
        TestCase(
            identifier="ranksums channel loss B",
            goal="Check consistency between channel loss",
            quality_scenarios=["default.card-qas_003"],
            validator=validators.p_value_greater_or_equal_to(0.05),
        ),
        # Resource Utilization QASs test cases.
        TestCase(
            identifier="model size",
            goal="Check storage consumption",
            quality_scenarios=["default.card-qas_004"],
            validator=LocalObjectSize.get_output_type().less_than(
                150, Units.megabyte
            ),
        ),
        TestCase(
            identifier="predicting memory",
            goal="Check memory used while predicting",
            quality_scenarios=["default.card-qas_004"],
            validator=LocalProcessMemoryConsumption.get_output_type().average_consumption_less_than(
                512.0, unit=Units.megabyte
            ),
        ),
        TestCase(
            identifier="predicting cpu",
            goal="Check cpu % used while predicting",
            quality_scenarios=["default.card-qas_004"],
            validator=LocalProcessCPUUtilization.get_output_type().max_utilization_less_than(
                30.0, unit=Units.percent
            ),
        ), 
        # Understandability QAS test case.
        #TestCase(
        #    identifier="image attributions",
        #    goal="Check what the model is doing",
        #    quality_scenarios=["default.card-qas_005"],
        #    validator=Image.register_info("Inspect the image."),
        #),
        # Functional Correctness - Accuracy QAS test cases.
        TestCase(
            identifier="overall model accuracy",
            goal="Measure the overall accuracy of your end to end pipeline",
            quality_scenarios=["default.card-qas_006"],
            validator=Real.greater_than(0.9),
        ),
        # Functional Correctness - I/O spec QAS test cases.
        #TestCase(
        #    identifier="input format validation success",
        #    goal="Model input format must conform to specified format",
        #    quality_scenarios=["default.card-qas_007"],
        #    validator=String.contains("Model - Input Validation Pass"),
        #),
        #TestCase(
        #    identifier="output format validation success",
        #    goal="Model output format must conform to specified format",
        #    quality_scenarios=["default.card-qas_007"],
        #    validator=String.contains("Model - Output Validation Pass"),
        #),
        # Reliability: Input Validation QAS test cases.
        #TestCase(
        #    identifier="input format validation error",
        #    goal="Model inputs must conform to specified format",
        #    quality_scenarios=["default.card-qas_008"],
        #    validator=String.contains("Model - Input Validation Error"),
        #),
        #  Analyzability QAS test cases.
        #TestCase(
        #    identifier="detect ood inputs",
        #    goal="Monitor inputs for OOD data and unexpected shifts",
        #    quality_scenarios=["default.card-qas_009"],
        #    validator=String.contains("Model - Input OOD Error"),
        #), 
        #Monitorability QAS test case
        #TestCase(
        #    identifier="monitor output confidence shift",
        #    goal="Monitor inputs for OOD data and unexpected shifts",
        #    quality_scenarios=["default.card-qas_010"],
        #    validator=String.contains("Model - Output Confidence Error"),
        #),
        #Time Behaviour 
        #TestCase(
        #    identifier="predicting cpu time",
        #    goal="Check cpu time used while predicting",
        #    quality_scenarios=["default.card-qas_011"],
        #    validator=Real.less_than(2.0, Units.second),
        #),
        #Repeatability QAS test case
        TestCase(
            identifier="repeated results sampling",
            goal="Repeatedly sampling results gives same results",
            quality_scenarios=["default.card-qas_012"],
            validator=validators.p_value_greater_or_equal_to(0.05 ),
        ),
        #Reproducability QAS test case
        TestCase(
            identifier="repeated training on training samples",
            goal="Repeatedly training on different sammples of training data gives same results on test data set",
            quality_scenarios=["default.card-qas_013"],
            validator=validators.p_value_greater_or_equal_to(0.05 ),
        ),
        #Domain Adaptability QAS test case
        TestCase(
            identifier="running in new domain",
            goal="Repeatedly training on different sammples of training data gives same results on test data set",
            quality_scenarios=["default.card-qas_014"],
            validator=MultipleRanksums.all_p_values_greater_or_equal_than(0.05 ),
        ),
        #Testability QAS test case
        TestCase(
            identifier="test results from dev and op env", #test results from dev and op env
            goal="aligment of test results from dev and op environments",
            quality_scenarios=["default.card-qas_015"],
            validator=Validator.build_info_validator("Inspect the alignment of results for no more than 0.25% difference."),
        ),
        #Understandability QAS test case
        TestCase(
            identifier="understanding design choices",
            goal="understanding design and implementation choices",
            quality_scenarios=["default.card-qas_016"],
            validator=Validator.build_info_validator("Inspect projrct code and documentation."),
        ),
        #Maintainability QAS test case
        TestCase(
            identifier="keep ML component up to date",
            goal="keep trained ML component up to date with op environment changes",
            quality_scenarios=["default.card-qas_017"],
            validator=Validator.build_info_validator("Validate work time less than 8hrs."),
        ),
        #Modifiability QAS test case
        TestCase(
            identifier="update data pipelines",
            goal="aligment of test results from dev and op environments",
            quality_scenarios=["default.card-qas_018"],
            validator=Validator.build_info_validator("Validate work time less than 4hrs."),
        ),
        #Replaceability QAS test case
        TestCase(
            identifier="update ML training algorithm",
            goal="update ML component training algorithm",
            quality_scenarios=["default.card-qas_019"],
            validator=Validator.build_info_validator("Validate train time less than 16hrs."),
        ),
        #Retrainability QAS test case
        TestCase(
            identifier="retrain ML model",
            goal="update ML component training algorithm",
            quality_scenarios=["default.card-qas_020"],
            validator=Validator.build_info_validator("Model takes less than 1 hr to retrain."),
        ),
        #Reuseability QAS test case
        TestCase(
            identifier="reuse ML component",
            goal="reuse ML component in new app",
            quality_scenarios=["default.card-qas_021"],
            validator=Validator.build_info_validator("Validate work time less than 4hrs."),
        ),
    ]
)
test_suite.save(parents=True, force=True)

ArtifactModel(header=ArtifactHeaderModel(identifier='suite.default', type='suite', timestamp=1759248081, creator=None, level='model'), body=TestSuiteModel(artifact_type=<ArtifactType.TEST_SUITE: 'suite'>, test_cases=[TestCaseModel(identifier='accuracy across gardens', goal='Check if model performs well accross different populations', qas_list=['default.card-qas_001'], measurement=None, validator=ValidatorModel(bool_exp='gASVfgMAAAAAAACMCmRpbGwuX2RpbGyUjBBfY3JlYXRlX2Z1bmN0aW9ulJOUKGgAjAxfY3JlYXRlX2NvZGWUk5QoQwgWARj/EAIo/pRLAUsASwBLAUsESxNDbJUBlwB0AQAAAAAAAAAAiAFmAWQBhAh8AGoCAAAAAAAAAAAAAAAAAAAAAAAARACrAAAAAAAAAKsBAAAAAAAAdAUAAAAAAAAAAHwAagIAAAAAAAAAAAAAAAAAAAAAAACrAQAAAAAAAGsoAABTAJROaAQoQwIKAZRLAUsASwBLAksDSzNDKJUBSwABAJcAfABdCQAAfQF8AYkCa1wAAJYBlwEBAIwLBAB5AK0DdwGUToWUKYwCLjCUjAFnlIaUjGgvVXNlcnMvcmJyb3dlcnNpbm5pbmcvRG9jdW1lbnRzL1Jlc2VhcmNoRm9sZGVycy9Db250aW51dW1fTFRQL0dpdFJlcG9zL21sdGUvZGVtby9zY2VuYXJpb3MvdmFsaWRhdG9ycy5weZSMCTxnZW5leHByPpSMRmFsbF9hY2N1cmFjaWVzX21vcmVfb3JfZXF1YWxfdGhh