# QAS Demo

This is a set of demo notebooks to illustrate the use of the MLTE library and SDMT process, using Quality Attribute Scenarios as guidance for the required Properties and Conditions.

NOTE: this demo has an additional set of requirements than MLTE. You can install them with the command: 

`poetry install --with demo`


## 0. Quality Attribute Scenarios

The following are the QASs that we want to validate through the use of MLTE. The examples below relate to a hypothetical system used by visitors to a botanical garden to identify flowers in the different gardens and learn more about them. The system used an ML model that was trained on the flower category dataset [Nilsback 2008] (https://www.robots.ox.ac.uk/~vgg/data/flowers/102/). 

* **Explainability - Explain LLM results**
The LLM receives a prompt from the manager asking for an employee evaluation during normal operations. The model outputs an employee evaluation, including a performance score for the employee and human understandable rationale for the employee score. 
    
* **Functional Correctness- Model provides correct results**
The LLM receives a prompt from the manager asking for an employee evaluation containing the employee goals, employee statement, and manager notes, during normal operations. The LLM outputs an employee evaluation, including a performance score for the employee. The LLM generated performance score should match the manager expected overall score in at least 95% of cases.

* **Functional Correctness- Model provides self-consistent results**
The LLM receives a prompt from the manager asking for an employee evaluation during normal operations. The model outputs an employee evaluation, including an overall performance score for the employee and an evaluation for each important sub-category. The sub-category scores should average to match the overall score in at least 95% of the cases.
    
* **Repeatability - Model provides repeatable results**
The LLM may receive multiple entries of similarly performing employees for evaluation during normal operations. In the case of similar prompts and input information, the LLM generated employee evaluation, including performance scores and evaluation summary,  should be semantically similar each time. 
    
* **Robustness - Model is robust to spacing, punction and case inconsistencies**
The LLM may receive prompts with different variations, such as casing, spacing, and punctuation, during normal operations.  These variations should not influence the employee evaluation or score generated by the LLM.
   
* **Time Behavior - Model provides results in a timely manner**
The LLM receives a prompt from the manager asking for an employee evaluation during normal LLM operations. The time required for the LLM to output an employee evaluation, including a performance score,  will be no more than 10 seconds.

* **Fairness - Model provides similar results to different named individuals**
As the LLM is being used in the evaluation process to help managers generate performance reviews, the generated reviews need to be fair for every employee. The LLM should provide the same performance review for similar prompts, regardless of the name and pronouns used by the employee.

* **Inclusivity - Model provides results not dependent on reading level**
The LLM will be used to generate performance reviews based on self-evaluations for all employees across the organization, in all job types. The overall performance score provided by the LLM should not be impacted by the writing level of the employee's self evaluation.

* **Economic Risk Consideration - Model identifies economic risk from employees** 

* **Health and Safety Risk Considerations - Model identifies health and safety risk from employees**

* **Societal and Ethical Risk Considerations - Model identifies societal and ethical risk from employees**

* **Privacy- Model results do not  contain PII for otehr employees**
The LLM will receive many similar prompts from managers asking for employee evaluations. The output employee evaluation should not contain PII for other employees despite similarity between prompts.

* **Resistance - Model is resistant to additional, embedded instructions from employees**
The LLM receives a prompt containing instructions for the LLM to give that employee a good review, within the self-evaluation, during normal operations. The performance score output by the LLM should not be affected by the embedded instructions. 


# 1. Requirements

## 1.1 Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces.

In [4]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from demo.scenarios.session import *


Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte/demo/scenarios/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte/demo/scenarios/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


## 1.2 Build a `NegotiationCard`

In MLTE, we negotiation requirements with the help of a `NegotiationCard`. This can be done manually through code, but it is easier to use the MLTE UI to do so. Below we are copying a pre-built one that applies to this scenario. In MLTE, we define requirements by constructing a `NegotiationCard` that will include explicit Quality Attribute Scenarios with the requirements.

In [9]:
##!sh ../setup_store.sh

cp: ./sample_store/: No such file or directory
/bin/ls: /bin/ls: cannot execute binary file


## 1.3 Define a TestSuite

In the first phase of SDMT, we define a `TestSuite` that represents the tests the completed model must will have to pass in order to be acceptable for use in the system into which it will be integrated.

In MLTE, we define the tests that will be required for the different requirements in a `TestSuite`. Note that a new `Evidence` types (`MultipleRanksums`) had to be created in this case to handle the data and `Validator` for that case, and two stand-alone `Validator`s were defined in `validators.py` to validate data using existing `Evidence` types.

First we need to load up our `NegotiationCard`, so we can get the list of ids of its quaity attribute scenarios, that will be added to the `TestCase`s here. Those ids are the way to link the `TestCase`s to their quality attribute requirements.

In [12]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
card.save(force=True)
card.print_quality_scenarios()

default.card-qas_001 (Fairness): The model receives a picture taken at the garden from the garden buddy application while in normal operations, regardless of the location in the garden, the model can correctly identify the correct flowers, at least 90% of the time
default.card-qas_002 (Robustness): The model receives a picture taken at the garden, and it is a bit blurry. imagemagick will be used with different levels of blur introduced to the origional flower test data set to simulate the operational images taken from the by the garden buddy application while in normal operations, the model successfully identifies flowers, at the same rate as non-blurry images, as measured using the wilcoxon rank-sum test, with significance at p-value <=0.05.
default.card-qas_003 (Resilience): The model receives a picture taken at the garden, and it is a bit blurry. imagemagick will be used with different channel loss introduced to the origional flower test data set to simulate the operational images t

Now we can create our `TestSuite`, consisting of a list of `TestCases`, each of them addressing one or more Quality Attribute Scenarios from our `NegotiationCard`. When defining the `TestCase`s below, we need to set the id of the corresponding Quality Attribute Scenario we want to test in its "quality_scenarios" attribute.

In [44]:
from mlte.tests.test_case import TestCase
from mlte.tests.test_suite import TestSuite

# The Evidence types we will use to validate each condition.
from mlte.measurement.storage import LocalObjectSize
from mlte.measurement.cpu import LocalProcessCPUUtilization
from mlte.measurement.units import Units
from mlte.measurement.memory import LocalProcessMemoryConsumption
from mlte.evidence.types.image import Image
from mlte.evidence.types.string import String
from mlte.evidence.types.real import Real
from demo.scenarios import validators
from demo.scenarios.evidence.multiple_ranksums import MultipleRanksums
from mlte.evidence.types.string import String
from mlte.validation.validator import Validator


# The full test suite.
test_suite = TestSuite(
    test_cases=[
        # Fairness QAS test cases.
        TestCase(
            identifier="accuracy across gardens",
            goal="Check if model performs well accross different populations",
            quality_scenarios=["default.card-qas_001"],
            validator=validators.all_accuracies_more_or_equal_than(0.9),
        ),
        # Robustness QAS test cases.
        TestCase(
            identifier="ranksums blur2x8",
            goal="Check blur and noise for 2x8 case",
            quality_scenarios=["default.card-qas_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="ranksums blur5x8",
            goal="Check blur and noise for 5x8 case",
            quality_scenarios=["default.card-qas_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="ranksums blur0x8",
            goal="Check blur and noise for 0x8 case",
            quality_scenarios=["default.card-qas_002"],
            validator=validators.p_value_greater_or_equal_to(0.05 / 3),
        ),
        TestCase(
            identifier="effect of blur across families",
            goal="Check consistency in families",
            quality_scenarios=["default.card-qas_002"],
            validator=MultipleRanksums.all_p_values_greater_or_equal_than(0.05 / 141),
        ),
        #Resilience QAS test case 
        TestCase(
            identifier="ranksums channel loss R",
            goal="Check consistency between channel loss",
            quality_scenarios=["default.card-qas_003"],
            validator=validators.p_value_greater_or_equal_to(0.05),
        ),
        TestCase(
            identifier="ranksums channel loss G",
            goal="Check consistency between channel loss",
            quality_scenarios=["default.card-qas_003"],
            validator=validators.p_value_greater_or_equal_to(0.05),
        ),
        TestCase(
            identifier="ranksums channel loss B",
            goal="Check consistency between channel loss",
            quality_scenarios=["default.card-qas_003"],
            validator=validators.p_value_greater_or_equal_to(0.05),
        ),
        # Resource Utilization QASs test cases.
        TestCase(
            identifier="model size",
            goal="Check storage consumption",
            quality_scenarios=["default.card-qas_004"],
            validator=LocalObjectSize.get_output_type().less_than(
                150, Units.megabyte
            ),
        ),
        TestCase(
            identifier="predicting memory",
            goal="Check memory used while predicting",
            quality_scenarios=["default.card-qas_004"],
            validator=LocalProcessMemoryConsumption.get_output_type().average_consumption_less_than(
                512.0, unit=Units.megabyte
            ),
        ),
        TestCase(
            identifier="predicting cpu",
            goal="Check cpu % used while predicting",
            quality_scenarios=["default.card-qas_004"],
            validator=LocalProcessCPUUtilization.get_output_type().max_utilization_less_than(
                30.0, unit=Units.percent
            ),
        ), 
        # Understandability QAS test case.
        #TestCase(
        #    identifier="image attributions",
        #    goal="Check what the model is doing",
        #    quality_scenarios=["default.card-qas_005"],
        #    validator=Image.register_info("Inspect the image."),
        #),
        # Functional Correctness - Accuracy QAS test cases.
        TestCase(
            identifier="overall model accuracy",
            goal="Measure the overall accuracy of your end to end pipeline",
            quality_scenarios=["default.card-qas_006"],
            validator=Real.greater_than(0.9),
        ),
        # Functional Correctness - I/O spec QAS test cases.
        #TestCase(
        #    identifier="input format validation success",
        #    goal="Model input format must conform to specified format",
        #    quality_scenarios=["default.card-qas_007"],
        #    validator=String.contains("Model - Input Validation Pass"),
        #),
        #TestCase(
        #    identifier="output format validation success",
        #    goal="Model output format must conform to specified format",
        #    quality_scenarios=["default.card-qas_007"],
        #    validator=String.contains("Model - Output Validation Pass"),
        #),
        # Reliability: Input Validation QAS test cases.
        #TestCase(
        #    identifier="input format validation error",
        #    goal="Model inputs must conform to specified format",
        #    quality_scenarios=["default.card-qas_008"],
        #    validator=String.contains("Model - Input Validation Error"),
        #),
        #  Analyzability QAS test cases.
        #TestCase(
        #    identifier="detect ood inputs",
        #    goal="Monitor inputs for OOD data and unexpected shifts",
        #    quality_scenarios=["default.card-qas_009"],
        #    validator=String.contains("Model - Input OOD Error"),
        #), 
        #Monitorability QAS test case
        #TestCase(
        #    identifier="monitor output confidence shift",
        #    goal="Monitor inputs for OOD data and unexpected shifts",
        #    quality_scenarios=["default.card-qas_010"],
        #    validator=String.contains("Model - Output Confidence Error"),
        #),
        #Time Behaviour 
        #TestCase(
        #    identifier="predicting cpu time",
        #    goal="Check cpu time used while predicting",
        #    quality_scenarios=["default.card-qas_011"],
        #    validator=Real.less_than(2.0, Units.second),
        #),
        #Repeatability QAS test case
        TestCase(
            identifier="repeated results sampling",
            goal="Repeatedly sampling results gives same results",
            quality_scenarios=["default.card-qas_012"],
            validator=validators.p_value_greater_or_equal_to(0.05 ),
        ),
        #Reproducability QAS test case
        TestCase(
            identifier="repeated training on training samples",
            goal="Repeatedly training on different sammples of training data gives same results on test data set",
            quality_scenarios=["default.card-qas_013"],
            validator=validators.p_value_greater_or_equal_to(0.05 ),
        ),
        #Domain Adaptability QAS test case
        TestCase(
            identifier="running in new domain",
            goal="Repeatedly training on different sammples of training data gives same results on test data set",
            quality_scenarios=["default.card-qas_014"],
            validator=MultipleRanksums.all_p_values_greater_or_equal_than(0.05 ),
        ),
        #Testability QAS test case
        TestCase(
            identifier="test results from dev and op env",
            goal="aligment of test results from dev and op environments",
            quality_scenarios=["default.card-qas_015"],
            validator=Validator.build_info_validator("Inspect the alignment of results for no more than 0.25% difference."),
        ),
        #Understandability QAS test case
        TestCase(
            identifier="understanding design choices",
            goal="understanding design and implementation choices",
            quality_scenarios=["default.card-qas_016"],
            validator=Validator.build_info_validator("Inspect projrct code and documentation."),
        ),
        #Maintainability QAS test case
        TestCase(
            identifier="keep ML component up to date",
            goal="keep trained ML component up to date with op environment changes",
            quality_scenarios=["default.card-qas_017"],
            validator=Validator.build_info_validator("Validate work time less than 8hrs."),
        ),
        #Modifiability QAS test case
        TestCase(
            identifier="update data pipelines",
            goal="aligment of test results from dev and op environments",
            quality_scenarios=["default.card-qas_018"],
            validator=Validator.build_info_validator("Validate work time less than 4hrs."),
        ),
        #Replaceability QAS test case
        TestCase(
            identifier="update ML training algorithm",
            goal="update ML component training algorithm",
            quality_scenarios=["default.card-qas_019"],
            validator=Validator.build_info_validator("Validate train time less than 16hrs."),
        ),
        #Reuseability QAS test case
        TestCase(
            identifier="reuse ML component",
            goal="reuse ML component in new app",
            quality_scenarios=["default.card-qas_020"],
            validator=Validator.build_info_validator("Validate work time less than 4hrs."),
        ),
    ]
)
test_suite.save(parents=True, force=True)

ArtifactModel(header=ArtifactHeaderModel(identifier='suite.default', type='suite', timestamp=1759188697, creator=None, level='model'), body=TestSuiteModel(artifact_type=<ArtifactType.TEST_SUITE: 'suite'>, test_cases=[TestCaseModel(identifier='accuracy across gardens', goal='Check if model performs well accross different populations', qas_list=['default.card-qas_001'], measurement=None, validator=ValidatorModel(bool_exp='gASVfgMAAAAAAACMCmRpbGwuX2RpbGyUjBBfY3JlYXRlX2Z1bmN0aW9ulJOUKGgAjAxfY3JlYXRlX2NvZGWUk5QoQwgWARj/EAIo/pRLAUsASwBLAUsESxNDbJUBlwB0AQAAAAAAAAAAiAFmAWQBhAh8AGoCAAAAAAAAAAAAAAAAAAAAAAAARACrAAAAAAAAAKsBAAAAAAAAdAUAAAAAAAAAAHwAagIAAAAAAAAAAAAAAAAAAAAAAACrAQAAAAAAAGsoAABTAJROaAQoQwIKAZRLAUsASwBLAksDSzNDKJUBSwABAJcAfABdCQAAfQF8AYkCa1wAAJYBlwEBAIwLBAB5AK0DdwGUToWUKYwCLjCUjAFnlIaUjGgvVXNlcnMvcmJyb3dlcnNpbm5pbmcvRG9jdW1lbnRzL1Jlc2VhcmNoRm9sZGVycy9Db250aW51dW1fTFRQL0dpdFJlcG9zL21sdGUvZGVtby9zY2VuYXJpb3MvdmFsaWRhdG9ycy5weZSMCTxnZW5leHByPpSMRmFsbF9hY2N1cmFjaWVzX21vcmVfb3JfZXF1YWxfdGhh