## 3. Validation and Report Generation

The final phase of SDMT involves aggregating evidence, validating the metrics reflected by the evidence we collected, and displaying this information in a report.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [1]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *

Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


### Validate Evidence and get an updated `TestResults` with `Result`s

Now that we have our `TestSuite` ready and we have enough evidence, we create a `TestSuiteValidator` with our TestSuite, and add all the `Evidence`s we have. With that we can validate our tests and generate an output `TestResults`, with the validation results.

In [4]:
from mlte.validation.test_suite_validator import TestSuiteValidator

# Load validator for default TestSuite id
test_suite_validator = TestSuiteValidator()

# Load all Evidence and validate TestCases
test_results = test_suite_validator.load_and_validate()

# We want to see the validation results in the Notebook, regardless of them being saved.
test_results.print_results()

# TestResults also supports persistence
test_results.save(force=True)

 > Test Case: LLM provides evidence, result: Info, details: Inspect the explinations.
 > Test Case: evaluation is correct, result: Failure, details: Real magnitude is below threshold 0.95 - values: ["0.7142857142857143"]
 > Test Case: eval is consistent, result: Failure, details: Real magnitude is below threshold 0.95 - values: ["0.2857142857142857"]
 > Test Case: repeatable review, result: Failure, details: Real magnitude is below threshold 0.95 - values: ["0.625"]
 > Test Case: LLM is robsust to format, result: Failure, details: Real magnitude is below threshold 0.95 - values: ["0.9285714285714286"]
 > Test Case: results returned promptly, result: Failure, details: One or more numbers are above 10s - values: ["[16.129170894622803]"]
 > Test Case: fair eval, result: Success, details: All p-values provided are not signifigant at a threshold of p < 0.05 - values: ["[0.9149604289819329, 0.3370165123526908]"]
 > Test Case: eval not dependent on writing level, result: Success, details: All

ArtifactModel(header=ArtifactHeaderModel(identifier='results.default', type='results', timestamp=1761930560, creator=None, level='version'), body=TestResultsModel(artifact_type=<ArtifactType.TEST_RESULTS: 'results'>, test_suite_id='suite.default', test_suite=TestSuiteModel(artifact_type=<ArtifactType.TEST_SUITE: 'suite'>, test_cases=[TestCaseModel(identifier='LLM provides evidence', goal='Check that LLM provided SHAP score showing what parts of the prompt influenced the review', qas_list=['card.default-qas_001'], measurement=None, validator=ValidatorModel(bool_exp=None, bool_exp_str=None, thresholds=[], success=None, failure=None, info='Inspect the explinations.', input_types=[], creator_entity='mlte.validation.validator.Validator', creator_function='build_info_validator', creator_args=['Inspect the explinations.'])), TestCaseModel(identifier='evaluation is correct', goal="LLM eval matches the manager's evaluation of employee", qas_list=['card.default-qas_002'], measurement=None, valid

Here we see some of the results of the validation.

For example, there is a significant difference between original model with no blur and blur 0x8. So we see a drop in model accuracy with increasing blur. But aside from max blur (0x8), the model accuracy fall off isn't bad.  

### Generate a Report

The final step of SDMT involves the generation of a report to communicate the results of model evaluation.

In [3]:
from mlte.report.artifact import (
    Report,
    CommentDescriptor,
)

# Create a report with the default NegotiationCard, TestSuite and TestResults in this store.
report = Report(
    comments=[
        CommentDescriptor(
            content="This model should not be used for nefarious purposes."
        )
    ],
)

report.save(force=True, parents=True)

ArtifactModel(header=ArtifactHeaderModel(identifier='report.default-20251031-130755', type='report', timestamp=1761930475, creator=None, level='version'), body=ReportModel(artifact_type=<ArtifactType.REPORT: 'report'>, negotiation_card_id='card.default', negotiation_card=NegotiationCardModel(artifact_type=<ArtifactType.NEGOTIATION_CARD: 'card'>, system=SystemDescriptor(goals=[GoalDescriptor(description='Consistent ratings across all subordinates based on predetermined manager criteria', metrics=[MetricDescriptor(description='matching to manager scores', baseline=''), MetricDescriptor(description='with-in review score consistancy', baseline='')]), GoalDescriptor(description='Decrease amount of time taken by managers to generate performance reviews.', metrics=[MetricDescriptor(description='Total effort required to complete reviews', baseline='')])], problem_type=<ProblemType.CONTENT_GENERATION: 'content_generation'>, task='Generate performance review and ratings', usage_context='The LLM 