## 3. Validation and Report Generation

The final phase of SDMT involves aggregating evidence, validating the metrics reflected by the evidence we collected, and displaying this information in a report.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [89]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from demo.scenarios.session import *

### Validate Evidence and get an updated `TestResults` with `Result`s

Now that we have our `TestSuite` ready and we have enough evidence, we create a `TestSuiteValidator` with our TestSuite, and add all the `Evidence`s we have. With that we can validate our tests and generate an output `TestResults`, with the validation results.

In [91]:
from mlte.results.test_results import TestResults

#laod test results

test_results = TestResults.load()

test_results.print_results()

 > Test Case: accuracy across gardens, result: Success, details: All accuracies are equal to or over threshold 0.9 - values: ["[0.946, 0.956, 0.913]"]
 > Test Case: ranksums blur2x8, result: Success, details: P-Value is greater or equal to 0.016666666666666666 - values: ["[0.07946178703316073, 0.9366653249981838]"]
 > Test Case: ranksums blur5x8, result: Success, details: P-Value is greater or equal to 0.016666666666666666 - values: ["[0.4032389192727559, 0.6867724711187835]"]
 > Test Case: ranksums blur0x8, result: Failure, details: P-Value is less than threshold 0.016666666666666666 - values: ["[2.908064206049404, 0.003636736621916332]"]
 > Test Case: effect of blur across families, result: Success, details: All p-values are equal to or over threshold 0.0003546099290780142 - values: ["{'array': [{'evidence.ranksums Order Apiales-Apiales blurdelta_2x8': [0.0, 1.0]}, {'evidence.ranksums Order Apiales-Alismatales blurdelta_2x8': [0.0, 1.0]}, {'evidence.ranksums Order Apiales-Asterales b

Here we see some of the results of the validation.

For example, there is a significant difference between original model with no blur and blur 0x8. So we see a drop in model accuracy with increasing blur. But aside from max blur (0x8), the model accuracy fall off isn't bad.  

### Generate a Report

The final step of SDMT involves the generation of a report to communicate the results of model evaluation.

In [92]:
from mlte.report.artifact import (
    Report,
    CommentDescriptor,
)

# Create a report with the default NegotiationCard, TestSuite and TestResults in this store.
report = Report(
    comments=[
        CommentDescriptor(
            content="This model should not be used for nefarious purposes."
        )
    ],
)

report.save(force=True, parents=True)

ArtifactModel(header=ArtifactHeaderModel(identifier='report.default-20250930-120505', type='report', timestamp=1759248305, creator=None, level='version'), body=ReportModel(artifact_type=<ArtifactType.REPORT: 'report'>, negotiation_card_id='card.default', negotiation_card=NegotiationCardModel(artifact_type=<ArtifactType.NEGOTIATION_CARD: 'card'>, system=SystemDescriptor(goals=[GoalDescriptor(description='Correct identification of flowers', metrics=[MetricDescriptor(description='Accuracy > 0.9', baseline='Paper that describes the base model')]), GoalDescriptor(description='Increased number of visits to the garden', metrics=[MetricDescriptor(description='40% growth in repeat visits', baseline='Strategic plan'), MetricDescriptor(description='40% new visits', baseline='Strategic plan')])], problem_type=<ProblemType.CLASSIFICATION: 'classification'>, task='Identify flowers in pictures taken at the garden', usage_context='The model will be part of an application, the Garden Buddy app, that ru