## 2l. Evidence - Reproducibility QAS Measurements

Evidence collected in this section checks for the Reproducibility scenario defined in the previous step. Note that some functions will be loaded from external Python files.

The cell below must contain JSON data about this evidence that will be used to automatically populate the sample test catalog.

In [None]:
{
    "tags": ["Computer Vision","Object detection"],
    "quality_attribute": "reproducibility",
    "description": "testing the ability of the ML model to produce similar model outputs when training multiple models on different random samples of the training data",
    "inputs": "model results from different models, trained on different random sample sets of the training data",
    "output": "p-value from the Friedman test",
}

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [None]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *

### Set up scenario test case

In [None]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 12
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

**A Specific test case generated from the scenario:**

**Data and Data Source:**	10 class-balanced sets of the training data set will be generated via sampling of the training data set. The ML algorithm will then be used to generate a trained on each of those data sets, producing 10 ML models. The test data used will be used to evaluate and compare the 10 models' performance. 

**Measurement and Condition:**	ML components will be compared for each data label class. A Friedman test will be used to evaluate if similarity of class of model results, with p<0.05. 

**Context:**	Normal Operation

### Helper Functions
General functions and external imports.

In [None]:
# General functions.

from utils import garden
import pandas as pd
from scipy import stats
from os import path


def load_data(data_folder: str):
    """Loads all garden data results and taxonomy categories."""
    df_results = garden.load_base_results(data_folder, "predictions_test.csv")
    df_results.head()

    # Load the taxonomic data and merge with results.
    df_info = garden.load_taxonomy(data_folder)
    df_results.rename(columns={"label": "Label"}, inplace=True)
    df_all = garden.merge_taxonomy_with_results(df_results, df_info)

    return df_info, df_all


def load_results(data_folder: str):
    """loads reproducabilty test result runs"""
    # my_folder = data_folder +
    df_results = pd.read_csv(
        path.join(data_folder, "ReproducibilityDataSet_CV.csv")
    )

    return df_results

In [None]:
# Prepare the data. For this section, instead of executing the model, we will use CSV files containing the results of an already executed run of the model.

df = load_results(DATASETS_DIR)

In [None]:
results = stats.friedmanchisquare(
    df.Accuracy_r0,
    df.Accuracy_r1,
    df.Accuracy_r2,
    df.Accuracy_r3,
    df.Accuracy_r4,
    df.Accuracy_r5,
    df.Accuracy_r6,
    df.Accuracy_r7,
    df.Accuracy_r8,
    df.Accuracy_r9,
)
results.pvalue

### Measurements

In this example, we evaluate the output from `stats.friedmanchisquare` using an `ExternalMeasurement` class, and store the result.

In [None]:
from mlte.evidence.types.array import Array
from mlte.measurement.external_measurement import ExternalMeasurement


kruskal_measurement = ExternalMeasurement(
    "repeated training on training samples", Array, stats.friedmanchisquare
)

# Evaluate.
kruskal_res = kruskal_measurement.evaluate(
    df.Accuracy_r0,
    df.Accuracy_r1,
    df.Accuracy_r2,
    df.Accuracy_r3,
    df.Accuracy_r4,
    df.Accuracy_r5,
    df.Accuracy_r6,
    df.Accuracy_r7,
    df.Accuracy_r8,
    df.Accuracy_r9,
)

# Inspect values
print(kruskal_res)

# Save to artifact store
kruskal_res.save(force=True)