## 2b. Evidence - Robustness QAS Measurements

Evidence collected in this section checks for the Robustness scenario defined in the previous step. Note that some functions will be loaded from external Python files.

In [None]:
{
    "tags": ["Computer Vision"],
    "quality_attribute": "Robustness to Noise (Image Blur)",
    "description": "The model receives a picture taken at a garden by a member of the general public, and it is a bit blurry.  The model should still be able to successfully identify the flower at the same rate as non-blurry images. Test data needs to include blurred flower images.  Blurred images will be created using ImageMagick. Three datasets will be generated, each with different amounts of blur: minimal blur, maximum blur, and in between minimal and maximum blur. Blurry images are successfully identified at rates equal to that of non-blurred images. This will be measured using the Wilcoxon Rank-Sum test, with significance at p-value <=0.05.",
    "inputs": "three garden populations, model results on Oxford garden data",
    "output": "robustness to noise",
}

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [1]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *

Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/GardenBuddy/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/GardenBuddy/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


### Set up scenario test case

In [8]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 1
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

card.default-qas_002
Robustness
The model receives a picture that is a bit blurry from  the Garden Buddy application  during  normal operation .  he model successfully identifies flowers at the same rate as non-blurry images


**A Specific test case generated from the scenario:**

**Data and Data Source:**	Test data needs to include blurred flower images.  Test blurred images will be created using ImageMagick. Three datasets will be generated, each with different amounts of blur: minimal blur, maximum blur, and in between minimal and maximum blur.

**Measurement and Condition:**	Blurry images are successfully identified at rates equal to that of non-blurred images. This will be measured using the Wilcoxon Rank-Sum test, with significance at p-value <=0.05.

**Context:**	Normal Operation

### Helper Functions

General functions and external imports.

In [None]:
# General functions.
import utils.garden as garden
import pandas as pd


def calculate_base_accuracy(df_results: pd.DataFrame) -> pd.DataFrame:
    # Calculate the base model accuracy result per data label
    df_pos = (
        df_results[df_results["model correct"] == True].groupby("label").count()
    )
    # df_pos.drop(columns=["predicted_label"], inplace=True)
    df_neg = (
        df_results[df_results["model correct"] == False]
        .groupby("label")
        .count()
    )
    # df_neg.drop(columns=["predicted_label"], inplace=True)
    df_neg.rename(columns={"model correct": "model incorrect"}, inplace=True)
    df_res = df_pos.merge(
        df_neg, right_on="label", left_on="label", how="outer"
    )
    df_res.fillna(0, inplace=True)
    df_res["model acc"] = df_res["model correct"] / (
        df_res["model correct"] + df_res["model incorrect"]
    )
    df_res["count"] = df_res["model correct"] + df_res["model incorrect"]
    df_res.drop(columns=["model correct", "model incorrect"], inplace=True)
    df_res.head()

    return df_res


def calculate_accuracy_per_set(
    data_folder: str, df_results: pd.DataFrame, df_res: pd.DataFrame
) -> pd.DataFrame:
    # Calculate the model accuracy per data label for each blurred data set
    base_filename = "predictions_test"
    ext_filename = ".csv"
    set_filename = ["_blur2x8", "_blur5x8", "_blur0x8"]

    col_root = "model acc"

    for fs in set_filename:
        filename = os.path.join(data_folder, base_filename + fs + ext_filename)
        colname = col_root + fs

        df_temp = pd.read_csv(filename)
        df_temp = df_temp[["model correct", "label"]]

        df_pos = (
            df_temp[df_temp["model correct"] == True].groupby("label").count()
        )
        df_neg = (
            df_results[df_results["model correct"] == False]
            .groupby("label")
            .count()
        )
        df_neg.rename(
            columns={"model correct": "model incorrect"}, inplace=True
        )
        df_res2 = df_pos.merge(
            df_neg,
            right_on="label",
            left_on="label",
            how="outer",
        ).fillna(0)
        df_res2.fillna(0, inplace=True)

        df_res2[colname] = df_res2["model correct"] / (
            df_res2["model correct"] + df_res2["model incorrect"]
        )
        df_res2.drop(columns=["model correct", "model incorrect"], inplace=True)

        df_res = df_res.merge(
            df_res2, right_on="label", left_on="label", how="outer"
        ).fillna(0)

    return df_res


def print_model_accuracy(df_res: pd.DataFrame, key: str, name: str):
    model_acc = sum(df_res[key] * df_res["count"]) / sum(df_res["count"])
    print(name, model_acc)

In [12]:
# Prepare all data. Same as the case above, we will use CSV files that contain results of a previous execution of the model.
df_results = garden.load_base_results(DATASETS_DIR, "predictions_test.csv")
df_results = df_results[["model correct", "label"]]
df_res = calculate_base_accuracy(df_results)
df_res = calculate_accuracy_per_set(DATASETS_DIR, df_results, df_res)
df_info = garden.load_taxonomy(DATASETS_DIR)
df_all = garden.merge_taxonomy_with_results(df_res, df_info, "label", "Label")

102 102 102


### Measurements

Now do the actual measurements. First simply see the model accuracy across blurs.

In [13]:
# view changes in model accuracy
print_model_accuracy(df_res, "model acc", "base model accuracy")
print_model_accuracy(
    df_res, "model acc_blur2x8", "model accuracy with 2x8 blur"
)
print_model_accuracy(
    df_res, "model acc_blur5x8", "model accuracy with 5x8 blur"
)
print_model_accuracy(
    df_res, "model acc_blur0x8", "model accuracy with 0x8 blur"
)

base model accuracy 0.947265625
model accuracy with 2x8 blur 0.9457940876397908
model accuracy with 5x8 blur 0.9395827696608947
model accuracy with 0x8 blur 0.7439894903273809


Measure the ranksums (p-value) for all blur cases, using `scipy.stats.ranksums` and the `ExternalMeasurement` wrapper.

In [14]:
import scipy.stats

from mlte.evidence.types.array import Array
from mlte.measurement.external_measurement import ExternalMeasurement


my_blur = ["2x8", "5x8", "0x8"]
for i in range(len(my_blur)):
    # Define measurements.
    ranksum_measurement = ExternalMeasurement(
        f"ranksums blur{my_blur[i]}", Array, scipy.stats.ranksums
    )

    # Evaluate.
    ranksum: Array = ranksum_measurement.evaluate(
        df_res["model acc"], df_res[f"model acc_blur{my_blur[i]}"]
    )
    print(f"blur {my_blur[i]}: {ranksum}")

    # Inspect values
    print(ranksum)

    # Save to artifact store
    ranksum.save(force=True)

blur 2x8: RanksumsResult(statistic=np.float64(0.07946178703316073), pvalue=np.float64(0.9366653249981838))
RanksumsResult(statistic=np.float64(0.07946178703316073), pvalue=np.float64(0.9366653249981838))
blur 5x8: RanksumsResult(statistic=np.float64(0.4032389192727559), pvalue=np.float64(0.6867724711187835))
RanksumsResult(statistic=np.float64(0.4032389192727559), pvalue=np.float64(0.6867724711187835))
blur 0x8: RanksumsResult(statistic=np.float64(2.908064206049404), pvalue=np.float64(0.003636736621916332))
RanksumsResult(statistic=np.float64(2.908064206049404), pvalue=np.float64(0.003636736621916332))


Now to next part of the question- is this equal across the phylogenic groups?

To do that, we will check for differences of the effect of the blur between families, using the phylohentic grouping of the plant pictures to stratify the data

In [15]:
from typing import List

from evidence.multiple_ranksums import MultipleRanksums

# use the initial result, blur columns to anaylze effect of blur
df_all["delta_2x8"] = df_all["model acc"] - df_all["model acc_blur2x8"]
df_all["delta_5x8"] = df_all["model acc"] - df_all["model acc_blur5x8"]
df_all["delta_0x8"] = df_all["model acc"] - df_all["model acc_blur0x8"]

pops = df_all["Order"].unique().tolist()
blurs = [
    "delta_2x8",
    "delta_5x8",
    "delta_0x8",
]


def run_ranksum(samp1, samp2):
    res = scipy.stats.ranksums(samp1, samp2)
    float_list = [float(x) for x in res]
    # print(float(res))
    return float_list


def calculate_multiple_ranksums(df_all, pops, blurs):
    ranksums: List = []
    for i in range(len(blurs)):
        for p1 in range(len(pops)):  # pop1 in pops:
            pop1 = pops[p1]
            for p2 in range(p1, len(pops)):  # pop2 in pops:
                pop2 = pops[p2]
                ranksum_measurement = ExternalMeasurement(
                    f"ranksums Order {pop1}-{pop2} blur{blurs[i]}",
                    Array,
                    run_ranksum,  # scipy.stats.ranksums,
                )
                ranksum: Array = ranksum_measurement.evaluate(
                    df_all[df_all["Order"] == pop1][blurs[i]],
                    df_all[df_all["Order"] == pop2][blurs[i]],
                )
                # print(f"blur {blurs[i]}: {ranksum}")

                ranksums.append({ranksum.identifier: ranksum.array})
    return ranksums


multiple_ranksums_meas = ExternalMeasurement(
    f"effect of blur across families",
    MultipleRanksums,
    calculate_multiple_ranksums,
)
multiple_ranksums: MultipleRanksums = multiple_ranksums_meas.evaluate(
    df_all, pops, blurs
)
multiple_ranksums.num_pops = len(pops)
multiple_ranksums.save(force=True)

ArtifactModel(header=ArtifactHeaderModel(identifier='evidence.effect of blur across families', type='evidence', timestamp=1762179256, creator=None, level='version'), body=EvidenceModel(artifact_type=<ArtifactType.EVIDENCE: 'evidence'>, metadata=EvidenceMetadata(test_case_id='effect of blur across families', measurement=MeasurementMetadata(measurement_class='mlte.measurement.external_measurement.ExternalMeasurement', output_class='evidence.multiple_ranksums.MultipleRanksums', additional_data={'function': '__main__.calculate_multiple_ranksums'})), evidence_class='evidence.multiple_ranksums.MultipleRanksums', value=OpaqueValueModel(evidence_type=<EvidenceType.OPAQUE: 'opaque'>, data={'array': [{'evidence.ranksums Order Apiales-Apiales blurdelta_2x8': [0.0, 1.0]}, {'evidence.ranksums Order Apiales-Alismatales blurdelta_2x8': [0.0, 1.0]}, {'evidence.ranksums Order Apiales-Asterales blurdelta_2x8': [-0.1091089451179962, 0.9131160800723744]}, {'evidence.ranksums Order Apiales-Ericales blurdel