## 2e. Evidence - Robustness QAS Measurements

Evidence collected in this section checks for the robustness QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [None]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *
from session_LLMinfo import *

### Set up test case from scenario

In [None]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 4
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

### A Specific test case generated from the scenario:

**Data and Data Source:**	The original test data set can be used. The test data will be augmented by altering entries by changing case, adding white space padding and removing punctuation.

**Measurement and Condition:**	The LLM output will be analyzed to determine if the scores generated to the prompts series, which are different only by the removal or addition of whitespace and punction as detailed above, are the same 95% of the time.

**Context:**	Normal Operation

### Gather evidence

In [None]:
# import necessary packages
import pandas as pd

from evaluation_helpers import *

In [None]:
# Read the files with with the necessary input data and LLM evaluation results
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5e_llm_input_robustness.csv")
)
response_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5e_llm_output_robustness.csv")
)
response_df.drop(columns=["Unnamed: 0"], inplace=True)
input_df.drop(columns=["Unnamed: 0"], inplace=True)

response_df

In [None]:
### Save evidence to the specicified scenario

In [None]:
# evaluate if all results are the same
def all_scores_equal(response_df):
    mx = 0
    for s in response_df.extractedOverallRating.unique():
        n = len(response_df[response_df.extractedOverallRating == s])
        if n > mx:
            mx = n
    max_val_pcent = mx / len(response_df)

    return float(max_val_pcent)


max_val_pcent = all_scores_equal(response_df)
if max_val_pcent >= 0.95:
    print(
        f"test passes with {max_val_pcent} of evaluation scores being the same"
    )
else:
    print(f"test fails with only {max_val_pcent} being the same")

In [None]:
from mlte.evidence.types.real import Real
from mlte.measurement.external_measurement import ExternalMeasurement

# Evaluate accuracy, identifier has to be the same one defined in the TestSuite.
repeatable_measurement = ExternalMeasurement(
    "LLM is robsust to format", Real, all_scores_equal
)
repeated_pcent = repeatable_measurement.evaluate(response_df)

# Inspect value
print(repeated_pcent)

# Save to artifact store
repeated_pcent.save(force=True)