## 2e. Evidence - Robustness QAS Measurements

Evidence collected in this section checks for the robustness QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [5]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *
from session_LLMinfo import *

### Set up test case from scenario

In [6]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 4
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

card.default-qas_005
Robustness
ReviewPro may receive prompts with different variations, such as casing, spacing, and punctuation from  the manager  during  normal operation .  The employee evaluation  should not be influenced by these input variations


### A Specific test case generated from the scenario:

**Data and Data Source:**	The original test data set can be used. The test data will be augmented by altering entries by changing case, adding white space padding and removing punctuation.

**Measurement and Condition:**	The LLM output will be analyzed to determine if the scores generated to the prompts series, which are different only by the removal or addition of whitespace and punction as detailed above, are the same 95% of the time.

**Context:**	Normal Operation

### Gather evidence

In [7]:
# import necessary packages
import pandas as pd

In [8]:
# Read the files with with the necessary input data and LLM evaluation results
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2e_llm_input_robustness.csv")
)
response_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2e_llm_output_robustness.csv")
)
response_df.drop(columns=["Unnamed: 0"], inplace=True)
input_df.drop(columns=["Unnamed: 0"], inplace=True)

response_df

Unnamed: 0,evaluationOutput,prompt,extractedOverallRating,extractedDrinks,extractedTimeliness,extractedCustomerSatisfaction,extractedStoreOperations,extractedOnTime,extractedName,modelCalled,averageScore
0,**Employee Evaluation**\n\n**Employee:** Kate ...,System: You are an assistant to the manager of...,0.0,3.0,0.0,0.0,0.0,0.0,** Kate,client=<openai.resources.chat.completions.comp...,1.0
1,Employee: Kate \nDate and history: [Insert da...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,Kate,client=<openai.resources.chat.completions.comp...,0.0
2,**Employee Evaluation**\n\n**Employee:** Kate ...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,** Kate,client=<openai.resources.chat.completions.comp...,0.0
3,**Employee:** Kate \n**Date and History:** Oc...,System: You are an assistant to the manager of...,3.0,3.0,0.0,3.0,0.0,0.0,** Kate,client=<openai.resources.chat.completions.comp...,1.0
4,**Employee:** Kate \n**Date and history:** [I...,System: You are an assistant to the manager of...,0.0,3.0,0.0,3.0,0.0,0.0,** Kate,client=<openai.resources.chat.completions.comp...,1.0
5,Employee: Kate \nDate and history: [Insert Da...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,Kate,client=<openai.resources.chat.completions.comp...,0.0
6,**Employee:** Kate \n**Date and history:** Oc...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,** Kate,client=<openai.resources.chat.completions.comp...,0.0
7,**Employee: Kate** \n**Date and History: [Ins...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,Kate**,client=<openai.resources.chat.completions.comp...,0.0
8,**Employee:** Kate \n**Date and history:** [I...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,** Kate,client=<openai.resources.chat.completions.comp...,0.0
9,**Employee:** Kate \n**Date and history:** [I...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,** Kate,client=<openai.resources.chat.completions.comp...,0.0


### Save evidence to the specicified scenario

In [9]:
# evaluate if all results are the same
def all_scores_equal(response_df):
    mx = 0
    for s in response_df.extractedOverallRating.unique():
        n = len(response_df[response_df.extractedOverallRating == s])
        if n > mx:
            mx = n
    max_val_pcent = mx / len(response_df)

    return float(max_val_pcent)


max_val_pcent = all_scores_equal(response_df)
if max_val_pcent >= 0.95:
    print(
        f"test passes with {max_val_pcent} of evaluation scores being the same"
    )
else:
    print(f"test fails with only {max_val_pcent} being the same")

test fails with only 0.9285714285714286 being the same


In [10]:
from mlte.evidence.types.real import Real
from mlte.measurement.external_measurement import ExternalMeasurement

# Evaluate accuracy, identifier has to be the same one defined in the TestSuite.
repeatable_measurement = ExternalMeasurement(
    "LLM is robsust to format", Real, all_scores_equal
)
repeated_pcent = repeatable_measurement.evaluate(response_df)

# Inspect value
print(repeated_pcent)

# Save to artifact store
repeated_pcent.save(force=True)

0.9285714285714286


ArtifactModel(header=ArtifactHeaderModel(identifier='evidence.LLM is robsust to format', type='evidence', timestamp=1761930155, creator=None, level='version'), body=EvidenceModel(artifact_type=<ArtifactType.EVIDENCE: 'evidence'>, metadata=EvidenceMetadata(test_case_id='LLM is robsust to format', measurement=MeasurementMetadata(measurement_class='mlte.measurement.external_measurement.ExternalMeasurement', output_class='mlte.evidence.types.real.Real', additional_data={'function': '__main__.all_scores_equal'})), evidence_class='mlte.evidence.types.real.Real', value=RealValueModel(evidence_type=<EvidenceType.REAL: 'real'>, real=0.9285714285714286, unit=None)))