## 2d. Evidence - Repeatability QAS Measurements

Evidence collected in this section checks for the repeatability QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [1]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *
from session_LLMinfo import *

Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


### Set up scenario test case

In [2]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 3
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

card.default-qas_004
Repeatability
ReviewPro may receive multiple entries of similarly performing employees for evaluation from  the manager  during  normal operation .  n the case of similar prompts and input information, the LLM generated employee evaluation, including performance scores and evaluation summary, should be semantically similar each time. 


### A Specific test case generated from the scenario:

**Data and Data Source:**	The original test data set can be used, but will be augmented to contain repeated instances of the prompts.

**Measurement and Condition:**	The LLM output will be analyzed to determine if the scores generated to the prompt series, which are didentical, are the same 95% of the time.

**Context:**	Normal Operation

### Gather evidence

In [3]:
# import necessary packages
import pandas as pd

In [4]:
# Read the files with with the necessary input data and LLM evaluation results
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2d_llm_input_repeatability.csv")
)
response_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2d_llm_output_repeatability.csv")
)
response_df.drop(columns=["Unnamed: 0"], inplace=True)

# Preview the cleaned dataframe
print(response_df.extractedOverallRating)

0    3.0
1    0.0
2    3.0
3    0.0
4    0.0
5    3.0
6    0.0
7    0.0
Name: extractedOverallRating, dtype: float64


### Save evidence to the specified scenario

In [5]:
# evaluate if all results are the same
def all_scores_equal(response_df):
    mx = 0
    for s in response_df.extractedOverallRating.unique():
        n = len(response_df[response_df.extractedOverallRating == s])
        if n > mx:
            mx = n
    max_val_pcent = mx / len(response_df)

    return float(max_val_pcent)


max_val_pcent = all_scores_equal(response_df)
if max_val_pcent >= 0.95:
    print(
        f"test passes with {max_val_pcent} of evaluation scores being the same"
    )
else:
    print(f"test fails with only {max_val_pcent} being the same")

test fails with only 0.625 being the same


In [6]:
from mlte.evidence.types.real import Real
from mlte.measurement.external_measurement import ExternalMeasurement

# Evaluate accuracy, identifier has to be the same one defined in the TestSuite.
repeatable_measurement = ExternalMeasurement(
    "repeatable review", Real, all_scores_equal
)
repeated_pcent = repeatable_measurement.evaluate(response_df)

# Inspect value
print(repeated_pcent)

# Save to artifact store
repeated_pcent.save(force=True)

0.625


ArtifactModel(header=ArtifactHeaderModel(identifier='evidence.repeatable review', type='evidence', timestamp=1761930134, creator=None, level='version'), body=EvidenceModel(artifact_type=<ArtifactType.EVIDENCE: 'evidence'>, metadata=EvidenceMetadata(test_case_id='repeatable review', measurement=MeasurementMetadata(measurement_class='mlte.measurement.external_measurement.ExternalMeasurement', output_class='mlte.evidence.types.real.Real', additional_data={'function': '__main__.all_scores_equal'})), evidence_class='mlte.evidence.types.real.Real', value=RealValueModel(evidence_type=<EvidenceType.REAL: 'real'>, real=0.625, unit=None)))