## 2c. Evidence - Functional Correctness QAS Measurements

Evidence collected in this section checks for the secound functional correctness QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [1]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *
from session_LLMinfo import *

Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


### Set up scenario test case

In [2]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 2
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

card.default-qas_003
Functional Correctness
ReviewPro receives a prompt asking for an employee review from  the manager  during  normal operation .  The model outputs an employee evaluation, including an overall performance score for the employee and an evaluation for each important sub-category.  The sub-category scores should average to match the overall score in at least 95% of the cases.


### A Specific test case generated from the scenario:

**Data and Data Source:**	The LLM receives a prompt, containing the employee goals, employee statement, and manager notes, for an employee evaluation and performance score. The original test data set can be used to simulate this request.

**Measurement and Condition:**	The LLM generated scores will be self consistent, and when rounding, the average of the sub-category scores will match the overall score for 95% samples.

**Context:**	Normal Operation

### Gather evidence

In [3]:
# import necessary packages
import pandas as pd

In [4]:
# Read the files with with the necessary input data and LLM evaluation results
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2abc_llm_input_functional_correctness.csv")
)
results_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2abc_llm_output_functional_correctness.csv")
)
results_df.drop(columns=["Unnamed: 0"], inplace=True)

# Preview the cleaned dataframe
print(input_df.columns)
print(results_df.columns)

Index(['employeeSelfEval', 'managerComments', 'goalsAndObjectives',
       'EmployeeName', 'correctEvalScore'],
      dtype='object')
Index(['evaluationOutput', 'prompt', 'extractedOverallRating',
       'extractedDrinks', 'extractedTimeliness',
       'extractedCustomerSatisfaction', 'extractedStoreOperations',
       'extractedOnTime', 'extractedName', 'modelCalled', 'averageScore'],
      dtype='object')


### Save evidence to the specified scenario

In [5]:
# show percentage of incorrect results
def evaluate_inconsistent_pcent(results_df):
    mismatches = (
        results_df["averageScore"] != results_df["extractedOverallRating"]
    )
    print(mismatches)
    mismatch_count = mismatches.sum()
    data_size = len(results_df)
    mismatch_val = mismatch_count / data_size  # * 100
    return float(mismatch_val)


mismatch_val = evaluate_inconsistent_pcent(results_df)
print(mismatch_val)
if mismatch_val < 0.05:
    print(f"test passes with {mismatch_val} failures")
else:
    print(f"test fails with {mismatch_val} failures")

0    False
1     True
2    False
3    False
4    False
5    False
6     True
dtype: bool
0.2857142857142857
test fails with 0.2857142857142857 failures


In [6]:
from mlte.evidence.types.real import Real
from mlte.measurement.external_measurement import ExternalMeasurement

# Evaluate accuracy, identifier has to be the same one defined in the TestSuite.
mismatch_measurement = ExternalMeasurement(
    "eval is consistent", Real, evaluate_inconsistent_pcent
)
mismatch_pcent = mismatch_measurement.evaluate(results_df)

# Inspect value
print(mismatch_pcent)

# Save to artifact store
mismatch_pcent.save(force=True)

0    False
1     True
2    False
3    False
4    False
5    False
6     True
dtype: bool
0.2857142857142857


ArtifactModel(header=ArtifactHeaderModel(identifier='evidence.eval is consistent', type='evidence', timestamp=1761930099, creator=None, level='version'), body=EvidenceModel(artifact_type=<ArtifactType.EVIDENCE: 'evidence'>, metadata=EvidenceMetadata(test_case_id='eval is consistent', measurement=MeasurementMetadata(measurement_class='mlte.measurement.external_measurement.ExternalMeasurement', output_class='mlte.evidence.types.real.Real', additional_data={'function': '__main__.evaluate_inconsistent_pcent'})), evidence_class='mlte.evidence.types.real.Real', value=RealValueModel(evidence_type=<EvidenceType.REAL: 'real'>, real=0.2857142857142857, unit=None)))