## 2c. Evidence - Functional Correctness QAS Measurements

Evidence collected in this section checks for the secound functional correctness QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [None]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *
from session_LLMinfo import *

### Set up scenario test case

In [None]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 2
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

### A Specific test case generated from the scenario:

**Data and Data Source:**	The LLM receives a prompt, containing the employee goals, employee statement, and manager notes, for an employee evaluation and performance score. The original test data set can be used to simulate this request.

**Measurement and Condition:**	The LLM generated scores will be self consistent, and when rounding, the average of the sub-category scores will match the overall score for 95% samples.

**Context:**	Normal Operation

### Gather evidence

In [None]:
# import necessary packages
import pandas as pd

In [None]:
# Read the files with with the necessary input data and LLM evaluation results
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5abc_llm_input_functional_correctness.csv")
)
results_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5abc_llm_output_functional_correctness.csv")
)
results_df.drop(columns=["Unnamed: 0"], inplace=True)

# Preview the cleaned dataframe
print(input_df.columns)
print(results_df.columns)

### Save evidence to the specified scenario

In [None]:
# show percentage of incorrect results
def evaluate_inconsistent_pcent(results_df):
    mismatches = (
        results_df["averageScore"] != results_df["extractedOverallRating"]
    )
    print(mismatches)
    mismatch_count = mismatches.sum()
    data_size = len(results_df)
    mismatch_val = mismatch_count / data_size  # * 100
    return float(mismatch_val)


mismatch_val = evaluate_inconsistent_pcent(results_df)
print(mismatch_val)
if mismatch_val < 0.05:
    print(f"test passes with {mismatch_val} failures")
else:
    print(f"test fails with {mismatch_val} failures")

In [None]:
from mlte.evidence.types.real import Real
from mlte.measurement.external_measurement import ExternalMeasurement

# Evaluate accuracy, identifier has to be the same one defined in the TestSuite.
mismatch_measurement = ExternalMeasurement(
    "eval is consistent", Real, evaluate_inconsistent_pcent
)
mismatch_pcent = mismatch_measurement.evaluate(testing_df)

# Inspect value
print(mismatch_pcent)

# Save to artifact store
mismatch_pcent.save(force=True)