## 2a. Evidence - Explainability QAS Measurements

Evidence collected in this section checks for the Explainability QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [1]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.

from session import *
from session_LLMinfo import *

Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


### Set up scenario test case 

In [2]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 0
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

card.default-qas_001
Explainability
ReviewPro receives a prompt asking for an explanation of a previously provided employee evaluation from  the manager  during  normal operation .  The application will provide a rationale for the employee review that is human understandable text 


**A Specific test case generated from the scenario:**

**Data and Data Source:**	The LLM receives a prompt from the manager asking for an employee evaluation, and the original test data set can be used to mimic this request.

**Measurement and Condition:**	When queried for an explination of the score, the LLM will return an explination how the score is supported by the evidence, in this case the employee's self review and goals and objectives and manager's notes.  

**Context:**	Normal Operation 


### Gather evidence

In [3]:
import pandas as pd

In [4]:
# create list of file names for data
# Read the CSV with the correct encoding
# initial input to ReviewPro
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2abc_llm_input_functional_correctness.csv")
)

# initial output of ReviewPro
output_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2abc_llm_output_functional_correctness.csv")
)

# response when asked to explain evaluation score
response_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2a_llm_output_explainability.csv")
)
output_df.drop(columns=["Unnamed: 0"], inplace=True)

# Preview the cleaned dataframe
print(input_df.columns)
output_df.columns

Index(['employeeSelfEval', 'managerComments', 'goalsAndObjectives',
       'EmployeeName', 'correctEvalScore'],
      dtype='object')


Index(['evaluationOutput', 'prompt', 'extractedOverallRating',
       'extractedDrinks', 'extractedTimeliness',
       'extractedCustomerSatisfaction', 'extractedStoreOperations',
       'extractedOnTime', 'extractedName', 'modelCalled', 'averageScore'],
      dtype='object')

### Save evidence (the LLM explination of the responses) to the specific scenario

In [5]:
def pull_explination(filename):
    """Runs the model and gets the log."""
    print(filename)
    response_df = pd.read_csv(filename)
    print(response_df.columns)

    return response_df.response.tolist()

In [None]:
from mlte.measurement.external_measurement import ExternalMeasurement
from mlte.evidence.types.array import Array


# Save to MLTE store.
evi_collector = ExternalMeasurement(
    "LLM provides evidence", Array, pull_explination
)

evi = evi_collector.evaluate("data/2a_output_explainability.csv")
evi.save(force=True)

data/2a_output_explainability.csv
Index(['Unnamed: 0', 'prompt', 'response', 'model', 'employeeSelfEval',
       'goalsAndObjectives', 'managerComments'],
      dtype='object')


ArtifactModel(header=ArtifactHeaderModel(identifier='evidence.LLM provides evidence', type='evidence', timestamp=1761930028, creator=None, level='version'), body=EvidenceModel(artifact_type=<ArtifactType.EVIDENCE: 'evidence'>, metadata=EvidenceMetadata(test_case_id='LLM provides evidence', measurement=MeasurementMetadata(measurement_class='mlte.measurement.external_measurement.ExternalMeasurement', output_class='mlte.evidence.types.array.Array', additional_data={'function': '__main__.pull_explination'})), evidence_class='mlte.evidence.types.array.Array', value=ArrayValueModel(evidence_type=<EvidenceType.ARRAY: 'array'>, data=["Based on the inputs provided, the overall rating of 3.0 reflects several factors:\n\n1. **Basic Job Performance**: Kate performs the fundamental tasks required by her role, such as making drinks and collecting her paycheck. This suggests that she's meeting the essential duties of her job, which is a baseline expectation for an average performance rating.\n\n2. **