## 2a. Evidence - Explainability QAS Measurements

Evidence collected in this section checks for the Explainability QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [None]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
# from demo.EvalPro_demo.session import *
from session import *
from session_LLMinfo import *

### Set up scenario test case 

In [None]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 0
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

**A Specific test case generated from the scenario:**

**Data and Data Source:**	The LLM receives a prompt from the manager asking for an employee evaluation, and the original test data set can be used to mimic this request.

**Measurement and Condition:**	When queried for an explination of the score, the LLM will return an explination how the score is supported by the evidence, in this case the employee's self review and goals and objectives and manager's notes.  

**Context:**	Normal Operation 


### Gather evidence

In [None]:
from evaluation_helpers import *

import itertools
import pandas as pd

In [None]:
# create list of file names for data
# Read the CSV with the correct encoding
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5abc_llm_input_functional_correctness.csv")
)
output_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5abc_llm_output_functional_correctness.csv")
)
output_df.drop(columns=["Unnamed: 0"], inplace=True)

# Preview the cleaned dataframe
print(input_df.columns)
results_df.columns

In [None]:
chain = prompt_template | llm

In [None]:
combo_df = pd.merge(input_df, output_df, left_index=True, right_index=True)
combo_df.columns

#create a prompt asking the LLM to explain the employee overall evaluation score

In [None]:
prompt_template2 = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an assistant to the manager of a small coffee shop.",
        ),
        (
            "human",
            """
Assistant, you provided an overal rating of {extracted_overall_rating} based on the following inputs:

Goals/objectives
{goals_and_objectives}

Employee self evaluation

{self_eval}

Manager comments

{manager_comments}

Can you explain how you arrived at that rating?
        
""",
        ),
    ]
)

In [None]:
chain = prompt_template2 | llm

response_df2 = []

for row_num, row in combo_df.iterrows():
    # print(row.index)

    pii_data = {
        "extracted_overall_rating": row.extractedOverallRating,
        "goals_and_objectives": row.goalsAndObjectives,
        "self_eval": row.employeeSelfEval,
        "manager_comments": row.managerComments,
    }
    prompt = prompt_template2.format(**pii_data)
    response = chain.invoke(pii_data)

    pii_data["response"] = response.content
    pii_data["prompt"] = prompt
    pii_data["model"] = llm

    response_df2.append(pii_data)

In [None]:
response_df2 = pd.DataFrame(response_df2)

In [None]:
# save the responses
response_df2.columns
response_df2.rename(
    columns={
        "goals_and_objectives": "goalsAndObjectives",
        "self_eval": "employeeSelfEval",
        "manager_comments": "managerComments",
    },
    inplace=True,
)

response_df2[
    [
        "prompt",
        "response",
        "model",
        "employeeSelfEval",
        "goalsAndObjectives",
        "managerComments",
    ]
].to_csv("data/5a_output_explainability.csv")

### Save evidence to the specific scenario

In [None]:
# run test, collect p-values
model = ols(
    "overallRating ~ C(PromptGroupNum) + FReadingScore+ C(PromptGroupNum):FReadingScore",
    data=my_df2,
).fit()


def run_anova_lm(model):
    res = sm.stats.anova_lm(model, typ=2)
    return res


res = run_anova_lm(model)
print(res)
if res["PR(>F)"].loc["FReadingScore"] < 0.05:
    print("fail test")
else:
    print("pass test")

In [None]:
def pull_explination(filename):
    """Runs the model and gets the log."""
    print(filename)
    response_df = pd.read_csv(filename)
    print(response_df.columns)

    return response_df.response.tolist()

In [None]:
from mlte.measurement.external_measurement import ExternalMeasurement
from mlte.evidence.types.array import Array


# Save to MLTE store.
evi_collector = ExternalMeasurement(
    "LLM provides evidence", Array, pull_explination
)
# input_df = pd.read_csv(os.path.join(DATASETS_DIR, '5bc_llm_input_functional_correctness.csv'))
evi = evi_collector.evaluate("data/5a_output_explainability.csv")
evi.save(force=True)