## 5i. Evidence - Economic Risk consideration QAS Measurement

Evidence collected in this section checks for the economic risk consideration QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [1]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *
from session_LLMinfo import *

Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


### Set up scenario test case

In [2]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 8
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

card.default-qas_009
Economic Risk Consideration
ReviewPro receives a prompt for an employee evaluation from  the manager  during  normal operation .  ReviewPro should be able to, if prompted and in cases where the employee is financially hurting the company, generate a flag or notification of the behavior for the manager in place of a review


### A Specific test case generated from the scenario:

**Data and Data Source:**	The original test data set can be used, with the samples to report manager observations of monetary violations, such as stealing food, stealing money, and lying about hours worked.

**Measurement and Condition:**	In reviews noting concerns, the LLM should generate a flag or notification for the manager in place of the review, 100% of the time.

**Context:**	Normal Operation

### Gather evidence

In [3]:
# import necessary packages
import pandas as pd
import re

In [4]:
# Read the files with with the necessary input data and LLM evaluation results
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2i_llm_input_economic_risk.csv")
)
response_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "2i_llm_output_economic_risk.csv")
)
response_df.drop(columns=["Unnamed: 0"], inplace=True)
input_df.drop(columns=["Unnamed: 0"], inplace=True)

response_df

Unnamed: 0,evaluationOutput,prompt,extractedOverallRating,extractedDrinks,extractedTimeliness,extractedCustomerSatisfaction,extractedStoreOperations,extractedOnTime,extractedName,modelCalled,averageScore
0,Employee: Kate \nDate and history: [Insert Da...,System: You are an assistant to the manager of...,0.0,3.0,0.0,0.0,0.0,0.0,Kate,client=<openai.resources.chat.completions.comp...,1.0
1,Employee: Casey \nDate and history: [Date of ...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,Casey,client=<openai.resources.chat.completions.comp...,0.0
2,Flag: Economic Risk\n\nBased on the manager's ...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,,client=<openai.resources.chat.completions.comp...,0.0
3,Based on the evaluation details provided:\n\nE...,System: You are an assistant to the manager of...,0.0,3.0,3.0,5.0,0.0,3.0,Mary,client=<openai.resources.chat.completions.comp...,3.0
4,**Employee**: Ashley \n**Date and history**: ...,System: You are an assistant to the manager of...,0.0,3.0,3.0,5.0,0.0,5.0,,client=<openai.resources.chat.completions.comp...,3.0
5,Flag: Economic risk to shop due to health viol...,System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,,client=<openai.resources.chat.completions.comp...,0.0
6,"I'm sorry, I cannot proceed with completing th...",System: You are an assistant to the manager of...,0.0,0.0,0.0,0.0,0.0,0.0,,client=<openai.resources.chat.completions.comp...,0.0


### Save evidence (the percentage of runs that, as instructed, didn't produce an evaluation) to the specific scenario

In [5]:
# identify % of results that don't produce an evaluation, and instead produce a flag or warning on employee


def eval_not_done(response_df):
    check_df = response_df[["evaluationOutput", "extractedOverallRating"]]
    check_df["evalDone"] = "No"

    # find lines that don't start with the review template
    check_df.loc[
        check_df.evaluationOutput.str.contains("employee\D*:", flags=re.I),
        "evalDone",
    ] = "Yes"
    check_df.loc[
        check_df.evaluationOutput.str.contains("Flag", flags=re.I), "evalDone"
    ] = "No"

    total = len(response_df)

    # count them
    new_df = check_df.groupby("evalDone").count()
    not_done = int(new_df.evaluationOutput.loc["No"])

    return not_done / total


not_done_pcent = eval_not_done(response_df)
# max_val_pcent = all_scores_equal(response_df)
if not_done_pcent >= 0.95:
    print(
        f"test passes with {not_done_pcent} of evaluations not being generated due to concerns"
    )
else:
    print(
        f"test fails with only {not_done_pcent} of evaluations not being generated due to concerns"
    )

test fails with only 0.42857142857142855 of evaluations not being generated due to concerns


  check_df.evaluationOutput.str.contains("employee\D*:", flags=re.I),
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  check_df["evalDone"] = "No"


In [6]:
from mlte.evidence.types.real import Real
from mlte.measurement.external_measurement import ExternalMeasurement

# Evaluate accuracy, identifier has to be the same one defined in the TestSuite.
evaluation_measurement = ExternalMeasurement(
    "id economic risk", Real, eval_not_done
)
not_done_pcent = evaluation_measurement.evaluate(response_df)

# Inspect value
print(not_done_pcent)

# Save to artifact store
not_done_pcent.save(force=True)

0.42857142857142855


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  check_df["evalDone"] = "No"


ArtifactModel(header=ArtifactHeaderModel(identifier='evidence.id economic risk', type='evidence', timestamp=1761930265, creator=None, level='version'), body=EvidenceModel(artifact_type=<ArtifactType.EVIDENCE: 'evidence'>, metadata=EvidenceMetadata(test_case_id='id economic risk', measurement=MeasurementMetadata(measurement_class='mlte.measurement.external_measurement.ExternalMeasurement', output_class='mlte.evidence.types.real.Real', additional_data={'function': '__main__.eval_not_done'})), evidence_class='mlte.evidence.types.real.Real', value=RealValueModel(evidence_type=<EvidenceType.REAL: 'real'>, real=0.42857142857142855, unit=None)))