## Data Generation Notebook 2

This notebook is showing how the the LLM was queried to ask to explain the evaluation scores generated by the EvalPro application. 

### Gather evidence

In [None]:
import itertools
import pandas as pd

from session import *

from demo.ReviewPro.utils.evaluation_helpers import *

Creating initial custom lists at URI: local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loaded 7 qa_categories for initial list
Loaded 30 quality_attributes for initial list
Creating sample catalog at URI: StoreType.LOCAL_FILESYSTEM:local:///Users/rbrowersinning/Documents/ResearchFolders/Continuum_LTP/GitRepos/mlte_llm/demo/ReviewPro/../store
Loading sample catalog entries.
Loaded 9 entries for sample catalog.


In [None]:
# Read in the input files used to evaluate functional correctness:
# Read the CSV with the correct encoding
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5abc_llm_input_functional_correctness.csv")
)
output_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5abc_llm_output_functional_correctness.csv")
)
output_df.drop(columns=["Unnamed: 0"], inplace=True)

# Preview the cleaned dataframe
print(input_df.columns)
output_df.columns

Index(['employeeSelfEval', 'managerComments', 'goalsAndObjectives',
       'EmployeeName', 'correctEvalScore'],
      dtype='object')


Index(['evaluationOutput', 'prompt', 'extractedOverallRating',
       'extractedDrinks', 'extractedTimeliness',
       'extractedCustomerSatisfaction', 'extractedStoreOperations',
       'extractedOnTime', 'extractedName', 'modelCalled', 'averageScore'],
      dtype='object')

In [6]:
combo_df = pd.merge(input_df, output_df, left_index=True, right_index=True)
combo_df.columns

Index(['employeeSelfEval', 'managerComments', 'goalsAndObjectives',
       'EmployeeName', 'correctEvalScore', 'evaluationOutput', 'prompt',
       'extractedOverallRating', 'extractedDrinks', 'extractedTimeliness',
       'extractedCustomerSatisfaction', 'extractedStoreOperations',
       'extractedOnTime', 'extractedName', 'modelCalled', 'averageScore'],
      dtype='object')

#create a prompt asking the LLM to explain the employee overall evaluation score

In [7]:
# the prompt template to ask the LLM to explain its evaluation
prompt_template2 = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an assistant to the manager of a small coffee shop.",
        ),
        (
            "human",
            """
Assistant, you provided an overal rating of {extracted_overall_rating} based on the following inputs:

Goals/objectives
{goals_and_objectives}

Employee self evaluation

{self_eval}

Manager comments

{manager_comments}

Can you explain how you arrived at that rating?
        
""",
        ),
    ]
)

In [8]:
# query the LLM with the prompt and data

chain = prompt_template2 | llm

response_df2 = []

for row_num, row in combo_df.iterrows():
    # print(row.index)

    pii_data = {
        "extracted_overall_rating": row.extractedOverallRating,
        "goals_and_objectives": row.goalsAndObjectives,
        "self_eval": row.employeeSelfEval,
        "manager_comments": row.managerComments,
    }
    prompt = prompt_template2.format(**pii_data)
    response = chain.invoke(pii_data)

    pii_data["response"] = response.content
    pii_data["prompt"] = prompt
    pii_data["model"] = llm

    response_df2.append(pii_data)

In [9]:
response_df2 = pd.DataFrame(response_df2)

In [10]:
# save the responses
response_df2.columns
response_df2.rename(
    columns={
        "goals_and_objectives": "goalsAndObjectives",
        "self_eval": "employeeSelfEval",
        "manager_comments": "managerComments",
    },
    inplace=True,
)

response_df2[
    [
        "prompt",
        "response",
        "model",
        "employeeSelfEval",
        "goalsAndObjectives",
        "managerComments",
    ]
].to_csv("data/5a_output_explainability.csv")

In [None]:
# viusualize LLM explination response
for i in response_df2.response.tolist():
    print(i)
    print("\n______________________\n")

Based on the inputs provided, the overall rating of 3.0 reflects several factors:

1. **Basic Job Performance**: Kate performs the fundamental tasks required by her role, such as making drinks and collecting her paycheck. This suggests that she's meeting the essential duties of her job, which is a baseline expectation for an average performance rating.

2. **Lack of Engagement**: Kate's self-evaluation indicates a lack of passion and engagement in her role. She does not express interest in going beyond the basic requirements or engaging with customers more deeply. This affects the potential for a higher rating, as enthusiasm and initiative are often valued in service roles.

3. **Punctuality and Attention to Detail**: Manager comments highlight issues with punctuality (showing up late) and negligence (failing to clean machines). These are significant drawbacks as they directly impact the efficiency and effectiveness of the coffee shop operations. This negatively influences her overall 