This project is an advanced sentiment analysis as well as evaluation of LLMs that addresses the challenge of detecting tacit emotions in professional communication. It utilizes a hybrid approach, combining a rule-based classification system with the capabilities of large language models like GPT-4o and Phi-3 to identify nuanced emotional states that are often implied rather than explicitly stated. The core goal is to provide a more emotionally intelligent system for applications in organizational health and governance.

The code from google.colab import drive and drive.mount is used in a Google Colab notebook to connect your Google Drive account to the Colab virtual machine.

In [None]:
# MoUNT THE GOOGLE DRIVE
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# phi specific installs

The !pip install command is a shell command used in Google Colab to install Python packages. It installs transformers, accelerate, and bitsandbytes for machine learning tasks with large language models, torch as the core deep learning framework, and pandas, openpyxl, and tqdm for data handling and progress visualization. The specific versions are pinned to ensure compatibility and reproducibility.

In [None]:
# Install exact versions known to work with Phi-3
!pip install transformers==4.40.2 accelerate==0.27.2 bitsandbytes==0.41.1 \
             torch==2.1.2 pandas openpyxl tqdm

In [None]:
# import modules and functions
import numpy as np
import pandas as pd
import re
import random
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
import gc
import time
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import jaccard_score
from sklearn.metrics import cohen_kappa_score
import os

Reading datasets like phrases with rules, rules, and main file is situation + prompts

In [None]:
phrases_df = pd.read_excel('/content/drive/MyDrive/BA 890/datasets for project/phrases.xlsx')
rules_df = pd.read_excel('/content/drive/MyDrive/BA 890/datasets for project/rule_sheet.xlsx')
sitn_prompts_df = pd.read_excel('/content/drive/MyDrive//BA 890/datasets for project/Situation_Prompts.xlsx')

print("Phrases Head:\5", phrases_df.head())
print("\nRules Head:\5", rules_df.head())
print("\nPrompts Head:\5", sitn_prompts_df.head())

In [None]:
# 2. Prepare Phrase Map
# Create a dictionary mapping Component names to a list of their associated phrases and types.
component_phrases_map = {}
for index, row in phrases_df.iterrows():
    component = str(row['Component']).strip()
    phrase = str(row['Phrase']).lower().strip()
    phrase_type = str(row['Include']).strip()

    if component not in component_phrases_map:
        component_phrases_map[component] = []
    component_phrases_map[component].append({'phrase': phrase, 'type': phrase_type})

print("\n--- Component-Phrases Map Created ---")


Rule based classifier

In [None]:
# Implement the Rule-Based Classifier Function

def classify_emotion_with_rules(text, rules_df, component_phrases_map):
    """
    Classifies emotions in a given text based on predefined rules and phrase mappings.

    Args:
        text (str): The input text (e.g., from Prompt_Text or Generated_Response).
        rules_df (pd.DataFrame): DataFrame containing the emotion rules.
        component_phrases_map (dict): Dictionary mapping components to their phrases.

    Returns:
        list: A list of detected compound emotion tags (e.g., ['Appeasement', 'Guilt_downplay']),
              or ['Neutral'] if no specific emotion is detected.
    """
    if pd.isna(text): # Handle NaN or empty text inputs
        return ['Neutral']

    text_lower = text.lower() # Convert text to lowercase for case-insensitive matching

    # Dictionary to store detected components in the current text

    detected_components_in_text = {}
    for component_name, phrase_list in component_phrases_map.items():
        is_component_found = False
        for phrase_data in phrase_list:
            phrase = phrase_data['phrase']
            # Use regex for whole word matching to avoid partial matches

            if re.search(r'\b' + re.escape(phrase) + r'\b', text_lower):
                is_component_found = True
                break
        detected_components_in_text[component_name] = is_component_found

    # Dictionaries to store potential inclusions and exclusions
    # {emotion_tag: True/False} indicating if an inclusion/exclusion rule fired for this emotion
    potential_inclusions = {}
    potential_exclusions = {}

    # Iterate through each rule to evaluate
    for index, rule in rules_df.iterrows():
        rule_id = str(rule['Rule_ID']).strip()
        rule_type = str(rule['Rule_Type']).strip()
        emotion_tag = str(rule['Compound_Emotion']).strip()

        # Check if 'Condition_Logic' exists in the rule and get its value, otherwise default to 'AND'
        condition_logic = 'AND' # Default value
        if 'Condition_Logic' in rule and pd.notna(rule['Condition_Logic']):
             condition_logic = str(rule['Condition_Logic']).strip().upper()


        # Collect components for the current rule
        rule_components = []
        for i in range(1, 5): # Check Component_1 to Component_4
            comp_col_name = f'Component_{i}'
            if comp_col_name in rule and pd.notna(rule[comp_col_name]):
                rule_components.append(str(rule[comp_col_name]).strip())

        if not rule_components: # Skip rules with no components defined
            continue

        # Check if the rule's components are present in the text based on detected_components_in_text
        components_match_rule = []
        for comp_name in rule_components:
            components_match_rule.append(detected_components_in_text.get(comp_name, False))

        rule_fires = False
        if condition_logic == 'AND':
            rule_fires = all(components_match_rule)
        elif condition_logic == 'OR':
            rule_fires = any(components_match_rule)

        # Store the rule's outcome based on its type
        if rule_fires:
            if rule_type == 'Inclusion':
                potential_inclusions[emotion_tag] = True
            elif rule_type == 'Exclusion':
                potential_exclusions[emotion_tag] = True

    # --- Step 4: Apply Priority and Resolve Conflicts ---
    final_emotions = []

    # Handle 'All' exclusion first (if it exists and fires)
    # Assuming 'All' exclusion, if present, is a strong override
    if 'All' in potential_exclusions and potential_exclusions['All']:
      return ['Neutral'] # Global override to Neutral

    # Iterate through potential inclusions and check against exclusions
    for emotion, is_included in potential_inclusions.items():
        if is_included: # Only consider emotions that were marked for inclusion
            is_excluded = False
            if emotion in potential_exclusions and potential_exclusions[emotion]:
                # Default behavior: If a specific exclusion rule fires for an emotion, it overrides inclusion
                is_excluded = True

            if not is_excluded:
                final_emotions.append(emotion)

    # If no specific emotions were detected after all rules, default to Neutral
    if not final_emotions:
        return ['Neutral']

    return sorted(list(set(final_emotions))) # Return unique, sorted list for multi-label consistency

In [None]:
# 5. Apply the function to prompts

# Create a new column for the detected emotion labels
sitn_prompts_df['Surrogate_Rater_Rule_Based_Label'] = sitn_prompts_df['Prompt_Text'].apply(
    lambda x: classify_emotion_with_rules(x, rules_df, component_phrases_map)
)

print("\n--- Emotion Classification Complete ---")
print(sitn_prompts_df[['Prompt_ID', 'Prompt_Text', 'Surrogate_Rater_Rule_Based_Label']].head())

# You can save this updated DataFrame for your next steps (e.g., IRR calculation)
sitn_prompts_df.to_csv('prompts_with_rule_based_labels.csv', index=False)
print("\nSaved prompts_with_rule_based_labels.csv")




--- Emotion Classification Complete ---
   Prompt_ID                                        Prompt_Text  \
0          1  Hi Team, the probation review  highlights the ...   
1          2  Hello Team, the exit interviews highlighted co...   
2          3  Panel, the interview score adjustments show ho...   
3          4  Hey, the missed DEI training is a reminder tha...   
4          5  Team, the recent concerns raised during the br...   

  Surrogate_Rater_Rule_Based_Label  
0                        [Neutral]  
1                        [Neutral]  
2                        [Neutral]  
3                        [Neutral]  
4                        [Neutral]  

Saved prompts_with_rule_based_labels.csv


We will try the Transformer

The Setting up the tranformer model to read implicit emotions from situations and prompts according to rules
This part initializes the core AI model that will do the heavy lifting of emotion detection.

In [None]:
#  Initialize the Transformer Model
print("\n--- Initializing monologg/bert-base-cased-goemotions-original model ---")
try:
    emotion_classifier = pipeline(
        "text-classification",
        model="monologg/bert-base-cased-goemotions-original",
        top_k=None # Get scores for all labels
    )
    print("Model initialized successfully.")
except Exception as e:
    print(f"Error initializing model: {e}")

    exit()


--- Initializing monologg/bert-base-cased-goemotions-original model ---


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/182 [00:00<?, ?B/s]

Device set to use cpu


Model initialized successfully.


The Translator's Dictionary - Defining Your Emotion Mappings
This section defines the rules for translating the specific GoEmotions into the broader Plutchik categories, and then defines the Plutchik categories themselves.

This is a Python dictionary (like a lookup table).

On the left (the "key") are the 28 GoEmotions labels that the AI model will output. On the right (the "value") is the corresponding Plutchik primary emotions that we want it to map to. This is the  mapping. In this process,it was decided that we group the finer GoEmotions into the broader Plutchik categories.

PLUTCHIK_PRIMARY_EMOTIONS: This is just a simple list of the 8 core Plutchik emotions. It's used later to initialize scores.

In [None]:
# Define the GoEmotions to Plutchik's emotions Mapping

GOEMOTIONS_TO_PLUTCHIK = {
    'admiration': 'joy', 'amusement': 'joy', 'approval': 'joy', 'caring': 'joy',
    'desire': 'joy', 'excitement': 'joy', 'gratitude': 'joy', 'joy': 'joy',
    'love': 'joy', 'optimism': 'joy', 'pride': 'joy', 'relief': 'joy',
    'serenity': 'trust', # Added: Serenity aligns with peace and acceptance
    'disappointment': 'sadness', 'embarrassment': 'sadness', 'grief': 'sadness',
    'remorse': 'sadness', 'sadness': 'sadness',
    'shame': 'sadness', # Added: Shame often involves feelings of sorrow and self-reproach
    'pensiveness': 'sadness', # Added: Pensive thought can be associated with melancholic reflection
    'boredom': 'sadness', # Added: Boredom can be a low-energy form of discontent or sadness
    'fear': 'fear', 'nervousness': 'fear',
    'anxiety': 'fear', # Added: Anxiety is a form of apprehension or fear
    'apprehension': 'fear', # Added: Apprehension is worry or unease about something negative
    'anger': 'anger', 'annoyance': 'anger',
    'surprise': 'surprise',
    'anticipation': 'anticipation',
    'confusion': 'anticipation', # Confusion can lead to anticipation of clarity/resolution
    'curiosity': 'anticipation', # Curiosity implies a desire to know what's next
    'interest': 'anticipation', # Added: Interest implies engagement and expectation
    'disapproval': 'disgust', 'disgust': 'disgust',
    'apathy': 'disgust', # Added: Apathy can represent a strong disengagement or repulsion from a situation
   'realization': 'anticipation','realization': 'surprise' # Realization can imply understanding and acceptance/trust in a conclusion
}

# Plutchik's 8 Primary Emotions
PLUTCHIK_PRIMARY_EMOTIONS = [
    'joy', 'sadness', 'trust', 'disgust', 'fear', 'anger', 'surprise', 'anticipation']


This is where  we have defined the custom functions that perform the multi-step emotional analysis.

In [None]:
#Define the Classification Functions and Custom Rules

def classify_and_map_to_plutchik(text, classifier, score_threshold=0.05):
    """
    Classifies emotion in a single text using the GoEmotions Transformer model,
    then maps the results to Plutchik's primary emotions with aggregated scores.
    """
    if pd.isna(text) or not text.strip():
        return {'neutral': 1.0}

    try:
        results = classifier(text)[0]
        plutchik_scores = {p_emo: 0.0 for p_emo in PLUTCHIK_PRIMARY_EMOTIONS}

        for res in results:
            go_label = res['label']
            go_score = res['score']

            if go_score >= score_threshold:
                if go_label in GOEMOTIONS_TO_PLUTCHIK:
                    plutchik_label = GOEMOTIONS_TO_PLUTCHIK[go_label]
                    plutchik_scores[plutchik_label] = max(plutchik_scores[plutchik_label], go_score)

        if all(score == 0.0 for score in plutchik_scores.values()):
            neutral_go_score = next((res['score'] for res in results if res['label'] == 'neutral'), 0)
            if neutral_go_score >= score_threshold:
                return {'neutral': neutral_go_score}
            else:
                return {'neutral': 1.0}

        final_plutchik_scores = {k: v for k, v in plutchik_scores.items() if v > 0.0}
        return final_plutchik_scores

    except Exception as e:
        print(f"Error classifying text '{text[:50]}...': {e}")
        return {'error_classification': 1.0}


In [None]:
# we laod the rules sheet here
CUSTOM_COMPOUND_RULES = {}
try:
    # Load the rules from the Excel file in Google Drive using the confirmed path
    rules_df = pd.read_excel('/content/drive/MyDrive/BA 890/datasets for project/rule_sheet.xlsx')
    inclusion_rules = rules_df[rules_df['Rule_Type'] == 'Inclusion']

    for index, row in inclusion_rules.iterrows():
        compound_emotion = str(row['Compound_Emotion']).strip()
        components = []
        # Check Component_1 to Component_4 columns - adjust range if more components exist
        for i in range(1, 5):
            comp_col_name = f'Component_{i}'
            if comp_col_name in row and pd.notna(row[comp_col_name]):
                component = str(row[comp_col_name]).strip().lower()
                # Add component if it does not contain '_indicator'
                if '_indicator' not in component:
                    # Convert 'acceptance' to 'trust' for consistency with Plutchik primaries
                    processed_comp = component.replace('acceptance', 'trust')
                    components.append(processed_comp)

        if compound_emotion not in CUSTOM_COMPOUND_RULES:
            CUSTOM_COMPOUND_RULES[compound_emotion] = []
        if components: # Only add if there are valid components for this rule
            CUSTOM_COMPOUND_RULES[compound_emotion].append(tuple(components))

    print("Custom Compound Rules loaded successfully.")
    # print("Loaded Rules:", CUSTOM_COMPOUND_RULES) # Uncomment to inspect the loaded rules

except FileNotFoundError:
    print("Error: 'rule_sheet.xlsx' not found at the specified path. Custom compound rules will be empty.")
except Exception as e:
    print(f"Error loading custom compound rules: {e}")


def infer_compound_emotion(plutchik_scores, compound_rules_dict, primary_threshold=0.3, compound_min_confidence=0.1):
    """
    Infers custom compound emotions based on Plutchik primary emotion scores and predefined rules.
    This version processes rules from the loaded CUSTOM_COMPOUND_RULES dictionary.

    Args:
        plutchik_scores (dict): Dictionary of Plutchik primary emotion scores.
        compound_rules_dict (dict): A dictionary where keys are compound emotion names
                               and values are lists of component tuples (e.g., {'Appeasement': [('fear', 'trust')]}).
        primary_threshold (float): Minimum confidence for a primary emotion to contribute to a compound.
        compound_min_confidence (float): Minimum calculated confidence for a compound emotion to be included.

    Returns:
        dict: A dictionary of detected compound emotions with their calculated confidence scores.
              Returns an empty dict if no compounds are detected or {'error': True} on error.
    """
    detected_compounds = {}

    try:
        for compound_name, list_of_component_tuples in compound_rules_dict.items():

            for components_tuple in list_of_component_tuples:
                can_form_compound = True
                contributing_scores = []


                for component in components_tuple:

                    # classify_and_map_to_plutchik is modified to pass them through or derive them differently.
                    score = plutchik_scores.get(component, 0.0) # Get score, default to 0 if not found

                    if score >= primary_threshold:
                        contributing_scores.append(score)
                    else:
                        can_form_compound = False
                        break
                if can_form_compound and contributing_scores:
                    if len(contributing_scores) == len(components_tuple):
                        compound_confidence = sum(contributing_scores) / len(contributing_scores)
                        if compound_confidence >= compound_min_confidence:

                            detected_compounds[compound_name] = max(detected_compounds.get(compound_name, 0.0), compound_confidence)

    except Exception as e:
        print(f"Error inferring compound emotion: {e}")
        return {'error': True}

    return detected_compounds


--- Loading Custom Compound Rules from rule_sheet.xlsx ---
Custom Compound Rules loaded successfully.


In [None]:
#  Apply the Transformer Model and Plutchik Mapping to the prompts

sitn_prompts_df['Plutchik_Emotion_Scores'] = sitn_prompts_df['Prompt_Text'].apply(
    lambda x: classify_and_map_to_plutchik(x, emotion_classifier)
)

sitn_prompts_df['Custom_Compound_Emotions'] = sitn_prompts_df['Plutchik_Emotion_Scores'].apply(
    lambda x: infer_compound_emotion(x, CUSTOM_COMPOUND_RULES)
)

def extract_top_5_emotions(text):
    if pd.isna(text) or not text.strip():
        return {}
    try:
        results = emotion_classifier(text)[0]
        sorted_scores = sorted(results, key=lambda x: x['score'], reverse=True)
        return {item['label']: round(item['score'], 4) for item in sorted_scores[:5]}
    except Exception as e:
        print(f"Error extracting top 5 emotions for text: {text[:50]}... → {e}")
        return {'error': True}

sitn_prompts_df['Top_5_GoEmotions'] = sitn_prompts_df['Prompt_Text'].apply(extract_top_5_emotions)

print("\n--- Emotion Classification and Mapping Complete ---")
print(sitn_prompts_df[['Prompt_ID', 'Prompt_Text', 'Plutchik_Emotion_Scores',
                       'Custom_Compound_Emotions', 'Top_5_GoEmotions']].head())

sitn_prompts_df.to_csv('prompts_with_plutchik_and_custom_emotion_scores3.csv', index=False)
print("\nSaved prompts_with_plutchik_and_custom_emotion_scores3.csv")


--- Classifying prompts, mapping to Plutchik, and inferring compounds ---

--- Emotion Classification and Mapping Complete ---
   Prompt_ID                                        Prompt_Text  \
0          1  Hi Team, the probation review  highlights the ...   
1          2  Hello Team, the exit interviews highlighted co...   
2          3  Panel, the interview score adjustments show ho...   
3          4  Hey, the missed DEI training is a reminder tha...   
4          5  Team, the recent concerns raised during the br...   

                             Plutchik_Emotion_Scores Custom_Compound_Emotions  \
0                     {'neutral': 0.999985933303833}                       {}   
1                    {'neutral': 0.9999814033508301}                       {}   
2                    {'neutral': 0.9966273903846741}                       {}   
3                    {'neutral': 0.9986237287521362}                       {}   
4  {'joy': 0.406469464302063, 'surprise': 0.07701...            

In [None]:
#Save the Updated DataFrame
sitn_prompts_df.to_csv('prompts_with_plutchik_and_custom_emotion_scores3.csv', index=False)
print("\nSaved prompts_with_plutchik_and_custom_emotion_scores3.csv")


Saved prompts_with_plutchik_and_custom_emotion_scores3.csv


In [None]:
# Analyze Raw GoEmotion Scores for Sample Prompts
sample_prompts = [
    "Please check the updated handbook for now, I’m juggling with a recent request and also that compliance things at the moment, but ping me if you can’t find what you need.",
    "If leadership wants these numbers, we’ll report them.Although I am sick of it, can you imagine they actually asked this?",
    "I keep replaying today's meeting, we can't even hire 3 more people, Wand on top of that we’ll have to work with last year’s materials. Email Jessica that we are literally just processing what's happened today."
]

print("\n--- Analyzing Raw GoEmotion Scores for Sample Prompts ---")
for i, text in enumerate(sample_prompts):
    print(f"\nSample Prompt {i+1}: \"{text}\"")
    raw_scores = emotion_classifier(text) # This is the crucial line

    if raw_scores and isinstance(raw_scores[0], list):
        for emotion_list in raw_scores:
            print("Raw GoEmotion Scores:")
            for item in emotion_list:
                label = item['label']
                score = item['score']
                print(f"  - {label}: {score:.4f}")
    else:
        print("Could not retrieve raw scores for this prompt.")


--- Analyzing Raw GoEmotion Scores for Sample Prompts ---

Sample Prompt 1: "Please check the updated handbook for now, I’m juggling with a recent request and also that compliance things at the moment, but ping me if you can’t find what you need."
Raw GoEmotion Scores:
  - neutral: 0.9999
  - approval: 0.0001
  - realization: 0.0000
  - curiosity: 0.0000
  - optimism: 0.0000
  - confusion: 0.0000
  - caring: 0.0000
  - annoyance: 0.0000
  - surprise: 0.0000
  - disappointment: 0.0000
  - fear: 0.0000
  - desire: 0.0000
  - excitement: 0.0000
  - nervousness: 0.0000
  - relief: 0.0000
  - disapproval: 0.0000
  - disgust: 0.0000
  - admiration: 0.0000
  - pride: 0.0000
  - anger: 0.0000
  - gratitude: 0.0000
  - sadness: 0.0000
  - amusement: 0.0000
  - embarrassment: 0.0000
  - grief: 0.0000
  - remorse: 0.0000
  - love: 0.0000
  - joy: 0.0000

Sample Prompt 2: "If leadership wants these numbers, we’ll report them.Although I am sick of it, can you imagine they actually asked this?"
Raw

## Setting up Phase 2


Here we set up:  Neutral Prompt Source and Handling, Loading and Randomizing the Dataset,Construct the Detailed LLM Prompt, LLM Interaction Loop and Data Collection

Neutral prompts list

In [None]:
neutral_prompts_list = [
    "What is 2+2?",
    "What is the capital of Canada?",
    "What is the largest planet in our solar system?",
    "What is the chemical symbol for water?",
    "How many continents are there?",
    "What is the freezing point of water in Celsius?",
    "What is the capital of France?",
    "How many days are in a leap year?",
    "What is the chemical symbol for gold?",
    "Which planet is known as the Red Planet?",
    "What is the largest mammal on Earth?"
]

In [None]:
prompts_df = pd.read_csv('/content/drive/MyDrive//BA 890/datasets for project/Situation_Prompts.csv', encoding='latin-1')

In [None]:
# Need to shuffle the order of prompt_id, situation and prompt and add neutral prompts to set the baseline

df = prompts_df

# combine Context and Prompt_Text to form the full emotional prompt for the LLM
df['Full_Emotional_Prompt'] = "Situation: " + df['Context'] + "\nCommunication Prompt: " + df['Prompt_Text']

# convert DataFrame rows to a list of dictionaries
# Include Prompt_ID, Context, Prompt_Text, Implied_Emotion_Designer_Rashi and the full prompt
emotional_data_records = df[['Prompt_ID', 'Context', 'Prompt_Text', 'Implied_Emotion_Designer_Rashi', 'Full_Emotional_Prompt']].to_dict(orient='records')

# Randomize the order of entire dataset records
random.shuffle(emotional_data_records)

In [None]:
try:
    prompts_df = pd.read_csv('/content/drive/MyDrive/BA 890/datasets for project/Situation_Prompts.csv', encoding='latin-1')
    df = prompts_df

    df['Full_Emotional_Prompt'] = "Context: " + df['Context'] + "\nCommunication Prompt: " + df['Prompt_Text']
    emotional_data_records = df[['Prompt_ID', 'Context', 'Prompt_Text', 'Implied_Emotion_Designer_Rashi', 'Full_Emotional_Prompt']].to_dict(orient='records')
    random.shuffle(emotional_data_records)


In [None]:
# emotions list from Plutchik's theory
emotions_list = [
    "Joy",
    "Trust",
    "Fear",
    "Surprise",
    "Sadness",
    "Disgust",
    "Anger",
    "Anticipation",
    "Frustration",
    "Disappointment",
    "Anxiety",
    "Satisfaction",
    "Excitement",
    "Gratitude",
    "Confusion",
    "Neutral",
    "Guilt",
    "Pensiveness",
    "Boredom",
    "Shame",
    "Anxiousness",
    "Apprehension",
    "Professional/Formal"
]

In [None]:
emotions_str_for_prompt = ", ".join(emotions_list)

In [None]:
def create_llm_query(original_context, original_prompt_text, emotions_str_for_prompt):
    """
    Creates the LLM query string to analyze emotional content in a given context and communication prompt.
    It instructs the LLM to select emotions only from a predefined list and return structured output.
    """

    if not original_context:
        original_context = "[No situation provided]"
    if not original_prompt_text:
        original_prompt_text = "[No communication prompt provided]"

    return f"""
You are an expert in emotion recognition and concise response generation. Your analysis must be based **solely** on the provided 'Context' and 'Communication Prompt'. Do NOT infer emotions from external knowledge or make assumptions not directly supported by the text.

For 'Selected Emotions', you **MUST** choose *only* from the following list: {emotions_str_for_prompt}. Do **not** include this list in your response for the 'Selected Emotions' field.

Your task is to analyze the provided situation and communication prompt, then extract the implied emotions. You should provide both a free-form interpretation and select from the provided list.

--- Context ---
{original_context}

--- Communication Prompt ---
{original_prompt_text}

Provide your analysis in the following exact format. Your response **MUST start immediately** with 'Implied Emotion(s) (Free-Form):' and contain **NO other text, headers, or conversational elements before this first line**:

Implied Emotion(s) (Free-Form): [Insert at least one emotion label here]
Reasoning for Free-Form Implied Emotions (up to 50 words):

Selected Emotions: [Choose only from the provided list]
Reasoning for Selected Emotions (up to 50 words):
""".strip()


LLM Interaction Loop and Data Collection: This part conceptually outlines how we set up actual API calls.

Initial set up for Phi3 model loading

In [None]:
# Clear GPU memory before loading a model
import torch
import gc

def clear_gpu_memory():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        gc.collect()
        print("GPU memory cleared.")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

# Load Phi-3 Mini model
print("Loading Phi-3 Mini model with eager attention...")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation='eager'
)
print("Phi-3 model loaded successfully with eager attention!")

CUDA available: True
GPU device: Tesla T4
Loading Phi-3 Mini model with eager attention...




tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Phi-3 model loaded successfully with eager attention!


Calling LLM - Phi-3 model API

In [None]:
def call_phi3_api(prompt_text, temperature=0.1, max_tokens=350):
    """
    Call Phi-3 Mini Instruct 4K using Hugging Face Transformers
    Optimized for GPU usage.
    """
    # Apply chat template
    chat = [{"role": "user", "content": prompt_text}]
    prompt = tokenizer.apply_chat_template(
        chat, tokenize=False, add_generation_prompt=True
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            do_sample=(temperature > 0),
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            return_dict_in_generate=True
        )

    # Get only the newly generated tokens
    generated_tokens = outputs.sequences[0][inputs['input_ids'].shape[1]:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return response.strip()

In [None]:


def parse_llm_response(resp: str):
    """Return a dict with four clean fields—even if labels sit on the same line."""
    result = {
        "Model_FreeForm_Implied_Emotion": "",
        "Model_Reasoning_FreeForm_Emotion": "",
        "Model_Selected_Emotions": [],
        "Model_Reasoning_Selected_Emotions": ""
    }

    lines = [l.strip() for l in resp.strip().splitlines() if l.strip()]
    n = len(lines)

    def grab_emotions(text):
        # look for capitalized words separated by commas or ‘and’
        return re.findall(r"\b[A-Z][a-zA-Z/]+\b", text)

    # main loop
    i = 0
    while i < n:
        line = lines[i]

        # 1) free‑form label(s)
        if line.startswith("Implied Emotion(s) (Free-Form):"):
            payload = line.split(":", 1)[1].strip()
            if payload:
                result["Model_FreeForm_Implied_Emotion"] = payload
            elif i+1 < n and not lines[i+1].startswith("Reasoning"):
                result["Model_FreeForm_Implied_Emotion"] = lines[i+1]
                i += 1

        # 2) free‑form reasoning
        elif line.startswith("Reasoning for Free-Form"):
            result["Model_Reasoning_FreeForm_Emotion"] = line.split(":", 1)[1].strip()

        # 3) selected labels
        elif line.startswith("Selected Emotions"):
            payload = line.split(":", 1)[1].strip()
            if payload:
                result["Model_Selected_Emotions"] = [e.strip() for e in payload.split(",")]
            elif i+1 < n and not lines[i+1].startswith("Reasoning"):
                result["Model_Selected_Emotions"] = [e.strip() for e in lines[i+1].split(",")]
                i += 1

        # 4) reasoning for selected labels
        elif line.startswith("Reasoning for Selected Emotions"):
            result["Model_Reasoning_Selected_Emotions"] = line.split(":", 1)[1].strip()

        i += 1

    return result


Processing Loop for Phi-3

In [None]:
# --- Processing Loop ---
NUM_PROMPTS_TO_PROCESS = 250
batch_size = 5

# Temperatures
MAIN_QUERY_TEMPERATURE = 0.1
NEUTRAL_QUERY_TEMPERATURE = 0

phi3_results = []

# print(f"\n{'='*50}")
# print(f"STARTING PHASE 1: PROCESSING FIRST
# print(f"{'='*50}")

for i in range(0, min(NUM_PROMPTS_TO_PROCESS, len(emotional_data_records)), batch_size):
    batch = emotional_data_records[i:i+batch_size]
    print(f"\n{'='*20} Phi-3 Batch {i//batch_size + 1} {'='*20}")

    for record in batch:
        prompt_id = record.get('Prompt_ID', f"ID_{i}")
        full_situation_for_llm = record.get('Full_Emotional_Prompt', '')
        original_context = record.get('Context', '')
        original_prompt_text = record.get('Prompt_Text', '')
        implied_emotion_designer = record.get('Implied_Emotion_Designer_Rashi', '')

        print(f"  Processing Prompt ID: {prompt_id} (Phi-3)")

        # Neutral Prompt Reset for Phi-3 before each emotional query
        selected_neutral_prompt_phi3 = random.choice(neutral_prompts_list)
        print(f"    Sending Neutral Prompt: '{selected_neutral_prompt_phi3}'")
        _ = call_phi3_api(
            selected_neutral_prompt_phi3,
            max_tokens=50,
            temperature=NEUTRAL_QUERY_TEMPERATURE
        )
        print(f"    Phi-3 neutral reset sent for {prompt_id}.")

        # Combined Emotional Prompt for Phi-3
        llm_query_phi3 = create_llm_query(original_context, original_prompt_text, emotions_str_for_prompt)
        print(f"\n    Full Emotional Query for {prompt_id}:\n---START QUERY---\n{llm_query_phi3}\n---END QUERY---\n")

        llm_raw_response_phi3 = call_phi3_api(
            llm_query_phi3,
            max_tokens=350, # Max tokens for the combined response
            temperature=MAIN_QUERY_TEMPERATURE
        )

        print(f"\n    Raw LLM Response from call_phi3_api for {prompt_id}:\n---START RAW RESPONSE---\n{llm_raw_response_phi3}\n---END RAW RESPONSE---\n")

        # Parse Phi-3 Response using the combined parser
        parsed_data_phi3 = parse_llm_response(llm_raw_response_phi3)

        phi3_results.append({
            'Prompt_ID': prompt_id,
            'Context': original_context,
            'Prompt_Text': original_prompt_text,
            'Implied_Emotion_Designer': implied_emotion_designer,
            'FreeForm_Implied_Emotion': parsed_data_phi3.get('Model_FreeForm_Implied_Emotion', ''),
            'Reasoning_FreeForm_Emotion': parsed_data_phi3.get('Model_Reasoning_FreeForm_Emotion', ''),
            'Selected_Emotion': ", ".join(parsed_data_phi3.get('Model_Selected_Emotions', [])),
            'Reasoning_Selected_Emotion': parsed_data_phi3.get('Model_Reasoning_Selected_Emotions', '')
        })

        print(f"  ✓ Phi-3 data collected for {prompt_id}. Parsed Data: {parsed_data_phi3}")
        print(f"  Appended to results: {phi3_results[-1]}")


print(f"\n{'='*50}")
print("FINISHED PHASE 1: PHI-3 PROCESSING COMPLETE")
print(f"{'='*50}")


  Processing Prompt ID: P_196 (Phi-3)
    Sending Neutral Prompt: 'What is the largest mammal on Earth?'




[1;30;43mStreaming output truncated to the last 5000 lines.[0m
You are an expert in emotion recognition and concise response generation. Your analysis must be based **solely** on the provided 'Context' and 'Communication Prompt'. Do NOT infer emotions from external knowledge or make assumptions not directly supported by the text.

For 'Selected Emotions', you **MUST** choose *only* from the following list: Joy, Trust, Fear, Surprise, Sadness, Disgust, Anger, Anticipation, Frustration, Disappointment, Anxiety, Satisfaction, Excitement, Gratitude, Confusion, Neutral, Guilt, Pensiveness, Boredom, Shame, Anxiousness, Apprehension, Professional/Formal. Do **not** include this list in your response for the 'Selected Emotions' field.

Your task is to analyze the provided situation and communication prompt, then extract the implied emotions. You should provide both a free-form interpretation and select from the provided list.

--- Context ---
A company veteran HR business partner in Seoul ac

In [None]:
# Save Results to CSV
df_phi3_long_format_1 = pd.DataFrame(phi3_results)
df_phi3_long_format_1['Model'] = 'Phi-3'
print("\nPhi-3 results converted to long format DataFrame.")

output_filename_long_format_1 = 'llm_emotional_analysis_results_phi3_combined_debug.csv'
df_phi3_long_format_1.to_csv(output_filename_long_format_1, index=False)

print(f"\nFinal Phi-3 results in LONG FORMAT (first {NUM_PROMPTS_TO_PROCESS} prompt) saved to {output_filename_long_format_1}")
print("\nExample of final sorted Phi-3 data (first 1 row):")
print(df_phi3_long_format_1.head(1).to_string())


Phi-3 results converted to long format DataFrame.

Final Phi-3 results in LONG FORMAT (first 250 prompt) saved to llm_emotional_analysis_results_phi3_combined_debug.csv

Example of final sorted Phi-3 data (first 1 row):
  Prompt_ID                                                                                                              Context                                                                                                    Prompt_Text         Implied_Emotion_Designer    FreeForm_Implied_Emotion                                                                                              Reasoning_FreeForm_Emotion            Selected_Emotion                                                                                                                                  Reasoning_Selected_Emotion  Model
0     P_196  Early-career female marketer in Asia, overwhelmed by campaign performance pressure, replies to optimization request  Analyzing the data now, once I am don

Formatting the results in read-able form

In [None]:
# reordering
# Save Results to CSV
df_phi3_long_format_1= pd.DataFrame(phi3_results)
df_phi3_long_format_1['Model'] = 'Phi-3'
print("\nPhi-3 results converted to long format DataFrame.")

# Start of Reordering for prompts+ ID.

# Handle potential NaN values by converting to string first and dropping NaNs
df_phi3_long_format_1['Prompt_ID'] = df_phi3_long_format_1['Prompt_ID'].astype(str, errors='ignore')
df_phi3_long_format_1['Prompt_Order'] = df_phi3_long_format_1['Prompt_ID'].str.replace('P_', '', regex=False)

# Convert to numeric, coercing errors to NaN, then drop rows with NaN in this column
df_phi3_long_format_1['Prompt_Order'] = pd.to_numeric(df_phi3_long_format_1['Prompt_Order'], errors='coerce')
df_phi3_long_format_1.dropna(subset=['Prompt_Order'], inplace=True)

# Convert to integer after dropping NaNs
df_phi3_long_format_1['Prompt_Order'] = df_phi3_long_format_1['Prompt_Order'].astype(int)


# 2. Sort the DataFrame by this new numerical column
df_phi3_long = df_phi3_long_format_1.sort_values(by='Prompt_Order', ascending=True)

# 3. Drop the temporary 'Prompt_Order' column for final CSV
df_phi3_long_1= df_phi3_long.drop(columns=['Prompt_Order'])
# End of Reordering

output_filename = 'llm_emotional_analysis_results_phi3_reordered.csv'
df_phi3_long_1.to_csv(output_filename, index=False)

print(f"\nFinal Phi-3 results in LONG FORMAT (first {NUM_PROMPTS_TO_PROCESS} prompt) saved to {output_filename_long_format_1}")
print("\nExample of final sorted Phi-3 data (first 1 row):")
print(df_phi3_long_format_1.head(1).to_string())


Phi-3 results converted to long format DataFrame.

Final Phi-3 results in LONG FORMAT (first 250 prompt) saved to llm_emotional_analysis_results_phi3_combined_debug.csv

Example of final sorted Phi-3 data (first 1 row):
  Prompt_ID                                                                                                              Context                                                                                                    Prompt_Text         Implied_Emotion_Designer    FreeForm_Implied_Emotion                                                                                              Reasoning_FreeForm_Emotion            Selected_Emotion                                                                                                                                  Reasoning_Selected_Emotion  Model  Prompt_Order
0     P_196  Early-career female marketer in Asia, overwhelmed by campaign performance pressure, replies to optimization request  Analyzing the data now,

It took 2 hours 30 mins  to deal with 500 prompts.

We repeat the process with GPT-4o

## OPEN AI API : Response Generation

Secure Your Open AI API Key

In [None]:

import os
os.environ["OPENAI_API_KEY"] = "codenotmentionedhere"

In [None]:
# Install OpenAI Python library
!pip install openai
# Import and initialize
from openai import OpenAI
# Initialize OpenAI client (automatically uses OPENAI_API_KEY environment variable)
client = OpenAI()
print("OpenAI client initialized successfully!")

OpenAI client initialized successfully!


In [None]:
import time

In [None]:

# initialize the open ai api client

def call_llm_api(prompt_text, model_name="gpt-4o", temperature=0.1, max_tokens=500):
    """
    Call OpenAI API with specified parameters
    Supports temperature and max_tokens as requested
    """
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "user", "content": prompt_text}
            ],
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content.strip()

    except Exception as e:
        print(f"Error calling OpenAI API: {e}")
        return ""


Processing loop

In [None]:
# --- Processing Loop ---
NUM_PROMPTS_TO_PROCESS = 250
batch_size = 5 #

# Temperatures
MAIN_QUERY_TEMPERATURE = 0.1
NEUTRAL_QUERY_TEMPERATURE = 0.0

# Define the OpenAI model name
OPENAI_MODEL_NAME = "gpt-4o"

llm_results = [] #


for i in range(0, min(NUM_PROMPTS_TO_PROCESS, len(emotional_data_records)), batch_size):
    batch = emotional_data_records[i:i+batch_size]

    for record in batch:
        prompt_id = record.get('Prompt_ID', f"ID_{i}")
        full_situation_for_llm = record.get('Full_Emotional_Prompt', '')
        original_context = record.get('Context', '')
        original_prompt_text = record.get('Prompt_Text', '')
        implied_emotion_designer = record.get('Implied_Emotion_Designer_Rashi', '')

        # Neutral Prompt
        selected_neutral_prompt = random.choice(neutral_prompts_list)
        # print(f"Sending Neutral Prompt: '{selected_neutral_prompt}'")
        _ = call_llm_api(
            selected_neutral_prompt,
            model_name=OPENAI_MODEL_NAME,
            max_tokens=50,
            temperature=NEUTRAL_QUERY_TEMPERATURE
        )

        # Main Query
        llm_raw_response = call_llm_api(
            create_llm_query(original_context, original_prompt_text, emotions_str_for_prompt),
            model_name=OPENAI_MODEL_NAME,
            max_tokens=350,
            temperature=MAIN_QUERY_TEMPERATURE
        )
        # Parse LLM Response
        parsed_data = parse_llm_response(llm_raw_response)

        llm_results.append({
            'Prompt_ID': prompt_id,
            'Context': original_context,
            'Prompt_Text': original_prompt_text,
            'Implied_Emotion_Designer': implied_emotion_designer,
            'FreeForm_Implied_Emotion': parsed_data.get('Model_FreeForm_Implied_Emotion', ''),
            'Reasoning_FreeForm_Emotion': parsed_data.get('Model_Reasoning_FreeForm_Emotion', ''),
            'Selected_Emotion': ", ".join(parsed_data.get('Model_Selected_Emotions', [])),
            'Reasoning_Selected_Emotion': parsed_data.get('Model_Reasoning_Selected_Emotions', '')
        })

        print(f"  OpenAI data collected for {prompt_id}. Parsed Data: {parsed_data}")
        print(f"  Appended to results: {llm_results[-1]}")

print("FINISHED PHASE 1: OpenAI PROCESSING COMPLETE")
# print(f"{'='*50}")

  ✓ OpenAI data collected for P_196. Parsed Data: {'Model_FreeForm_Implied_Emotion': 'Professional/Formal, Anxiety', 'Model_Reasoning_FreeForm_Emotion': 'The response is structured and focused on task completion, indicating professionalism. The context of being overwhelmed by performance pressure suggests underlying anxiety about meeting expectations.', 'Model_Selected_Emotions': ['Professional/Formal', 'Anxiety'], 'Model_Reasoning_Selected_Emotions': 'The communication is formal and task-oriented, reflecting professionalism. The context of pressure and the need to analyze data imply anxiety about achieving desired campaign outcomes.'}
  Appended to results: {'Prompt_ID': 'P_196', 'Context': 'Early-career female marketer in Asia, overwhelmed by campaign performance pressure, replies to optimization request', 'Prompt_Text': "Analyzing the data now, once I am done with this, will propose adjustments once I've identified key patterns.", 'Implied_Emotion_Designer': 'Unconscious_Emotional D

In [None]:
# Save Results to CSV
df_openai_long_format = pd.DataFrame(llm_results)
df_openai_long_format['Model'] = 'GPT4-o'
print("\nOpenaiGPT-4o results converted to long format DataFrame.")

output_gpt4olong_format = 'llm_emotional_analysis_results_openaigpt4o.csv'
df_openai_long_format.to_csv(output_gpt4olong_format, index=False)

print(f"\nOpenaiGPT-4o results in LONG FORMAT (first {NUM_PROMPTS_TO_PROCESS} prompt) saved to {output_gpt4olong_format}")
print("\nExample of final sorted Openai data (first 1 row):")
print(df_openai_long_format.head(1).to_string())


OpenaiGPT-4o results converted to long format DataFrame.

OpenaiGPT-4o results in LONG FORMAT (first 250 prompt) saved to llm_emotional_analysis_results_openaigpt4o.csv

Example of final sorted Openai data (first 1 row):
  Prompt_ID                                                                                                              Context                                                                                                    Prompt_Text         Implied_Emotion_Designer      FreeForm_Implied_Emotion                                                                                                                                                                               Reasoning_FreeForm_Emotion              Selected_Emotion                                                                                                                                                                Reasoning_Selected_Emotion   Model
0     P_196  Early-career female marketer in Asia,

In [None]:
# re ordering the prompt id and getting responses

out_path = "llm_emotional_analysis_openai_reordered.csv"

df_gpt4o = (
    pd.DataFrame(llm_results)
      .assign(Model="GPT‑4o",
              Prompt_Order=lambda d: pd.to_numeric(
                  d["Prompt_ID"].astype(str).str.removeprefix("P_"),
                  errors="coerce"))
      .dropna(subset=["Prompt_Order"])
      .sort_values("Prompt_Order")
      .drop(columns="Prompt_Order")
)

if df_gpt4o.empty:
    raise ValueError("No valid P_### IDs found in Prompt_IDs.")

df_gpt4o.to_csv(out_path, index=False)
print(f'Saved to {out_path}\n", df_gpt4o.head(1).to_string())

✅ Saved to llm_emotional_analysis_openai_reordered.csv
    Prompt_ID                                                                                                                                                                                                            Context                                                                                                                                                                              Prompt_Text Implied_Emotion_Designer          FreeForm_Implied_Emotion                                                                                                                                                                                                                                Reasoning_FreeForm_Emotion                  Selected_Emotion                                                                                                                                                                                   Reasoning_Sele

After both processing phases are complete, convert collected results into DataFrames and merge them based on Prompt_ID. Finally, sort the combined DataFrame for a clean output.

Phase 3 - Analysis

Analysis of Results: Inter-rater agreement

In [None]:
interrater_df = pd.read_excel('/content/drive/MyDrive/BA 890/datasets for project/inter_rater_copy.xlsx')

In [None]:
interrater_df.columns.tolist()

['Prompt_ID',
 'Context',
 'Prompt_Text',
 'Implied_Emotion_Designer_Rashi',
 'Hf_goemotions_label_3',
 'Label_1',
 'Label_2',
 'FreeForm_Implied_Emotion_Phi3',
 'Selected_Emotion_Phi3',
 'FreeForm_Implied_Emotion_GPT4-O',
 'Selected_Emotion_GPT4-O']

#Using the hugging face transformer as a rater

In [None]:
import ast
def extract_top_two_from_dict(row):
    if pd.isna(row):
        return ""

    try:
        emotion_dict = ast.literal_eval(row)  # Safely parse string to dict
        sorted_emotions = sorted(emotion_dict.items(), key=lambda x: x[1], reverse=True)[:2]
        return ", ".join([emotion for emotion, _ in sorted_emotions])
    except:
        return ""

# Apply the corrected function
interrater_df ['Label_HF_Goemotions_Cleaned'] = interrater_df ['Hf_goemotions_label_3'].apply(extract_top_two_from_dict)

# Display top 5 cleaned rows
interrater_df ['Label_HF_Goemotions_Cleaned'].head()

Unnamed: 0,Label_HF_Goemotions_Cleaned
0,"neutral, approval"
1,"neutral, approval"
2,"neutral, approval"
3,"neutral, optimism"
4,"neutral, approval"


In [None]:
interrater_df['Label_HF_Goemotions_Cleaned'].head(5)

Unnamed: 0,Label_HF_Goemotions_Cleaned
0,"neutral, approval"
1,"neutral, approval"
2,"neutral, approval"
3,"neutral, optimism"
4,"neutral, approval"


In [None]:
def clean_label_column_all(entry):
    if pd.isna(entry):
        return ""

    # Split on commas, semicolons, or 'and', strip whitespace, lowercase
    labels = re.split(r'[;,]| and ', str(entry).lower())
    labels = [label.strip() for label in labels if label.strip()]

    # Remove duplicates while preserving order
    unique_labels = list(dict.fromkeys(labels))

    return ", ".join(unique_labels)

# Apply to both columns
interrater_df['Label_1_Cleaned'] = interrater_df['Label_1'].apply(clean_label_column_all)
interrater_df['Label_2_Cleaned'] = interrater_df['Label_2'].apply(clean_label_column_all)


interrater_df[['Label_1', 'Label_1_Cleaned', 'Label_2', 'Label_2_Cleaned']].head()

Unnamed: 0,Label_1,Label_1_Cleaned,Label_2,Label_2_Cleaned
0,"disgust, fear","disgust, fear",shame and dismissive,"shame, dismissive"
1,"shame, disgust, regret","shame, disgust, regret","embarassed, numb/withdrawn, indifferent","embarassed, numb/withdrawn, indifferent"
2,"shame, regret, disgust","shame, regret, disgust",bitter and distant/dismissive,"bitter, distant/dismissive"
3,"disgust, frustation, and optimism","disgust, frustation, optimism","remorseful, hopeful,","remorseful, hopeful"
4,"disgust, shame,annoyance","disgust, shame, annoyance","annoyed, disrespected, bitter","annoyed, disrespected, bitter"


In [None]:
from sklearn.metrics import jaccard_score
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

# Define the new label comparisons
label_sets = {
    "Rashi_vs_Nadira": ('Label_1_Cleaned', 'Label_2_Cleaned'), # Use cleaned labels
    "Rashi_vs_GPT4O": ('Label_1_Cleaned', 'Selected_Emotion_GPT4-O'), # Use cleaned labels
    "Rashi_vs_Phi3": ('Label_1_Cleaned', 'Selected_Emotion_Phi3'), # Use cleaned labels
    "Nadira_vs_GPT4O": ('Label_2_Cleaned', 'Selected_Emotion_GPT4-O'), # Use cleaned labels
    "Nadira_vs_Phi3": ('Label_2_Cleaned', 'Selected_Emotion_Phi3'), # Use cleaned labels
    "GPT4O_vs_Phi3": ('Selected_Emotion_GPT4-O', 'Selected_Emotion_Phi3'),
    # Corrected column name to 'Label_HF_Goemotions_Cleaned'
    "HFGoEmotions_vs_GPT4O": ('Label_HF_Goemotions_Cleaned', 'Selected_Emotion_GPT4-O'),
    "HFGoEmotions_vs_Phi3": ('Label_HF_Goemotions_Cleaned', 'Selected_Emotion_Phi3')
}

results = {}
mlb = MultiLabelBinarizer()

for label_name, (col1, col2) in label_sets.items():
    list1 = interrater_df[col1].fillna("").apply(lambda x: [i.strip().lower() for i in str(x).split(",") if i.strip()])
    list2 = interrater_df[col2].fillna("").apply(lambda x: [i.strip().lower() for i in str(x).split(",") if i.strip()])

    all_labels = pd.concat([list1, list2])
    mlb.fit(all_labels)

    bin1 = mlb.transform(list1)
    bin2 = mlb.transform(list2)

    try:
        #  vectorized Jaccard
        avg_jaccard = jaccard_score(bin1, bin2, average='samples')
    except ValueError as e:
        print(f"Error computing Jaccard score for {label_name}: {e}. Setting Jaccard to 0.")
        avg_jaccard = 0

    results[label_name] = avg_jaccard

# Display results as a sorted DataFrame
results_df = pd.DataFrame.from_dict(results, orient='index', columns=["Jaccard Similarity"])
results_df.sort_values(by="Jaccard Similarity", ascending=False, inplace=True)

display(results_df)

Unnamed: 0,Jaccard Similarity
GPT4O_vs_Phi3,0.130333
Rashi_vs_Nadira,0.036871
Rashi_vs_GPT4O,0.0184
Rashi_vs_Phi3,0.013667
HFGoEmotions_vs_GPT4O,0.011667
HFGoEmotions_vs_Phi3,0.005
Nadira_vs_GPT4O,0.003133
Nadira_vs_Phi3,0.003133


In [None]:
# Drop it and update the DataFrame in‑place
interrater_df.drop(columns=["Implied_Emotion_Designer_Rashi"], inplace=True)

In [None]:
def to_list(cell):
    tokens = [t.strip().lower() for t in str(cell)
              .replace("{","").replace("}","").replace("'","").split(",")]
    return [t for t in tokens if t]

# Define the pairs requested
pairs = {
    "Rashi_vs_Nadira (Label_1 vs Label_2)": ("Label_1", "Label_2"),
    "FreeForm GPT‑4o vs FreeForm Phi‑3": ("FreeForm_Implied_Emotion_GPT4-O", "FreeForm_Implied_Emotion_Phi3"),
    "Selected GPT‑4o vs Selected Phi‑3": ("Selected_Emotion_GPT4-O", "Selected_Emotion_Phi3"),
}

results = {}

for name, (col_a, col_b) in pairs.items():
    a_lists = interrater_df[col_a].fillna("").apply(to_list)
    b_lists = interrater_df[col_b].fillna("").apply(to_list)

    mlb = MultiLabelBinarizer()
    mlb.fit(pd.concat([a_lists, b_lists]))

    A_bin = mlb.transform(a_lists)
    B_bin = mlb.transform(b_lists)

    micro_jaccard = jaccard_score(A_bin, B_bin, average="samples")
    results[name] = micro_jaccard

results_df = pd.DataFrame.from_dict(results, orient="index", columns=["Micro Jaccard"])
print(results_df.sort_values(by="Micro Jaccard", ascending=False))


                                      Micro Jaccard
FreeForm GPT‑4o vs FreeForm Phi‑3          0.184267
Selected GPT‑4o vs Selected Phi‑3          0.130333
Rashi_vs_Nadira (Label_1 vs Label_2)       0.036405


Per-emotion Cohen Kappa value


In [None]:
def to_list(cell):
    tokens = [t.strip().lower() for t in str(cell)
              .replace("{", "").replace("}", "").replace("'", "").split(",")]
    return [t for t in tokens if t]

# Pairs to compare
pairs = {
    "Rashi_vs_Nadira": ("Label_1", "Label_2"),
    "FreeForm_GPT4o_vs_Phi3": ("FreeForm_Implied_Emotion_GPT4-O", "FreeForm_Implied_Emotion_Phi3"),
    "Selected_GPT4o_vs_Phi3": ("Selected_Emotion_GPT4-O", "Selected_Emotion_Phi3"),
}

saved_files = []

for pair_name, (col_a, col_b) in pairs.items():
    a_lists = interrater_df[col_a].fillna("").apply(to_list)
    b_lists = interrater_df[col_b].fillna("").apply(to_list)

    mlb = MultiLabelBinarizer()
    mlb.fit(pd.concat([a_lists, b_lists]))

    A_bin = mlb.transform(a_lists)
    B_bin = mlb.transform(b_lists)
    label_names = mlb.classes_

    # Per‑emotion κ
    kappa_scores = {
        label: cohen_kappa_score(A_bin[:, idx], B_bin[:, idx])
        for idx, label in enumerate(label_names)
    }

    kappa_df = (pd.Series(kappa_scores, name="Cohen_Kappa")
                  .sort_values(ascending=False)
                  .reset_index()
                  .rename(columns={"index": "Emotion"})
                  .round(3))

    # Clean pair name for filename
    safe_pair = re.sub(r'[^\w\-]+', '_', pair_name)
    # Modified path to save to Google Drive
    out_path = f"/content/drive/MyDrive/BA 890/datasets for project/Dataset/per_emotion_kappa_{safe_pair}.csv"
    kappa_df.to_csv(out_path, index=False)
    saved_files.append(out_path)

print(saved_files)

['/content/drive/MyDrive/BA 890/datasets for project/Dataset/per_emotion_kappa_Rashi_vs_Nadira.csv', '/content/drive/MyDrive/BA 890/datasets for project/Dataset/per_emotion_kappa_FreeForm_GPT4o_vs_Phi3.csv', '/content/drive/MyDrive/BA 890/datasets for project/Dataset/per_emotion_kappa_Selected_GPT4o_vs_Phi3.csv']


Across all comparisons, the Cohen's Kappa scores are generally low, indicating that assigning emotions from text is a complex and often subjective task so it is a challenging task to label emotions. GPT-4o and Phi-3 demonstrate the highest per-emotion agreement when they can generate free-form responses. This consistency decreases when they are forced to categorize into a fixed list, revealing differences in their interpretation of specific emotion labels. Even if they don't perfectly match humans, these consistent AI responses might represent a form of computational consensus. They indicate that these models are processing information in somewhat similar ways to arrive at their emotional conclusions.

Per Emotion Interpretation for all comparisons:

F-1 scores


In [None]:
from sklearn.metrics import f1_score, precision_recall_fscore_support
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd, re, os

def to_list(cell):
    return [t.strip().lower() for t in str(cell)
            .replace("{","").replace("}","").replace("'","").split(",") if t.strip()]

pairs = {
    "Rashi_vs_Nadira": ("Label_1", "Label_2"),
    "FreeForm_GPT4o_vs_Phi3": ("FreeForm_Implied_Emotion_GPT4-O",
                                "FreeForm_Implied_Emotion_Phi3"),
    "Selected_GPT4o_vs_Phi3": ("Selected_Emotion_GPT4-O",
                                "Selected_Emotion_Phi3")
}

for pair_name, (cA, cB) in pairs.items():
    A = interrater_df[cA].fillna("").apply(to_list)
    B = interrater_df[cB].fillna("").apply(to_list)

    mlb = MultiLabelBinarizer()
    mlb.fit(pd.concat([A, B]))

    A_bin = mlb.transform(A)
    B_bin = mlb.transform(B)

    micro_f1 = f1_score(A_bin, B_bin, average="micro", zero_division=0)
    macro_f1 = f1_score(A_bin, B_bin, average="macro", zero_division=0)

    # --- per‑emotion precision / recall / F1 ---
    p, r, f, _ = precision_recall_fscore_support(
        A_bin, B_bin, average=None, zero_division=0)
    per_label_df = (pd.DataFrame({
                        "Emotion": mlb.classes_,
                        "Precision": p.round(3),
                        "Recall": r.round(3),
                        "F1": f.round(3)
                     }).sort_values("F1", ascending=False))

    # save both summaries
    safe_pair = re.sub(r'[^\w\-]+', '_', pair_name)
    # Modified path to save to Google Drive
    summary_path = f"/content/drive/MyDrive/BA 890/datasets for project/Dataset/f1_summary_{safe_pair}.csv"
    per_label_path = f"/content/drive/MyDrive/BA 890/datasets for project/Dataset/f1_per_emotion_{safe_pair}.csv"


    pd.DataFrame({"Metric": ["Micro_F1", "Macro_F1"],
                  "Score": [round(micro_f1, 3), round(macro_f1, 3)]
                 }).to_csv(summary_path, index=False)

    per_label_df.to_csv(per_label_path, index=False)

F-1 Score interpretation:

In [None]:
#Hamming Loss

In [None]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import hamming_loss, accuracy_score
import os

# Helper: convert cell to list of labels
def to_list(cell):
    return [t.strip().lower()
            for t in str(cell).replace("{","").replace("}","").replace("'", "").split(",")
            if t.strip()]

# Pairs to evaluate
pairs = {
    "Rashi_vs_Nadira": ("Label_1", "Label_2"),
    "FreeForm_GPT4o_vs_Phi3": ("FreeForm_Implied_Emotion_GPT4-O", "FreeForm_Implied_Emotion_Phi3"),
    "Selected_GPT4o_vs_Phi3": ("Selected_Emotion_GPT4-O", "Selected_Emotion_Phi3"),
}

summary_rows = []

for pair_name, (colA, colB) in pairs.items():
    A_lists = interrater_df[colA].fillna("").apply(to_list)
    B_lists = interrater_df[colB].fillna("").apply(to_list)

    mlb = MultiLabelBinarizer()
    mlb.fit(pd.concat([A_lists, B_lists]))

    A_bin = mlb.transform(A_lists)
    B_bin = mlb.transform(B_lists)

    # Hamming Loss
    ham_loss = hamming_loss(A_bin, B_bin)

    # Label-based Accuracy: compute accuracy per label then average
    accuracies = []
    for i in range(A_bin.shape[1]):
        accuracies.append(accuracy_score(A_bin[:, i], B_bin[:, i]))
    label_based_accuracy = sum(accuracies) / len(accuracies) if accuracies else float('nan')

    summary_rows.append({
        "Pair": pair_name,
        "Hamming_Loss": round(ham_loss, 3),
        "Label_Based_Accuracy": round(label_based_accuracy, 3)
    })

summary_df = pd.DataFrame(summary_rows)
# Modified path to save to Google Drive
out_path = "/content/drive/MyDrive/BA 890/datasets for project/Dataset/hamming_label_accuracy_summary.csv"
summary_df.to_csv(out_path, index=False)

out_path

'/content/drive/MyDrive/BA 890/datasets for project/Dataset/hamming_label_accuracy_summary.csv'

Interpretation of the Hamming Loss:

Free-form vs List-based Alignment: Analyze when the model's free-form identification differs from its constrained selection.

In [None]:
# Making a list of emotion strings
def to_list(cell):
    return [t.strip().lower()
            for t in str(cell).replace("{","").replace("}","").replace("'", "").split(",")
            if t.strip()]

models = {
    "GPT4o": ("FreeForm_Implied_Emotion_GPT4-O", "Selected_Emotion_GPT4-O"),
    "Phi3":  ("FreeForm_Implied_Emotion_Phi3",  "Selected_Emotion_Phi3")
}

summary_rows = []

for model_name, (free_col, sel_col) in models.items():
    free_lists = interrater_df[free_col].fillna("").apply(to_list)
    sel_lists  = interrater_df[sel_col].fillna("").apply(to_list)

    # union of both columns for consistent binarization
    mlb = MultiLabelBinarizer()
    mlb.fit(pd.concat([free_lists, sel_lists]))

    free_bin = mlb.transform(free_lists)
    sel_bin  = mlb.transform(sel_lists)
    labels = mlb.classes_

    # Micro Jaccard between free-form and selected
    # Handle potential ValueError if there are no labels to compare
    try:
        micro_j = jaccard_score(free_bin, sel_bin, average="samples")
    except ValueError:
        micro_j = 0.0 # Set Jaccard to 0 if no labels

    # Per-prompt mismatch flag
    mismatches_idx = [i for i in range(len(free_lists)) if set(free_lists[i]) != set(sel_lists[i])]
    mismatch_df = interrater_df.loc[mismatches_idx, ["Prompt_ID", free_col, sel_col]].copy()
    # Modified path to save to Google Drive
    mismatch_path = f"/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_mismatches_{model_name}.csv"
    mismatch_df.to_csv(mismatch_path, index=False)

    # Per-emotion gains/losses
    gains = {lab:0 for lab in labels}
    losses = {lab:0 for lab in labels}
    for fr, sl in zip(free_lists, sel_lists):
        s_fr, s_sl = set(fr), set(sl)
        for lab in s_fr - s_sl:
            losses[lab] += 1  # lost in selection
        for lab in s_sl - s_fr:
            gains[lab] += 1   # appeared only in selection

    diff_df = pd.DataFrame({
        "Emotion": labels,
        "Lost_when_selecting": [losses.get(l, 0) for l in labels], # Use .get for safety
        "Gained_only_in_selection": [gains.get(l, 0) for l in labels] # Use .get for safety
    }).sort_values("Lost_when_selecting", ascending=False)
    # Modified path to save to Google Drive
    diff_path = f"/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_emotion_diff_{model_name}.csv"
    diff_df.to_csv(diff_path, index=False)

    summary_rows.append({
        "Model": model_name,
        "Micro_Jaccard_Free_vs_Selected": round(micro_j, 3),
        "Total_Prompts": len(free_lists),
        "Prompts_with_Mismatch": len(mismatches_idx)
    })

# Summary CSV
summary_df = pd.DataFrame(summary_rows)
# Modified path to save to Google Drive
summary_path = "/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_alignment_summary.csv"
summary_df.to_csv(summary_path, index=False)

print(json.dumps({
    "summary": summary_path,
    "mismatch_files": [f"/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_mismatches_{m}.csv" for m in models],
    "emotion_diff_files": [f"/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_emotion_diff_{m}.csv" for m in models]
}, indent=2))

{
  "summary": "/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_alignment_summary.csv",
  "mismatch_files": [
    "/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_mismatches_GPT4o.csv",
    "/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_mismatches_Phi3.csv"
  ],
  "emotion_diff_files": [
    "/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_emotion_diff_GPT4o.csv",
    "/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_emotion_diff_Phi3.csv"
  ]
}


The Micro F1 score of 0.260 indicates a modest level of overall alignment between the two models when allowed to generate free-form emotional labels. However, the substantially lower Macro F1 score (0.061) reveals that this agreement is not evenly distributed, the models align primarily on more frequent or generic emotions, while their recognition of less common or subtler emotional states diverges significantly. This gap between Micro and Macro F1 highlights the sensitivity of free-form emotion annotation to vocabulary variation and rare label mismatches, even between otherwise capable LLMs.

In [None]:
# Load the CSV file
df = pd.read_csv('hamming_label_accuracy_summary.csv')

# Create the bar chart for Hamming Loss
plt.figure(figsize=(10, 6))
plt.bar(df['Pair'], df['Hamming_Loss'], color=['skyblue', 'lightcoral', 'lightgreen'])
plt.xlabel('Comparison Pair')
plt.ylabel('Hamming Loss')
plt.title('Hamming Loss Comparison Across Different Pairs')
plt.ylim(0, df['Hamming_Loss'].max() * 1.1) # Set y-axis limit slightly above max for better visualization
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping

# Save the plot
plt.savefig('hamming_loss_bar_chart.png')

print("Bar chart for Hamming Loss Comparison Across Pairs saved as 'hamming_loss_bar_chart.png'")

In [None]:
free_vs_selected_emotion_diff_GPT4o = pd.read_csv('/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_emotion_diff_GPT4o.csv')

In [None]:
free_vs_selected_emotion_diff_Phi3 = pd.read_csv('/content/drive/MyDrive/BA 890/datasets for project/Dataset/free_vs_selected_emotion_diff_Phi3.csv')

In [None]:
#Load the data for GPT-4o
df_gpt4o = pd.read_csv(drive_path + 'free_vs_selected_emotion_diff_GPT4o.csv')

# Load the data for Phi3
df_phi3 = pd.read_csv(drive_path + 'free_vs_selected_emotion_diff_Phi3.csv')

# Data Cleaning and Filtering
def clean_emotion_label(label):
    return label.replace('[', '').replace(']', '').strip().lower()

df_phi3['Emotion'] = df_phi3['Emotion'].apply(clean_emotion_label)
df_gpt4o['Emotion'] = df_gpt4o['Emotion'].apply(clean_emotion_label)


# Emotions to remove (case-insensitive and exact match after cleaning)
emotions_to_remove = ['professional/formal', 'professionalism', 'professional']


df_gpt4o_filtered = df_gpt4o[~df_gpt4o['Emotion'].isin(emotions_to_remove)].copy()
df_phi3_filtered = df_phi3[~df_phi3['Emotion'].isin(emotions_to_remove)].copy()

# Visualization for GPT-4o
# Sort by total impact (Lost + Gained) for top emotions
df_gpt4o_filtered['Total_Impact'] = df_gpt4o_filtered['Lost_when_selecting'] + df_gpt4o_filtered['Gained_only_in_selection']
df_gpt4o_sorted = df_gpt4o_filtered.sort_values(by='Total_Impact', ascending=False).head(15)

fig, ax = plt.subplots(figsize=(12, 7))

bar_width = 0.35
index = range(len(df_gpt4o_sorted))

bar1 = ax.bar(index, df_gpt4o_sorted['Lost_when_selecting'], bar_width, label='Lost when Selecting', color='salmon')
bar2 = ax.bar([i + bar_width for i in index], df_gpt4o_sorted['Gained_only_in_selection'], bar_width, label='Gained only in Selection', color='lightgreen')

ax.set_xlabel('Emotion')
ax.set_ylabel('Count')
ax.set_title('GPT-4o: Per-Emotion Gains and Losses (Free-form vs. Selected) - "Professional" Removed')
ax.set_xticks([i + bar_width / 2 for i in index])
ax.set_xticklabels(df_gpt4o_sorted['Emotion'], rotation=45, ha='right')
ax.legend()
plt.tight_layout()
plt.savefig('gpt4o_emotion_gains_losses_no_professional.png')
plt.close()

# --- Visualization for Phi3 (filtered) ---
df_phi3_filtered['Total_Impact'] = df_phi3_filtered['Lost_when_selecting'] + df_phi3_filtered['Gained_only_in_selection']
df_phi3_sorted = df_phi3_filtered.sort_values(by='Total_Impact', ascending=False).head(15)

fig, ax = plt.subplots(figsize=(12, 7))

index = range(len(df_phi3_sorted))

bar1 = ax.bar(index, df_phi3_sorted['Lost_when_selecting'], bar_width, label='Lost when Selecting', color='salmon')
bar2 = ax.bar([i + bar_width for i in index], df_phi3_sorted['Gained_only_in_selection'], bar_width, label='Gained only in Selection', color='lightgreen')

ax.set_xlabel('Emotion')
ax.set_ylabel('Count')
ax.set_title('Phi3: Per-Emotion Gains and Losses (Free-form vs. Selected')
ax.set_xticks([i + bar_width / 2 for i in index])
ax.set_xticklabels(df_phi3_sorted['Emotion'], rotation=45, ha='right')
ax.legend()
plt.tight_layout()
plt.savefig('phi3_emotion_gains_losses_no_professional.png')
plt.close()

print("Updated per-emotion gains and losses charts  saved as 'gpt4o_emotion_gains_losses_no_professional.png' and 'phi3_emotion_gains_losses_no_professional.png'.")

Updated per-emotion gains and losses charts (without 'professional/formal' related emotions) saved as 'gpt4o_emotion_gains_losses_no_professional.png' and 'phi3_emotion_gains_losses_no_professional.png'.


Micro and Macro F-1 score chart

In [None]:
import matplotlib.pyplot as plt

# Data for the chart
formats = ['Free-Form', 'Selected-List']
micro_f1 = [0.260, 0.198]
macro_f1 = [0.061, 0.027]

# Bar width and positions
bar_width = 0.35
x = range(len(formats))

# Create the plot with custom colors
fig, ax = plt.subplots(figsize=(8, 5))
bars1 = ax.bar([i - bar_width/2 for i in x], micro_f1, width=bar_width, label='Micro F1', color='lightblue')
bars2 = ax.bar([i + bar_width/2 for i in x], macro_f1, width=bar_width, label='Macro F1', color='lightpink')

# Add labels and title
ax.set_xlabel('Format')
ax.set_ylabel('F1 Score')
ax.set_title('Micro vs Macro F1 Scores for Free-Form vs Selected-List (GPT-4o vs Phi-3)')
ax.set_xticks(x)
ax.set_xticklabels(formats)
ax.set_ylim(0, 0.3)
ax.legend()

# Show value labels on bars
for bar in bars1 + bars2:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval + 0.005, f'{yval:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()



Compared to free-form generation, the Micro F1 score (0.198) is lower, indicating that constraining models to select from a fixed list reduces their alignment. The even lower Macro F1 (0.027) reinforces that disagreement is especially pronounced on infrequent or nuanced emotions in the list. These results suggest that rigid label mapping may distort underlying emotion interpretations, introducing friction even between otherwise semantically aligned LLMs.


GPT 4o vs Phi 3 (Free form): When allowed to express emotions in their own words, the models show low but reliable alignment:
•	Micro F1 = 0.260
•	Macro F1 = 0.061
•	Micro Jaccard = 0.194
The gap between Micro and Macro F1 indicates that agreement is concentrated in a few broad, frequent signals (captured by Micro F1), while overlap on rare or fine-grained emotions is minimal (very small Macro F1). This pattern is stable rather than noise: the models systematically converge on common high-prevalence cues yet diverge in how they articulate subtler states.
GPT‑4o vs Phi‑3(Selected emotions): Constraining outputs to a predefined list reduces alignment further:
•	Micro F1 = 0.198
•	Macro F1 = 0.027
•	Micro Jaccard = 0.157
Although the label space is standardized, the mapping pressure appears to push the models toward different neighbouring categories for ambiguous cases, preserving modest overlap on common labels (Micro F1) but yielding very poor uniform agreement across the full label set (Macro F1). In short, the predefined taxonomy adds discretization friction, lowering consistency relative to free‑form.


GPT 4o vs Phi 3 (Free form): When allowed to express emotions in their own words, the models show low but reliable alignment:
•	Micro F1 = 0.260
•	Macro F1 = 0.061
•	Micro Jaccard = 0.194
The gap between Micro and Macro F1 indicates that agreement is concentrated in a few broad, frequent signals (captured by Micro F1), while overlap on rare or fine-grained emotions is minimal (very small Macro F1). This pattern is stable rather than noise: the models systematically converge on common high-prevalence cues yet diverge in how they articulate subtler states.


GPT‑4o vs Phi‑3(Selected emotions): Constraining outputs to a predefined list reduces alignment further:
•	Micro F1 = 0.198
•	Macro F1 = 0.027
•	Micro Jaccard = 0.157
Although the label space is standardized, the mapping pressure appears to push the models toward different neighbouring categories for ambiguous cases, preserving modest overlap on common labels (Micro F1) but yielding very poor uniform agreement across the full label set (Macro F1). In short, the predefined taxonomy adds discretization friction, lowering consistency relative to free‑form.


Qualitative analysis:
Cases where there was high disagreement and difference between how GPT4O and Phi3 looked at the response
GPT-4o demonstrates a distinctive ability to detect low-arousal, institutionally framed emotions such as disappointment in professional situations, frustration in formal situations, and managerial trust are often embedded subtly within corporate or procedural language. For instance, in Prompt P_19 and P_20, GPT-4o interprets phrases like “reminds us of the importance” and “reflects the need for more focus” as carrying emotional weight suggesting disappointment or frustration cloaked in professional tone.

Phi-3, on the other hand, tends to gravitate toward more affectively explicit labels such as concern, frustration, and anxiety, and frequently misses these subtler, forward-looking emotional inferences for the same prompts, now in Prompt P_18, GPT-4o captures trust recognizing a managerial tone of belief in the intern’s potential while Phi-3 narrows in on the supportive framing of the message. This comparison reveals a key differentiation: GPT-4o reads emotional content through structural cues and compositional intent, whereas Phi-3 relies more on direct affective language, making GPT-4o particularly adept at decoding emotions that are conveyed through strategic, role-specific, or formally composed communication in workplace settings.
