# Product Insight Validation Using LLMs: Copilot Recommendations🔍

## Overview

This notebooks aims to evaluate different prompting strategies and prompts for validating Copilot Recommendations using a Large Language Model (LLM). The goal is to have a sandbox where we can fine-tune prompts using the same cross validation sets. 

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing business goals for validation.

2. **Apply LLM-Based Validation**  
   - Building blocks for using LLM to validate, and cleaning inputs

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.
   - 3.1 Zero Shot Prompting
   - 3.2 Few Shot Prompting
   - 3.3 Multi-pass w/ Few Shot prompting

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.



In [1]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key



# 1 - Load Business Goals

Let's take a glimpse at the data we have. All these business goals have been validated by human validators during the month of April '25 

These datasets will act as makeshift cross-validation sets, that we can use to test the performance of different prompting strategies and approaches. 


We have two datasets:

- A small set of 20 cases ( 10 valid, 10 invalids ) for quick experimentation
- A bigger set of 200 cases ( 100 valid, 100 invalids ) for testing and measuring results 

In [2]:
# We load all the data 

copilot_recommendation_20 = pd.read_csv('./cases/copilot_recommendation_cv20.csv')
copilot_recommendation_200 = pd.read_csv('./cases/copilot_recommendation_cv200.csv')

# Now let's print one of the datasets to see its shape

copilot_recommendation_20[:5]

Unnamed: 0,ID,Case Number,UPN,Line Of Business,Recommendation Details,Recommendation Details Validation,Recommendation Details Comment
0,28155,2503310040000583,gig_wfh_micsi@microsoftsupport.com,Business Advisor Reactive,\nInformed customer that Copilot is integrate...,1,Relevant in customer's key areas of usage in M365
1,27414,2504041420002774,gig_wfh_abelr@microsoftsupport.com,Business Advisor Reactive,\n- Given the administrator's objective of ut...,1,Valid
2,26653,2504031420003187,gig_wfh_hotab@microsoftsupport.com,Trials Nurturing Reactive,\nCopilot in Outlook assists with email manag...,1,Valid recommendation
3,27703,2504040030008181,gig_wfh_feasu@office365support.com,Business Assist,\n To save time I recommended summarizing ema...,1,Copilot Outlook to summarize email thread was ...
4,26036,2503280040006869,gig_wfh_maond@microsoftsupport.com,Business Assist,Copilot recommendation details: \nThe customer...,1,Valid


In [3]:
# Column explanation
data = [
    ["Recommendation Details", "Raw Copilot Recommendation Details captured by the ambassador as they figure in the tracker"],
    ["Recommendation Details Validation", "Validation done by a human validator - 0 is invalid, 1 is valid"],
    ["Recommendation Details Comment", "Comment/Explanation provided by a human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Recommendation Details,Raw Copilot Recommendation Details captured by the ambassador as they figure in the tracker
1,Recommendation Details Validation,"Validation done by a human validator - 0 is invalid, 1 is valid"
2,Recommendation Details Comment,Comment/Explanation provided by a human validator


## 1.1 Defining Performance

These 2 datasets have already been evaluated by human validators.

This means we can use the previous labels to calculate Sensitivity, Recall and F1 for this dataset, which will give us performance metrics we can analyse and optimise. We will be looking at the following metrics.

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


We will create a dataframe where we will store the results of our tests as we run them.

In [4]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])


test_results


Unnamed: 0,test_name,sensitivity,precision,f1_score


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [5]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [6]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "The feedback addresses a usability issue related to the app's menu being convoluted and crowded, which directly impacts the user's ability to use the product."
}
```


The model is giving us back a JSON wrapped in Markdown. Let's create a function to clean it 

In [7]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [8]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "The feedback addresses a usability issue related to the app's menu being convoluted and crowded, which directly impacts the user's ability to use the product."
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through the rows of one of our datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the entry
    - 2. Evaluate if the LLM did a good job or not
          - LLM => Valid, Human => Valid, then *true positive*
          - LLM => Invalid, Human => Invalid, then *true negative*
          - LLM => Valid, Human => Invalid, then *false positive*
          - LLM => Invalid, Human => Valid, then *false negative*
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [9]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store, dataset):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        print(row_counter)
        if row_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            row_counter = 0
                
        llm_res = send_prompt(prompt, row['Recommendation Details'])
        llm_res = clean_llm_response(llm_res)
        row_counter += 1
        print(row['Recommendation Details'],llm_res)

        try:
            llm_res = json.loads(llm_res)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row['Recommendation Details Validation']
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
        elif llm_res['valid'] == False and human_validation == 1:
            fn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Loose Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe high level criteria that is frequently mentioned by human validators to mark insights as valid or invalid. These are drawn from analysing the reasons as to why validators mark insights as valid or invalid. 


In [10]:
loose_zero_shot_prompt_Test1 = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Valid Copilot Recommendation Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Relevance: The recommendation is clearly aligned with the customer's specific needs or how they use Microsoft 365

    - B) Copilot Focus: The recommendation includes specific mention of Copilot and its features

    - C) Demonstrated Benefit: The recommendation explicitly outlines the benefits or expected impact for the customer

    - D) Workflow Impact: The recommendation includes details about Microsoft 365 features and how they will improve the customer's workflow

    - E) Specificity: The recommendation targets a specific Microsoft 365 application (e.g., Outlook, Teams, Excel, etc.)

## Set 2: Invalid Copilot Recommendation Criteria

Meeting any of the criteria below is enough for the entry to be considered invalid.

    - F) Generic/Vague: The recommendation lacks specific detail about Copilot features or their benefits

    - G) Misaligned: The recommendation does not align with any known customer need or business goal

    - H) No Copilot Mention: The recommendation does not mention any specific Copilot functionality

    - I) Template-based: The recommendation appears copy/pasted or overly generic, lacking personalization

    - J) No Demonstrated Benefit: The recommendation does not explain how it would help the customer or impact their workflow


## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [11]:
# Test Loose Zero Shot Prompt here 
test_results = analyse_test_prompt('loose_zero_shot_20_Test1', loose_zero_shot_prompt_Test1, test_results, copilot_recommendation_20)


0
 
Informed customer that Copilot is integrated into the apps like Outlook, Word, Excel, Teams, and more. It helps save time, enhance productivity, and make smarter decisions.
Draft and summarize email thread in Outlook
Draft and edit documents faster in Word.
Analyze data effortlessly in Excel.
Collaborate seamlessly and boost meeting productivity in Teams.
Automate repetitive tasks and free up time for strategic work.
Provided helpful article {
  "valid": false,
  "reasoning": "The entry lacks specificity and does not provide detailed information about Copilot features and their benefits. While Copilot is mentioned, the recommendation is generic and fails to demonstrate a clear alignment with specific customer needs (criterion F). It also lacks any demonstrated impact on workflow (criterion J). Furthermore, the use of multiple applications without specific needs or workflow improvements makes the recommendation appear overly template-based and vague (criterion I)."
}
1
 
- Given the

  results_store = pd.concat([results_store, new_results_row], ignore_index=True)


In [12]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20_Test1,0.8,0.888889,0.842105


In [13]:
loose_zero_shot_prompt_Test2 = '''
What You Do
You check Copilot Value-Add recommendations to see if they really help customers—not just solve a small problem, but make their work faster, easier, or more successful.

Good Recommendations ✅
    - Fits the customer's needs → It helps with how they already use Microsoft 365.

    - Mentions Copilot features → It's clearly about Copilot, not just general advice.

    - Explains the benefits → Shows why it's useful.

    - Gives details on features & workflow impact → How does it make work better?

    - Talks about specific apps (Outlook, Teams, Excel, etc.) → Targeted tips are better.

Bad Recommendations ❌
    - Too vague → Doesn't really say how Copilot helps.

    - Not related to customer needs or business goals → Doesn't fit their actual work.

    - No mention of Copilot features → Must be about Copilot, not just Microsoft 365.

    - Copy-paste or template responses → Needs to be personal and useful.

    - Doesn't explain impact → Just saying something is good isn't enough—how does it help?

AI Output Format:
    json
    {
        "valid": make it true if the entry is considered valid, false if invalid,
        "reasoning": add your reasoning based on the criteria set above
    }
'''

In [14]:
# Test Loose Zero Shot Prompt here 
test_results = analyse_test_prompt('loose_zero_shot_20_Test2', loose_zero_shot_prompt_Test2, test_results, copilot_recommendation_20)


0
 
Informed customer that Copilot is integrated into the apps like Outlook, Word, Excel, Teams, and more. It helps save time, enhance productivity, and make smarter decisions.
Draft and summarize email thread in Outlook
Draft and edit documents faster in Word.
Analyze data effortlessly in Excel.
Collaborate seamlessly and boost meeting productivity in Teams.
Automate repetitive tasks and free up time for strategic work.
Provided helpful article {
    "valid": false,
    "reasoning": "While the recommendation mentions some Copilot features related to different apps, it remains too vague and generic. It does not go into detail about specific workflows or scenarios where Copilot can uniquely help the customer. For example, how exactly does Copilot draft emails in Outlook, analyze data in Excel, or automate tasks? There is no explanation of the impact or benefits tailored to the customer's needs, and the advice seems copy-pasted or templated rather than personalized. Additionally, while p

In [15]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20_Test1,0.8,0.888889,0.842105
1,loose_zero_shot_20_Test2,0.8,0.8,0.8


In [None]:
loose_zero_shot_no_negative_criteria_Test1 = """
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Valid Copilot Recommendation Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Relevance: The recommendation is clearly aligned with the customer's specific needs or how they use Microsoft 365

    - B) Copilot Focus: The recommendation includes specific mention of Copilot and its features

    - C) Demonstrated Benefit: The recommendation explicitly outlines the benefits or expected impact for the customer

    - D) Workflow Impact: The recommendation includes details about Microsoft 365 features and how they will improve the customer's workflow

    - E) Specificity: The recommendation targets a specific Microsoft 365 application (e.g., Outlook, Teams, Excel, etc.)

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above
"""


In [17]:
test_results = analyse_test_prompt('loose_zero_shot_20_no_negative_criteria_Test1', loose_zero_shot_no_negative_criteria_Test1, test_results, copilot_recommendation_20)

0
 
Informed customer that Copilot is integrated into the apps like Outlook, Word, Excel, Teams, and more. It helps save time, enhance productivity, and make smarter decisions.
Draft and summarize email thread in Outlook
Draft and edit documents faster in Word.
Analyze data effortlessly in Excel.
Collaborate seamlessly and boost meeting productivity in Teams.
Automate repetitive tasks and free up time for strategic work.
Provided helpful article {
  "valid": false,
  "reasoning": "The entry does not meet all the required criteria. While it mentions Copilot and outlines its integration into various Microsoft 365 apps, it lacks specificity about how the recommendations align with the customer's specific needs (Criteria A). The benefits and workflow impacts mentioned are generic and do not provide enough detail about how Copilot improves workflows for the customer (Criteria C, D). Furthermore, while Microsoft 365 applications are named, there is insufficient specificity around the unique 

In [18]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20_Test1,0.8,0.888889,0.842105
1,loose_zero_shot_20_Test2,0.8,0.8,0.8
2,loose_zero_shot_20_no_negative_criteria_Test1,0.8,1.0,0.888889


In [19]:
loose_zero_shot_no_negative_criteria_Test2 = '''
What You Do
You check Copilot Value-Add recommendations to see if they really help customers—not just solve a small problem, but make their work faster, easier, or more successful.

Good Recommendations ✅
    - Fits the customer's needs → It helps with how they already use Microsoft 365.

    - Mentions Copilot features → It's clearly about Copilot, not just general advice.

    - Explains the benefits → Shows why it's useful.

    - Gives details on features & workflow impact → How does it make work better?

    - Talks about specific apps (Outlook, Teams, Excel, etc.) → Targeted tips are better.

AI Output Format:
    json
    {
        "valid": make it true if the entry is considered valid, false if invalid,
        "reasoning": add your reasoning based on the criteria set above
    }
'''

In [20]:
test_results = analyse_test_prompt('loose_zero_shot_20_no_negative_criteria_Test2', loose_zero_shot_no_negative_criteria_Test2, test_results, copilot_recommendation_20)


0
 
Informed customer that Copilot is integrated into the apps like Outlook, Word, Excel, Teams, and more. It helps save time, enhance productivity, and make smarter decisions.
Draft and summarize email thread in Outlook
Draft and edit documents faster in Word.
Analyze data effortlessly in Excel.
Collaborate seamlessly and boost meeting productivity in Teams.
Automate repetitive tasks and free up time for strategic work.
Provided helpful article {
    "valid": true,
    "reasoning": "This recommendation clearly outlines Copilot features integrated within specific Microsoft 365 apps (Outlook, Word, Excel, Teams, etc.). It explains how Copilot helps users save time, enhance productivity, and make smarter decisions while highlighting specific workflows like drafting emails, editing documents, analyzing data, collaborating in Teams, and automating tasks. The inclusion of a helpful article further supports the recommendation with detailed information. This meets the criteria of being benefi

In [21]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20_Test1,0.8,0.888889,0.842105
1,loose_zero_shot_20_Test2,0.8,0.8,0.8
2,loose_zero_shot_20_no_negative_criteria_Test1,0.8,1.0,0.888889
3,loose_zero_shot_20_no_negative_criteria_Test2,0.9,0.818182,0.857143


### 3.2 Detailed Zero Shot Prompt

In the prompt below, we provide a detailed list of criteria based on the latest version of the insights framework available at GigPlus. This has a detailed set of tiered criteria. 


In [22]:
loose_detailed_zero_shot_prompt = '''
You are an AI assistant that validates **Copilot Value Add** recommendations in customer support conversations. These recommendations highlight how **Microsoft 365 Copilot** (the AI assistant integrated into apps like Teams, Outlook, Word, etc.) can provide **additional value** beyond solving the customer's immediate issue. A Copilot Value Add is essentially a **personalized suggestion** that goes the extra mile – it's tailored to the customer's unique situation and shows how Copilot can help them achieve their broader goals or improve an aspect of their work.

**What does a great Copilot Value Add look like?**  
It's **customer-centric**: The ambassador (support agent) clearly links a Copilot feature to a **specific business benefit** for that customer. It demonstrates how using Copilot will contribute to the customer's business or personal objectives (e.g., saving time, improving accuracy, boosting team collaboration). It also usually involves **showing Copilot in action** (like giving a quick demo or guiding the customer through using Copilot right there, so they experience its value first-hand). Throughout, it's **tailored to the customer's use case**: it addresses the customer's own pain points and workflows, with context about their environment, rather than being a generic pitch. In short, a valid value-add recommendation feels like helpful coaching or insight, not a sales script.

If an ambassador is an external consultant working on behalf of the customer, or if the recommendation is part of a promotional campaign, it's still crucial that they **define a relevant use case** so the customer can see the practical value. (For example, instead of just saying “Try Copilot's new feature X!”, they should say something like “Since you often have to do Y, Copilot's feature X could save you Z hours by...”). 

## Validation Criteria:

To mark the entry **Valid**, ensure the conversation's recommendation meets most of the following:

  - **Aligns with Needs/Goals:** It clearly connects a Copilot feature or capability with the customer's stated **business needs or goals**. (e.g., “Using Copilot in Excel will help you analyze these sales figures faster so you can meet your deadline.”)

  - **Provides Value-Add Insight:** It goes beyond the original support issue to offer something extra that contributes to the customer’s success. The ambassador demonstrates an understanding of the customer's context and gives a relevant, helpful suggestion.

  - **Hands-On or Clearly Described:** A strong recommendation either demonstrates the Copilot feature or clearly describes how it works and what benefit it brings to the customer. A live demo is ideal, but not required if the explanation is vivid and practical.

  - **Tailored to Use Case (Flexible):** The suggestion should ideally reference the customer’s specific scenario or pain point. However, clear and helpful recommendations that are relevant to common Microsoft 365 usage scenarios may still be considered valid.

  - **Copilot Mention (Flexible):** If the recommendation clearly describes a capability associated with Copilot (e.g. summarizing emails, drafting responses, generating insights), it can be valid even if the word "Copilot" is not mentioned—provided the feature is accurately described.

  - **Context and Impact:** The recommendation should help the customer understand when and why they might use the feature. Mentioning expected benefits (e.g. saving time, improving accuracy, reducing effort) is sufficient — precise metrics are not required.

If the recommendation is vague, generic, not aligned with a business goal or customer need, fails to explain the benefit of the feature, or sounds like a template without personalization — it should be marked **Invalid**.

**Output Format:**

Respond in JSON with:
```json
{
  "valid": true or false,
  "reasoning": "your explanation based on the validation criteria"
}
'''

In [23]:
test_results = analyse_test_prompt('loose_detailed_zero_shot_20', loose_detailed_zero_shot_prompt, test_results, copilot_recommendation_20)


0
 
Informed customer that Copilot is integrated into the apps like Outlook, Word, Excel, Teams, and more. It helps save time, enhance productivity, and make smarter decisions.
Draft and summarize email thread in Outlook
Draft and edit documents faster in Word.
Analyze data effortlessly in Excel.
Collaborate seamlessly and boost meeting productivity in Teams.
Automate repetitive tasks and free up time for strategic work.
Provided helpful article {
  "valid": false,
  "reasoning": "The recommendation is generic and does not connect specific Copilot features to the customer's unique business goals or pain points. While it lists various capabilities of Copilot—such as drafting emails, editing documents, analyzing data, and automating tasks—it does not demonstrate how these features would specifically benefit the customer based on their needs or workflows. Additionally, there is no example or detailed explanation of how Copilot works, nor a tailored use case that would make the suggestions

In [24]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20_Test1,0.8,0.888889,0.842105
1,loose_zero_shot_20_Test2,0.8,0.8,0.8
2,loose_zero_shot_20_no_negative_criteria_Test1,0.8,1.0,0.888889
3,loose_zero_shot_20_no_negative_criteria_Test2,0.9,0.818182,0.857143
4,loose_detailed_zero_shot_20,0.285714,1.0,0.444444


### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [25]:
few_shot_prompt = '''

You are an AI assistant that validates **Copilot Value Add** recommendations in customer support conversations. These recommendations highlight how **Microsoft 365 Copilot** (the AI assistant integrated into apps like Teams, Outlook, Word, etc.) can provide **additional value** beyond solving the customer’s immediate issue. A Copilot Value Add is essentially a **personalized suggestion** that goes the extra mile – it’s tailored to the customer’s unique situation and shows how Copilot can help them achieve their broader goals or improve an aspect of their work.

**What does a great Copilot Value Add look like?**  
It’s **customer-centric**: The ambassador (support agent) clearly links a Copilot feature to a **specific business benefit** for that customer. It demonstrates how using Copilot will contribute to the customer’s business or personal objectives (e.g., saving time, improving accuracy, boosting team collaboration). It also usually involves **showing Copilot in action** (like giving a quick demo or guiding the customer through using Copilot right there, so they experience its value first-hand). Throughout, it’s **tailored to the customer’s use case**: it addresses the customer’s own pain points and workflows, with context about their environment, rather than being a generic pitch. In short, a valid value-add recommendation feels like helpful coaching or insight, not a sales script.

If an ambassador is an external consultant working on behalf of the customer, or if the recommendation is part of a promotional campaign, it’s still crucial that they **define a relevant use case** so the customer can see the practical value. (For example, instead of just saying “Try Copilot’s new feature X!”, they should say something like “Since you often have to do Y, Copilot’s feature X could save you Z hours by...”). 

## Validation Criteria:

To mark the entry **Valid**, ensure the conversation’s recommendation:

  - **Aligns with Needs/Goals:** It explicitly connects a Copilot feature or capability with the customer’s stated **business needs or goals**. (Does it say *which* Copilot feature, and *how* it helps the customer reach their goal or solve a problem? e.g. “Using Copilot in Excel will help you analyze these sales figures 10x faster, so you can make your quarterly report deadline.”)

  - **Provides Value-Add Insight:** It goes beyond the original support issue’s resolution to offer something extra. The ambassador is clearly focused on the customer’s broader success: they show they **understand the customer’s business/technical context** and give a suggestion that genuinely benefits the customer or their business (not just pushing a random feature). It should feel like they’re building a relationship or trust, not just closing a ticket.

  - **Hands-On Interaction:** Whenever possible, the ambassador actually **demonstrates or walks the customer through the Copilot feature** in real-time. (For example, “Let’s have Copilot draft an email for you right now so you see how it works.”) This tangible experience helps the customer immediately grasp the feature’s value. *(If a live demo isn’t possible, at least the conversation should vividly describe how to use the feature or what it would do for them.)*

  - **Tailored to the Use Case:** The suggestion is not one-size-fits-all; it’s personalized. It references the customer’s unique scenario, data, or pain point and shows **how Copilot addresses that specific situation**. (It should sound like, “Given you mentioned [pain point/need], Copilot can [specific solution].” If it’s generic or irrelevant to what the customer cares about, that’s not good.)

  - **Context and Impact:** It provides enough context about the feature and how it will fit into or improve the customer’s **workflow or satisfaction**. It might mention how it would save time in their daily routine, reduce errors in their process, improve their team’s communication, etc. The customer should be able to imagine exactly when and why they’d use Copilot from this explanation.

If any of these elements are missing (e.g. the ambassador just says “Try Copilot, it’s cool!” without context, or suggests a feature that doesn’t match the customer’s goals), then the recommendation is **Invalid**. Also, if the ambassador fails to show or explain the feature, or the customer is left confused about why they’d use it, that’s not a successful value-add.

###EXAMPLES

Example 1 (Valid):
Conversation excerpt: After resolving a Teams call quality issue, the ambassador says: “Since you run many large meetings, have you tried Copilot in Teams? It can generate a quick summary of your meetings and highlight action items. For instance, in your last all-hands, Copilot could have pulled out the three next steps everyone agreed on – saving you the time of writing up notes. Shall I show you how to do that?” They then walk the customer through using Teams Copilot to summarize a recent meeting recording, and the customer reacts positively.
Analysis: The ambassador specifically links a Copilot feature (Teams meeting summaries) to the customer’s goal (not having to manually summarize large meetings). They provided a hands-on demonstration (“Shall I show you…”) and clearly tailored it to the customer’s scenario (their last all-hands meeting). The value (time saved, clearer action items) was explicit.

Example 2 (Invalid):
Conversation excerpt: The support issue was about a SharePoint file permission error, which the ambassador fixed. Then the ambassador adds: “By the way, you should check out Copilot – it’s awesome. It can do a lot of cool things with your data!” The customer asks if it can help with file permissions, and the ambassador just says, “Not exactly, but it’s still worth exploring. Lots of our clients love it.” The customer murmurs “okay…,” sounding uncertain, and the call ends.
Analysis: This recommendation is vague and not tied to the customer’s issue or goals. The ambassador didn’t specify any feature (just “Copilot” broadly) or show how it would actually help this customer. There was no demonstration or concrete example relevant to the customer’s context (file management). It comes off as a superficial pitch. From a PLG perspective, it’s not useful – it doesn’t educate or excite the user, so it likely won’t drive adoption.

**Output Format:**

Respond in JSON with:
```json
{
  "valid": true or false,
  "reasoning": "your explanation based on the validation criteria"
}
'''

In [27]:
test_results = analyse_test_prompt('few shot', few_shot_prompt, test_results, copilot_recommendation_20)

print(test_results)

0
 
Informed customer that Copilot is integrated into the apps like Outlook, Word, Excel, Teams, and more. It helps save time, enhance productivity, and make smarter decisions.
Draft and summarize email thread in Outlook
Draft and edit documents faster in Word.
Analyze data effortlessly in Excel.
Collaborate seamlessly and boost meeting productivity in Teams.
Automate repetitive tasks and free up time for strategic work.
Provided helpful article {
  "valid": false,
  "reasoning": "The recommendation is too general and lacks personalization to the customer's specific needs or goals. While it mentions several Copilot features and their benefits, it does not tie any of these features explicitly to the customer's own pain points or workflows. For example, it does not provide context about why the customer would benefit from drafting emails faster in Outlook, analyzing data in Excel, or automating tasks. There is no demonstration or direct engagement to show the customer how Copilot works i

In [None]:
def analyse_multipass_prompt(test_name, prompt_product, prompt_deployment, results_store, true_positives, true_negatives, false_positives, false_negatives):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt_product: system prompt passed to the LLM to validate product feedback
      - prompt_deployment: system prompt passed to the LLM to validate deployment blockers
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()
    
    dataframes = [
        [true_positives, True], # first value contains the data, the second what we would like the model to return for every row
        [true_negatives, False], # for instance, the llm should evaluate all true positives as valid to have 100% accuracy 
        [false_positives, False],
        [false_negatives, True]
    ]

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
    for dataframe in dataframes:
        data, expectation = dataframe
        
        for index, row in data.iterrows():
            # avoid token limit if needed every 10 rows 
            print(llm_call_counter)
            if llm_call_counter >= 10:
                print(f"Rate limit is close, continuing in {60} seconds...")
                time.sleep(61)
                llm_call_counter = 0
                
                
            llm_res = send_prompt(prompt_product, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            llm_call_counter += 1

            print(llm_res)
            # if we did not get a TP or an TN, we use the other prompt
            llm_res = json.loads(llm_res)
            if expectation != llm_res['valid']:
                llm_res = send_prompt(prompt_deployment, row['Feedback'])
                llm_res = clean_llm_response(llm_res)
                try:
                    llm_res = json.loads(llm_res)
                except json.JSONDecodeError as e:
                    print(f"Error parsing JSON: {e}, moving to next case")  # Log the error
                    next
                    
                llm_call_counter += 1
                
            
            
            print(llm_res)
            
            if expectation == True and llm_res['valid'] == True:
                tp += 1
            elif expectation == True and llm_res['valid'] == False:
                fn += 1
            elif expectation == False and llm_res['valid'] == True:
                fp += 1
            elif expectation == False and llm_res['valid'] == False:
                tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

## Results

Here we have the performance of several prompting strategies on our makeshift cross validation sample. 

In [None]:
# print test results store

4
5
