# Product Insight Validation Using LLMs: Copilot Insights🔍

## Overview

This notebooks aims to evaluate different prompting strategies and prompts for validating Copilot Insights using a Large Language Model (LLM). The goal is to have a sandbox where we can fine-tune prompts using the same cross validation sets. 

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing business goals for validation.

2. **Apply LLM-Based Validation**  
   - Building blocks for using LLM to validate, and cleaning inputs

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.
   - 3.1 Zero Shot Prompting
   - 3.2 Few Shot Prompting
   - 3.3 Multi-pass w/ Few Shot prompting

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.



In [4]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load Copilot Insights

Let's take a glimpse at the data we have. All these business goals have been validated by human validators during the month of April '25 

These datasets will act as makeshift cross-validation sets, that we can use to test the performance of different prompting strategies and approaches. 


We have two datasets:

- A small set of 20 cases ( 10 valid, 10 invalids ) for quick experimentation
- A bigger set of 200 cases ( 100 valid, 100 invalids ) for testing and measuring results 

In [5]:
# We load all the data 

copilot_insights_20 = pd.read_csv('./cases/copilot_insights_cv20.csv')
copilot_insights_200 = pd.read_csv('./cases/copilot_insights_cv200.csv')

# Now let's print one of the datasets to see its shape

copilot_insights_20[:5]

Unnamed: 0,ID,Case Number,UPN,Line Of Business,Copilot Insights,Copilot Insights Validation,Copilot Insights Comment
0,28342,2503270010002818,gig_wfh_kuman@microsoftsupport.com,Business Advisor Reactive,Based on my conversation it would benefit the ...,1,Validate Copilot Insight
1,28799,2504080010002339,gig_wfh_joqui@microsoftsupport.com,Business Advisor Reactive,The customer mentioned Copilot for outlook can...,1,Valid
2,25766,"2,50329E+15",gig_wfh_shtal@microsoftsupport.com,Business Assist,"The customer, an engineering firm specializing...",1,"Valid, customer experiences regarding the usab..."
3,27427,2504060040000137,gig_wfh_naric@microsoftsupport.com,Business Advisor Reactive,He mentioned that likes using Copilot to compa...,1,Valid copilot insight
4,26643,2504040030001743,gig_wfh_rapan@microsoftsupport.com,Business Assist,The user highlighted that they are unable to u...,1,Insight is valid


In [3]:
# Column explanation
data = [
    ["Copilot Insights", "Raw Copilot Insights captured by the ambassador as they figure in the tracker"],
    ["Copilot Insights Validation", "Validation done by a human validator - 0 is invalid, 1 is valid"],
    ["Copilot Insights comment", "Comment/Explanation provided by a human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Copilot Insights,Raw Copilot Insights captured by the ambassador as they figure in the tracker
1,Copilot Insights Validation,"Validation done by a human validator - 0 is invalid, 1 is valid"
2,Copilot Insights comment,Comment/Explanation provided by a human validator


## 1.1 Defining Performance

These 2 datasets have already been evaluated by human validators.

This means we can use the previous labels to calculate Sensitivity, Recall and F1 for this dataset, which will give us performance metrics we can analyse and optimise. We will be looking at the following metrics.

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


We will create a dataframe where we will store the results of our tests as we run them.

In [6]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])


test_results


Unnamed: 0,test_name,sensitivity,precision,f1_score


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [8]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [9]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{ 
  "valid": true, 
  "reason": "The feedback highlights a usability issue—the app's menu is convoluted, crowded with icons, and hard to read. This directly impacts the ease of use and navigability of the application." 
}
```


The model is giving us back a JSON wrapped in Markdown. Let's create a function to clean it 

In [10]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [11]:
print(clean_llm_response(res))

{ 
  "valid": true, 
  "reason": "The feedback highlights a usability issue—the app's menu is convoluted, crowded with icons, and hard to read. This directly impacts the ease of use and navigability of the application." 
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through the rows of one of our datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the entry
    - 2. Evaluate if the LLM did a good job or not
          - LLM => Valid, Human => Valid, then *true positive*
          - LLM => Invalid, Human => Invalid, then *true negative*
          - LLM => Valid, Human => Invalid, then *false positive*
          - LLM => Invalid, Human => Valid, then *false negative*
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [12]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store, dataset):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        print(row_counter)
        if row_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            row_counter = 0
                
        llm_res = send_prompt(prompt, row['Copilot Insights'])
        llm_res = clean_llm_response(llm_res)
        row_counter += 1
        print(row['Copilot Insights'],llm_res)

        try:
            llm_res = json.loads(llm_res)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row['Copilot Insights Validation']
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
        elif llm_res['valid'] == False and human_validation == 1:
            fn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Loose Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe high level criteria that is frequently mentioned by human validators to mark insights as valid or invalid. These are drawn from analysing the reasons as to why validators mark insights as valid or invalid. 


In [20]:
loose_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Copilot Relevance: The feedback must clearly relate to the Copilot product—its performance, features, or experience.

    - B) Specificity: The entry should include specific and concrete observations, experiences, or suggestions related to Copilot.

    - C) Actionability: The feedback should be useful for improving Copilot, such as identifying enhancement opportunities, accuracy issues, pricing concerns, or deployment barriers.

## Set 2: Invalid Business Goal Criteria

Meeting any of the criteria below is enough for the entry to be considered invalid Copilot feedback.

    - D) No Copilot Experience: The customer has not used Copilot or is unaware of it.

    - E) Vague or Generic Feedback: The feedback lacks detail or does not describe specific aspects of Copilot.

    - F) Ambassador-Originated: The feedback comes from an ambassador, not the customer themselves.

    - G) Not Actionable: The entry does not provide enough information to derive improvements or insights.

    - H) No Real Feedback: The entry does not include any actual feedback, feature request, or deployment blocker.

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [23]:
# Test Loose Zero Shot Prompt here (test_name, prompt, results_store, dataset)
test_results = analyse_test_prompt('loose_zero_shot_20', loose_zero_shot_prompt, test_results, copilot_insights_20)

0
Based on my conversation it would benefit the customer if copilot comes free of usage for the first week of a new account to get users hooked on using AI with their work flow. {
  "valid": true,
  "reasoning": "The entry is valid because it clearly relates to the Copilot product (A: Copilot Relevance), provides a specific suggestion—introducing a free usage trial for a week (B: Specificity), and the suggestion is actionable as it identifies an enhancement opportunity aimed at improving adoption and user engagement (C: Actionability)."
}
1
The customer mentioned Copilot for outlook can have a feature for end users where they can have text to image so the emails can be personalized with the ideas of the moment {
  "valid": true,
  "reasoning": "The entry meets all of the valid business goals criteria. It is clearly related to Copilot (Criteria A), specifically identifies a feature suggestion related to email personalization (Criteria B), and provides actionable feedback by suggesting t

In [24]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,0.7,0.777778,0.736842
1,loose_zero_shot_20_no_negative_criteria,0.9,0.75,0.818182


Ha dado como resultado:

sensitivity: 0.8
precision: 0.888889
f1_score: 0.842105


In [None]:
loose_zero_shot_no_negative_criteria = '''

You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Valid Copilot Insights Criteria

Meeting any of the criteria below is enough for the entry to be considered invalid Copilot feedback.

    - A) No Copilot Experience: The customer has not used Copilot or is unaware of it.

    - B) Vague or Generic Feedback: The feedback lacks detail or does not describe specific aspects of Copilot.

    - C) Ambassador-Originated: The feedback comes from an ambassador, not the customer themselves.

    - D) Not Actionable: The entry does not provide enough information to derive improvements or insights.

    - E) No Real Feedback: The entry does not include any actual feedback, feature request, or deployment blocker.

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''


In [None]:
test_results = analyse_test_prompt('loose_zero_shot_20_no_negative_criteria', loose_zero_shot_no_negative_criteria, test_results, copilot_insights_20)


0
Based on my conversation it would benefit the customer if copilot comes free of usage for the first week of a new account to get users hooked on using AI with their work flow. {
  "valid": true,
  "reasoning": "The entry includes a specific feature request related to Copilot—making it free for the first week of a new account which could benefit user adoption. This provides actionable feedback relevant to Copilot insights."
}
1
The customer mentioned Copilot for outlook can have a feature for end users where they can have text to image so the emails can be personalized with the ideas of the moment {
  "valid": true,
  "reasoning": "The entry includes a specific feature request (text-to-image functionality) that could improve Copilot for Outlook by enabling personalized emails. It provides actionable feedback and is specific enough to offer insights for improvement."
}
2
The customer, an engineering firm specializing in human-machine interfaces, uses Copilot AI to streamline email draf

In [None]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,0.8,0.888889,0.842105
1,loose_zero_shot_20_no_negative_criteria,0.9,0.75,0.818182


Ha dado como resultado:

sensitivity: 0.9
precision: 0.75
f1_score: 0.818182


### 3.2 Detailed Zero Shot Prompt

In the prompt below, we provide a detailed list of criteria based on the latest version of the insights framework available at GigPlus. This has a detailed set of tiered criteria. 


In [13]:
detailed_zero_shot_prompt = '''
You're an AI assistant that reviews Microsoft 365 Copilot Product Feedback to ensure it provides actionable insights for product improvement. Valid feedback should be specific, relevant to Copilot, and help Microsoft refine the product based on real user experience.
 
What Makes Strong Copilot Feedback?
✅ Mentions a specific Copilot feature or aspect - Feedback should focus on a particular functionality, not general comments. ✅ Explains how the feature impacts business goals - How does it support or fail to meet customer needs? ✅ Describes user experience or workflow impact - Does it affect productivity, efficiency, or satisfaction? ✅ Provides actionable details - Can the product team investigate or improve based on this feedback?
 
When is Feedback Invalid?
❌ Customer hasn't used Copilot or is unaware of it - Feedback should come from real experience. ❌ Too generic or vague - It doesn't specify Copilot features or effects. ❌ Ambassador recommendation, not customer feedback - Feedback should come from users, not internal suggestions. ❌ Lack of actionable details - If it doesn't help Microsoft refine Copilot, it's invalid. ❌ No actual feedback or request - Entries must provide a real issue, feature request, or usability concern.

## Response ##

You will respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [14]:
test_results = analyse_test_prompt('detailed_zero_shot', detailed_zero_shot_prompt, test_results, copilot_insights_20)


0
Based on my conversation it would benefit the customer if copilot comes free of usage for the first week of a new account to get users hooked on using AI with their work flow. {
  "valid": false,
  "reasoning": "This feedback does not come from a real user experience or actual usage of Copilot. Instead, it is a recommendation about pricing strategy or onboarding incentives, which falls outside of actionable feedback related to refining specific features in Copilot. Additionally, it lacks details about functional aspects, impact on workflows, or productivity."
}
1
The customer mentioned Copilot for outlook can have a feature for end users where they can have text to image so the emails can be personalized with the ideas of the moment {
  "valid": true,
  "reasoning": "The feedback is specific and mentions a Copilot feature for Outlook, suggesting a text-to-image capability to enhance email personalization. It identifies a potential improvement that can be implemented and relates to us

  results_store = pd.concat([results_store, new_results_row], ignore_index=True)


In [16]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,detailed_zero_shot,0.5,0.833333,0.625


### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [21]:
few_shot_prompt = '''

You are an AI assistant that validates Microsoft 365 Copilot product feedback (suggestions, bug reports, usability challenges, etc.) based on four criteria:
 
(1) Specificity - Identifies a specific feature or aspect of Copilot
(2) Alignment With Objectives - Explains how that feature supports or fails the customer's business goals
(3) Impact on Workflows or Satisfaction - Describes the effect on the user's experience or processes
(4) Actionability - Offers enough detail for the product team to investigate or make improvements

If the feedback meets these criteria, respond:
{
  "CopilotPFvalid": "Valid",
  "CopilotPFcomment": ""
}
Otherwise, respond:
{
  "CopilotPFvalid": "Invalid",
  "CopilotPFcomment": ""
}
 
### Examples:
 
• Example 1 (Valid)
“Copilot's 'Summarize Document' feature doesn't handle spreadsheets well. Our finance team needs accurate summaries for weekly budget reports. Because the data can't be summarized, we have to do it manually, wasting hours. We need Copilot to parse tables in Excel files more accurately.”
- Specific feature named.
- States the business objective (weekly budget reports).
- Describes workflow impact (manual process).
- Actionable improvement (parse Excel tables better).
 
Valid Output:
{
  "valid": true,
  "CopilotPFcomment": "Valid — Identifies feature, aligns with business needs, describes the problem's impact and calls for a fix. This is actionable."
}
 
• Example 2 (Invalid)
“Our team hates Copilot. It's just too annoying.”
- No specific feature.
- No tie to any business objective.
- Not actionable.
 
Invalid Output:
{
  "valid": false,
  "CopilotPFcomment": "Invalid — Vague feedback. Lacks specifics on what's annoying or how it hinders business goals. Not actionable from a PLG perspective; needs more detail."
}
 
## Response ##

You will always respond in JSON format with only the following fields and no more:

* valid - make it true or false
* reasoning - add your reasoning based on the criteria set above

'''

In [22]:
test_results = analyse_test_prompt('detailed_few_shot_20', few_shot_prompt, test_results, copilot_insights_20)


0
Based on my conversation it would benefit the customer if copilot comes free of usage for the first week of a new account to get users hooked on using AI with their work flow. {
  "valid": false,
  "reasoning": "Invalid — While the feedback suggests a promotional idea (free trial for a week) and hints at enhancing user adoption, it does not identify a specific feature or aspect of Copilot, tie the suggestion explicitly to a business objective, describe the impact on workflows or satisfaction, nor offer enough detail to make it actionable for the product team. It requires clearer alignment with product functionality and objectives."
}
1
The customer mentioned Copilot for outlook can have a feature for end users where they can have text to image so the emails can be personalized with the ideas of the moment {
  "valid": false,
  "reasoning": "Invalid — While the feedback suggests a potential feature (text-to-image for personalized emails), it does not identify a specific existing featu

In [23]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,detailed_zero_shot,0.5,0.833333,0.625
1,detailed_few_shot_20,0.3,1.0,0.461538


In [24]:
detailed_few_shot_v2_cot = '''
You are an AI assistant that reviews **Microsoft 365 Copilot Product Feedback** to ensure it provides actionable insights for product improvement. Valid feedback should reflect real customer experience with Copilot and help Microsoft refine the product based on specific issues, needs, or limitations.

---

## ✅ What Makes Valid Copilot Feedback?

A **Valid** entry should include:
1. **Specificity:** Clearly mentions a **specific Copilot feature or functionality** (not just “Copilot in general”).
2. **Business Relevance:** Explains how the Copilot feature supports or fails to support the **customer’s business goals or use case**.
3. **Impact:** Describes how it affects productivity, satisfaction, workflow, or decision-making.
4. **Actionability:** Gives Microsoft product teams something they could **investigate, improve, or build upon**.

---

## When Is Feedback Invalid?

Mark the entry as **Invalid** if:
- The entry says that the customer hasn’t used Copilot or isn’t aware of it
- The feedback is **generic**, vague, or lacks any specific feature
- It’s a **suggestion from an internal ambassador**, not customer feedback
- It’s missing real insight (e.g., no issue, request, or pain point)
- It lacks **actionable detail**

---

## Valid Examples

- Customer mentioned they struggle with generating high-quality email drafts using Copilot in Outlook and would like more control over tone and formality.
- Customer requested Copilot in Teams to summarize long chat threads more accurately, as sometimes key action points are missing.
- Customer reported that Copilot in Word does not always align with their company’s writing style guide when generating proposals.
- Customer asked if Copilot in PowerPoint could provide more creative slide layouts instead of just reformatting existing content to improve quality of produced content.

---

## Invalid Examples

- Copilot is an amazing tool that simplifies tasks across all Office apps
- A small business owner expressed interest in AI tools but provided vague suggestions.Copilot should be able to browse the web in real-time and fetch live news updates automatically.
- "ChatGPT provides better responses than Copilot, which feels less human-like.

---

## 🧠 Chain of Thought (How to Think Before Deciding)

- Is the feedback clearly tied to a **specific part** of Copilot?
- Does it describe the **customer's experience**, not just general praise or confusion?
- Does it show **impact** on the user’s workflow or goals?
- Could a Microsoft engineer or PM reasonably take action based on this?

If the answer is mostly “yes,” mark as **valid**. If mostly “no,” mark as **invalid**.

---

## 📦 Output Format

Do not include any of the reasoning in your response. Respond in **JSON only**, with the following fields:

```json
{
  "valid": true,
  "reasoning": "A short explanation of your reasoning."
}

'''

In [24]:
detailed_few_shot_v2_cot = '''
You are an AI assistant that reviews **Microsoft 365 Copilot Product Feedback** to ensure it provides actionable insights for product improvement. Valid feedback should reflect real customer experience with Copilot and help Microsoft refine the product based on specific issues, needs, or limitations.

---

## ✅ What Makes Valid Copilot Feedback?

A **Valid** entry should include:
1. **Specificity:** Clearly mentions a **specific Copilot feature or functionality** (not just “Copilot in general”).
2. **Business Relevance:** Explains how the Copilot feature supports or fails to support the **customer’s business goals or use case**.
3. **Impact:** Describes how it affects productivity, satisfaction, workflow, or decision-making.
4. **Actionability:** Gives Microsoft product teams something they could **investigate, improve, or build upon**.

---

## When Is Feedback Invalid?

Mark the entry as **Invalid** if:
- The entry says that the customer hasn’t used Copilot or isn’t aware of it
- The feedback is **generic**, vague, or lacks any specific feature
- It’s a **suggestion from an internal ambassador**, not customer feedback
- It’s missing real insight (e.g., no issue, request, or pain point)
- It lacks **actionable detail**

---

## Valid Examples

- Customer mentioned they struggle with generating high-quality email drafts using Copilot in Outlook and would like more control over tone and formality.
- Customer requested Copilot in Teams to summarize long chat threads more accurately, as sometimes key action points are missing.
- Customer reported that Copilot in Word does not always align with their company’s writing style guide when generating proposals.
- Customer asked if Copilot in PowerPoint could provide more creative slide layouts instead of just reformatting existing content to improve quality of produced content.

---

## Invalid Examples

- Copilot is an amazing tool that simplifies tasks across all Office apps
- A small business owner expressed interest in AI tools but provided vague suggestions.Copilot should be able to browse the web in real-time and fetch live news updates automatically.
- "ChatGPT provides better responses than Copilot, which feels less human-like.

---

## 🧠 Chain of Thought (How to Think Before Deciding)

- Is the feedback clearly tied to a **specific part** of Copilot?
- Does it describe the **customer's experience**, not just general praise or confusion?
- Does it show **impact** on the user’s workflow or goals?
- Could a Microsoft engineer or PM reasonably take action based on this?

If the answer is mostly “yes,” mark as **valid**. If mostly “no,” mark as **invalid**.

---

## 📦 Output Format

Do not include any of the reasoning in your response. Respond in **JSON only**, with the following fields:

```json
{
  "valid": true,
  "reasoning": "A short explanation of your reasoning."
}

'''

In [25]:
test_results = analyse_test_prompt('detailed_few_shot_v2_cot_20', detailed_few_shot_v2_cot , test_results, copilot_insights_20)


0
Based on my conversation it would benefit the customer if copilot comes free of usage for the first week of a new account to get users hooked on using AI with their work flow. {
  "valid": false,
  "reasoning": "This feedback is a suggestion regarding pricing or onboarding strategy rather than a specific issue or feature related to Copilot functionality or experience."
}
1
The customer mentioned Copilot for outlook can have a feature for end users where they can have text to image so the emails can be personalized with the ideas of the moment {
  "valid": true,
  "reasoning": "The feedback suggests a specific enhancement for Copilot in Outlook, proposing text-to-image capabilities to improve email personalization. This is actionable and tied to a particular business use case."
}
2
The customer, an engineering firm specializing in human-machine interfaces, uses Copilot AI to streamline email drafting and enhance efficiency. They are curious about the sources of information Copilot rel

In [26]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,detailed_zero_shot,0.5,0.833333,0.625
1,detailed_few_shot_20,0.3,1.0,0.461538
2,detailed_few_shot_v2_cot_20,0.4,1.0,0.571429


In [None]:
def analyse_multipass_prompt(test_name, prompt_product, prompt_deployment, results_store, true_positives, true_negatives, false_positives, false_negatives):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt_product: system prompt passed to the LLM to validate product feedback
      - prompt_deployment: system prompt passed to the LLM to validate deployment blockers
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()
    
    dataframes = [
        [true_positives, True], # first value contains the data, the second what we would like the model to return for every row
        [true_negatives, False], # for instance, the llm should evaluate all true positives as valid to have 100% accuracy 
        [false_positives, False],
        [false_negatives, True]
    ]

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
    for dataframe in dataframes:
        data, expectation = dataframe
        
        for index, row in data.iterrows():
            # avoid token limit if needed every 10 rows 
            print(llm_call_counter)
            if llm_call_counter >= 10:
                print(f"Rate limit is close, continuing in {60} seconds...")
                time.sleep(61)
                llm_call_counter = 0
                
                
            llm_res = send_prompt(prompt_product, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            llm_call_counter += 1

            print(llm_res)
            # if we did not get a TP or an TN, we use the other prompt
            llm_res = json.loads(llm_res)
            if expectation != llm_res['valid']:
                llm_res = send_prompt(prompt_deployment, row['Feedback'])
                llm_res = clean_llm_response(llm_res)
                try:
                    llm_res = json.loads(llm_res)
                except json.JSONDecodeError as e:
                    print(f"Error parsing JSON: {e}, moving to next case")  # Log the error
                    next
                    
                llm_call_counter += 1
                
            
            
            print(llm_res)
            
            if expectation == True and llm_res['valid'] == True:
                tp += 1
            elif expectation == True and llm_res['valid'] == False:
                fn += 1
            elif expectation == False and llm_res['valid'] == True:
                fp += 1
            elif expectation == False and llm_res['valid'] == False:
                tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

## Results

Here we have the performance of several prompting strategies on our makeshift cross validation sample. 

In [None]:
# print test results store

4
5
