# Product Insight Validation Using LLMs: M365 Product Feedback🔍

## Overview

This notebooks aims to evaluate different prompting strategies and prompts for validating M365 Product Feedback using a Large Language Model (LLM). The goal is to have a sandbox where we can fine-tune prompts using the same cross validation sets. 

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing business goals for validation.

2. **Apply LLM-Based Validation**  
   - Building blocks for using LLM to validate, and cleaning inputs

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.
   - 3.1 Zero Shot Prompting
   - 3.2 Few Shot Prompting
   - 3.3 Multi-pass w/ Few Shot prompting

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.



In [1]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load M365 Product Feedback

Let's take a glimpse at the data we have. All these business goals have been validated by human validators during the month of April '25 

These datasets will act as makeshift cross-validation sets, that we can use to test the performance of different prompting strategies and approaches. 


We have two datasets:

- A small set of 20 cases ( 10 valid, 10 invalids ) for quick experimentation
- A bigger set of 200 cases ( 100 valid, 100 invalids ) for testing and measuring results 

In [13]:
# We load all the data 

m365_product_feedback_20 = pd.read_csv('./cases/m365_feedback_cv20.csv') # 2-3 min
m365_product_feedback_200 = pd.read_csv('./cases/m365_feedback_cv200.csv') # 25-30 min

# Now let's print one of the datasets to see its shape

m365_product_feedback_20[:5]

Unnamed: 0,ID,Case Number,UPN,Line Of Business,Product Feedback and Limitations,Product Feedback and Limitations Validation,Product Feedback and Limitations Comment
0,27636,2504050040000885,gig_wfh_udmem@office365support.com,Business Advisor Reactive,"\nCustomer complained that the domain verification is for the users with have IT skills, and not al of the users are , Microsoft should create a system when adding the domain to the Microsoft 365 will not have to be manually, it it can be done by just clicking on a button so that the domain can be looked into.",1,Feedback is valid
1,25781,2504010040001913,gig_wfh_alass@microsoftsupport.com,Proactive Grace,Feedback and limitations: The customer noted that the admin portal is vague and not straightforward because it is hard to find what the customer's looking for in the long menus. The customer justified his feedback by a situation he experienced which that he has been trying for 2 years to change and cease the subscription renewal cycle from the billing section in the portal but couldn't figure it out himself. \n\nProduct Feedback and Limitations:,1,Valid
2,27911,2503310010001432,gig_wfh_hariv@microsoftsupport.com,Business Assist,"\nThe customer seeks to restrict entry-level employees' usage of M365 resources and control their access without subscribing to additional add-in licenses. They require a solution similar to Apple Business Manager, which allows them to limit device functionality and user experience without incurring extra costs for user-based licenses, focusing solely on device subscriptions.",1,valid
3,26153,2504020030001649,gig_wfh_jibal@microsoftsupport.com,Business Advisor Reactive,"M365 Product Feedback: \nIt would be beneficial for the customer if, when verifying their business domain from a third-party domain host, Microsoft would simply ask for a sign-in page. This would make it easier to add the DNS records instantly and save time on verification.",1,Valid
4,26494,2503310040002168,gig_wfh_avadv@microsoftsupport.com,Trials Nurturing Proactive,"Feedback and limitations: The customer emphasized that integrating Forms and SharePoint for appointment scheduling greatly enhances their shop's operations. Forms collects detailed booking information accurately, while SharePoint securely organizes data for staff access and scheduling. They particularly value the simplicity of using a QR code to direct clients to the system, streamlining the process and improving overall efficiency.\n\nProduct Feedback and Limitations:",1,Valid


In [3]:
# Column explanation
data = [
    ["Product Feedback and Limitations", "Raw M365 product feedback captured by the ambassador as they figure in the tracker"],
    ["Product Feedback and Limitations Validation", "Validation done by a human validator - 0 is invalid, 1 is valid"],
    ["Product Feedback and Limitations Comment", "Comment/Explanation provided by a human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Product Feedback and Limitations,Raw M365 product feedback captured by the ambassador as they figure in the tracker
1,Product Feedback and Limitations Validation,"Validation done by a human validator - 0 is invalid, 1 is valid"
2,Product Feedback and Limitations Comment,Comment/Explanation provided by a human validator


## 1.1 Defining Performance

These 2 datasets have already been evaluated by human validators.

This means we can use the previous labels to calculate Sensitivity, Recall and F1 for this dataset, which will give us performance metrics we can analyse and optimise. We will be looking at the following metrics.

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


We will create a dataframe where we will store the results of our tests as we run them.

In [4]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])


test_results


Unnamed: 0,test_name,sensitivity,precision,f1_score


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [5]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [6]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "The feedback highlights a usability issue. A convoluted menu and difficulty in reading due to crowded icons directly impact the user's ability to navigate and use the app effectively."
}
```


The model is giving us back a JSON wrapped in Markdown. Let's create a function to clean it 

In [7]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [8]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "The feedback highlights a usability issue. A convoluted menu and difficulty in reading due to crowded icons directly impact the user's ability to navigate and use the app effectively."
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through the rows of one of our datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the entry
    - 2. Evaluate if the LLM did a good job or not
          - LLM => Valid, Human => Valid, then *true positive*
          - LLM => Invalid, Human => Invalid, then *true negative*
          - LLM => Valid, Human => Invalid, then *false positive*
          - LLM => Invalid, Human => Valid, then *false negative*
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [15]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store, dataset):
    '''
    Returns an array with
    - The performance of a prompt 
    - A comparison dataframe that helps compare AI and validator scores and comments.  

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    comparison_columns = [
        "Insight",
        "AI validation",
        "human validation",
        "AI comment",
        "human comment",
        "Result type"
    ]

    comparison_dataframe = pd.DataFrame(columns=comparison_columns)
    
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        print(row_counter)
        if row_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            row_counter = 0
                
        llm_res = send_prompt(prompt, row['Product Feedback and Limitations'])
        llm_res = clean_llm_response(llm_res)
        row_counter += 1
        print(row['Product Feedback and Limitations'],llm_res)

        try:
            llm_res = json.loads(llm_res)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row['Product Feedback and Limitations Validation']
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
            result_type = "True Positive"
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
            result_type = "True Negative"
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
            result_type = "False Positive"
        elif llm_res['valid'] == False and human_validation == 1:
            fn += 1
            result_type = "False Negative"

        new_comparison_row = {
            "Insight": row['Product Feedback and Limitations'],
            "AI validation": llm_res['valid'],
            "human validation": human_validation,
            "AI comment": llm_res['reasoning'],
            "human comment": row['Product Feedback and Limitations Comment'],
            "Result type": result_type
        }

        comparison_dataframe.loc[len(comparison_dataframe)] = new_comparison_row


    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return [results_store, comparison_dataframe]

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Loose Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe high level criteria that is frequently mentioned by human validators to mark insights as valid or invalid. These are drawn from analysing the reasons as to why validators mark insights as valid or invalid. 


In [16]:
loose_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Set 2: Invalid Business Goal Criteria

Meeting any of the criteria below is enough for the entry to be considered invalid.

    - D) Focus on tools: The entry just lists the M365 applications being used, but there is no business goal or need mentioned

    - E) Vague business goal or need: The entry does not include any details nor actionable business goals/needs

    - F) Technical issue: The entry only describes a technical issue experienced by the customer and there is no business goal or need

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [17]:
# Test Loose Zero Shot Prompt here 

In [18]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score


In [None]:
loose_zero_shot_no_negative_criteria = 
"""
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''


"""

In [None]:
test_results = analyse_test_prompt('loose_zero_shot_200_no_negative_criteria', loose_zero_shot_no_negative_criteria, test_results, business_goals_200)


In [36]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.615385,0.761905
1,loose_zero_shot_20_no_negative_criteria,1.0,0.642857,0.782609


### 3.2 Detailed Zero Shot Prompt

In the prompt below, we provide a detailed list of criteria based on the latest version of the insights framework available at GigPlus. This has a detailed set of tiered criteria. 


In [11]:
detailed_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

**Important:** *“Limitations”* in this context are specific product constraints or deficiencies identified by customers – for example, a technical glitch, a missing capability, or a UX issue that hindered the product’s effectiveness or user satisfaction. Recognizing and describing such limitations clearly (especially when they recur or affect broader usage) is crucial for product teams to address them.

## Validation Criteria:

A Microsoft 365 product feedback entry is **Valid** if it meets **all** the following:

- **Specificity:** The feedback clearly refers to a **specific product feature or element**. (It should name the feature or function in question – e.g. “SharePoint file version history” rather than a vague statement like “the system is slow.”)

- **Support of Objectives:** It connects that feature to the **customer’s business objectives or use case**. (It should explain *why* this feature matters or what goal isn’t being met – e.g. “This limitation prevents our compliance team from auditing changes, affecting our regulatory workflow.”)

- **Impact on Customer Journey:** It provides detailed context on **how this issue affects the customer’s workflow, productivity, or satisfaction**. (Does it slow them down? Cause frustration? Occur at a critical stage in their process? The feedback should make that clear.)

- **Actionability:** The insight is presented in a way that could lead to a specific improvement. (It’s something that Microsoft’s product/engineering teams can investigate or act on – not an ambiguous “it’s bad” comment with no details.)

- **(For Support Cases)** If the feedback arises from a support ticket, it should describe a **product flaw beyond that one incident**. (For example, if a user had an issue that support fixed by proper configuration, that’s not product feedback – but if the issue persists even after correct setup, or it’s a known limitation that support couldn’t resolve, then it’s valid feedback. In short, it shouldn’t just re-state the exact problem of an open ticket unless it’s indicating a broader product deficiency or recurring bug.)

If the feedback meets all the above criteria, mark it **Valid**. If it is missing one or more of these elements (for example, it’s too vague, doesn’t explain business impact, or isn’t something the product team can act on), mark it **Invalid**.

### Output Format:

Respond in **JSON** only, with the keys:
- `"valid"`: true or false.
- `"m365PFcomment"`: A brief explanation with your reasoning

'''

In [19]:
test_results = analyse_test_prompt('detailed_zero_shot_20', detailed_zero_shot_prompt, test_results, m365_product_feedback_20)


0
 
Customer complained that the domain verification is for the users with have IT skills, and not al of the users are , Microsoft should create a system when adding the domain to the Microsoft 365 will not have to be manually, it it can be done by just clicking on a button so that the domain can be looked into. {
  "valid": false,
  "m365PFcomment": "The feedback is too vague and lacks specific details about the current domain verification process in Microsoft 365, the business impact, and how the manual steps affect customer workflows or objectives. While it suggests an improvement, it doesn't clarify the limitation well enough for it to be actionable by the product team."
}
1
Feedback and limitations: The customer noted that the admin portal is vague and not straightforward because it is hard to find what the customer's looking for in the long menus. The customer justified his feedback by a situation he experienced which that he has been trying for 2 years to change and cease the su

In [20]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,detailed_zero_shot_20,0.4,0.5,0.444444


### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [23]:
few_shot_prompt = '''

You are an AI assistant that validates **Microsoft 365 Product Feedback** entries. These entries capture customer-reported **suggestions, bug reports, opinions (with concrete examples), shortcomings, technical issues, missing features, or usability challenges** related to Microsoft 365 (excluding Copilot). Product feedback should provide **actionable insights** that help Microsoft understand how well the product meets customer needs and where it falls short, so the product team can consider improvements.

**Important:** *“Limitations”* in this context are specific product constraints or deficiencies identified by customers – for example, a technical glitch, a missing capability, or a UX issue that hindered the product’s effectiveness or user satisfaction. Recognizing and describing such limitations clearly (especially when they recur or affect broader usage) is crucial for product teams to address them.

## Validation Criteria:

A Microsoft 365 product feedback entry is **Valid** if it meets **all** the following:

- **Specificity:** The feedback clearly refers to a **specific product feature or element**. (It should name the feature or function in question – e.g. “SharePoint file version history” rather than a vague statement like “the system is slow.”)

- **Support of Objectives:** It connects that feature to the **customer’s business objectives or use case**. (It should explain *why* this feature matters or what goal isn’t being met – e.g. “This limitation prevents our compliance team from auditing changes, affecting our regulatory workflow.”)

- **Impact on Customer Journey:** It provides detailed context on **how this issue affects the customer’s workflow, productivity, or satisfaction**. (Does it slow them down? Cause frustration? Occur at a critical stage in their process? The feedback should make that clear.)

- **Actionability:** The insight is presented in a way that could lead to a specific improvement. (It’s something that Microsoft’s product/engineering teams can investigate or act on – not an ambiguous “it’s bad” comment with no details.)

- **(For Support Cases)** If the feedback arises from a support ticket, it should describe a **product flaw beyond that one incident**. (For example, if a user had an issue that support fixed by proper configuration, that’s not product feedback – but if the issue persists even after correct setup, or it’s a known limitation that support couldn’t resolve, then it’s valid feedback. In short, it shouldn’t just re-state the exact problem of an open ticket unless it’s indicating a broader product deficiency or recurring bug.)

If the feedback meets all the above criteria, mark it **Valid**. If it is missing one or more of these elements (for example, it’s too vague, doesn’t explain business impact, or isn’t something the product team can act on), mark it **Invalid**.

### Output Format:

Respond in **JSON** only, with the keys:
- valid: true or false
- `"m365PFcomment"`: A brief explanation. If **invalid**, explain what is missing or unclear, and **challenge the submitter to improve it** (e.g. ask for specifics or context that would make it useful). Include a note on why, from a Product-Led Growth (PLG) perspective, the current feedback isn’t helpful (for instance, “it doesn’t provide actionable detail”). If **valid**, give a short reasoning highlighting how it satisfied the criteria.

### Examples:

- **Example 1 (Valid):**  
  *Feedback:* “**Outlook’s ‘Send Later’ feature** lacks the ability to schedule recurring emails. This forces our team to manually send weekly reports, reducing our productivity. We need recurring scheduling to automate our regular communications.”  
  *Analysis:* This is **specific** (calls out Outlook’s “Send Later” scheduling limitation), **aligned with a business need** (teams want to automate weekly report emails), and **explains the impact** (manual work every week, hurting productivity). It’s clearly **actionable** (suggesting a new feature: recurring scheduled send).  
  *Output:*  
  ```json
  {
    "valid": true or false,
    "m365PFcomment": "A brief explanation with your reasoning"
  }
  ```
M365 product feedback to validate. If the following is empty, please mark the entry as invalid and explain
 M365 Product Feedback 

'''

In [24]:
test_results = analyse_test_prompt('detailed_few_shot_20', few_shot_prompt, test_results, m365_product_feedback_20)

print(test_results)

0
 
Customer complained that the domain verification is for the users with have IT skills, and not al of the users are , Microsoft should create a system when adding the domain to the Microsoft 365 will not have to be manually, it it can be done by just clicking on a button so that the domain can be looked into. {
  "valid": false,
  "m365PFcomment": "The feedback is unclear and lacks actionable detail. It broadly mentions a challenge with 'domain verification' being complex for non-IT users, but does not specify which part of the process is overly technical or difficult. It does not describe the business impact or workflow disruption (e.g. how manual domain verification affects user onboarding). Additionally, it vaguely suggests automation ('done by just clicking on a button') without explaining how this improvement would meet user needs or objectives. To make this feedback valid, the submitter should clarify which part of the domain verification process is overly complex, provide con

In [11]:
few_shot_prompt_v2_cot = '''

You are an AI assistant that validates **Microsoft 365 Product Feedback** entries. These entries should reflect **customer-reported** limitations, feature suggestions, usability issues, or technical problems related to Microsoft 365 (excluding Copilot). The purpose is to identify feedback that offers **actionable insight** for Microsoft’s product teams to consider improvements.

---

## 🎯 Validation Criteria

A feedback entry is **Valid** if it clearly meets **all** the following:

1. **Specificity:** Identifies a specific product feature, functionality, usability issue or feature request (not vague).
2. **Support of Objectives:** Explains why this matters to the customer’s goals or use case.
3. **Impact on Journey:** Describes how the issue affects productivity, workflow, or satisfaction.
4. **Actionability:** Offers insight that Microsoft can investigate or act upon.
5. **Support Case Caveat:** If the entry describes a support issue, it should also include the details above. 

> If the feedback is **too vague**, lacks context or business impact, or isn’t actionable, mark it **Invalid**
> If the feedback contains *feedback about Copilot, mark it **Invalid**

---

## Valid Examples

- Customers using Microsoft Planner for task management need a recurring task feature to avoid manually re-entering repetitive tasks.
- Teams users need the ability to bulk delete messages in chat for better conversation management
- Finance teams processing invoices in Excel need an automated data validation feature to prevent incorrect entries.
- Accessing Excel for Mac is difficult because some features that are available on Windows PC are not on Mac. Features such as the autosave feature, the "View Side by Side" feature, and the "smart tag" feature are not available in the Mac version
- Customer uses Outlook and wants to see the number of emails in their inbox at a glance, without having to go into the settings to view the accumulated new emails.

## Invalid Examples 

- customer is satisfied with Microsoft products especially OneDrive because it allows them to store and access files from any device seamlessly.
- ike MS Teams because its soo easy and intuitive, its not complicated. Its very good way to connect. You can message someone and video call instantly.
- cx suggested that Microsoft should have a pdf application like adobe that comes with Microsoft subscription, this will help users to stop paying extra to be able to use adobe

---

## 🧠 Think Step by Step (Chain of Thought):
Before deciding, reflect on:
- Does the feedback mention a concrete feature or product area?
- Does it explain how the issue connects to the customer’s broader needs?
- Does it describe the practical consequences of the issue?
- Could a Microsoft product team act on this feedback meaningfully?
- If the entry describes a support issue, does it also cover the above?

---

## ✅ Output Format

Provide only a **JSON** object with two fields:
```json
{
  "valid": true,
  "reasoning": "A short explanation of your reasoning."
}



'''

In [19]:
analysis = analyse_test_prompt('few_shot_prompt_v2_cot', few_shot_prompt_v2_cot, test_results, m365_product_feedback_20)

test_results = analysis[0]
comparison_dataframe = analysis[1]

0
 
Customer complained that the domain verification is for the users with have IT skills, and not al of the users are , Microsoft should create a system when adding the domain to the Microsoft 365 will not have to be manually, it it can be done by just clicking on a button so that the domain can be looked into. {
  "valid": true,
  "reasoning": "The feedback specifies the challenge users face with domain verification in Microsoft 365, highlighting the lack of user-friendliness for individuals who do not possess IT skills. It suggests a feature improvement (automatic domain verification via a button) that could streamline the process and make it more accessible for non-technical users. This feedback is actionable and provides insight for a potential enhancement that could improve usability and satisfaction."
}
1
Feedback and limitations: The customer noted that the admin portal is vague and not straightforward because it is hard to find what the customer's looking for in the long menus

  results_store = pd.concat([results_store, new_results_row], ignore_index=True)


In [20]:
from IPython.display import FileLink, display

display(test_results)

display(comparison_dataframe)

comparison_dataframe.to_csv("comparison_dataframe_m365_recommendations.csv", index=False)

FileLink("comparison_dataframe_m365_recommendations.csv")

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,few_shot_prompt_v2_cot,0.7,0.5,0.583333


Unnamed: 0,Insight,AI validation,human validation,AI comment,human comment,Result type
0,"\nCustomer complained that the domain verification is for the users with have IT skills, and not al of the users are , Microsoft should create a system when adding the domain to the Microsoft 365 will not have to be manually, it it can be done by just clicking on a button so that the domain can be looked into.",True,1,"The feedback specifies the challenge users face with domain verification in Microsoft 365, highlighting the lack of user-friendliness for individuals who do not possess IT skills. It suggests a feature improvement (automatic domain verification via a button) that could streamline the process and make it more accessible for non-technical users. This feedback is actionable and provides insight for a potential enhancement that could improve usability and satisfaction.",Feedback is valid,True Positive
1,Feedback and limitations: The customer noted that the admin portal is vague and not straightforward because it is hard to find what the customer's looking for in the long menus. The customer justified his feedback by a situation he experienced which that he has been trying for 2 years to change and cease the subscription renewal cycle from the billing section in the portal but couldn't figure it out himself. \n\nProduct Feedback and Limitations:,True,1,The feedback is specific to the Microsoft 365 admin portal and highlights a usability issue with the long menus and lack of intuitive navigation. The feedback includes a customer use case where the inability to manage subscription renewal after 2 years of trying impacts their satisfaction and administrative efficiency. This is actionable as it suggests improving menu clarity and navigation in the billing section of the portal.,Valid,True Positive
2,"\nThe customer seeks to restrict entry-level employees' usage of M365 resources and control their access without subscribing to additional add-in licenses. They require a solution similar to Apple Business Manager, which allows them to limit device functionality and user experience without incurring extra costs for user-based licenses, focusing solely on device subscriptions.",True,1,"The feedback clearly identifies a specific need for functionality around access and resource control for entry-level employees within Microsoft 365, without requiring additional licenses. It explains why this matters for cost efficiency and aligns with the customer’s goals regarding user and device management. The impact on workflow is significant as it affects how companies manage their resources effectively while controlling expenses. This provides actionable insight for Microsoft to consider adjustments to its licensing or feature offerings.",valid,True Positive
3,"M365 Product Feedback: \nIt would be beneficial for the customer if, when verifying their business domain from a third-party domain host, Microsoft would simply ask for a sign-in page. This would make it easier to add the DNS records instantly and save time on verification.",True,1,The feedback clearly points to a specific usability issue related to domain verification for business users. It explains how simplifying the verification process by introducing a sign-in page could save time and streamline the workflow. This is actionable as Microsoft could investigate ways to improve the domain verification process for third-party domain hosts.,Valid,True Positive
4,"Feedback and limitations: The customer emphasized that integrating Forms and SharePoint for appointment scheduling greatly enhances their shop's operations. Forms collects detailed booking information accurately, while SharePoint securely organizes data for staff access and scheduling. They particularly value the simplicity of using a QR code to direct clients to the system, streamlining the process and improving overall efficiency.\n\nProduct Feedback and Limitations:",False,1,"The feedback does not identify a specific limitation or feature request related to Forms and SharePoint. While it praises the integration and QR code simplicity, it lacks actionable insight that Microsoft can act on to improve the product. There is no mention of a missing feature, usability issue, or technical problem.",Valid,False Negative
5,"Feedback and limitations: As a Mac user The customer requested the from Excels dev team to support the Distinct Count feature in Excel for Mac, highlighting its importance for their workflow and emphasizing that its absence significantly hampers their ability to complete their tasks efficiently.\n\nProduct Feedback and Limitations:",True,1,"The feedback is specific, requesting support for the 'Distinct Count' feature in Excel for Mac. It explains how the absence of this feature affects the customer's workflow and hampers task efficiency. This provides actionable insight for Excel's development team to address a feature disparity between platforms.",feedback is valid,True Positive
6,"Feedback and limitations: The customer, who primarily uses Outlook and OneDrive, is highly impressed with OneDrive's features. It allows them to back up important files, share and collaborate on documents effortlessly, and secure their data. This enhances their productivity by facilitating seamless collaboration on documents.\n\nProduct Feedback and Limitations:",False,1,"The feedback praises OneDrive's features but does not identify a specific limitation, feature request, usability issue, or technical problem. It lacks actionable insight for improvement and does not meet the validation criteria.","Valid, customer positive feedback regarding the usability of the product.",False Negative
7,"\nCustomer mentioned that he understands the new Outlook is in test mode. However, he did not have the toggle to switch easily to the classic Outlook and could not find the import PST file option in the new version. As feedback, he mentioned that this is a really helpful option that the new Outlook should have from the beginning.",True,1,"The feedback is specific, mentioning the inability to toggle back to classic Outlook and the absence of the import PST file option in the new version. It explains why the issue matters, as importing PST files is crucial for migrating or accessing previously archived emails, and suggests improving the ease of transition in the testing phase of the product. This offers actionable insight for the product team to investigate and address.",Valid,True Positive
8,"\nThe Teams chat feature allows users to pin multiple conversations, but only the latest pinned conversation is highlighted, making it cumbersome to navigate and view others. Customer shared that they often pin important project discussions, but when they need to quickly access an older pinned conversation, they have to scroll through the chat, which is time-consuming. To improve, there should be an option in the chat home to display all pinned conversations.",True,1,"The feedback is specific to the Teams chat feature and describes a usability issue with pinned conversations. It explains the customer’s use case of pinning important project discussions and highlights the impact on their workflow and productivity, emphasizing the need for quicker access to older pinned conversations. The suggestion for showing all pinned conversations is actionable for Microsoft's product team to consider.",User experience that demonstrates an area for improvement.,True Positive
9,"\nThe customer sought assistance from Copilot to create a custom signature in the Exchange Admin Center. However, they did not receive the exact steps for inserting an image correctly on the first attempt. This led the admin to raise a support request for further help. Additionally, Copilot did not mention that the image should be uploaded online first, which added to the confusion.",False,1,"The feedback primarily relates to Copilot's behavior and shortcomings in providing guidance. Since feedback about Copilot does not fall within the scope of Microsoft 365 Product Feedback validation, this entry is invalid.","Customer sought Copilot's help for a custom signature in Exchange Admin Center but faced issues with image insertion, leading to a support request. Copilot didn't mention uploading the image online first.",False Negative


In [14]:
def analyse_multipass_prompt(test_name, first_prompt, second_prompt, results_store, dataset):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - first_prompt: system prompt passed to the first validation LLM
      - second_prompt: system prompt passed for the second validation using LLM
      - results_store: dataframe where we can store the results
      - dataset: Cross validation set used for the task
    '''
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
        
    for index, row in dataset.iterrows():
            # avoid token limit if needed every 10 rows 
        print(llm_call_counter)
        if llm_call_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            llm_call_counter = 0
            
            
        llm_res = send_prompt(first_prompt, row['Feedback'])
        llm_res = clean_llm_response(llm_res)
        llm_call_counter += 1

        print(llm_res)
        # if we did not get a TP or an TN, we use the other prompt
        llm_res = json.loads(llm_res)
        if (llm_res['valid'] == False and human_validation == 1) or (llm_res['valid'] == True and human_validation == 0):
            llm_res = send_prompt(prompt_deployment, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            try:
                llm_res = json.loads(llm_res)
            except json.JSONDecodeError as e:
                print(f"[WARN] eError parsing JSON: {e}, moving to next case")  # Log the error
                continue
                
            llm_call_counter += 1
            
        
        
        print(llm_res)
        
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
        elif llm_res['valid'] == False and human_validation == 1:
            tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

## Results

Here we have the performance of several prompting strategies on our makeshift cross validation sample. 

In [21]:
# print test results store