# Product Insight Validation Using LLMs: M365 Recommendations🔍

## Overview

This notebooks aims to evaluate different prompting strategies and prompts for validating M365 Recommendations using a Large Language Model (LLM). The goal is to have a sandbox where we can fine-tune prompts using the same cross validation sets. 

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing business goals for validation.

2. **Apply LLM-Based Validation**  
   - Building blocks for using LLM to validate, and cleaning inputs

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.
   - 3.1 Zero Shot Prompting
   - 3.2 Few Shot Prompting
   - 3.3 Multi-pass w/ Few Shot prompting

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.



In [1]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load M365 Value Add Insights

Let's take a glimpse at the data we have. All these M365 Value Add Insights  have been validated by human validators during the month of April '25 

These datasets will act as makeshift cross-validation sets, that we can use to test the performance of different prompting strategies and approaches. 


We have two datasets:

- A small set of 20 cases ( 10 valid, 10 invalids ) for quick experimentation
- A bigger set of 200 cases ( 100 valid, 100 invalids ) for testing and measuring results 

In [12]:
# We load all the data 

m365_recommendation_20 = pd.read_csv('./cases/m365_value_add_cv20.csv')
m365_recommendation_200 = pd.read_csv('./cases/m365_value_add_cv200.csv')

# Now let's print one of the datasets to see its shape

m365_recommendation_20[:5]

Unnamed: 0,ID,Case Number,UPN,Line Of Business,Product Led Growth Conversation,PLG Conversation Validation,PLG Conversation Comment
0,29059,2504090030004589,gig_wfh_balpa@microsoftsupport.com,Business Advisor Reactive,"\nProvided an overview of the Teams application, highlighting how it enables to schedule appointments, collaborate on meetings, chat, and manage files all in one place. This solution can significantly enhance productivity and help build your brand.",1,Valid: Addressed features related to business goal and mentioned how it will impact the customer.
1,28991,2503210040015077,gig_wfh_ayato@microsoftsupport.com,Business Advisor Reactive,"\nConnect business domain to Microsoft 365 - Based on customer's need to build their brand, I recommended connecting their company domain to Microsoft 365 and I assisted customer to purchase business domain from Squarespace. Guided customer to set up the domain on Microsoft 365 and changed the customer's username and primary email from onmicrosoft.com to their company domain name.",1,valid
2,29031,"2,5031E+15",gig_wfh_ahade@microsoftsupport.com,Trials Nurturing Migrations,\nI recommended using Bookings and shared calendar features to streamline scheduling for the company. These tools can help manage appointments and daily schedules efficiently.,1,Valid PLG
3,29019,"2,50407E+15",gig_whi_masin@office365support.com,Business Assist,"\nI suggested customer to purchase a premium Azure Active Directory license to enhance their security and identity management. It provides advanced features like Conditional Access, seamless single sign-on, detailed reporting, and improved device management, helping them safeguard their organization and streamline operations.",1,Valid: Addressed features related to business goal and mentioned how it will impact the customer.
4,28909,2504060040001378,gig_wfh_revil@microsoftsupport.com,Business Advisor Reactive,\nI suggested utilizing calendar sharing within Microsoft 365 to synchronize schedules and streamline daily tasks. Share calendar and contacts in Microsoft 365 - Outlook | Microsoft Learn,1,Valid: Addressed features related to business goal and mentioned how it will impact the customer.


In [3]:
# Column explanation
data = [
    ["Product Led Growth Conversation", "Raw Product Led Growth Conversation captured by the ambassador as they figure in the tracker"],
    ["PLG Conversation Validation", "Validation done by a human validator - 0 is invalid, 1 is valid"],
    ["PLG Conversation Comment", "Comment/Explanation provided by a human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Product Led Growth Conversation,Raw Product Led Growth Conversation captured by the ambassador as they figure in the tracker
1,PLG Conversation Validation,"Validation done by a human validator - 0 is invalid, 1 is valid"
2,PLG Conversation Comment,Comment/Explanation provided by a human validator


## 1.1 Defining Performance

These 2 datasets have already been evaluated by human validators.

This means we can use the previous labels to calculate Sensitivity, Recall and F1 for this dataset, which will give us performance metrics we can analyse and optimise. We will be looking at the following metrics.

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


We will create a dataframe where we will store the results of our tests as we run them.

In [4]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])


test_results


Unnamed: 0,test_name,sensitivity,precision,f1_score


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [5]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [6]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "The feedback highlights a usability issue related to the menu being convoluted and cluttered, making it difficult to navigate and use the app effectively."
}
```


The model is giving us back a JSON wrapped in Markdown. Let's create a function to clean it 

In [7]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [8]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "The feedback highlights a usability issue related to the menu being convoluted and cluttered, making it difficult to navigate and use the app effectively."
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through the rows of one of our datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the entry
    - 2. Evaluate if the LLM did a good job or not
          - LLM => Valid, Human => Valid, then *true positive*
          - LLM => Invalid, Human => Invalid, then *true negative*
          - LLM => Valid, Human => Invalid, then *false positive*
          - LLM => Invalid, Human => Valid, then *false negative*
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [10]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store, dataset):
    '''
    Returns an array with
    - The performance of a prompt 
    - A comparison dataframe that helps compare AI and validator scores and comments.  

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    comparison_columns = [
        "Insight",
        "AI validation",
        "human validation",
        "AI comment",
        "human comment",
        "Result type"
    ]

    comparison_dataframe = pd.DataFrame(columns=comparison_columns)
    
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        print(row_counter)
        if row_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            row_counter = 0
                
        llm_res = send_prompt(prompt, row['Product Led Growth Conversation'])
        llm_res = clean_llm_response(llm_res)
        row_counter += 1
        print(row['Product Led Growth Conversation'],llm_res)

        try:
            llm_res = json.loads(llm_res)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row['PLG Conversation Validation']
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
            result_type = "True Positive"
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
            result_type = "True Negative"
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
            result_type = "False Positive"
        elif llm_res['valid'] == False and human_validation == 1:
            fn += 1
            result_type = "False Negative"

        new_comparison_row = {
            "Insight": row['Product Led Growth Conversation'],
            "AI validation": llm_res['valid'],
            "human validation": human_validation,
            "AI comment": llm_res['reasoning'],
            "human comment": row['PLG Conversation Comment'],
            "Result type": result_type
        }

        comparison_dataframe.loc[len(comparison_dataframe)] = new_comparison_row


    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return [results_store, comparison_dataframe]

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Loose Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe high level criteria that is frequently mentioned by human validators to mark insights as valid or invalid. These are drawn from analysing the reasons as to why validators mark insights as valid or invalid. 


In [10]:
loose_zero_shot_prompt = '''
You are a classifier for Microsoft 365 Product Led Growth (PLG) recommendations. Your task is to decide if a given conversation qualifies as a valid PLG recommendation, using a balanced and moderately lenient approach. Use a silent chain-of-thought internally, but do not include it in your output.

Evaluation Guidelines:
- A conversation is considered VALID if it:
   • Contains either an explicit or implied reference to a Microsoft 365 product, service, or feature.
   • Mentions benefits or improvements (such as enhanced collaboration, improved efficiency, or better user experience) that address a customer need, even if described in general terms.
   • Shows a clear intention to promote a product-led growth recommendation, even if not every detail is exhaustively specified.

- A conversation is considered INVALID if:
   • It has no reference (explicit or implied) to any Microsoft 365 product or feature.
   • It is overly generic or off-topic, focusing only on technical support, troubleshooting, billing, licensing, or subjects unrelated to a product-led growth context.
   • It primarily mentions unrelated tools (such as Copilot) without a clear connection to the customer benefit or Microsoft 365 context.

Additional Instructions:
- If a conversation is borderline but includes an indication of benefits related to Microsoft 365, lean towards marking it as VALID.
- Your response must be solely a JSON object with exactly two keys:
    • "valid": a boolean value (true if valid, false if invalid).
    • "m365PFcomment": a concise, one-sentence justification that captures the main reason for your decision.
- Do not include any extra text or chain-of-thought details in your output.
- The answer must be entirely in English and follow the exact JSON format below:

{
  "valid": true,
  "m365PFcomment": "Your explanation here."
}

Evaluate the provided conversation accordingly and output only the JSON result.
'''


In [11]:
# Test Loose Zero Shot Prompt here 
test_results = analyse_test_prompt('loose_zero_shot_20', loose_zero_shot_prompt, test_results, m365_recommendation_20)


0
 
Provided an overview of the Teams application, highlighting how it enables to schedule appointments, collaborate on meetings, chat, and manage files all in one place. This solution can significantly enhance productivity and help build your brand. {
  "valid": true,
  "m365PFcomment": "The conversation explicitly references Microsoft Teams and highlights its benefits like scheduling, collaboration, chat, and file management to enhance productivity and branding, aligning with PLG principles."
}
1
 
Connect business domain to Microsoft 365 - Based on customer's need to build their brand, I recommended connecting their company domain to Microsoft 365 and I assisted customer to purchase business domain from Squarespace. Guided customer to set up the domain on Microsoft 365 and changed the customer's username and primary email from onmicrosoft.com to their company domain name. {
  "valid": true,
  "m365PFcomment": "The recommendation explicitly connects the customer's business domain to 

  results_store = pd.concat([results_store, new_results_row], ignore_index=True)


In [12]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.588235,0.740741


In [13]:
loose_zero_shot_no_negative_criteria = """
What You Do
You check Microsoft 365 feedback to see if it helps improve the product. Good feedback is clear, useful, and related to Microsoft 365 (not Copilot).

- A conversation is considered VALID if it:
   • Contains either an explicit or implied reference to a Microsoft 365 product, service, or feature.
   • Mentions benefits or improvements (such as enhanced collaboration, improved efficiency, or better user experience) that address a customer need, even if described in general terms.
   • Shows a clear intention to promote a product-led growth recommendation, even if not every detail is exhaustively specified.

Additional Instructions:
- If a conversation is borderline but includes an indication of benefits related to Microsoft 365, lean towards marking it as VALID.
- Your response must be solely a JSON object with exactly two keys:
    • "valid": a boolean value (true if valid, false if invalid).
    • "m365PFcomment": a concise, one-sentence justification that captures the main reason for your decision.
- Do not include any extra text or chain-of-thought details in your output.
- The answer must be entirely in English and follow the exact JSON format below:

{
  "valid": true,
  "m365PFcomment": "Your explanation here."
}

Evaluate the provided conversation accordingly and output only the JSON result.

"""

In [14]:
test_results = analyse_test_prompt('loose_zero_shot_20_no_negative_criteria', loose_zero_shot_no_negative_criteria, test_results, m365_recommendation_20)


0
 
Provided an overview of the Teams application, highlighting how it enables to schedule appointments, collaborate on meetings, chat, and manage files all in one place. This solution can significantly enhance productivity and help build your brand. {
  "valid": true,
  "m365PFcomment": "The feedback highlights multiple benefits of Microsoft Teams, including productivity enhancement and brand building."
}
1
 
Connect business domain to Microsoft 365 - Based on customer's need to build their brand, I recommended connecting their company domain to Microsoft 365 and I assisted customer to purchase business domain from Squarespace. Guided customer to set up the domain on Microsoft 365 and changed the customer's username and primary email from onmicrosoft.com to their company domain name. {
  "valid": true,
  "m365PFcomment": "The feedback explicitly references Microsoft 365 and provides a clear user benefit of improved branding and customization for business emails and domains."
}
2
 
I r

In [None]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.714286,0.833333
1,loose_zero_shot_20_no_negative_criteria,0.9,0.5625,0.692308


### 3.2 Detailed Zero Shot Prompt

In the prompt below, we provide a detailed list of criteria based on the latest version of the insights framework available at GigPlus. This has a detailed set of tiered criteria. 


In [18]:
detailed_zero_shot_prompt = '''
You are an AI assistant that validates **Microsoft 365 Value Add** recommendations within customer conversations. These recommendations should be **customer-centric** – meaning they tie Microsoft 365 features to the customer’s unique needs in a way that clearly adds value beyond the initial support issue.

**What is a “Value Add” recommendation?**  
It’s a personalized suggestion in a support conversation that goes beyond solving the original problem, showing the customer how an additional Microsoft 365 feature or product can help them achieve their business/personal goals or improve their experience. For example, after resolving a customer’s email issue, an ambassador might recommend using Microsoft Teams for better collaboration, explaining how it can streamline their team’s communication (that’s a value-add recommendation).

## Validation Criteria:

A **valid** M365 Value Add conversation will **clearly do all or most of the following**:

1. **Align with Customer’s Goals:** Explicitly link the Microsoft 365 feature/product to the customer’s specific business or personal objectives. (Does the conversation show how this feature helps achieve the customer’s stated goals or solve a stated pain point?)

2. **Provide Additional Value:** Go beyond the original issue to deliver extra benefit. The ambassador builds rapport and provides insights or best practices, not just a fix. (Is the ambassador focusing on the customer’s broader needs, and offering guidance that adds value to their business or workflow?)

3. **Hands-On Demonstration:** Where possible, involve a **show-and-tell**. The ambassador might walk the customer through the feature or offer a demo/trial so the customer can see the benefit in real-time. (Does the conversation include actually demonstrating the feature or describing a concrete example scenario? This isn’t always possible, but it strongly enhances the value if done.)

4. **Tailored to the Use Case:** The recommendation is customized to the customer’s situation. (Is it clear the ambassador considered the customer’s industry, use case, or environment? The advice should address **their** specific pain points, not just be a generic sales pitch.)

5. **Context and Impact:** The conversation provides context for how the feature fits into the customer’s workflow and how it will improve their experience or outcomes. (Does it say **why** this feature matters for them? e.g. “By using this, your team can save 2 hours a week on X,” or “This will make it easier for you to collaborate when working remotely.”)

If the conversation’s **value-add recommendation** strongly meets the above points, mark it **Valid**. If it’s missing one or more key elements (for example, it’s generic, not clearly tied to the customer’s goals, or lacks any context), mark it **Invalid**. 

> **Note:** Even when suggesting a Microsoft 365 feature as part of a broader campaign or agenda, it should still be framed in terms of the customer’s own use case and benefits. If a recommendation is given without linking to a customer need (e.g. just pushing a product feature with no context), it’s not truly customer-centric and should be considered **Invalid** because it doesn’t show the customer why that feature matters to them.

### Output Format:

- Your response must be solely a JSON object with exactly two keys:
    • "valid": a boolean value (true if valid, false if invalid).
    • "m365PFcomment": a concise, one-sentence justification that captures the main reason for your decision.
- Do not include any extra text or chain-of-thought details in your output.
- The answer must be entirely in English and follow the exact JSON format below:

{
  "valid": true,
  "m365PFcomment": "Your explanation here."
}


'''

In [19]:
test_results = analyse_test_prompt('detailed_zero_shot_20', detailed_zero_shot_prompt, test_results, m365_recommendation_20)


0
 
Provided an overview of the Teams application, highlighting how it enables to schedule appointments, collaborate on meetings, chat, and manage files all in one place. This solution can significantly enhance productivity and help build your brand. {
  "valid": false,
  "m365PFcomment": "The recommendation is too generic and does not directly align with the customer's specific goals or use case, lacking clear context for why Teams specifically matters to them."
}
1
 
Connect business domain to Microsoft 365 - Based on customer's need to build their brand, I recommended connecting their company domain to Microsoft 365 and I assisted customer to purchase business domain from Squarespace. Guided customer to set up the domain on Microsoft 365 and changed the customer's username and primary email from onmicrosoft.com to their company domain name. {
  "valid": true,
  "m365PFcomment": "The recommendation clearly aligns with the customer's goal of building their brand by connecting their co

In [20]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.588235,0.740741
1,loose_zero_shot_20_no_negative_criteria,0.9,0.529412,0.666667
2,detailed_zero_shot_20,0.3,0.5,0.375


### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [21]:
few_shot_prompt = '''

You are an AI assistant that validates **Microsoft 365 Value Add** recommendations within customer conversations. These recommendations should be **customer-centric** – meaning they tie Microsoft 365 features to the customer’s unique needs in a way that clearly adds value beyond the initial support issue.

**What is a “Value Add” recommendation?**  
It’s a personalized suggestion in a support conversation that goes beyond solving the original problem, showing the customer how an additional Microsoft 365 feature or product can help them achieve their business/personal goals or improve their experience. For example, after resolving a customer’s email issue, an ambassador might recommend using Microsoft Teams for better collaboration, explaining how it can streamline their team’s communication (that’s a value-add recommendation).

## Validation Criteria:

A **valid** M365 Value Add conversation will **clearly do all or most of the following**:

1. **Align with Customer’s Goals:** Explicitly link the Microsoft 365 feature/product to the customer’s specific business or personal objectives. (Does the conversation show how this feature helps achieve the customer’s stated goals or solve a stated pain point?)

2. **Provide Additional Value:** Go beyond the original issue to deliver extra benefit. The ambassador builds rapport and provides insights or best practices, not just a fix. (Is the ambassador focusing on the customer’s broader needs, and offering guidance that adds value to their business or workflow?)

3. **Hands-On Demonstration:** Where possible, involve a **show-and-tell**. The ambassador might walk the customer through the feature or offer a demo/trial so the customer can see the benefit in real-time. (Does the conversation include actually demonstrating the feature or describing a concrete example scenario? This isn’t always possible, but it strongly enhances the value if done.)

4. **Tailored to the Use Case:** The recommendation is customized to the customer’s situation. (Is it clear the ambassador considered the customer’s industry, use case, or environment? The advice should address **their** specific pain points, not just be a generic sales pitch.)

5. **Context and Impact:** The conversation provides context for how the feature fits into the customer’s workflow and how it will improve their experience or outcomes. (Does it say **why** this feature matters for them? e.g. “By using this, your team can save 2 hours a week on X,” or “This will make it easier for you to collaborate when working remotely.”)

If the conversation’s **value-add recommendation** strongly meets the above points, mark it **Valid**. If it’s missing one or more key elements (for example, it’s generic, not clearly tied to the customer’s goals, or lacks any context), mark it **Invalid**. 

> **Note:** Even when suggesting a Microsoft 365 feature as part of a broader campaign or agenda, it should still be framed in terms of the customer’s own use case and benefits. If a recommendation is given without linking to a customer need (e.g. just pushing a product feature with no context), it’s not truly customer-centric and should be considered **Invalid** from a Product-Led Growth (PLG) perspective, because it doesn’t show the customer why that feature matters to them.

### Output Format:

Provide the answer as a JSON object with two keys:  
- **"m365VAvalid"** – `"Valid"` if the conversation meets the criteria, or `"Invalid"` if it does not.  
- **"m365VAcomment"** – A brief explanation. If **invalid**, explain what is missing or how it fell short, and **challenge the ambassador** to improve it (e.g. “what could be added or done better?”). Include a note on why, from a PLG standpoint, the current conversation isn’t as useful (perhaps it’s too generic, not aligned to needs, etc.). If **valid**, give a short reason citing how it met the key criteria.

### Examples:

- **Example 1 – Valid:**  
  *Conversation Summary:* The customer’s goal was to improve team collaboration. After helping with their SharePoint site issue, the ambassador suggests using Microsoft Teams for project discussions. The ambassador says, “Since your goal is faster team collaboration, let’s set up a Teams channel for your project. For example, you mentioned keeping track of file versions – in Teams, files are automatically shared and version-controlled. I can walk you through setting up your first channel now.” They then guide the customer through creating a Teams channel and share best practices.  
  *Why It’s Valid:* This clearly **aligns with the customer’s goal** (faster team collaboration). It **adds value** beyond the original SharePoint issue by introducing Teams. The ambassador gives a **hands-on walkthrough** (setting up a channel in real-time). It’s **tailored** to the customer’s project context (mentioning file version concerns the customer had). And it provides **context** on how Teams will fit into their workflow (centralizing files and discussions for the project).  


### Output Format:

- Your response must be solely a JSON object with exactly two keys:
    • "valid": a boolean value (true if valid, false if invalid).
    • "m365PFcomment": a concise, one-sentence justification that captures the main reason for your decision.
- Do not include any extra text or chain-of-thought details in your output.
- The answer must be entirely in English and follow the exact JSON format below:

{
  "valid": true,
  "m365PFcomment": "Your explanation here."
}

These are the business goals and microsoft 365 value add conversations captured. if the following is empty, mark the entry as invalid and explain why:
 Business Goals or Needs
 M365 Value Add 
'''

In [22]:
test_results = analyse_test_prompt('few_shot_20', few_shot_prompt, test_results, m365_recommendation_20 )

print(test_results)

0
 
Provided an overview of the Teams application, highlighting how it enables to schedule appointments, collaborate on meetings, chat, and manage files all in one place. This solution can significantly enhance productivity and help build your brand. {
  "valid": false,
  "m365PFcomment": "The recommendation is too generic and lacks alignment with specific customer goals or pain points. It needs to clearly tie the features of Teams to the customer's unique objectives or workflow to effectively demonstrate value."
}
1
 
Connect business domain to Microsoft 365 - Based on customer's need to build their brand, I recommended connecting their company domain to Microsoft 365 and I assisted customer to purchase business domain from Squarespace. Guided customer to set up the domain on Microsoft 365 and changed the customer's username and primary email from onmicrosoft.com to their company domain name. {
  "valid": true,
  "m365PFcomment": "The recommendation aligns with the customer's goal of 

In [24]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.588235,0.740741
1,loose_zero_shot_20_no_negative_criteria,0.9,0.529412,0.666667
2,detailed_zero_shot_20,0.3,0.5,0.375
3,few_shot_20,0.2,0.666667,0.307692


In [14]:
few_shot_prompt_v2_w_cot = '''

You are an AI assistant that validates **Microsoft 365 Value Add** recommendations within customer conversations.

Your task is to assess whether a conversation contains a **Valid** value-add recommendation based on the following five criteria. Think through them step by step before deciding:

1. **Provide Value** – Does the suggestion help the customer achieve a goal or improve their experience?
2. **Feature Recommendation** – Is there a recommendation of a feature or product in the Microsoft 365 ecosystem?
3. **Tailored to the Use Case** – Is the recommendation clearly customized to the customer’s context, industry, or situation?
4. **Context and Impact** – Does the ambassador explain *why* the recommendation matters or how it will improve the customer’s outcomes?


Use the following **Chain of Thought** approach:
- Think through each of the five criteria above.
- If **most or all are met**, mark the entry as **Valid**.
- If any **key element is missing** or the recommendation feels generic, mark it as **Invalid**.

> **Notes:** 
- A generic, unspecific entry without alignment to customer goals is **Invalid**, even if it technically mentions a Microsoft 365 feature. True value-add means showing the **why it matters** for that customer.
- An entry that recommends Copilot is automatically invalid

--- 

### Valid Examples 

- I recommended Onedrive for business because it gives the user the platform to save her personal files with the use of 1 TB of cloud storage. I chose Onedrive for business because it can be synced to different device so she can access it anywhere. I demo on how to use Onedrive for business.

- I recommended using Designer. This tool will help him generate creative photos based on descriptions, making his ideas clearer. Additionally, Designer will save time, enhance creativity with various design options, improve clarity with visually appealing ideas, and increase social media engagement with high-quality images."

- I recommended that the customer use Power BI, which can consolidate their Excel reports into interactive dashboards, giving them real-time insights into customer trends."

- I recommended that the customer use Microsoft Purview to protect sensitive legal contracts, allowing them to encrypt and restrict access to confidential files." 

### Invalid Examples

- Explained about Win 11 and ran a check, processor doesn't meet. User said - .I will have to go out and buy a new computer

- I recommended the customer to use Excel to organize data

- I recommended the customer to use Teams to collaborate better

- I recommended the customer to use Microsoft security tools

---

### Evaluation Instructions:

Evaluate the conversation using the chain of thought steps above.


### Response

- Your response must be solely a JSON object with exactly two keys:
    • "valid": a boolean value (true if valid, false if invalid).
    • "reasoning": a concise, one-sentence justification that captures the main reason for your decision.
- Do not include any extra text or chain-of-thought details in your output.
- The answer must be entirely in English and follow the exact JSON format below:

{
  "valid": true,
  "reasoning": "Your explanation here."
}

'''

In [15]:
analysis = analyse_test_prompt('few_shot_prompt_v2_w_cot', few_shot_prompt_v2_w_cot, test_results, m365_recommendation_20)

test_results = analysis[0]
comparison_dataframe = analysis[1]

0
 
Provided an overview of the Teams application, highlighting how it enables to schedule appointments, collaborate on meetings, chat, and manage files all in one place. This solution can significantly enhance productivity and help build your brand. {
  "valid": false,
  "reasoning": "The recommendation lacks specific tailoring to the customer's context and does not explain why Teams is uniquely relevant for their goals or industry."
}
1
 
Connect business domain to Microsoft 365 - Based on customer's need to build their brand, I recommended connecting their company domain to Microsoft 365 and I assisted customer to purchase business domain from Squarespace. Guided customer to set up the domain on Microsoft 365 and changed the customer's username and primary email from onmicrosoft.com to their company domain name. {
  "valid": true,
  "reasoning": "The recommendation directly helps the customer build their brand, involves connecting a business domain to Microsoft 365, is tailored to t

  results_store = pd.concat([results_store, new_results_row], ignore_index=True)


In [34]:
from IPython.display import FileLink, display

display(test_results)

display(comparison_dataframe)

comparison_dataframe.to_csv("comparison_dataframe_copilot_insights.csv", index=False)

FileLink("comparison_dataframe_copilot_insights.csv")


Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.588235,0.740741
1,loose_zero_shot_20_no_negative_criteria,0.9,0.529412,0.666667
2,detailed_zero_shot_20,0.3,0.5,0.375
3,few_shot_20,0.2,0.666667,0.307692
4,few_shot_v2_cot_20,0.6,0.6,0.6


In [None]:
def analyse_multipass_prompt(test_name, prompt_product, prompt_deployment, results_store, true_positives, true_negatives, false_positives, false_negatives):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt_product: system prompt passed to the LLM to validate product feedback
      - prompt_deployment: system prompt passed to the LLM to validate deployment blockers
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()
    
    dataframes = [
        [true_positives, True], # first value contains the data, the second what we would like the model to return for every row
        [true_negatives, False], # for instance, the llm should evaluate all true positives as valid to have 100% accuracy 
        [false_positives, False],
        [false_negatives, True]
    ]

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
    for dataframe in dataframes:
        data, expectation = dataframe
        
        for index, row in data.iterrows():
            # avoid token limit if needed every 10 rows 
            print(llm_call_counter)
            if llm_call_counter >= 10:
                print(f"Rate limit is close, continuing in {60} seconds...")
                time.sleep(61)
                llm_call_counter = 0
                
                
            llm_res = send_prompt(prompt_product, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            llm_call_counter += 1

            print(llm_res)
            # if we did not get a TP or an TN, we use the other prompt
            llm_res = json.loads(llm_res)
            if expectation != llm_res['valid']:
                llm_res = send_prompt(prompt_deployment, row['Feedback'])
                llm_res = clean_llm_response(llm_res)
                try:
                    llm_res = json.loads(llm_res)
                except json.JSONDecodeError as e:
                    print(f"Error parsing JSON: {e}, moving to next case")  # Log the error
                    next
                    
                llm_call_counter += 1
                
            
            
            print(llm_res)
            
            if expectation == True and llm_res['valid'] == True:
                tp += 1
            elif expectation == True and llm_res['valid'] == False:
                fn += 1
            elif expectation == False and llm_res['valid'] == True:
                fp += 1
            elif expectation == False and llm_res['valid'] == False:
                tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

## Results

Here we have the performance of several prompting strategies on our makeshift cross validation sample. 

In [None]:
# print test results store