# Product Insight Validation Using LLMs: Business Goals🔍

## Overview

This notebooks aims to evaluate different prompting strategies and prompts for validating Business Goals using a Large Language Model (LLM). The goal is to have a sandbox where we can fine-tune prompts using the same cross validation sets. 

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing business goals for validation.

2. **Apply LLM-Based Validation**  
   - Building blocks for using LLM to validate, and cleaning inputs

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.
   - 3.1 Zero Shot Prompting
   - 3.2 Few Shot Prompting
   - 3.3 Multi-pass w/ Few Shot prompting

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.



In [2]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load Business Goals

Let's take a glimpse at the data we have. All these business goals have been validated by human validators during the month of April '25 

These datasets will act as makeshift cross-validation sets, that we can use to test the performance of different prompting strategies and approaches. 


We have two datasets:

- A small set of 20 cases ( 10 valid, 10 invalids ) for quick experimentation
- A bigger set of 200 cases ( 100 valid, 100 invalids ) for testing and measuring results 

In [3]:
# We load all the data 

business_goals_20 = pd.read_csv('./cases/business_goals_cv20.csv')
business_goals_200 = pd.read_csv('./cases/business_goals_cv200.csv')

# Now let's print one of the datasets to see its shape

business_goals_20[:5]

Unnamed: 0,ID,Case Number,UPN,Line Of Business,Business Goals and Needs,Business Goal Validation,Business Goal Comment
0,28380,2504040040013287,gig_wfh_ossal@microsoftsupport.com,Business Advisor Reactive,Save time and money\nThe customer's business i...,1,BG is clear.
1,26505,2504020030000826,gig_wfh_kuman@microsoftsupport.com,Business Advisor Reactive,After discussing it became clear that the cust...,1,It details the need of the customer
2,26155,2504030030000687,gig_wfh_jibal@microsoftsupport.com,Business Advisor Reactive,Simplify everyday task\n\nThe business needs t...,1,Valid: Business goals is clearly stated
3,27892,2503310040001537,gig_wfh_sirai@microsoftsupport.com,Proactive Grace,The customer is a tech company offering web an...,1,Valid
4,26026,2503270040000491,gig_wfh_jewan@microsoftsupport.com,Trials Nurturing Proactive,Business offers a delightful range of homemade...,1,BG is clear.


In [4]:
# Column explanation
data = [
    ["Business Goals and Needs", "Raw Business Goals and Needs captured by the ambassador as they figure in the tracker"],
    ["Business Goal Validation", "Validation done by a human validator - 0 is invalid, 1 is valid"],
    ["Business Goal comment", "Comment/Explanation provided by a human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Business Goals and Needs,Raw Business Goals and Needs captured by the ambassador as they figure in the tracker
1,Business Goal Validation,"Validation done by a human validator - 0 is invalid, 1 is valid"
2,Business Goal comment,Comment/Explanation provided by a human validator


## 1.1 Defining Performance

These 2 datasets have already been evaluated by human validators.

This means we can use the previous labels to calculate Sensitivity, Recall and F1 for this dataset, which will give us performance metrics we can analyse and optimise. We will be looking at the following metrics.

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


We will create a dataframe where we will store the results of our tests as we run them.

In [5]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])


test_results


Unnamed: 0,test_name,sensitivity,precision,f1_score


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [6]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [7]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "The feedback addresses a usability issue related to the app's menu design being convoluted and difficult to navigate, which impacts the user's ability to use the app effectively."
}
```


The model is giving us back a JSON wrapped in Markdown. Let's create a function to clean it 

In [8]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [9]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "The feedback addresses a usability issue related to the app's menu design being convoluted and difficult to navigate, which impacts the user's ability to use the app effectively."
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through the rows of one of our datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the entry
    - 2. Evaluate if the LLM did a good job or not
          - LLM => Valid, Human => Valid, then *true positive*
          - LLM => Invalid, Human => Invalid, then *true negative*
          - LLM => Valid, Human => Invalid, then *false positive*
          - LLM => Invalid, Human => Valid, then *false negative*
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [14]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store, dataset):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        print(row_counter)
        if row_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            row_counter = 0
                
        llm_res = send_prompt(prompt, row['Business Goals and Needs'])
        llm_res = clean_llm_response(llm_res)
        row_counter += 1
        print(row['Business Goals and Needs'],llm_res)

        try:
            llm_res = json.loads(llm_res)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row['Business Goal Validation']
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
        elif llm_res['valid'] == False and human_validation == 1:
            fn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Loose Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe high level criteria that is frequently mentioned by human validators to mark insights as valid or invalid. These are drawn from analysing the reasons as to why validators mark insights as valid or invalid. 


In [13]:
loose_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Set 2: Invalid Business Goal Criteria

Meeting any of the criteria below is enough for the entry to be considered invalid.

    - D) Focus on tools: The entry just lists the M365 applications being used, but there is no business goal or need mentioned

    - E) Vague business goal or need: The entry does not include any details nor actionable business goals/needs

    - F) Technical issue: The entry only describes a technical issue experienced by the customer and there is no business goal or need

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [29]:
# Test Loose Zero Shot Prompt here 
# done - test_results = analyse_test_prompt('loose_zero_shot_20', loose_zero_shot_prompt, test_results, business_goals_20)
# done - test_results = analyse_test_prompt('loose_zero_shot_200', loose_zero_shot_prompt, test_results, business_goals_200)


In [30]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,0.9,0.642857,0.75
1,loose_zero_shot_20_no_negative_criteria,0.777778,0.538462,0.636364
2,loose_zero_shot_200,0.9,0.638298,0.746888


In [31]:
loose_zero_shot_no_negative_criteria = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [21]:
test_results = analyse_test_prompt('loose_zero_shot_20_no_negative_criteria', loose_zero_shot_no_negative_criteria, test_results, business_goals_20)


0
Save time and money
The customer's business imports and sells car parts from Europe to LATAM. Their goal is to implement a collaboration platform to enhance engagement and improve daily operations. {
  "valid": true,
  "reasoning": "The entry meets all the validity criteria. (A) It is clear and easy to understand, as it specifies the customer's goal of enhancing engagement and improving daily operations. (B) It is specific, as it identifies the business context—importing and selling car parts—and the goal to implement a collaboration platform. (C) It is actionable, as it focuses on leveraging a Microsoft 365 product (e.g., Teams or SharePoint) to address collaboration and operational improvements."
}
1
After discussing it became clear that the customer aims to enhance efficiency in daily operations by establishing a generic email address connected to a centralized mailbox for their company. This setup would allow multiple users to access the mailbox simultaneously and receive notific

In [41]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,0.9,0.642857,0.75
1,loose_zero_shot_20_no_negative_criteria,0.777778,0.538462,0.636364
2,loose_zero_shot_200,0.9,0.638298,0.746888
3,detailed_zero_shot_20,0.555556,0.625,0.588235
4,detailed_few_shot_20,0.4,0.5,0.444444


### 3.2 Detailed Zero Shot Prompt

In the prompt below, we provide a detailed list of criteria based on the latest version of the insights framework available at GigPlus. This has a detailed set of tiered criteria. 


In [35]:
detailed_zero_shot_prompt = '''
You are an AI assistant that validates **Business Goals or Needs** by checking if they are well-defined and actionable for Microsoft 365 (including Copilot) solutions. A **Business Goal or Need** should describe a **specific technical need** tied to a business outcome that Microsoft 365 can achieve, rather than a vague goal.

**Example of a well-defined Business Goal/Need:**  
*A customer service team wants to respond to customers faster and more personally. To do this, they **need an AI-based email assistant** that can draft personalized replies and provide real-time customer sentiment analysis.*  
(This example clearly states a technical need – an AI email assistant – and the desired outcome of faster, more personal responses.)

## Validation Criteria:

**Core Requirements (must have all):**  
1. **Technical Need & Outcome** – Does it clearly specify the **technical solution or capability needed** *and* the **business outcome** desired? (e.g. “need X tool to achieve Y result”).  
2. **M365 Actionable** – Is it **actionable with Microsoft 365 and Copilot**? (i.e. the need can be addressed by a specific Microsoft 365 product, service, or feature).  
3. **Clarity & Feasibility** – Is it **specific, achievable, and clear**? (Avoids vagueness or unrealistic goals. It should be realistic and well-defined enough to guide solution design.)

**Additional Characteristics (at least one is typically present in a valid need):**  
- **User-Centric Pain Point** – Reflects a real user or team problem that Microsoft 365 can solve.  
- **Collaboration Improvement** – Aims to enhance communication or teamwork (a common strength of M365 tools).  
- **Customer Experience** – Targets improving how the business serves or engages customers (via M365 capabilities).  
- **Process Optimization** – Involves streamlining workflows or business processes (which M365 features can support).  
- **Integration** – Involves connecting multiple M365 services or integrating M365 with other systems.  
- **Data Management** – Focuses on better organizing, accessing, or analyzing data (using M365 analytics, databases, etc.).  
- **Mobile/Remote Support** – Addresses needs for remote work or mobile productivity (leveraging M365’s cloud/mobile tools).  
- **Cost/ROI Focus** – Indicates potential for cost savings or high return on investment using M365.  
- **Time-Saving Automation** – Centers on automating tasks or reducing manual effort (a key benefit of many M365/Copilot solutions).

*A Business Goal/Need that meets all core requirements and is aligned with Microsoft 365 capabilities (and ideally touches on one or more additional characteristics) should be marked **Valid**. If it misses any core requirement (e.g., too vague, not tied to M365, etc.), mark it **Invalid**.*

## Response ##

You will respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [37]:
test_results = analyse_test_prompt('detailed_zero_shot_20', detailed_zero_shot_prompt, test_results, business_goals_20)


0
Save time and money
The customer's business imports and sells car parts from Europe to LATAM. Their goal is to implement a collaboration platform to enhance engagement and improve daily operations. {
  "valid": false,
  "reasoning": "The Business Goal/Need is too vague and lacks clarity on the specific technical solution or capability needed, which violates the 'Technical Need & Outcome' criterion. While it mentions a general goal to 'save time and money' and improve collaboration, it does not specify how Microsoft 365 (or Copilot) tools would be used to achieve this result. It also does not provide a clear or actionable technical need tied to a business outcome, such as automation, workflow optimization, or specific M365 features. Without these details, it is not actionable within Microsoft 365."
}
1
After discussing it became clear that the customer aims to enhance efficiency in daily operations by establishing a generic email address connected to a centralized mailbox for their co

In [38]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,0.9,0.642857,0.75
1,loose_zero_shot_20_no_negative_criteria,0.777778,0.538462,0.636364
2,loose_zero_shot_200,0.9,0.638298,0.746888
3,detailed_zero_shot_20,0.555556,0.625,0.588235


### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [39]:
few_shot_prompt = '''
You are an AI assistant that validates **Business Goals or Needs** by checking if they are well-defined and actionable for Microsoft 365 (including Copilot) solutions. A **Business Goal or Need** should describe a **specific technical need** tied to a business outcome that Microsoft 365 can achieve, rather than a vague goal.

**Example of a well-defined Business Goal/Need:**  
*A customer service team wants to respond to customers faster and more personally. To do this, they **need an AI-based email assistant** that can draft personalized replies and provide real-time customer sentiment analysis.*  
(This example clearly states a technical need – an AI email assistant – and the desired outcome of faster, more personal responses.)

## Validation Criteria:

**Core Requirements (must have all):**  
1. **Technical Need & Outcome** – Does it clearly specify the **technical solution or capability needed** *and* the **business outcome** desired? (e.g. “need X tool to achieve Y result”).  
2. **M365 Actionable** – Is it **actionable with Microsoft 365/Copilot**? (i.e. the need can be addressed by a specific Microsoft 365 product, service, or feature).  
3. **Clarity & Feasibility** – Is it **specific, achievable, and clear**? (Avoids vagueness or unrealistic goals. It should be realistic and well-defined enough to guide solution design.)

**Additional Characteristics (at least one is typically present in a valid need):**  
- **User-Centric Pain Point** – Reflects a real user or team problem that Microsoft 365 can solve.  
- **Collaboration Improvement** – Aims to enhance communication or teamwork (a common strength of M365 tools).  
- **Customer Experience** – Targets improving how the business serves or engages customers (via M365 capabilities).  
- **Process Optimization** – Involves streamlining workflows or business processes (which M365 features can support).  
- **Integration** – Involves connecting multiple M365 services or integrating M365 with other systems.  
- **Data Management** – Focuses on better organizing, accessing, or analyzing data (using M365 analytics, databases, etc.).  
- **Mobile/Remote Support** – Addresses needs for remote work or mobile productivity (leveraging M365’s cloud/mobile tools).  
- **Cost/ROI Focus** – Indicates potential for cost savings or high return on investment using M365.  
- **Time-Saving Automation** – Centers on automating tasks or reducing manual effort (a key benefit of many M365/Copilot solutions).

*A Business Goal/Need that meets all core requirements and is aligned with Microsoft 365 capabilities (and ideally touches on one or more additional characteristics) should be marked **Valid**. If it misses any core requirement (e.g., too vague, not tied to M365, etc.), mark it **Invalid**.*

### Examples:

- **Example 1 (Valid):** *“The sales team needs a real-time analytics dashboard (Power BI) to track daily sales and quickly adjust strategies to meet their targets.”*  
  – Meets 1 (specific need: real-time dashboard; outcome: adjust strategies to meet targets), 2 (M365 solution: Power BI), 3 (clear and achievable). Also aligns with Data Management and Time-Saving.  

- **Example 2 (Invalid):** *“Improve productivity with better tools.”*  
  – Fails 1 (no specific tool or outcome mentioned) and 3 (too vague). We don’t know what “better tools” or which outcome. Not actionable for PLG because it’s unclear what to build or measure.  

## Response ##

You will always respond in JSON format with only the following fields and no more:

* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [40]:
test_results = analyse_test_prompt('detailed_few_shot_20', few_shot_prompt, test_results, business_goals_20)

test_results

0
Save time and money
The customer's business imports and sells car parts from Europe to LATAM. Their goal is to implement a collaboration platform to enhance engagement and improve daily operations. {
  "valid": false,
  "reasoning": "The entry is too vague and does not specify a clear technical need or business outcome tied to Microsoft 365 solutions. While it mentions a broad goal of implementing a collaboration platform to enhance engagement and improve operations, it does not identify specific tools, capabilities, or outcomes (e.g., 'streamline communication through Teams for faster decision-making' or 'use SharePoint to centralize inventory tracking'). Without such detail, it fails core requirements for clarity, technical need, and actionable M365 capability."
}
1
After discussing it became clear that the customer aims to enhance efficiency in daily operations by establishing a generic email address connected to a centralized mailbox for their company. This setup would allow mult

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,0.9,0.642857,0.75
1,loose_zero_shot_20_no_negative_criteria,0.777778,0.538462,0.636364
2,loose_zero_shot_200,0.9,0.638298,0.746888
3,detailed_zero_shot_20,0.555556,0.625,0.588235
4,detailed_few_shot_20,0.4,0.5,0.444444


### Few Shot V2

This prompt features more positive and negative examples, stricter language, and eliminated redundancy. This prompt has also been optimised by an AI. 

In [12]:
few_shot_prompt_v2 = '''
You are an AI assistant that validates **Business Goals or Needs** by checking if they are well-defined and actionable for Microsoft 365 (including Copilot) solutions. A **Business Goal or Need** should describe a **specific technical need** tied to a business outcome that Microsoft 365 can achieve, rather than a vague goal.

## Validation Criteria:

**1. Core Criteria**

The entry must match all of the following core criteria: 

A: **Technical Need & Outcome** – The entry specifies the **technical solution or capability needed** *and* the **business outcome** desired? (e.g. “need X technical solution to achieve Y result”).  
B: **M365 Actionable** – The entry describes a need, goal or pain point that is *actionable with Microsoft 365 and/or Copilot products**
C: **Clarity & Specificty ** – The entry is **specific and clear**, it does not contain vague business jargon and showcases a concrete job-to-be-done for the customer

**2. Additional Criteria :**  

The entry must match at least one of the following criteria to be valid:

D: - **Collaboration Improvement** – The entry describes a need to enhance communication or teamwork (a common strength of M365 tools).  
E: - **Improving Customer Experience** – The entry describes a need related to better serve customers via M365 capabilities.  
F: - **Process Optimization** –The entry describes a need or goal related to automating business processes, autating tasks and reducing manual effort(which M365 features can support).  
G: - **Integration** – The entry describes a need related to connecting multiple M365 services or integrating M365 with other systems.  
H: - **Data Management** – The entry describes a need or goal related to organizing, accessing, or analyzing data (using M365 analytics, databases, etc.).  
I: - **Mobile/Remote Support** – The entry descrbes needs related to remote work or mobile productivity (leveraging M365’s cloud/mobile tools).  
J: - **Cost/ROI Focus** – Indicates potential for cost savings or high return on investment using M365.  

*An entry that meets all core requirements and is aligned with Microsoft 365 capabilities (and touches on one or more additional characteristics) should be marked **Valid**. If it misses any core requirement (e.g., too vague, not tied to M365, etc.), mark it **Invalid**.*

### Valid Examples:

- "A financial services company wants to reduce manual expense approvals by automating workflows within the next quarter"
    - Meets A, B, C and F

- "An independend consulstant managing client contracts is looking for a way to secure sensitive data and comply with GPDR"
    - Meets A, B, C and H

- “The sales team needs a real-time analytics dashboard to track daily sales and quickly adjust strategies to meet their targets.”
  – Meets A, B, C and F

- "The customer needs a solution that enables efficient project planning and real-time collaboration among team members".
 - Meets A, B, C and D

### Invalid Examples:

- “Improve productivity with better tools.”
  – Fails A and C

- “the customer needs a security suite to secure his business.”
  – Fails C

- "The customer's goal is to reduce costs by 50%, so it recommends cheaper plans like Microsoft Basic." 
 - Fails A and B

## Response ##

You will always respond in JSON format with only the following fields and no more:

* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [15]:
test_results = analyse_test_prompt('detailed_few_shot_20', few_shot_prompt_v2, test_results, business_goals_20)

0
Save time and money
The customer's business imports and sells car parts from Europe to LATAM. Their goal is to implement a collaboration platform to enhance engagement and improve daily operations. {
  "valid": false,
  "reasoning": "The entry is too vague and lacks specificity regarding the technical need and business outcome (Criteria A). While it mentions a general goal to implement a collaboration platform, it does not specify what capabilities are required or how Microsoft 365 will address their specific operational challenges. Additionally, 'Save time and money' is overly broad and does not clarify the concrete job-to-be-done (Criteria C)."
}
1
After discussing it became clear that the customer aims to enhance efficiency in daily operations by establishing a generic email address connected to a centralized mailbox for their company. This setup would allow multiple users to access the mailbox simultaneously and receive notifications when customers send inquiries, thereby improvi

  results_store = pd.concat([results_store, new_results_row], ignore_index=True)


In [16]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,detailed_few_shot_20,0.7,0.583333,0.636364


In [14]:
def analyse_multipass_prompt(test_name, prompt_product, prompt_deployment, results_store, true_positives, true_negatives, false_positives, false_negatives):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt_product: system prompt passed to the LLM to validate product feedback
      - prompt_deployment: system prompt passed to the LLM to validate deployment blockers
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()
    
    dataframes = [
        [true_positives, True], # first value contains the data, the second what we would like the model to return for every row
        [true_negatives, False], # for instance, the llm should evaluate all true positives as valid to have 100% accuracy 
        [false_positives, False],
        [false_negatives, True]
    ]

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
    for dataframe in dataframes:
        data, expectation = dataframe
        
        for index, row in data.iterrows():
            # avoid token limit if needed every 10 rows 
            print(llm_call_counter)
            if llm_call_counter >= 10:
                print(f"Rate limit is close, continuing in {60} seconds...")
                time.sleep(61)
                llm_call_counter = 0
                
                
            llm_res = send_prompt(prompt_product, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            llm_call_counter += 1

            print(llm_res)
            # if we did not get a TP or an TN, we use the other prompt
            llm_res = json.loads(llm_res)
            if expectation != llm_res['valid']:
                llm_res = send_prompt(prompt_deployment, row['Feedback'])
                llm_res = clean_llm_response(llm_res)
                try:
                    llm_res = json.loads(llm_res)
                except json.JSONDecodeError as e:
                    print(f"Error parsing JSON: {e}, moving to next case")  # Log the error
                    next
                    
                llm_call_counter += 1
                
            
            
            print(llm_res)
            
            if expectation == True and llm_res['valid'] == True:
                tp += 1
            elif expectation == True and llm_res['valid'] == False:
                fn += 1
            elif expectation == False and llm_res['valid'] == True:
                fp += 1
            elif expectation == False and llm_res['valid'] == False:
                tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

## Results

Here we have the performance of several prompting strategies on our makeshift cross validation sample. 

In [21]:
# print test results store