# Product Insight Validation Using LLMs: M365 Product Feedback🔍

## Overview

This notebooks aims to evaluate different prompting strategies and prompts for validating M365 Product Feedback using a Large Language Model (LLM). The goal is to have a sandbox where we can fine-tune prompts using the same cross validation sets. 

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing business goals for validation.

2. **Apply LLM-Based Validation**  
   - Building blocks for using LLM to validate, and cleaning inputs

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.
   - 3.1 Zero Shot Prompting
   - 3.2 Few Shot Prompting
   - 3.3 Multi-pass w/ Few Shot prompting

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.



In [2]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load M365 Product Feedback

Let's take a glimpse at the data we have. All these business goals have been validated by human validators during the month of April '25 

These datasets will act as makeshift cross-validation sets, that we can use to test the performance of different prompting strategies and approaches. 


We have two datasets:

- A small set of 20 cases ( 10 valid, 10 invalids ) for quick experimentation
- A bigger set of 200 cases ( 100 valid, 100 invalids ) for testing and measuring results 

In [6]:
# We load all the data 

m365_product_feedback_20 = pd.read_csv('./cases/m365_feedback_cv20.csv') # 2-3 min
m365_product_feedback_200 = pd.read_csv('./cases/m365_feedback_cv200.csv') # 25-30 min

# Now let's print one of the datasets to see its shape

m365_product_feedback_20[:5]

Unnamed: 0,ID,Case Number,UPN,Line Of Business,Product Feedback and Limitations,Product Feedback and Limitations Validation,Product Feedback and Limitations Comment
0,27636,2504050040000885,gig_wfh_udmem@office365support.com,Business Advisor Reactive,"\nCustomer complained that the domain verification is for the users with have IT skills, and not al of the users are , Microsoft should create a system when adding the domain to the Microsoft 365 will not have to be manually, it it can be done by just clicking on a button so that the domain can be looked into.",1,Feedback is valid
1,25781,2504010040001913,gig_wfh_alass@microsoftsupport.com,Proactive Grace,Feedback and limitations: The customer noted that the admin portal is vague and not straightforward because it is hard to find what the customer's looking for in the long menus. The customer justified his feedback by a situation he experienced which that he has been trying for 2 years to change and cease the subscription renewal cycle from the billing section in the portal but couldn't figure it out himself. \n\nProduct Feedback and Limitations:,1,Valid
2,27911,2503310010001432,gig_wfh_hariv@microsoftsupport.com,Business Assist,"\nThe customer seeks to restrict entry-level employees' usage of M365 resources and control their access without subscribing to additional add-in licenses. They require a solution similar to Apple Business Manager, which allows them to limit device functionality and user experience without incurring extra costs for user-based licenses, focusing solely on device subscriptions.",1,valid
3,26153,2504020030001649,gig_wfh_jibal@microsoftsupport.com,Business Advisor Reactive,"M365 Product Feedback: \nIt would be beneficial for the customer if, when verifying their business domain from a third-party domain host, Microsoft would simply ask for a sign-in page. This would make it easier to add the DNS records instantly and save time on verification.",1,Valid
4,26494,2503310040002168,gig_wfh_avadv@microsoftsupport.com,Trials Nurturing Proactive,"Feedback and limitations: The customer emphasized that integrating Forms and SharePoint for appointment scheduling greatly enhances their shop's operations. Forms collects detailed booking information accurately, while SharePoint securely organizes data for staff access and scheduling. They particularly value the simplicity of using a QR code to direct clients to the system, streamlining the process and improving overall efficiency.\n\nProduct Feedback and Limitations:",1,Valid


In [5]:
# Column explanation
data = [
    ["Product Feedback and Limitations", "Raw M365 product feedback captured by the ambassador as they figure in the tracker"],
    ["Product Feedback and Limitations Validation", "Validation done by a human validator - 0 is invalid, 1 is valid"],
    ["Product Feedback and Limitations Comment", "Comment/Explanation provided by a human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Product Feedback and Limitations,Raw M365 product feedback captured by the ambassador as they figure in the tracker
1,Product Feedback and Limitations Validation,"Validation done by a human validator - 0 is invalid, 1 is valid"
2,Product Feedback and Limitations Comment,Comment/Explanation provided by a human validator


## 1.1 Defining Performance

These 2 datasets have already been evaluated by human validators.

This means we can use the previous labels to calculate Sensitivity, Recall and F1 for this dataset, which will give us performance metrics we can analyse and optimise. We will be looking at the following metrics.

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


We will create a dataframe where we will store the results of our tests as we run them.

In [6]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])


test_results


Unnamed: 0,test_name,sensitivity,precision,f1_score


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [7]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [8]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "The feedback highlights a usability issue where the menu design is convoluted and overcrowded, leading to difficulties in navigation and readability. This directly affects the user's ability to use the app effectively."
}
```


The model is giving us back a JSON wrapped in Markdown. Let's create a function to clean it 

In [9]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [10]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "The feedback highlights a usability issue where the menu design is convoluted and overcrowded, leading to difficulties in navigation and readability. This directly affects the user's ability to use the app effectively."
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through the rows of one of our datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the entry
    - 2. Evaluate if the LLM did a good job or not
          - LLM => Valid, Human => Valid, then *true positive*
          - LLM => Invalid, Human => Invalid, then *true negative*
          - LLM => Valid, Human => Invalid, then *false positive*
          - LLM => Invalid, Human => Valid, then *false negative*
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [11]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store, dataset):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        print(row_counter)
        if row_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            row_counter = 0
                
        llm_res = send_prompt(prompt, row['Product Led Growth Conversation'])
        llm_res = clean_llm_response(llm_res)
        row_counter += 1
        print(row['Product Led Growth Conversation'],llm_res)

        try:
            llm_res = json.loads(llm_res)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row['PLG Conversation Validation']
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
        elif llm_res['valid'] == False and human_validation == 1:
            tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Loose Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe high level criteria that is frequently mentioned by human validators to mark insights as valid or invalid. These are drawn from analysing the reasons as to why validators mark insights as valid or invalid. 


In [38]:
loose_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Set 2: Invalid Business Goal Criteria

Meeting any of the criteria below is enough for the entry to be considered invalid.

    - D) Focus on tools: The entry just lists the M365 applications being used, but there is no business goal or need mentioned

    - E) Vague business goal or need: The entry does not include any details nor actionable business goals/needs

    - F) Technical issue: The entry only describes a technical issue experienced by the customer and there is no business goal or need

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [12]:
# Test Loose Zero Shot Prompt here 

In [40]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.615385,0.761905
1,loose_zero_shot_20_no_negative_criteria,1.0,0.642857,0.782609
2,loose_zero_shot_200,1.0,0.72,0.837209


In [None]:
loose_zero_shot_no_negative_criteria = 
"""
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''


"""

In [None]:
test_results = analyse_test_prompt('loose_zero_shot_200_no_negative_criteria', loose_zero_shot_no_negative_criteria, test_results, business_goals_200)


In [36]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.615385,0.761905
1,loose_zero_shot_20_no_negative_criteria,1.0,0.642857,0.782609


### 3.2 Detailed Zero Shot Prompt

In the prompt below, we provide a detailed list of criteria based on the latest version of the insights framework available at GigPlus. This has a detailed set of tiered criteria. 


In [20]:
loose_detailed_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

## PROMPT MISSING ##

## Response ##

You will respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [13]:
few_shot_prompt = '''

You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## PROMPT MISSING ##

* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [None]:
test_results = analyse_test_prompt('few shot', few_shot_prompt, test_results)

print(test_results)

In [14]:
def analyse_multipass_prompt(test_name, first_prompt, second_prompt, results_store, dataset):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - first_prompt: system prompt passed to the first validation LLM
      - second_prompt: system prompt passed for the second validation using LLM
      - results_store: dataframe where we can store the results
      - dataset: Cross validation set used for the task
    '''
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
        
    for index, row in dataset.iterrows():
            # avoid token limit if needed every 10 rows 
        print(llm_call_counter)
        if llm_call_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            llm_call_counter = 0
            
            
        llm_res = send_prompt(first_prompt, row['Feedback'])
        llm_res = clean_llm_response(llm_res)
        llm_call_counter += 1

        print(llm_res)
        # if we did not get a TP or an TN, we use the other prompt
        llm_res = json.loads(llm_res)
        if (llm_res['valid'] == False and human_validation == 1) or (llm_res['valid'] == True and human_validation == 0):
            llm_res = send_prompt(prompt_deployment, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            try:
                llm_res = json.loads(llm_res)
            except json.JSONDecodeError as e:
                print(f"[WARN] eError parsing JSON: {e}, moving to next case")  # Log the error
                continue
                
            llm_call_counter += 1
            
        
        
        print(llm_res)
        
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
        elif llm_res['valid'] == False and human_validation == 1:
            tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

## Results

Here we have the performance of several prompting strategies on our makeshift cross validation sample. 

In [21]:
# print test results store