# Product Insight Validation Using LLMs: Business Goals🔍

## Overview

This notebooks aims to evaluate different prompting strategies and prompts for validating Business Goals using a Large Language Model (LLM). The goal is to have a sandbox where we can fine-tune prompts using the same cross validation sets. 

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing business goals for validation.

2. **Apply LLM-Based Validation**  
   - Building blocks for using LLM to validate, and cleaning inputs

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.
   - 3.1 Zero Shot Prompting
   - 3.2 Few Shot Prompting
   - 3.3 Multi-pass w/ Few Shot prompting

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.



In [1]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load Business Goals

Let's take a glimpse at the data we have. All these business goals have been validated by human validators during the month of April '25 

These datasets will act as makeshift cross-validation sets, that we can use to test the performance of different prompting strategies and approaches. 


We have two datasets:

- A small set of 20 cases ( 10 valid, 10 invalids ) for quick experimentation
- A bigger set of 200 cases ( 100 valid, 100 invalids ) for testing and measuring results 

In [5]:
# We load all the data 

business_goals_20 = pd.read_csv('./cases/business_goals_cv20.csv')
business_goals_200 = pd.read_csv('./cases/business_goals_cv200.csv')

# Now let's print one of the datasets to see its shape

business_goals_20[:5]

Unnamed: 0,ID,Case Number,UPN,Line Of Business,Business Goals and Needs,Business Goal Validation,Business Goal Comment
0,28380,2504040040013287,gig_wfh_ossal@microsoftsupport.com,Business Advisor Reactive,Save time and money\nThe customer's business i...,1,BG is clear.
1,26505,2504020030000826,gig_wfh_kuman@microsoftsupport.com,Business Advisor Reactive,After discussing it became clear that the cust...,1,It details the need of the customer
2,26155,2504030030000687,gig_wfh_jibal@microsoftsupport.com,Business Advisor Reactive,Simplify everyday task\n\nThe business needs t...,1,Valid: Business goals is clearly stated
3,27892,2503310040001537,gig_wfh_sirai@microsoftsupport.com,Proactive Grace,The customer is a tech company offering web an...,1,Valid
4,26026,2503270040000491,gig_wfh_jewan@microsoftsupport.com,Trials Nurturing Proactive,Business offers a delightful range of homemade...,1,BG is clear.


In [6]:
# Column explanation
data = [
    ["Business Goals and Needs", "Raw Business Goals and Needs captured by the ambassador as they figure in the tracker"],
    ["Business Goal Validation", "Validation done by a human validator - 0 is invalid, 1 is valid"],
    ["Business Goal comment", "Comment/Explanation provided by a human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Business Goals and Needs,Raw Business Goals and Needs captured by the ambassador as they figure in the tracker
1,Business Goal Validation,"Validation done by a human validator - 0 is invalid, 1 is valid"
2,Business Goal comment,Comment/Explanation provided by a human validator


## 1.1 Defining Performance

These 2 datasets have already been evaluated by human validators.

This means we can use the previous labels to calculate Sensitivity, Recall and F1 for this dataset, which will give us performance metrics we can analyse and optimise. We will be looking at the following metrics.

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


We will create a dataframe where we will store the results of our tests as we run them.

In [8]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])


test_results


Unnamed: 0,test_name,sensitivity,precision,f1_score


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [9]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [10]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "This feedback highlights a usability issue, as the user is unable to effectively use the app due to the overcrowded and difficult-to-read menu interface."
}
```


The model is giving us back a JSON wrapped in Markdown. Let's create a function to clean it 

In [11]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [12]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "This feedback highlights a usability issue, as the user is unable to effectively use the app due to the overcrowded and difficult-to-read menu interface."
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through the rows of one of our datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the entry
    - 2. Evaluate if the LLM did a good job or not
          - LLM => Valid, Human => Valid, then *true positive*
          - LLM => Invalid, Human => Invalid, then *true negative*
          - LLM => Valid, Human => Invalid, then *false positive*
          - LLM => Invalid, Human => Valid, then *false negative*
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [37]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store, dataset):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        print(row_counter)
        if row_counter >= 10:
            print(f"Rate limit is close, continuing in {60} seconds...")
            time.sleep(61)
            row_counter = 0
                
        llm_res = send_prompt(prompt, row['Business Goals and Needs'])
        llm_res = clean_llm_response(llm_res)
        row_counter += 1
        print(row['Business Goals and Needs'],llm_res)

        try:
            llm_res = json.loads(llm_res)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row['Business Goal Validation']
        if llm_res['valid'] == True and human_validation == 1:
            tp += 1
        elif llm_res['valid'] == False and human_validation == 0:
            tn += 1
        elif llm_res['valid'] == True and human_validation == 0:
            fp += 1
        elif llm_res['valid'] == False and human_validation == 1:
            tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Loose Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe high level criteria that is frequently mentioned by human validators to mark insights as valid or invalid. These are drawn from analysing the reasons as to why validators mark insights as valid or invalid. 


In [38]:
loose_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Set 2: Invalid Business Goal Criteria

Meeting any of the criteria below is enough for the entry to be considered invalid.

    - D) Focus on tools: The entry just lists the M365 applications being used, but there is no business goal or need mentioned

    - E) Vague business goal or need: The entry does not include any details nor actionable business goals/needs

    - F) Technical issue: The entry only describes a technical issue experienced by the customer and there is no business goal or need

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [None]:
# Test Loose Zero Shot Prompt here 
test_results = analyse_test_prompt('loose_zero_shot_200', loose_zero_shot_prompt, test_results, business_goals_200)


0
Save time and money
The customer's business imports and sells car parts from Europe to LATAM. Their goal is to implement a collaboration platform to enhance engagement and improve daily operations. {
  "valid": true,
  "reasoning": "The entry is valid because it satisfies all the criteria under Set 1: Valid Business Goals Criteria. It clearly mentions a business goal (importing and selling car parts from Europe to LATAM), is specific (the need for a collaboration platform to enhance engagement and improve operations), and is actionable by pointing to the implementation of such a platform through Microsoft 365 products."
}
1
After discussing it became clear that the customer aims to enhance efficiency in daily operations by establishing a generic email address connected to a centralized mailbox for their company. This setup would allow multiple users to access the mailbox simultaneously and receive notifications when customers send inquiries, thereby improving communication and enabli

In [33]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.615385,0.761905
1,loose_zero_shot_20_no_negative_criteria,1.0,0.642857,0.782609


In [None]:
loose_zero_shot_no_negative_criteria = 
"""
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Valid Business Goals Criteria

Meeting all criteria below is a must have for the entry to be considered valid, otherwise it will be invalid

    - A) Clarity: the entry mentions a business goal or need that is clear and easy to understand

    - B) Specificity: The entry should clearly refer to a specific, concrete business goal or need

    - C) Actionable: The entry should focus on the practical applicability of Microsoft 365 products to address the business need described

## Response ##

You will always respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''


"""

In [None]:
test_results = analyse_test_prompt('loose_zero_shot_200_no_negative_criteria', loose_zero_shot_no_negative_criteria, test_results, business_goals_200)


In [36]:
test_results

Unnamed: 0,test_name,sensitivity,precision,f1_score
0,loose_zero_shot_20,1.0,0.615385,0.761905
1,loose_zero_shot_20_no_negative_criteria,1.0,0.642857,0.782609


### 3.2 Detailed Zero Shot Prompt

In the prompt below, we provide a detailed list of criteria based on the latest version of the insights framework available at GigPlus. This has a detailed set of tiered criteria. 


In [20]:
loose_detailed_zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

## PROMPT MISSING ##

## Response ##

You will respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [60]:
few_shot_prompt = '''

You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Product Feedback and Limitations Criteria

Meeting criteria A) and B) is a must have for the entry to be considered valid

    - A) Actionability: the entry mentions product feedback, limitations that are specific, actionable and valuable for a product team.

    - B) Specificity: The entry should clearly refer to a specific product feature or a product limitation
    
Meeting at least one of the following criteria C), D) and E) is enough to check the entry as valid

    - C) Support of Objectives: The entry should explain how the feedback aligns customer business objectives or business case, regardless of whether the feedback is positive or negative. 

    - D) Impact on Customer Experience: The entry must explain how it impacts customer workflows, satisfaction, or any stage of the customer's experience
    
    - E) Positive Feedback: The entry provides positive feedback about a feature or aspect of the product

##Set 1 end##
      
##Set 2: Valid Deployment Blockers## 

Meeting any of the following criteria is enough to check the entry as valid

    - Technical Barriers: The entry contains concrete obstacles that prevents or limits the successful implementation, adoption, or performance of a technology, system or product.  

    - Organizational Readiness: The entry refers to a shortage of trained personnel or expertise to adopt, implement or maintain a the product.

    - Compatibility:  The entry explains clearly how the product cannot be adopted or used due to lack of compatibility, outdated systems, or proprietary formats 

    - Support and Documentation: The entry explains how poor documentation prevents the deployment, adoption or use of the product 

    - Security and Compliance: The entry explains risks related to data protection, cybersecurity threats, or compliance with privacy laws that prevent deployment, adoption or use of the product.

## Set 2 end ## 

## Examples of valid Entries for Set 1 ## 

    - “The Product seems very hard to use, doing basic actions like managing the calendar requires many clicks and it's confusing.” – [ Valid, Meets Criteria A), B) and D) ]

    - “The Product is not able to perform scheduled updates, forcing the customer to do manual work and waste time” – [ Valid, Meets Criteria A), B) and C) ]

    - “The customer mentioned how happy he was with the new data visualization suite and the FabricView feature. It has really helped his team make better decisions” – [ Valid, Meets Criteria A), B) and E) ]

    - “The customer was frustrated because the product is unstable when running alongside another application” – [ Valid - Meets criteria A) and B) and D) ]

## Examples of valid Entries for Set 1 end ##

## Examples of invalid Entries for Set 1 ## 

    - “The Product seems slow sometimes.” – [ Invalid - Does not meet criteria A) and B) ]

    - “The customer does not like the product, he prefers the older version. Also thinks the competition is better” – [ Invalid - Does not meet criteria A) and B) ]

    - “We heard from another company that they had issues with the product.” – [ Invalid, Does not meet criteria A) and B) ]  

    - “We really love the product, the new functionalities are really cool and helps us make more money which is what we want” – [ Invalid, Does not meet criteria A) and B) even though meets C) and E) ] 

    - “We need time to adjust to new workflows.” - [ Invalid, Does not meet criteria A) and B) ]

    - "Product is a bit expensive, should rethink the price point" - [ Invalid, Does not meet criteria A) and B) ]

## Examples of invalid Entries for Set 1 end ##

## Examples of valid Entries for Set 2 ## 

    - “The customer raised a concern about the lack of multi-tenancy support. They need a way to manage multiple teams and departments separately within the product.” – [ Valid, Technical Barrier ]

    - “The customer said that while they see the value in our solution, they can't deploy because their team would need extensive training to use it effectively.” – [ Valid, Organizational Readiness ]

    - “he customer mentioned that they were excited to deploy the product, but they discovered it's not compatible with their existing infrastructure. Their systems run on Linux, while the software only supports Windows, making it impossible for them to implement” – [ Valid, Compatibility ]

    - “The customer said that when they encountered an issue, they couldn’t find sufficient troubleshooting guides or FAQs to resolve it on their own, making them overly reliant on support.” – [ Valid, Support and Documentation ]

    - "The product only provides US-based data hosting but the customer requires GDPR, so legally they cannot use it." - [Valid, Security and Compliance]

## Examples of valid Entries for Set 2 end ##

## Examples of invalid Entries for Set 2 ## 

    - “Our office is moving next month, so we can’t focus on deployment right now.” – [ Invalid, Does not meet any of the criteria ]

    - “The system doesn't seem to work as expected in our environment.” – [ Invalid, Does not meet any of the criteria ]

    - “The customer mentioned that they are facing some challenges with the new system.” – [ Invalid, Does not meet any of the criteria ]

    - “The customer said that they are not sure how to proceed with the migration.” – [ Invalid, Does not meet any of the criteria ]

    - "They have concerns about security that need to be cleared before they proceed with the deployment" - [Invalid, Does not meet any of the criteria]

## Examples of valid Entries for Set 2 end ##

Meeting the criteria of one of the sets is enough to consider an entry as valid. You must emit a final single judgement 

## Response ##

You will always respond in JSON format with only the following fields and no more:

* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [None]:
test_results = analyse_test_prompt('few shot', few_shot_prompt, test_results)

print(test_results)

In [14]:
def analyse_multipass_prompt(test_name, prompt_product, prompt_deployment, results_store, true_positives, true_negatives, false_positives, false_negatives):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt_product: system prompt passed to the LLM to validate product feedback
      - prompt_deployment: system prompt passed to the LLM to validate deployment blockers
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()
    
    dataframes = [
        [true_positives, True], # first value contains the data, the second what we would like the model to return for every row
        [true_negatives, False], # for instance, the llm should evaluate all true positives as valid to have 100% accuracy 
        [false_positives, False],
        [false_negatives, True]
    ]

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
    for dataframe in dataframes:
        data, expectation = dataframe
        
        for index, row in data.iterrows():
            # avoid token limit if needed every 10 rows 
            print(llm_call_counter)
            if llm_call_counter >= 10:
                print(f"Rate limit is close, continuing in {60} seconds...")
                time.sleep(61)
                llm_call_counter = 0
                
                
            llm_res = send_prompt(prompt_product, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            llm_call_counter += 1

            print(llm_res)
            # if we did not get a TP or an TN, we use the other prompt
            llm_res = json.loads(llm_res)
            if expectation != llm_res['valid']:
                llm_res = send_prompt(prompt_deployment, row['Feedback'])
                llm_res = clean_llm_response(llm_res)
                try:
                    llm_res = json.loads(llm_res)
                except json.JSONDecodeError as e:
                    print(f"Error parsing JSON: {e}, moving to next case")  # Log the error
                    next
                    
                llm_call_counter += 1
                
            
            
            print(llm_res)
            
            if expectation == True and llm_res['valid'] == True:
                tp += 1
            elif expectation == True and llm_res['valid'] == False:
                fn += 1
            elif expectation == False and llm_res['valid'] == True:
                fp += 1
            elif expectation == False and llm_res['valid'] == False:
                tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

## Results

Here we have the performance of several prompting strategies on our makeshift cross validation sample. 

In [21]:
# print test results store