# Product Insight Validation Using LLMs 🔍

## Overview

This notebooks aims to evaluate different prompting strategies for validating product insights using a Large Language Model (LLM). The goal is to determine the most effective prompting approach for distinguishing between valid and invalid insights based on predefined criteria.

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing product insights for validation.

2. **Apply LLM-Based Validation**  
   - Use different prompts and prompting strategies to classify insights.

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.


---

_This notebook has memes every once in a while. Jupyter notebooks are very nice but also can be a bit dry. The memes are not particularly good, don't judge me.

In [1]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load Product Insights

Let's take a glimpse at the data we have. All this data has been validated with an LLM with a custom prompt and then reviewed by human validators. This explains why we have true and false positives and negatives. 

In [2]:
# We load all the data 

true_positives = pd.read_csv('true_positive_sample.csv')
true_negatives = pd.read_csv('true_negative_sample.csv')
false_positives = pd.read_csv('false_positive_sample.csv')
false_negatives = pd.read_csv('false_negative_sample.csv')

# Now let's print one of the datasets to see its shape

true_positives[:5]

Unnamed: 0,Feedback,Product Feedback and Limitations validation_status,Product Feedback and Limitations comment,Product Feedback and Limitations_human_review,Product Feedback and Limitations_human_comment
0,Feedback and limitations The customer expresse...,1,The feedback is specific as it refers to the r...,Agree,Impact on the customer's workflow stated. Acti...
1,Feedback and limitations The limitations of th...,1,The feedback is specific about compatibility i...,Agree,Valid feedback
2,Feedback and limitations Customer face difficu...,1,The feedback is valid as it specifies compatib...,Agree,Valid
3,Feedback and limitations Cx was frustrated sin...,1,The feedback is valid as it specifies a limita...,Agree,
4,Feedback and limitations Product Limitation \n...,1,The feedback is specific as it refers to the d...,Agree,Product feedback is specific and clear


In [3]:
# Column explanation
data = [
    ["Feedback", "Raw feedback notes captured by the agent and stored on Gigplus Trackers"],
    ["Product Feedback and Limitations validation_status", "Validation done by the LLM - 0 is invalid, 1 is valid"],
    ["Product Feedback and Limitations comment", "Explanation provided by the LLM"],
    ["Product Feedback and Limitations_human_review", "Human review, agreeing or disagreeing with the model"],
    ["Product Feedback and Limitations_human_comment", "Comment left by the human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Feedback,Raw feedback notes captured by the agent and stored on Gigplus Trackers
1,Product Feedback and Limitations validation_status,"Validation done by the LLM - 0 is invalid, 1 is valid"
2,Product Feedback and Limitations comment,Explanation provided by the LLM
3,Product Feedback and Limitations_human_review,"Human review, agreeing or disagreeing with the model"
4,Product Feedback and Limitations_human_comment,Comment left by the human validator


## 1.1 Baselining Performance

These 4 datasets have already been evaluated by an LLM as well as been reviewed by a human.

This means we can calculate Sensitivity, Recall and F1 for this dataset, which will give us target performance metrics to iterate on. Let's refresh on how these are calculated

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


With that, let's calculate sensitivity, precision and F1 score for our current dataset


In [4]:
tp = true_positives.shape[0]  
tn = true_negatives.shape[0]
fp = false_positives.shape[0]
fn = false_negatives.shape[0]

sensitivity = tp / ( tp + fn )
precision = tp / ( tp + fp )
f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )

baseline_eval_metrics = pd.DataFrame([
    ["Sensitivity", sensitivity],
    ["Precision", precision],
    ["F1_Score", f_1]
    ],
    columns=["Metric", "Value"]
)

baseline_eval_metrics



Unnamed: 0,Metric,Value
0,Sensitivity,0.5
1,Precision,0.909091
2,F1_Score,0.645161


The metrics are very low and in principle "easy to beat", but this is only because the sample size is very small for true positives and true negatives. 

In reality, the previous model performed better than this - nevertheless, this gives us a compass for our exercise.

**New prompts/prompt strategies should be able to have a better ability to catch false positives and false negatives while maintaining accuracy with true positives and negatives**

We'll store the result of all our tests into a dataframe table. This will allow us to contrast and compare approaches and make a final selection.



In [5]:
test_results = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])

test_results = pd.concat([test_results, pd.DataFrame({"test_name": ["original"] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })], ignore_index=True)

test_results


  test_results = pd.concat([test_results, pd.DataFrame({"test_name": ["original"] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })], ignore_index=True)


Unnamed: 0,test_name,sensitivity,precision,f1_score
0,original,0.5,0.909091,0.645161


## 2. Setting Up Logic for LLM Validation and Analysis

### 2.1 Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [11]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [12]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "The feedback directly addresses a usability issue related to the app's menu being convoluted, crowded with icons, and hard to read. These factors impact the ability to use the app effectively."
}
```


The model is giving us back a string formatted in Markdown. Let's create a function to clean it 

In [14]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [15]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "The feedback directly addresses a usability issue related to the app's menu being convoluted, crowded with icons, and hard to read. These factors impact the ability to use the app effectively."
}


Great! We now have the basic building block for testing different validation prompts.   

###  2.2 Analysis of prompt performance

Now we need a function that allows us to do the following:

- 1. Iterate through our TP, TN, FP, FN datasets.
- 2. For each of the rows in each of the datasets
    - 1. Ask the LLM to validate the product feedback entry
    - 2. Evaluate if the LLM did a good job or not
    - 3. Store this information
- 4. Calculate Sensitivity, Recall and F1 for this prompt
- 5. Add the results to our log in the `test_results` variable we created before

In [34]:
import time 
import ipdb;

def analyse_test_prompt(test_name, prompt, results_store):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt: system prompt passed to the LLM to validate product feedback
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()
    
    dataframes = [
        [true_positives, True], # first value contains the data, the second what we would like the model to return for every row
        [true_negatives, False], # for instance, the llm should evaluate all true positives as valid to have 100% accuracy 
        [false_positives, False],
        [false_negatives, True]
    ]

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    row_counter = 0
    for dataframe in dataframes:
        data, expectation = dataframe
        
        for index, row in data.iterrows():
            # avoid token limit if needed every 10 rows 
            print(row_counter)
            if row_counter >= 10:
                print(f"Rate limit is close, continuing in {60} seconds...")
                time.sleep(61)
                row_counter = 0
                
                
            llm_res = send_prompt(prompt, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            row_counter += 1
            print(llm_res)
            
            llm_res = json.loads(llm_res)
            if expectation == True and llm_res['valid'] == True:
                tp += 1
            elif expectation == True and llm_res['valid'] == False:
                fn += 1
            elif expectation == False and llm_res['valid'] == True:
                fp += 1
            elif expectation == False and llm_res['valid'] == False:
                tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store

test_prompt = "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }"

# test_results = analyse_test_prompt('testy test', test_prompt, test_results)

            

Ok, our building blocks of logic are now ready. Let's start with some prompting

## 3. Testing Prompting Approaches

In this section we will test the performance of several prompting approaches to see which one seems performs better. Let's go!


### 3.1 Zero Shot Prompt

Zero-shot prompting is a technique used with large language models (LLMs) where the model is asked to perform a task without being given any specific examples of how to do it. We're relying entirely on the model's pre-existing knowledge and understanding to generate a response.  

In the prompt below, we describe 2 sets of criteria, one for Product Feedback and Limitations, and another one for Deployment Blockers. We instruct the model to validate the entries when 1 of the criteria sets are met.


In [7]:
zero_shot_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Product Feedback and Limitations Criteria

Meeting criteria A) and B) is a must have for the entry to be considered valid

    - A) Actionability: the entry mentions product feedback, limitations that are specific, actionable and valuable for a product team.

    - B) Specificity: The entry should clearly refer to a specific product feature
    
Meeting at least one of the following criteria C), D) and E) is enough to check the entry as valid

    - C) Support of Objectives: The entry should explain how the feedback aligns customer business objectives or business case, regardless of whether the feedback is positive or negative. 

    - D) Impact on Customer Experience: The entry must explain how it impacts customer workflows, satisfaction, or any stage of the customer's experience

    - E) Usability: The entry explains how a feature of the product is difficult to use
    
    - F) Positive Feedback: The entry provides positive feedback about a feature or aspect of the product

##Set 1 end##
      
##Set 2: Valid Deployment Blockers ## 

Meeting any of the following criteria is enough to check the entry as valid

    - Technical Barriers: The entry contains obstacles that prevents or limits the successful implementation, adoption, or performance of a technology, system or product.  

    - Organizational Readiness: The entry refers to a shortage of trained personnel or expertise to adopt, implement or maintain a the product.

    - Compatibility:  The entry explains how the product cannot be adopted or used due to lack of compatibility, outdated systems, or proprietary formats 

    - Support and Documentation: The entry explains how poor documentation prevents the deployment, adoption or use of the product 

    - Security and Compliance: The entry explains risks related to data protection, cybersecurity threats, or compliance with privacy laws that prevent deployment, adoption or use of the product.

## Set 2 end ## 

Meeting the criteria of one of the sets is enough to consider an entry as valid. 

## Response ##

You will respond in JSON format with the following fields:
* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [38]:
test_results = analyse_test_prompt('zero shot', zero_shot_prompt, test_results)

84.3721399307251


In [39]:
print(test_results)

    test_name  sensitivity  precision  f1_score
0    original          0.5   0.909091  0.645161
1  testy test          0.7   1.000000  0.823529
2   zero shot          1.0   0.714286  0.833333


### 3.1 Few-Shot Prompt

Few-shot prompting is a technique in prompt engineering that aims to augment LLMs by providing a small number of examples within the prompt itself. This allows the model to learn and adapt to a specific task without requiring extensive fine-tuning.

In the prompt below, we will provide a few positive and negative examples for each of the categories, and see its impact on performance. 

In [40]:
few_shot_prompt = '''

You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following sets of criteria.

## Set 1: Product Feedback and Limitations Criteria

Meeting criteria A) and B) is a must have for the entry to be considered valid

    - A) Actionability: the entry mentions product feedback, limitations that are specific, actionable and valuable for a product team.

    - B) Specificity: The entry should clearly refer to a specific product feature or a product limitation
    
Meeting at least one of the following criteria C), D) and E) is enough to check the entry as valid

    - C) Support of Objectives: The entry should explain how the feedback aligns customer business objectives or business case, regardless of whether the feedback is positive or negative. 

    - D) Impact on Customer Experience: The entry must explain how it impacts customer workflows, satisfaction, or any stage of the customer's experience
    
    - E) Positive Feedback: The entry provides positive feedback about a feature or aspect of the product

##Set 1 end##
      
##Set 2: Valid Deployment Blockers## 

Meeting any of the following criteria is enough to check the entry as valid

    - Technical Barriers: The entry contains concrete obstacles that prevents or limits the successful implementation, adoption, or performance of a technology, system or product.  

    - Organizational Readiness: The entry refers to a shortage of trained personnel or expertise to adopt, implement or maintain a the product.

    - Compatibility:  The entry explains clearly how the product cannot be adopted or used due to lack of compatibility, outdated systems, or proprietary formats 

    - Support and Documentation: The entry explains how poor documentation prevents the deployment, adoption or use of the product 

    - Security and Compliance: The entry explains risks related to data protection, cybersecurity threats, or compliance with privacy laws that prevent deployment, adoption or use of the product.

## Set 2 end ## 

## Examples of valid Entries for Set 1 ## 

    - “The Product seems very hard to use, doing basic actions like managing the calendar requires many clicks and it's confusing.” – [ Valid, Meets Criteria A), B) and D) ]

    - “The Product is not able to perform scheduled updates, forcing the customer to do manual work and waste time” – [ Valid, Meets Criteria A), B) and C) ]

    - “The customer mentioned how happy he was with the new data visualization suite and the FabricView feature. It has really helped his team make better decisions” – [ Valid, Meets Criteria A), B) and E) ]

    - “The customer was frustrated because the product is unstable when running alongside another application” – [ Valid - Meets criteria A) and B) and D) ]

## Examples of valid Entries for Set 1 end ##

## Examples of invalid Entries for Set 1 ## 

    - “The Product seems slow sometimes.” – [ Invalid - Does not meet criteria A) and B) ]

    - “The customer does not like the product, he prefers the older version. Also thinks the competition is better” – [ Invalid - Does not meet criteria A) and B) ]

    - “We heard from another company that they had issues with the product.” – [ Invalid, Does not meet criteria A) and B) ]  

    - “We really love the product, the new functionalities are really cool and helps us make more money which is what we want” – [ Invalid, Does not meet criteria A) and B) even though meets C) and E) ] 

    - “We need time to adjust to new workflows.” - [ Invalid, Does not meet criteria A) and B) ]

    - "Product is a bit expensive, should rethink the price point" - [ Invalid, Does not meet criteria A) and B) ]

## Examples of invalid Entries for Set 1 end ##

## Examples of valid Entries for Set 2 ## 

    - “The customer raised a concern about the lack of multi-tenancy support. They need a way to manage multiple teams and departments separately within the product.” – [ Valid, Technical Barrier ]

    - “The customer said that while they see the value in our solution, they can't deploy because their team would need extensive training to use it effectively.” – [ Valid, Organizational Readiness ]

    - “he customer mentioned that they were excited to deploy the product, but they discovered it's not compatible with their existing infrastructure. Their systems run on Linux, while the software only supports Windows, making it impossible for them to implement” – [ Valid, Compatibility ]

    - “The customer said that when they encountered an issue, they couldn’t find sufficient troubleshooting guides or FAQs to resolve it on their own, making them overly reliant on support.” – [ Valid, Support and Documentation ]

    - "The product only provides US-based data hosting but the customer requires GDPR, so legally they cannot use it." - [Valid, Security and Compliance]

## Examples of valid Entries for Set 2 end ##

## Examples of invalid Entries for Set 2 ## 

    - “Our office is moving next month, so we can’t focus on deployment right now.” – [ Invalid, Does not meet any of the criteria ]

    - “The system doesn't seem to work as expected in our environment.” – [ Invalid, Does not meet any of the criteria ]

    - “The customer mentioned that they are facing some challenges with the new system.” – [ Invalid, Does not meet any of the criteria ]

    - “The customer said that they are not sure how to proceed with the migration.” – [ Invalid, Does not meet any of the criteria ]

    - "They have concerns about security that need to be cleared before they proceed with the deployment" - [Invalid, Does not meet any of the criteria]

## Examples of valid Entries for Set 2 end ##

Meeting the criteria of one of the sets is enough to consider an entry as valid. You must emit a final single judgement 

## Response ##

You will always respond in JSON format with only the following fields and no more:

* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [35]:
test_results = analyse_test_prompt('few shot', few_shot_prompt, test_results)

print(test_results)

0
ERROR! Session/line number was not unique in database. History logging moved to new session 2
{
  "valid": true,
  "reasoning": "The entry is valid under Set 1 criteria. It satisfies A) Actionability, as the feedback is actionable and valuable to the product team, and B) Specificity, as it specifically refers to the removal of the add-on for email encryption. Furthermore, it meets D) Impact on Customer Experience, as it explains how the removal complicates the customer's workflow and forces them to take additional steps for secure communication. The entry does not fit the criteria for deployment blockers."
}
1
{
  "valid": false,
  "reasoning": "The entry touches upon compatibility issues and user experience differences between app versions and desktop versions. However, it lacks clear actionability and specificity related to a specific feature or limitation, which are required for Set 1 criteria A) and B) to be met. While it mentions compatibility issues, it does not explicitly expl

Interestingly, the few shot prompt did not perform better than the few shot prompt on the cross validation set. I wonder if there's anything that could be done to optimise this 

### 3.2 Multi-pass prompt

A multi-pass prompt is a prompt engineering technique where an AI model is guided through multiple stages or iterations to refine its response. 

For this particular task, the product insights are considered valid whenever they are product feedback or deployment blockers - however, these have very different definitions.

We could pass 2 different prompts to the model. 

1. We first check if it's product insight
2. If it is, we're done, the entry is valid
3. If not, we check if it's a deployment blocker
4. If it is, the entry is valid
5. If it's not, the entry is invalid.

We're going to need a different method for analysing the prompt that implements this logic

In [37]:
def analyse_multipass_prompt(test_name, prompt_product, prompt_deployment, results_store):
    '''
    Evaluates the performance of a prompt 

    Args:
      - test_name: name of the test, can be used as an identifier
      - prompt_product: system prompt passed to the LLM to validate product feedback
      - prompt_deployment: system prompt passed to the LLM to validate deployment blockers
      - results_store: dataframe where we can store the results
    '''
    start_time = time.time()
    
    dataframes = [
        [true_positives, True], # first value contains the data, the second what we would like the model to return for every row
        [true_negatives, False], # for instance, the llm should evaluate all true positives as valid to have 100% accuracy 
        [false_positives, False],
        [false_negatives, True]
    ]

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    llm_call_counter = 0
    for dataframe in dataframes:
        data, expectation = dataframe
        
        for index, row in data.iterrows():
            # avoid token limit if needed every 10 rows 
            print(llm_call_counter)
            if llm_call_counter >= 10:
                print(f"Rate limit is close, continuing in {60} seconds...")
                time.sleep(61)
                llm_call_counter = 0
                
                
            llm_res = send_prompt(prompt_product, row['Feedback'])
            llm_res = clean_llm_response(llm_res)
            llm_call_counter += 1

            print(llm_res)
            # if we did not get a TP or an TN, we use the other prompt
            llm_res = json.loads(llm_res)
            if expectation != llm_res['valid']:
                llm_res = send_prompt(prompt_deployment, row['Feedback'])
                llm_res = clean_llm_response(llm_res)
                llm_call_counter += 1
                
            
            
            print(llm_res)
            
            llm_res = json.loads(llm_res)
            if expectation == True and llm_res['valid'] == True:
                tp += 1
            elif expectation == True and llm_res['valid'] == False:
                fn += 1
            elif expectation == False and llm_res['valid'] == True:
                fp += 1
            elif expectation == False and llm_res['valid'] == False:
                tn += 1

    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)
    end_time = time.time()
    print(end_time - start_time) 
    return results_store


Now, let's break down the prompts with their examples

In [42]:
product_prompt = '''

You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following set of criteria.

## Product Feedback and Limitations Criteria

Meeting criteria A) and B) is a must have for the entry to be considered valid

    - A) Actionability: the entry mentions product feedback, limitations that are specific, actionable and valuable for a product team.

    - B) Specificity: The entry should clearly refer to a specific product feature or a product limitation
    
Meeting at least one of the following criteria C), D) and E) is enough to check the entry as valid

    - C) Support of Objectives: The entry should explain how the feedback aligns customer business objectives or business case, regardless of whether the feedback is positive or negative. 

    - D) Impact on Customer Experience: The entry must explain how it impacts customer workflows, satisfaction, or any stage of the customer's experience
    
    - E) Positive Feedback: The entry provides positive feedback about a feature or aspect of the product

##Criteria End##

## Examples of valid Entries ## 

    - “The Product seems very hard to use, doing basic actions like managing the calendar requires many clicks and it's confusing.” – [ Valid, Meets Criteria A), B) and D) ]

    - “The Product is not able to perform scheduled updates, forcing the customer to do manual work and waste time” – [ Valid, Meets Criteria A), B) and C) ]

    - “The customer mentioned how happy he was with the new data visualization suite and the FabricView feature. It has really helped his team make better decisions” – [ Valid, Meets Criteria A), B) and E) ]

    - “The customer was frustrated because the product is unstable when running alongside another application” – [ Valid - Meets criteria A) and B) and D) ]

## Examples of valid Entries end ##

## Examples of invalid Entries## 

    - “The Product seems slow sometimes.” – [ Invalid - Does not meet criteria A) and B) ]

    - “The customer does not like the product, he prefers the older version. Also thinks the competition is better” – [ Invalid - Does not meet criteria A) and B) ]

    - “We heard from another company that they had issues with the product.” – [ Invalid, Does not meet criteria A) and B) ]  

    - “We really love the product, the new functionalities are really cool and helps us make more money which is what we want” – [ Invalid, Does not meet criteria A) and B) even though meets C) and E) ] 

    - “We need time to adjust to new workflows.” - [ Invalid, Does not meet criteria A) and B) ]

    - "Product is a bit expensive, should rethink the price point" - [ Invalid, Does not meet criteria A) and B) ]

## Examples of invalid Entries for Set 1 end ##

## Response ##

You will always respond in JSON format with only the following fields and no more:

* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above
'''


In [43]:
deployment_blocker_prompt = '''
You are an AI assistant that validates entries based on specific criteria. 

Your job is to mark any entries given to you as valid or invalid. An input will be valid whenever it conforms to any of the following set of criteria.

##Valid Deployment Blockers Criteria## 

Meeting any of the following criteria is enough to check the entry as valid

    - Technical Barriers: The entry contains concrete obstacles that prevents or limits the successful implementation, adoption, or performance of a technology, system or product.  

    - Organizational Readiness: The entry refers to a shortage of trained personnel or expertise to adopt, implement or maintain a the product.

    - Compatibility:  The entry explains clearly how the product cannot be adopted or used due to lack of compatibility, outdated systems, or proprietary formats 

    - Support and Documentation: The entry explains how poor documentation prevents the deployment, adoption or use of the product 

    - Security and Compliance: The entry explains risks related to data protection, cybersecurity threats, or compliance with privacy laws that prevent deployment, adoption or use of the product.

## Valid Deployment Blockers Criteria end ## 

## Examples of valid Entries for Set 2 ## 

    - “The customer raised a concern about the lack of multi-tenancy support. They need a way to manage multiple teams and departments separately within the product.” – [ Valid, Technical Barrier ]

    - “The customer said that while they see the value in our solution, they can't deploy because their team would need extensive training to use it effectively.” – [ Valid, Organizational Readiness ]

    - “he customer mentioned that they were excited to deploy the product, but they discovered it's not compatible with their existing infrastructure. Their systems run on Linux, while the software only supports Windows, making it impossible for them to implement” – [ Valid, Compatibility ]

    - “The customer said that when they encountered an issue, they couldn’t find sufficient troubleshooting guides or FAQs to resolve it on their own, making them overly reliant on support.” – [ Valid, Support and Documentation ]

    - "The product only provides US-based data hosting but the customer requires GDPR, so legally they cannot use it." - [Valid, Security and Compliance]

## Examples of valid Entries for Set 2 end ##

## Examples of invalid Entries for Set 2 ## 

    - “Our office is moving next month, so we can’t focus on deployment right now.” – [ Invalid, Does not meet any of the criteria ]

    - “The system doesn't seem to work as expected in our environment.” – [ Invalid, Does not meet any of the criteria ]

    - “The customer mentioned that they are facing some challenges with the new system.” – [ Invalid, Does not meet any of the criteria ]

    - “The customer said that they are not sure how to proceed with the migration.” – [ Invalid, Does not meet any of the criteria ]

    - "They have concerns about security that need to be cleared before they proceed with the deployment" - [Invalid, Does not meet any of the criteria]

## Examples of valid Entries for Set 2 end ##

## Response ##

You will always respond in JSON format with only the following fields and no more:

* valid - make it true if the entry is considered valid, false if invalid
* reasoning - add your reasoning based on the criteria set above

'''

In [45]:
test_results = analyse_multipass_prompt('multipass', product_prompt, deployment_blocker_prompt, test_results)

test_results

0
{
  "valid": true,
  "reasoning": "The entry meets criteria A) Actionability because it discusses specific, actionable product feedback. It meets B) Specificity because it clearly refers to the removal of an email encryption add-on as a limitation. It also meets D) Impact on Customer Experience because it explains how this change complicates the workflow and impacts secure communication, affecting customer satisfaction."
}


TypeError: string indices must be integers