# Product Insight Validation Using LLMs 🔍

## Overview

This notebooks aims to evaluate different prompting strategies for validating product insights using a Large Language Model (LLM). The goal is to determine the most effective prompting approach for distinguishing between valid and invalid insights based on predefined criteria.

## Objectives

- **Compare Prompting Strategies:** Test multiple prompts and strategies to determine which yields the best classification results.
- **Evaluate Performance:** Measure the effectiveness of each strategy using precision, recall, and F1 score.
- **Cross-Validation Approach:** Utilize a labeled dataset containing:
  - **True Positives (TP):** Correctly identified valid insights.
  - **True Negatives (TN):** Correctly identified invalid insights.
  - **False Positives (FP):** Incorrectly marked invalid insights as valid.
  - **False Negatives (FN):** Incorrectly marked valid insights as invalid.

## Methodology

1. **Load Product Insights**  
   - Import CSV files containing product insights for validation.

2. **Apply LLM-Based Validation**  
   - Use different prompts and prompting strategies to classify insights.

3. **Evaluate Performance**  
   - Compute precision, recall, and F1 score to assess classification accuracy.
   - Compare the effectiveness of different strategies based on their performance metrics.

4. **Optimize for Accuracy**  
   - Identify the best-performing prompt and strategy for product insight validation.

## Tech Stack

- **LLM Provider:** Azure OpenAI  
- **Model:** ChatGPT 4.0  
- **Data Processing:** Python (pandas, numpy)  
- **Evaluation Metrics:** precision, recall, F1 score  

## Expected Outcomes

- A clear understanding of which prompting strategy yields the best results.
- A methodology/workflow that can be iteratively improved and scaled for future product insight validation tasks.


---

_This notebook has memes every once in a while. Jupyter notebooks are very nice but also can be a bit dry. The memes are not particularly good, don't judge me.

In [24]:
# let's import the packages we will need for this project

import requests # for connecting with Azure Open AI
import json # for parsing responses
import csv # for data processing
import pandas as pd # for data analysis 

# let's also import the config we will need to interact with the Azure Open AI API

from config import config_endpoint, config_key


# 1 - Load Product Insights

Let's take a glimpse at the data we have. All this data has been validated with an LLM with a custom prompt and then reviewed by human validators. This explains why we have true and false positives and negatives. 

In [26]:
# We load all the data 

true_positives = pd.read_csv('true_positive_sample.csv')
true_negatives = pd.read_csv('true_negative_sample.csv')
false_positives = pd.read_csv('false_positive_sample.csv')
false_negatives = pd.read_csv('false_negative_sample.csv')

# Now let's print one of the datasets to see its shape

true_positives[:5]

Unnamed: 0,Feedback,Product Feedback and Limitations validation_status,Product Feedback and Limitations comment,Product Feedback and Limitations_human_review,Product Feedback and Limitations_human_comment
0,Feedback and limitations - **Details ** Custom...,1,The feedback is valid as it specifically addre...,Agree,Valid concern
1,Feedback and limitations Product Limitation \n...,1,The feedback is specific as it refers to the d...,Agree,
2,Feedback and limitations Customer appreciates ...,1,"The feedback is specific, mentioning the centr...",Agree,"Specific, actionable product feedback with cle..."
3,Feedback and limitations Customer have mention...,1,"The feedback is specific, mentioning the activ...",Agree,
4,Feedback and limitations After deleting the ...,1,"The feedback is specific, mentioning the issue...",Agree,


In [36]:
# Column explanation
data = [
    ["Feedback", "Raw feedback notes captured by the agent and stored on Gigplus Trackers"],
    ["Product Feedback and Limitations validation_status", "Validation done by the LLM - 0 is invalid, 1 is valid"],
    ["Product Feedback and Limitations comment", "Explanation provided by the LLM"],
    ["Product Feedback and Limitations_human_review", "Human review, agreeing or disagreeing with the model"],
    ["Product Feedback and Limitations_human_comment", "Comment left by the human validator"]
]
column_data = pd.DataFrame(data, columns=["Column Name", "Explanation"])

pd.set_option("display.max_colwidth", None) 
column_data

Unnamed: 0,Column Name,Explanation
0,Feedback,Raw feedback notes captured by the agent and stored on Gigplus Trackers
1,Product Feedback and Limitations validation_status,"Validation done by the LLM - 0 is invalid, 1 is valid"
2,Product Feedback and Limitations comment,Explanation provided by the LLM
3,Product Feedback and Limitations_human_review,"Human review, agreeing or disagreeing with the model"
4,Product Feedback and Limitations_human_comment,Comment left by the human validator


## 1.1 Baselining Performance

Let's calculate Sensitivity, Recall and F1 for this dataset, which will give us target performance metrics to iterate on. Let's refresh on how these are calculated

# Performance Metrics

## Sensitivity (Recall)
Sensitivity, also known as **recall**, measures the ability to correctly identify positive cases:

$$
\text{Sensitivity} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

## Precision
Precision measures how many of the predicted positive cases were actually correct:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

## F1 Score
F1 Score is the harmonic mean of precision and recall, balancing both metrics:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Sensitivity}}{\text{Precision} + \text{Sensitivity}}
$$


With that, let's calculate sensitivity, precision and F1 score for our current dataset


In [39]:
tp = true_positives.shape[0]  
tn = true_negatives.shape[0]
fp = false_positives.shape[0]
fn = false_negatives.shape[0]

sensitivity = tp / ( tp + fn )
precision = tp / ( tp + fp )
f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )

baseline_eval_metrics = pd.DataFrame([
    ["Sensitivity", sensitivity],
    ["Precision", precision],
    ["F1_Score", f_1]
    ],
    columns=["Metric", "Value"]
)

baseline_eval_metrics



Unnamed: 0,Metric,Value
0,Sensitivity,0.571429
1,Precision,0.666667
2,F1_Score,0.615385


The metrics are very low and in principle "easy to beat", but this is only because the sample size is very small for true positives and true negatives. 

In reality, the previous model performed better than this - nevertheless, this gives us a compass for our exercise.

**New prompts/prompt strategies should be able to have a better ability to catch false positives and false negatives while maintaining accuracy with true positives and negatives**

We'll store the result of all our tests into a dataframe table. This will allow us to contrast and compare approaches and make a final selection.



In [48]:
test_results = pd.DataFrame([], columns=["test_name", "sensivity", "precision", "f1_score"])

test_results = pd.concat([test_results, pd.DataFrame({"test_name": ["original"] ,"sensivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })], ignore_index=True)

test_results


  test_results = pd.concat([test_results, pd.DataFrame({"test_name": ["original"] ,"sensivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })], ignore_index=True)


Unnamed: 0,test_name,sensivity,precision,f1_score
0,original,0.571429,0.666667,0.615385


## 2. Apply LLM Validation

Let's start this section by defining a function that calls Azure Open AI with a system prompt, and an input provided by the user. 

The system prompt will contain the criteria to validate an insight, and the user input will be the entry registered by our agents. 

In [51]:
HEADERS = {
    "Content-Type": "application/json",
    "api-key": config_key
}

def send_prompt(system_prompt, user_prompt, max_tokens=200):
    """Send a prompt to Azure OpenAI and return the response."""
    url = config_endpoint
    data = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "max_tokens": max_tokens
    }
    
    try:
        response = requests.post(url, headers=HEADERS, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"

Let's test it out with a very naive example to make sure it works

In [54]:
res = send_prompt(
    "You are system dedicated to validate product feedback. You will only declare as valid feedback that has to do usability issues, anything else will be invalid. Always return json with two fields: { valid: can only be true or false, reason: your reasoning as to why the insight is valid or invalid }",
    "I could not use the app at all, the menu was very convoluted and crowded with icons. Very hard to read"
)

print(res)
    

```json
{
  "valid": true,
  "reason": "The feedback highlights a usability issue regarding the app's menu being convoluted and crowded with icons, which makes it hard to use and read. This directly relates to how users interact with the app."
}
```


The model is giving us back a string formatted in Markdown. Let's create a function to clean it 

In [57]:
def clean_llm_response(res):
    return res.replace("json", "").replace(r'\n', '').replace(r"\'", "'").replace("`", "").strip()

In [58]:
print(clean_llm_response(res))

{
  "valid": true,
  "reason": "The feedback highlights a usability issue regarding the app's menu being convoluted and crowded with icons, which makes it hard to use and read. This directly relates to how users interact with the app."
}


Great! We now have the basic building block for testing different validation prompts.   

## 3. Analysing prompting approaches

Now we need a function that allows us to do the following:

- 1. Iterate through our TP, TN, FP, FN datasets.
- 2. For each of the rows
    - 1. Ask the LLM to validate  
    - 2. Evaluate if the LLM did a good job or not
    - 3. Store this information somewhere
- 4. Calculate Sensitivity, Recall and F1
- 5. Add the results to our run test