## Prompt Testing Techniques

In [1]:
!pip install huggingface_hub==0.24.7
!pip install rouge-score
!pip install transformers torch

Collecting huggingface_hub==0.24.7
  Downloading huggingface_hub-0.24.7-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.24.7-py3-none-any.whl (417 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/417.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m286.7/417.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.5/417.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.29.3
    Uninstalling huggingface-hub-0.29.3:
      Successfully uninstalled huggingface-hub-0.29.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.14.0 requires huggingface-hub>=

In [2]:
from huggingface_hub import InferenceClient
from transformers import pipeline
import torch
import os
import pandas as pd
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
from rouge_score import rouge_scorer
from google.colab import userdata
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [3]:
hf_token = userdata.get('HF_API_TOKEN')

deepseek_model_client = InferenceClient(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    token=hf_token,
    # headers={"X-Use-Cache": "false"}
)

In [4]:
def generate_response(model, system_role, user_query, temperature = 0.1, top_p = 0.1):

    response = model.chat_completion(
    messages=[{"role": "system", "content": system_role},
        {"role": "user", "content": user_query}],
    max_tokens=4000,
    temperature = temperature,
    top_p = top_p
    )

    return response.choices[0].message.content

### Pointwise vs. Pairwise Testing

#### Pointwise Testing

In [None]:
system_role = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet:
I like the movie but it was a bit too long."""

output = generate_response(deepseek_model_client,
               system_role,
               user_query)

## only retrieve the response not the thought process
response = output.strip().split("</think>")[-1].strip()
response

'The sentiment of the tweet is mixed. The user expresses a positive sentiment towards liking the movie but also a negative sentiment regarding its length. This combination of both positive and negative elements results in an overall mixed sentiment. \n\n**Answer:** The sentiment is mixed.'

#### Pairwise Testing

In [None]:
system_role = "You are an expert tweet sentiment analyzer."

user_query1 = f"""What is the sentiment expressed in the following tweet:
I like the movie but it was a bit too long."""


user_query2 = f"""What is the sentiment expressed in the following tweet.
Your response must be one word: postive, negative, or mixed.
I like the movie but it was a bit too long."""

prompts = {"Prompt 1": user_query1,
           "Prompt 2": user_query2}

for prompt, user_query in prompts.items():
  output = generate_response(deepseek_model_client,
               system_role,
               user_query)

  response = output.strip().split("</think>")[-1].strip()
  print(f"Response from {prompt}: {response}")



Response from Prompt 1: The sentiment of the tweet is mixed. The user expresses a positive sentiment towards liking the movie but also a negative sentiment regarding its length. This combination of both positive and negative elements results in an overall mixed sentiment. 

**Answer:** The sentiment is mixed.
Response from Prompt 2: mixed


## Reference Free vs Reference Based Testing

#### Reference-based Testing

In [None]:
target_label = "mixed"

system_role = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet.
Your response must be one word: postive, negative, or mixed.
I like the movie but it was a bit too long."""

output = generate_response(deepseek_model_client,
               system_role,
               user_query)

response = output.strip().split("</think>")[-1].strip()
print(response)

if response == target_label:
  print("Correct")
else:
  print("Incorrect")

mixed
Correct


#### Reference-free Testing

In [None]:
#Llama 3.3 endpoint
#https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
llama_model_client = InferenceClient(
    "meta-llama/Llama-3.3-70B-Instruct",
    token=hf_token
)

system_role = "You are an expert LLM response evaulator."
user_query = f"""Given the following input to an LLM:{user_query},
and the following response {response}. Do you think the response is accurate?"""


output = generate_response(llama_model_client,
               system_role,
               user_query)
output

'Yes. \n\nThe tweet expresses both a positive sentiment ("I like the movie") and a negative sentiment ("it was a bit too long"), which makes the overall sentiment "mixed". The response accurately captures this nuanced sentiment.'

## Factors Affecting Prompt Response

### System Instructions

In [None]:
system_role = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet.
Your response must be negative postive, negative, or mixed.
I like the movie but it was a bit too long."""

output = generate_response(deepseek_model_client,
               system_role,
               user_query)

response = output.strip().split("</think>")[-1].strip()
response

'The sentiment expressed in the tweet is mixed. \n\nStep-by-step explanation:\n1. The tweet begins with a positive statement: "I like the movie."\n2. It then contrasts with a negative point: "it was a bit too long."\n3. The presence of both positive and negative sentiments indicates a mixed overall sentiment.\n\nAnswer: mixed'

In [None]:
system_role = "You are an expert tweet sentiment analyzer. You respond in a single word."
user_query = f"""What is the sentiment expressed in the following tweet.
Your response must be negative postive, negative, or mixed.
I like the movie but it was a bit too long."""

output = generate_response(deepseek_model_client,
               system_role,
               user_query)

response = output.strip().split("</think>")[-1].strip()
response

'mixed'

### Temperature Settings

In [None]:
system_prompt = "You are an expert tweet sentiment analyzer."
user_query = f"""What is the sentiment expressed in the following tweet:
I liked the movie but it was a bit too long."""

output = generate_response(deepseek_model_client,
               system_prompt,
               user_query,
               temperature = 0)

response = output.strip().split("</think>")[-1].strip()
response

'The sentiment of the tweet is mixed. The user expresses a positive sentiment by stating they liked the movie, but also includes a negative aspect by mentioning that it was too long. This combination of positive and negative elements results in an overall mixed sentiment.'

In [None]:
output = generate_response(deepseek_model_client,
               system_prompt,
               user_query,
               temperature = 0.8)

response = output.strip().split("</think>")[-1].strip()
response

'The sentiment expressed in the tweet is mixed. The user acknowledges a positive aspect by stating they liked the movie, while also expressing a mild negative criticism about its length. This combination of positive and negative sentiments results in an overall mixed sentiment.'

### The Affect of Top-P

In [None]:
system_prompt = "You are an expert tweet sentiment analyze."
user_query = f"""What is the sentiment expressed in the following tweet:
I liked the movie but it was a bit too long."""

output = generate_response(deepseek_model_client,
               system_prompt,
               user_query,
               temperature = 0.5,
               top_p = 0.1)

response = output.strip().split("</think>")[-1].strip()
response


'The sentiment of the tweet is mixed. The user expresses a positive sentiment by stating they liked the movie, but also includes a negative aspect by mentioning that it was a bit too long. This combination of positive and negative elements results in an overall mixed sentiment.'

In [None]:
system_prompt = "You are an expert tweet sentiment analyze."
user_query = f"""What is the sentiment expressed in the following tweet:
I liked the movie but it was a bit too long."""

output = generate_response(deepseek_model_client,
               system_prompt,
               user_query,
               temperature = 0.5,
               top_p = 0.9)

response = output.strip().split("</think>")[-1].strip()
response

'The sentiment of the tweet is mixed. The person expresses a positive sentiment towards liking the movie but also a negative sentiment regarding its length. Therefore, the overall sentiment is mixed.'

## Patronus AI platform

### LLM As a Judge

In [5]:
!pip install patronus

Collecting patronus
  Downloading patronus-0.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting httpx<0.28.0,>=0.27.0 (from patronus)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting opentelemetry-exporter-otlp<2.0.0,>=1.31.0 (from patronus)
  Downloading opentelemetry_exporter_otlp-1.31.1-py3-none-any.whl.metadata (2.5 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from patronus)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc==1.31.1 (from opentelemetry-exporter-otlp<2.0.0,>=1.31.0->patronus)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.31.1-py3-none-any.whl.metadata (2.5 kB)
Collecting opentelemetry-exporter-otlp-proto-http==1.31.1 (from opentelemetry-exporter-otlp<2.0.0,>=1.31.0->patronus)
  Downloading opentelemetry_exporter_otlp_proto_http-1.31.1-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-exporter-otlp-proto-common==1.31.1 (from opentelemetry-exporter-otlp

In [24]:
import patronus
PATRONUS_API_KEY = userdata.get('PATRONUS_API_KEY')

In [25]:
patronus.init(
    api_key=PATRONUS_API_KEY,
)

# Create a simple tracer
@patronus.traced()
def test_function():
    return "Installation successful!"

# Call the function to test tracing
result = test_function()
print(result)

Installation successful!


  patronus.init(


In [26]:
# dataset download link: https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx
summaries = pd.read_excel(r'/content/summary_datasets.xlsx')
summaries.head()

Unnamed: 0.1,Unnamed: 0,id,human_summary,publication,author,date,year,month,theme,content
0,0,17283,In successfully seeking a temporary halt in th...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,politics,WASHINGTON — Congressional Republicans have...
1,0,17284,Officers put her in worse danger some months l...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,crime,"After the bullet shells get counted, the blood..."
2,0,17285,The film striking appearance had been created ...,New York Times,Margalit Fox,2017-01-06,2017.0,1.0,entertainment,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,0,17286,The year was only days old when the news came ...,New York Times,William McDonald,2017-04-10,2017.0,4.0,entertainment,"Death may be the great equalizer, but it isn’t..."
4,0,17287,If North Korea conducts a test in coming month...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,politics,"SEOUL, South Korea — North Korea’s leader, ..."


In [27]:
content = summaries["content"].iloc[10]
summary = summaries["human_summary"].iloc[10]
summary

'How much money you make how much time you spend with your friends and family how well your body functions years from now all of these in many ways are products of the habits you are building today.when you woke up this morning what did you do first did you hop in the shower check your email or grab a doughnut what did you say to your roommates on the way out the door salad or hamburger for lunch when you got home did you put on your sneakers and go for a run or eat dinner in front of the television most of the choices we make each day may feel like the products of decision making but they’re not.if you want to start running each morning it essential that you choose a simple cue and a clear reward .want more you might also like • the scientific workout • no time to workout try exercising on the job • how to pick a health insurance planand though each habit means relatively little on its own over time the meals we eat how we spend our evenings and how often we exercise have enormous imp

In [28]:
system_role = "You are an expert in text summarization. Summarize the articles like human."
user_query = f"""Generate a summary of the following article in 1000 characters:\n{content}"""
output = generate_response(deepseek_model_client,
                           system_role,
                           user_query)

## only retrieve the response not the thought process
response1 = output.strip().split("</think>")[-1].strip()
response1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


"The article discusses how daily choices often become habits, significantly impacting long-term health, productivity, and happiness, especially in one's 20s. It explains that habits form through a neurological loop involving a cue, routine, and reward. A study on exercisers found that habits develop due to specific cues and rewards. To create good habits, like exercise, one must establish clear cues (e.g., a specific time) and rewards (e.g., a sense of accomplishment or a treat). Initially, external rewards like chocolate may be needed to train the brain to associate the cue with the routine and inherent reward. Over time, the brain learns to enjoy the natural rewards, making the habit automatic. The article emphasizes understanding habit formation to build positive routines."

In [None]:
from patronus.evals import RemoteEvaluator

def evaluate_summarization_patronus(reference, candidate):

  exact_match = RemoteEvaluator("exact-match")
  fuzzy_match = RemoteEvaluator("judge","patronus:fuzzy-match")

  exact_match_result  = exact_match.evaluate(
      task_output=reference,
      gold_answer=candidate
      )

  fuzzy_score_result = fuzzy_match.evaluate(
      task_output=reference,
      gold_answer=candidate
  )

  results = {
      'Exact Match': exact_match_result,
      'Fuzzy Match': fuzzy_score_result
  }

  return results

result_1 = evaluate_summarization_patronus(summary, response1)
result_1

{'Exact Match': EvaluationResult(score=0.0, pass_=False, text_output=None, metadata={'positions': None, 'extra': None, 'confidence_interval': None}, explanation=None, tags={}, dataset_id=None, dataset_sample_id=None, evaluation_duration=datetime.timedelta(0), explanation_duration=datetime.timedelta(0)),
 'Fuzzy Match': EvaluationResult(score=1.0, pass_=True, text_output=None, metadata={'positions': [[167, 173], [996, 1002], [516, 523]], 'extra': None, 'confidence_interval': None}, explanation='- The pass criteria requires the output to be similar in meaning to the gold answer.\n- The gold answer discusses the impact of daily choices becoming habits, the neurological loop of habit formation, and the importance of cues and rewards in forming good habits.\n- The output also discusses the impact of daily habits on long-term outcomes, the importance of cues and rewards in habit formation, and provides examples of daily choices that become habits.\n- Both the gold answer and the output empha

In [None]:
for key, value in result_1.items():
  print(f"Results for {key}")
  print(f"Pass: {value.pass_}")
  print(f"Score Raw: {value.score}")
  print(f"Explanation: {value.explanation}")
  print("=================================")

Results for Exact Match
Pass: False
Score Raw: 0.0
Explanation: None
Results for Fuzzy Match
Pass: True
Score Raw: 1.0
Explanation: - The pass criteria requires the output to be similar in meaning to the gold answer.
- The gold answer discusses the impact of daily choices becoming habits, the neurological loop of habit formation, and the importance of cues and rewards in forming good habits.
- The output also discusses the impact of daily habits on long-term outcomes, the importance of cues and rewards in habit formation, and provides examples of daily choices that become habits.
- Both the gold answer and the output emphasize the significance of habits in shaping future outcomes and the process of habit formation.
- The output, while not identical, captures the essence of the gold answer by discussing the role of habits, cues, and rewards in shaping long-term outcomes.
- Therefore, the output is similar in meaning to the gold answer, fulfilling the pass criteria.


### Class Based Evaluators

In [29]:
from transformers import BertTokenizer, BertModel
from patronus import StructuredEvaluator, EvaluationResult
from patronus.experiments import run_experiment
import numpy as np


class BERTScore(StructuredEvaluator):
    def __init__(self, pass_threshold: float):
        self.pass_threshold = pass_threshold
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")

    def evaluate(
        self, *, task_output: str, gold_answer: str, **kwargs
    ) -> EvaluationResult:
        output_toks = self.tokenizer(
            task_output, return_tensors="pt", padding=True, truncation=True
        )
        gold_answer_toks = self.tokenizer(
            gold_answer, return_tensors="pt", padding=True, truncation=True
        )

        output_embeds = (
            self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
        )
        gold_answer_embeds = (
            self.model(**gold_answer_toks)
            .last_hidden_state.mean(dim=1)
            .detach()
            .numpy()
        )

        score = np.dot(output_embeds, gold_answer_embeds.T) / (
            np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
        )

        return EvaluationResult(
            score=score,
            pass_=score >= self.pass_threshold,
            tags={"pass_threshold": str(self.pass_threshold)},
        )




In [32]:
experiment = await run_experiment(
    api_key = PATRONUS_API_KEY,
    dataset=[
        {
            "task_output": summary,
            "gold_answer": response1,
        }
    ],
    evaluators=[BERTScore(pass_threshold=0.8)],
)

experiment.to_dataframe().to_dict()



Experiment  Global/1743493074: 100%|██████████| 1/1 [00:00<00:00,  2.81sample/s]



BERTScore [link_idx=0]
----------------------
Count     : 1
Pass rate : 1
Mean      : 0.831
Min       : 0.831
25%       : 0.831
50%       : 0.831
75%       : 0.831
Max       : 0.831

Score distribution
Score Range          Count      Histogram
0.00 - 0.17          0          
0.17 - 0.33          0          
0.33 - 0.50          0          
0.50 - 0.66          0          
0.66 - 0.83          1          ####################


{'link_idx': {0: 0},
 'task.name': {0: None},
 'evaluator_id': {0: 'BERTScore'},
 'criteria': {0: 'None'},
 'task.output': {0: None},
 'task.metadata': {0: None},
 'task.tags': {0: None},
 'eval.score': {0: 0.8306820392608643},
 'eval.pass': {0: True},
 'eval.text_output': {0: None},
 'eval.metadata': {0: None},
 'eval.explanation': {0: None},
 'eval.tags': {0: {'pass_threshold': '0.8'}},
 'eval.evaluation_duration': {0: None},
 'eval.explanation_duration': {0: None},
 'task_output': {0: 'How much money you make how much time you spend with your friends and family how well your body functions years from now all of these in many ways are products of the habits you are building today.when you woke up this morning what did you do first did you hop in the shower check your email or grab a doughnut what did you say to your roommates on the way out the door salad or hamburger for lunch when you got home did you put on your sneakers and go for a run or eat dinner in front of the television mo

In [None]:
result = experiment.result()
result.to_dataframe().to_dict()

{'link_idx': {0: 0},
 'task.name': {0: None},
 'evaluator_id': {0: 'BERTScore'},
 'criteria': {0: 'None'},
 'task.output': {0: None},
 'task.metadata': {0: None},
 'task.tags': {0: None},
 'eval.score': {0: 0.792450487613678},
 'eval.pass': {0: False},
 'eval.text_output': {0: None},
 'eval.metadata': {0: None},
 'eval.explanation': {0: None},
 'eval.tags': {0: {'pass_threshold': '0.8'}},
 'eval.evaluation_duration': {0: None},
 'eval.explanation_duration': {0: None},
 'task_output': {0: 'How much money you make how much time you spend with your friends and family how well your body functions years from now all of these in many ways are products of the habits you are building today.when you woke up this morning what did you do first did you hop in the shower check your email or grab a doughnut what did you say to your roommates on the way out the door salad or hamburger for lunch when you got home did you put on your sneakers and go for a run or eat dinner in front of the television mo

### Patronus AI Experiments with Glider LLM

In [33]:
## dataset download link
## here I upload my custom dataset

dataset = pd.read_csv("/content/validation-squad.csv.zip")

random_records = dataset.sample(n=50)

random_records.to_csv("qa_records.csv", index=False)

print(random_records.shape)

random_records.head()

(50, 6)


Unnamed: 0.1,Unnamed: 0,context,question,id,answer_start,text
3884,220,Internet2 is a not-for-profit United States co...,what is Internet2,5726472bdd62a815002e8042,13,a not-for-profit United States computer networ...
6875,68,Inflammation is one of the first responses of ...,What compounds are released by injured or infe...,572900f73f37b31900477f6b,228,eicosanoids and cytokines
3700,36,Packet mode communication may be implemented w...,In cases of shared physical medium how are the...,5726219489a1e219009ac2d0,541,multiple access scheme
9333,190,There are 13 natural reserves in Warsaw – amon...,What animals does the Vistula river's ecosyste...,57337ddc4776f41900660bbc,287,"otter, beaver and hundreds of bird species"
6622,174,Western musical instruments were introduced to...,What type of practices did the Yuan reintroduc...,572879574b864d1900164a17,432,Confucian governmental practices and examinations


In [34]:
from patronus.datasets import read_csv, read_jsonl

dataset = read_csv(
    "/content/qa_records.csv",
    task_input_field="question",
    task_context_field="context",
    )

dataset

Dataset(dataset_id=None, df=    Unnamed: 0                                            context  \
0          220  Internet2 is a not-for-profit United States co...   
9           96  After the revocation of the Edict of Nantes, t...   
10         171  Some theories of civil disobedience hold that ...   
11          78  After Malaysia's independence in 1957, the gov...   
12         253  The concept of prime number is so important th...   
13         287  Along the same lines, co-NP is the class conta...   
14         200  However, already in quantum mechanics there is...   
15          81  Jacksonville, like most large cities in the Un...   
16         124  On the other hand, higher economic inequality ...   
17         124  The university runs a number of academic insti...   
18          38  The first fortified settlements on the site of...   
1           68  Inflammation is one of the first responses of ...   
19         126  The Presiding Officer (or Deputy Presiding Off...   
20    

In [35]:
! pip -q install openai

from openai import OpenAI

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

oai_client = OpenAI(
    api_key = OPENAI_API_KEY
)

In [36]:
from patronus.datasets import Row
from patronus.experiments.types import TaskResult
from patronus import evaluator, RemoteEvaluator


In [37]:
def gpt_4o_mini_basic(row: Row, **kwargs) -> TaskResult:
    """Simple hallucination detection"""

    system_prompt = "Based on the context, answer the user's question."

    query = f"""
    Answer the following question based on the context.
    Question: {row.task_input}
    Context: {row.task_context}
    """


    evaluated_model_output = (
        oai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": query
                },
            ],
            temperature = 0.0
        )
        .choices[0]
        .message.content
    )

    return evaluated_model_output

In [38]:
def gpt_4o_mini_cot(row: Row, **kwargs) -> TaskResult:
    """COT based hallucination detection"""

    system_prompt = """You will receive  a user's question and the context
    Based on the context, answer the user's question.
    Only include information from the context and do not generate text inconsist with the context.
    Think step by step to generate your final response."""

    query = f"""
    Answer the following question based on the context.
    Question: {row.task_input}
    Context: {row.task_context}
    """


    evaluated_model_output = (
        oai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": query
                },
            ],
            temperature = 0.0
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output


In [39]:

small_hallucination_evaluator = RemoteEvaluator("glider", "small-hallucination-check")
assistants = [
    (gpt_4o_mini_basic, "gpt_4o_mini_basic"),
    (gpt_4o_mini_cot, "gpt_4o_mini_cot"),
]


results = []

for assistant_func, assistant_name in assistants:
    experiment_results = await run_experiment(
        api_key = PATRONUS_API_KEY,
        dataset=dataset,
        task = assistant_func,
        evaluators=[small_hallucination_evaluator],
        tags={"dataset_type": "qa RAG", "model": "gpt-4o-mini"},
        experiment_name= assistant_name,
        project_name = "Compare RAG Prompts with Custom Evaluator",
        )
    results.append(experiment_results)





Experiment  Compare RAG Prompts with Custom Evaluator/gpt_4o_mini_basic-1743493383: 100%|██████████| 50/50 [00:26<00:00,  1.90sample/s]



small-hallucination-check (glider) [link_idx=0]
-----------------------------------------------
Count     : 50
Pass rate : 0.6
Mean      : 2.42
Min       : 1.0
25%       : 2.0
50%       : 3.0
75%       : 3.0
Max       : 3.0

Score distribution
Score Range          Count      Histogram
1.00 - 1.40          9          ######
1.40 - 1.80          0          
1.80 - 2.20          11         #######
2.20 - 2.60          0          
2.60 - 3.00          30         ####################


Experiment  Compare RAG Prompts with Custom Evaluator/gpt_4o_mini_cot-1743493409: 100%|██████████| 50/50 [00:21<00:00,  2.28sample/s]


small-hallucination-check (glider) [link_idx=0]
-----------------------------------------------
Count     : 50
Pass rate : 0.58
Mean      : 2.44
Min       : 1.0
25%       : 2.0
50%       : 3.0
75%       : 3.0
Max       : 3.0

Score distribution
Score Range          Count      Histogram
1.00 - 1.40          7          ####
1.40 - 1.80          0          
1.80 - 2.20          14         #########
2.20 - 2.60          0          
2.60 - 3.00          29         ####################



